19:01:18 <wumpus> #startmeeting 19:01:18 <lightningbot> Meeting started Thu Mar 30 19:01:18 2017 UTC. The chair is wumpus. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:18 <lightningbot> Useful Commands: #action #agreed #help #info #idea #link #topic. 19:01:41 <instagibbs> hi 19:01:46 <jeremyrubin> here 19:01:56 <wumpus> BlueMatt: it shouldn't matter for the AbortNode() case though. Just make the conditional wait 200 milliseconds or so 19:02:03 <jtimon> topics? 19:02:24 <wumpus> doesn't matter if signals are slightly delayed 19:02:30 <BlueMatt> wumpus: the wait is already 200ms 19:02:30 <wumpus> yes, topics? 19:02:34 <BlueMatt> talk about abortnode 19:02:36 <BlueMatt> thats a reasonable topic 19:02:38 <wumpus> BlueMatt: yes, but make it a wait on a conditional 19:02:44 <BlueMatt> mmm, fair 19:02:56 <BlueMatt> yes, for sigterm its reasonable to wait a few 100 ms 19:03:03 <BlueMatt> for AbortNode it'll wake it immediately 19:03:06 <wumpus> right. 19:04:15 <BlueMatt> in other news, so we got 1/6 "Review Priority" PRs merged this week, that's ok, but we can do better :) 19:04:45 <BlueMatt> oh, nvm, got 2/6 19:04:49 <wumpus> FYI: https://github.com/bitcoin/bitcoin/projects/8 19:05:26 <gmaxwell> BlueMatt: 2/6. 19:05:28 <wumpus> removed the two that have been merged 19:05:28 <jtimon> I don't think anyone commented in mine, at least I got review in others 19:05:49 <gmaxwell> I forgot about the project tracker thing, but reviewed some of them anyways. 19:06:32 <wumpus> doesn't matter, we'll only add pulls that were mentioned to have priority in the meeting to that project tracker, so if you pay attention at the meeting you don't need it :) 19:07:31 <wumpus> anyhow, everyone go review them 19:08:21 <wumpus> topic: 0.14.1? 19:08:29 <bitcoin-git> [13bitcoin] 15sipa opened pull request #10126: Compensate for memory peak at flush time (06master...06peak_compensation) 02https://github.com/bitcoin/bitcoin/pull/10126 19:08:29 <gmaxwell> Yes, please. 19:08:33 <jeremyrubin> topic: slow tests? 19:08:37 <wumpus> #topic 0.14.1 19:09:02 <achow101> already? 19:09:25 <gmaxwell> yes. lots of nice fixes to ship. Minor releases are minor. 19:09:39 <sipa> we have 11 merged PRs marked 0.14.1 19:09:59 <wumpus> and three open pulls https://github.com/bitcoin/bitcoin/pulls?q=is%3Aopen+is%3Apr+milestone%3A0.14.1 19:10:32 <wumpus> sipa: only one still has the "needs backport" tag, I think MarcoFalke ported the rest 19:10:44 <gmaxwell> I think when we get those three in and those and the recent ones backported, we should be goof for a 0.14.1. 19:11:11 <gmaxwell> good too. 19:11:41 <wumpus> agreed 19:12:45 <wumpus> ok, next topic I guess 19:12:48 <wumpus> #topic slow tests 19:13:11 <wumpus> I made an overview here: https://github.com/bitcoin/bitcoin/issues/10026 of the slowest unit tests 19:13:15 <jeremyrubin> A lot of it is my fault it seems. https://github.com/bitcoin/bitcoin/pull/10099 is ready to go 19:13:27 <wumpus> some of those have already been worked on or even PRs open to make them faster 19:13:36 <gmaxwell> jnewbery: if you're around, if 10106 doesn't move forward in a few hours, I recommend you create a new pr, cherry picking that one fix, with your new tests and the fix for the other function. (just because the PR wasn't opened by a regular so they may not be responsive or may not know how to use github well enough to pull your changes.) 19:14:12 <jnewbery> gmaxwell: I'll do that this afternoon 19:14:43 <wumpus> yes, tend to agree, they may not know how to cherry-pick them. I could cherry-pick them into his branch but meh 19:14:53 <MarcoFalke_lab> wumpus: Yes, dropping GetRand() gave a 4* speedup. 19:15:02 <gmaxwell> jnewbery: and thanks for the awesome response. 19:15:08 <jnewbery> np 19:15:37 <wumpus> MarcoFalke_lab: that'll at least move it down a bit in the top 20 19:15:42 <jeremyrubin> The cuckoocache tests require a bit more discussion to decrease the parameters; I can pick something arbitrary 19:16:27 <jeremyrubin> The checkqueue tests should also speed up a lot with #9938, but I'm preparing some tweaks to those 19:16:28 <gribble> https://github.com/bitcoin/bitcoin/issues/9938 | Lock-Free CheckQueue by JeremyRubin · Pull Request #9938 · bitcoin/bitcoin · GitHub 19:16:46 <Chris_Stewart_5> re tests: Has anyone given any thought to #8469 , obviously this pull request sacrafices speed for exhaustiveness 19:16:48 <gribble> https://github.com/bitcoin/bitcoin/issues/8469 | [POC] Introducing property based testing to Core by Christewart · Pull Request #8469 · bitcoin/bitcoin · GitHub 19:17:02 <jeremyrubin> If anyone has high-core machines, they should try benching how that works 19:17:02 <wumpus> jeremyrubin: alternatively we could have an -extended mode for the unit tests, like we have for the functional tests, to do extra-thorough tests that shouldn't run every time 19:17:13 <MarcoFalke_lab> jeremyrubin: If the parameters matter, you could provide a switch to test against quick_run and an expensive one 19:17:20 <wumpus> yes ^^ 19:17:27 <gmaxwell> I think we should have an extended test, and make running it part of the release process. 19:17:42 <jeremyrubin> Agree 19:17:45 <wumpus> but for the standard 'make check' there should be a guideline of max ~1 second per test case 19:17:59 <MarcoFalke_lab> Everyone agrees! :) 19:18:07 <gmaxwell> I don't care if the tests take an hour to run if it's only a mandatory once in release thing. 19:18:09 <cfields> sgtm 19:18:11 <jeremyrubin> no, 2 seconds! 19:18:18 <wumpus> gmaxwell: yes or once per day or so on master 19:18:35 <jonasschnelli> (which is failing currently) 19:18:49 <gmaxwell> jonasschnelli: what is failing? 19:18:49 <wumpus> yes, travis is broken again, but there's a pull to fix that 19:18:52 <cfields> note to self: gitian should run whatever extended tests it can 19:18:55 <gmaxwell> oh travis 19:18:58 <jonasschnelli> gmaxwell: https://travis-ci.org/bitcoin/bitcoin/builds 19:19:13 <MarcoFalke_lab> Jup, will merge the travis fix tonight when I have access to my keys. (If no one beats me to it) 19:19:31 <wumpus> does anyone have the # perhaps? 19:19:58 <sdaftuar> i think #10114? 19:19:59 <MarcoFalke_lab> Though, we should be cautious with putting too much into the cron jobs. The extended test rely on the cache being up to date 19:19:59 <gribble> https://github.com/bitcoin/bitcoin/issues/10114 | An error has occurred and has been logged. Please contact this bot's administrator for more information. 19:20:10 <MarcoFalke_lab> Otherwise they would break the 49 minute limit on traivs 19:20:11 <wumpus> there's also #10072 19:20:12 <gribble> https://github.com/bitcoin/bitcoin/issues/10072 | Remove sources of unreliablility in extended functional tests by jnewbery · Pull Request #10072 · bitcoin/bitcoin · GitHub 19:20:39 <wumpus> but yes 114 was the one I meant 19:20:40 <gmaxwell> We should probably contemplate having some seperate process against master that does continual gitian builds or something. 19:21:10 <MarcoFalke_lab> jonasschnelli: Is doing that, gmaxwell 19:21:22 <gmaxwell> oh good. 19:21:25 <gmaxwell> ignore me. 19:21:29 <jonasschnelli> gmaxwell: https://bitcoin.jonasschnelli.ch 19:21:32 <wumpus> yes, jonasschnelli has a build server with a very nice web UI 19:21:43 <MarcoFalke_lab> jonasschnelli: Are you running the tests? 19:21:44 <jonasschnelli> (it's currently by manual trigger) 19:21:50 <jonasschnelli> no.. just gitian 19:21:55 <gmaxwell> somehow I missed this. cool. 19:22:08 <wumpus> travis already runs the tests, this is to get executables for testing 19:23:31 <jnewbery> travis is only broken now because we've set it to run the extended tests once per day, so we're currently flushing out all the extended tests that have always failed on travis. I think once #10114 and #10072 are merged the daily runs should succeed reliably 19:23:32 <gribble> https://github.com/bitcoin/bitcoin/issues/10114 | An error has occurred and has been logged. Please contact this bot's administrator for more information. 19:23:33 <gribble> https://github.com/bitcoin/bitcoin/issues/10072 | Remove sources of unreliablility in extended functional tests by jnewbery · Pull Request #10072 · bitcoin/bitcoin · GitHub 19:24:08 <jonasschnelli> thanks jnewbery for the info (and the fixes) 19:25:56 <bitcoin-git> [13bitcoin] 15sdaftuar opened pull request #10127: [0.14 backport] Mining: Prevent slowdown in CreateNewBlock on large mempools (060.14...062017-03-backport-cnb-optimizations) 02https://github.com/bitcoin/bitcoin/pull/10127 19:26:48 <wumpus> ok, any other topics? 19:27:41 <bitcoin-git> [13bitcoin] 15laanwj pushed 3 new commits to 06master: 02https://github.com/bitcoin/bitcoin/compare/cde9b1a864c1...edc62c959ab3 19:27:42 <bitcoin-git> 13bitcoin/06master 146426716 15John Newbery: Add send_await_disconnect() method to p2p-compactblocks.py... 19:27:42 <bitcoin-git> 13bitcoin/06master 146a18bb9 15John Newbery: [tests] sync_with_ping should assert that ping hasn't timed out... 19:27:43 <bitcoin-git> 13bitcoin/06master 14edc62c9 15Wladimir J. van der Laan: Merge #10114: [tests] sync_with_ping should assert that ping hasn't timed out... 19:27:49 <gmaxwell> I should make notes for topics... 19:27:57 <BlueMatt> oh, new prs 19:28:02 <BlueMatt> for review priority 19:28:12 <BlueMatt> looks like jonasschnelli already added his 19:28:16 <bitcoin-git> [13bitcoin] 15laanwj closed pull request #10114: [tests] sync_with_ping should assert that ping hasn't timed out (06master...06improve_sync_with_ping) 02https://github.com/bitcoin/bitcoin/pull/10114 19:28:21 <BlueMatt> anyone else have something to add (aside from 0.14.1-tagged things) 19:28:22 <wumpus> fine with me, but don't forget the current bunch :p 19:28:32 <wumpus> yes please review 0.14.1 tagged things 19:28:35 <wumpus> although those should be easy 19:28:36 <jonasschnelli> BlueMatt: Yes. The bumpfee refactor (to get a QT fee bump function) 19:28:53 <sipa> 0.14.1 has priority, but i'd like to see #9792 in at some point to further the get-rid-of-openssl thing :) 19:29:01 <BlueMatt> wumpus: there is a "For backport" column there... 19:29:02 <gribble> https://github.com/bitcoin/bitcoin/issues/9792 | FastRandomContext improvements and switch to ChaCha20 by sipa · Pull Request #9792 · bitcoin/bitcoin · GitHub 19:29:08 <wumpus> but I think we should be able to do 0.14.1 tomorrow so to have something to review for the rest of the week :D 19:29:22 <wumpus> BlueMatt: yes good point 19:29:53 <BlueMatt> ok, so morcos and sdaftuar get to propose new ones since they dont have ones up, anyone else want to propose ones? 19:29:55 <gmaxwell> oh someone opened a PR to do something with debug log excludes, and it adds more insane string processing in the debugging critical path. Would anyone mind if I brought back the PR I shelved to make debug categories use an enum? 19:30:17 <BlueMatt> gmaxwell: wfm, the pr was jnewbery's 19:30:20 <sipa> gmaxwell: ack enum debug categories 19:30:40 <wumpus> gmaxwell: not sure 19:30:43 <jnewbery> #10123 19:30:44 <gribble> https://github.com/bitcoin/bitcoin/issues/10123 | Allow debug logs to be excluded from specified component by jnewbery · Pull Request #10123 · bitcoin/bitcoin · GitHub 19:30:48 <gmaxwell> (the PR was just shelved because I opened it right before a release, and it is a conflict-the-universe PR) 19:30:59 <wumpus> gmaxwell: well no I guess it makes sense 19:31:00 <BlueMatt> gmaxwell: go for it now, I'd say 19:31:05 <wumpus> yes, do it 19:31:07 <sdaftuar> i'd like to get this DisconnectTip improvement wrapped up (#9208) but i need some help on resolving this annoying issue BlueMatt raised 19:31:08 <gribble> https://github.com/bitcoin/bitcoin/issues/9208 | Improve DisconnectTip performance by sdaftuar · Pull Request #9208 · bitcoin/bitcoin · GitHub 19:31:08 <cfields> gmaxwell: +1. 19:31:14 <jonasschnelli> Yes. ACK 19:31:23 <cfields> along the same lines, I'd like to do something similar for net messages 19:31:23 <wumpus> then you can also map the enum values to strings and automate the help message generation 19:31:26 <gmaxwell> #9424 19:31:27 <gribble> https://github.com/bitcoin/bitcoin/issues/9424 | Change LogAcceptCategory to use uint32_t rather than sets of strings. by gmaxwell · Pull Request #9424 · bitcoin/bitcoin · GitHub 19:31:30 <BlueMatt> sdaftuar: want to bring it up now/ 19:31:38 <jnewbery> gmaxwell: +1. I'll happily review that one 19:31:38 <BlueMatt> ? 19:31:48 <wumpus> cfields: yes, should be fairly easy to do, I already changed the syntax for net messages to be enum-like at some point 19:31:49 <jtimon> since #8855 didn't got new review nor re-review I think just leave that (just remind to potential reviewers that #8994 continues it, in case you "don't see the point") 19:31:51 <gribble> https://github.com/bitcoin/bitcoin/issues/8855 | Use a proper factory for creating chainparams by jtimon · Pull Request #8855 · bitcoin/bitcoin · GitHub 19:31:51 <gribble> https://github.com/bitcoin/bitcoin/issues/8994 | Testchains: Introduce custom chain whose constructor... by jtimon · Pull Request #8994 · bitcoin/bitcoin · GitHub 19:32:15 <wumpus> cfields: went as far as I could without making it an actual enum :) 19:32:17 <gmaxwell> cfields: I looked at adding a perfect hash for net messages... but didn't know if that would be a way you'd want to go. :) 19:32:29 <BlueMatt> new for review this week is 9792 and 9208 19:32:35 <cfields> wumpus: yes, it sure looks like an enum :) 19:32:37 <sipa> gmaxwell: just make sure all debug category names have different length 19:33:12 <gmaxwell> sipa: well we don't need to do any runtime lookup of category names at all.. so no need to do anything performant there. 19:33:27 <wumpus> at least give them consecutive values so a bit field can be used to represent the set 19:33:33 <wumpus> intead of a per-thread map :-( 19:33:40 <cfields> gmaxwell: i had considered making it an array<char, 12>. But I really don't care, as long as it's switchable and a compile-time constant 19:33:41 <gmaxwell> wumpus: that is what PR 9424 does. 19:33:52 <wumpus> gmaxwell: ok, great 19:33:53 <gmaxwell> wumpus: it assigns them each to a bit. 19:34:22 <wumpus> the current code that assigns it a thread-local variable is really strange 19:34:27 <bitcoin-git> [13bitcoin] 15JeremyRubin opened pull request #10128: Speed Up CuckooCache tests (06master...06cuckoo-tests-faster) 02https://github.com/bitcoin/bitcoin/pull/10128 19:34:30 <cfields> (and now that I think about it, std::array probably isn't switchable anyway) 19:34:39 <gmaxwell> yes. horrors from beyond time. 19:34:54 <sdaftuar> topic suggestion: dealing with abortnode / ConnectTip / DisconnectTip failures 19:35:05 <gmaxwell> in any case, if people support-- 9424 is hard to rebase because it touches the whole codebase. 19:35:30 <wumpus> cfields: in some modern languages it's possible to switch on compound types and arrays and strings, but no, not c++XX before XX=50 or so :) 19:35:33 <gmaxwell> but I think it also makes jnewbery's exclude trivial. (in fact: no runtime impact...) 19:35:50 <wumpus> gmaxwell: yes, it's how it should be done 19:36:09 <sipa> in haskell it's pretty much the only way you can inspect a compound type at all :p 19:36:14 <jnewbery> gmaxwell: agree 19:36:22 <wumpus> sipa: indeed 19:36:41 <sipa> ok, sdaftuar's topic? 19:36:46 <gmaxwell> cfields: my intution for that would be using gperf to map the messages to integers. then switching on them... but perhaps overkill. 19:36:53 <wumpus> #topic dealing with abortnode / ConnectTip / DisconnectTip failures 19:37:05 <BlueMatt> context: #9208 19:37:06 <gribble> https://github.com/bitcoin/bitcoin/issues/9208 | Improve DisconnectTip performance by sdaftuar · Pull Request #9208 · bitcoin/bitcoin · GitHub 19:37:09 <sipa> sdaftuar: i saw i was pinged in the PR, but haven't read it 19:37:13 <sdaftuar> so this issue might be hard to grasp without reviewing #9208 19:37:15 <gribble> https://github.com/bitcoin/bitcoin/issues/9208 | Improve DisconnectTip performance by sdaftuar · Pull Request #9208 · bitcoin/bitcoin · GitHub 19:37:29 <wumpus> gmaxwell: will that work for unknown messages too? 19:37:40 <sdaftuar> but basically matt brought up that we have some edge cases in our code if ConnectTip or DisconnectTip return false 19:37:54 <wumpus> gmaxwell: does a perfect hash handle collisions? I don't remember 19:38:07 <cfields> wumpus: perfect hash means no collisions :) 19:38:13 <wumpus> cfields: for known values 19:38:34 <jeremyrubin> wouldn't that not be backwards compatible? 19:38:36 <gmaxwell> wumpus: gperf can check, it's an option. 19:38:44 <wumpus> cfields: but what if 'moo' maps to the same 32-bit value as 'block' 19:38:50 <jeremyrubin> someone sends you a new type of message that hashes to something else 19:39:05 <wumpus> scary 19:39:06 <sipa> sdaftuar: hmm, i'll review the PR 19:39:10 <sdaftuar> the last issue i'm trying to resolve is if ConnectTip fails (but the block is not invalid), then we should be in an AbortNode() state. what consistency are we aiming for in this situation? 19:39:14 <cfields> wumpus: ah, true 19:39:16 <gmaxwell> wumpus: making it never have collisions for unknows is just an extra comparison with the correct value. 19:39:35 <gmaxwell> wumpus: which IIRC the gperf emitted code will do, if enabled. 19:39:47 <gmaxwell> (actually, looks like it's the default) 19:39:47 <BlueMatt> lol, ok, so lets pick a topic instead of two 19:40:16 <gmaxwell> sdaftuar: we should be able to shut down and come back up with the underlying issue resolved and continue. 19:40:26 <wumpus> yes, sorry, I was just curous about gperf 19:40:30 <gmaxwell> sdaftuar: I don't care if we lose a bunch of recent blocks. 19:40:32 <sipa> sdaftuar: it's fine if we lose progress in that case, but if at all avoidable, the on disk state should not be corrupted 19:40:44 <BlueMatt> anyway, so I further observed that shutdown upon AbortNode can take up to 200ms (see recent PR), which is somewhat frightening, given that mempool and chainstate may not be consistent which we assume in many places :/ 19:40:46 <sdaftuar> gmaxwell: sipa: what should we do with the mempool? 19:41:03 <sipa> sdaftuar: who cares? 19:41:08 <BlueMatt> so we keep running for a while normally and possibly have incorrect assumptions 19:41:11 <gmaxwell> I don't care. 19:41:16 <wumpus> just drop it 19:41:29 <sipa> sdaftuar: at restart we'll try to reimport what is on disk 19:41:36 <sipa> and re-run all necessary consistency checks 19:41:51 <sipa> so mempool.dat doesn't need to be consistent with the chainstate 19:41:59 <BlueMatt> and wallet? 19:42:01 <sdaftuar> i think the concern matt was raising was if before shutdown, it's possible for eg an RPC call to get incorrect results 19:42:04 <sdaftuar> or the wallet 19:42:08 <wumpus> yes, no tight coupling, good design 19:42:19 <BlueMatt> wumpus: point is we *have* tight coupling 19:42:22 <wumpus> wallet doesn't need to be in sync either, though it should not be ahead 19:42:24 <sipa> sdaftuar: ugh 19:42:28 <BlueMatt> and continue running after realizing something is corrupted 19:42:35 <gmaxwell> well abortnode should be instant-ish. 19:42:42 <gmaxwell> BlueMatt: so you're fixing that, what is there to discuss? 19:42:46 <wumpus> a bit behind (as long as it's not further than the prune distance) is ok, it will catch up in next start 19:42:49 <BlueMatt> gmaxwell: i mean the new pr is an improvement, but its not a fix 19:43:02 <BlueMatt> its not like you cant have something pending cs_main when coming out of Disconnect/ConnectTip 19:43:06 <wumpus> but if the wallet is ahead of the chain it will trigger a rescan IIRC 19:43:10 <gmaxwell> so perhaps abortnode needs to Abort()? 19:43:18 <BlueMatt> gmaxwell: thats pretty much the question 19:43:21 <BlueMatt> thats super shit, though, of course 19:43:24 <gmaxwell> and not faffabout witht politely shutting off threads. 19:43:24 <wumpus> I don't think so 19:43:39 <sdaftuar> i inadvertently introduced an assert failure in one of these situations. maybe that's a feature not a flaw! :) 19:43:44 <gmaxwell> Why is it shit? we should be durable across a power off, which is worse than aborting. 19:43:45 <wumpus> the point of abortnode is to be able to show a GUI dialog to the user 19:43:55 <gmaxwell> ah 19:43:57 <wumpus> if you abort() the user will never know to check the debug.log 19:44:07 <BlueMatt> wumpus: well we can show a gui dialog and abort() prior to unlocking cs_main 19:44:09 <BlueMatt> :P 19:44:15 <jeremyrubin> have a graceful shutdown falg 19:44:19 <jeremyrubin> *flag 19:44:24 <gmaxwell> wumpus: but it's not unlikely that we can't show a gui... if we can't allocate (likely cause). 19:44:27 <BlueMatt> which would effectively be sufficient 19:44:29 <wumpus> I always thought that AbortNode was for errors that could tolerate a graceful shutdown 19:44:35 <wumpus> the really terrible errors we already assert() or abort() on 19:44:36 <jeremyrubin> and if we're shutting down gracefully, write to a file "was graceful" 19:44:41 <BlueMatt> gmaxwell: AbortNode() is usually disk corruption 19:44:45 <jeremyrubin> On next start, show dialogue first thing? 19:44:46 <wumpus> please don't make it routine to abort() for everything 19:44:53 <wumpus> only for things that really warrant it 19:45:00 <wumpus> not crash on every single error 19:45:02 <BlueMatt> or too little disk space 19:45:02 <sdaftuar> or too little disk space, not quite teh same as corruption 19:45:03 <wumpus> that's just a mess 19:45:21 <gmaxwell> okay, sure. 19:45:24 <wumpus> for some problems it's fine to try to clean up normally 19:45:31 <wumpus> but we still need to exit with a message 19:45:34 <wumpus> that's what abortnode is for 19:45:56 <wumpus> if BlueMatt can make it work faster that's great, but don't silently kill the program on every error 19:46:18 <gmaxwell> wumpus: how about every other error? 19:46:26 <gmaxwell> :P 19:46:33 <wumpus> gmaxwell: hehe 19:46:59 <sipa> do #9208 and #9725 interact? 19:47:01 <gribble> https://github.com/bitcoin/bitcoin/issues/9208 | Improve DisconnectTip performance by sdaftuar · Pull Request #9208 · bitcoin/bitcoin · GitHub 19:47:03 <gribble> https://github.com/bitcoin/bitcoin/issues/9725 | CValidationInterface Cleanups by TheBlueMatt · Pull Request #9725 · bitcoin/bitcoin · GitHub 19:47:07 <gmaxwell> Y'all so worried about the dressings on the coffin. "It's already dead!" But sure. Sorry I was only thinking of OOM, it's just a recent subject. 19:47:19 <BlueMatt> sipa: they dont, afaik, aside from there now being two structs that can be merged 19:47:22 <jeremyrubin> I think deleting a file on graceful shutdown should work 19:47:42 <jeremyrubin> And then starting when that file is present shows the dialouge rather than starting 19:47:46 <BlueMatt> gmaxwell: yea, AbortNode isnt used for thigns like OOM and total death 19:47:49 <BlueMatt> disk space also does it 19:47:55 <BlueMatt> but it can also be db corruption 19:48:13 <jeremyrubin> This means you don't do anything at all on the Abort, but requires the user to restart their node to see the message 19:48:15 <BlueMatt> so maybe the solution is AbortNode gets renamed to ShutdownSoon() and use make sure disk corruption is something different? 19:48:31 <wumpus> BlueMatt: yes, makes sense to me 19:48:33 <wumpus> BlueMatt: or add a flag 19:48:38 <morcos> I haven't really thought this all the way through, but doesn't it seem that as we move more and more towards things happening in other threads, that you'd rather let those threads finish what they are doing 19:48:45 <gmaxwell> BlueMatt: you could also harden things up against your current concer by having the rpc stuff always check the shutdown flag right after taking cs_main (whenever it does)? 19:48:47 <morcos> If you're committing wallet changes for instance 19:48:59 <gmaxwell> BlueMatt: and if it finds out that its on its way down, just returning at that point. 19:49:06 <wumpus> in the long run morcos we should move toward looser coupling of things 19:49:09 <wumpus> I agree 19:49:13 <BlueMatt> gmaxwell: yea, I'm worried that if we start down that path we have to check it everywhere 19:49:22 <BlueMatt> eg wallet may make decisions based on some corrupted mempool state 19:49:27 <morcos> Let's say you just got some new address from the wallet, and the wallet hasn't committed that state yet and then you Abort() 19:49:31 <BlueMatt> (I havent thought all the way through all the potential issues here, just a potential concern) 19:50:19 <wumpus> morcos: luckily we have the keypool to handle that specific contingency 19:50:24 <jeremyrubin> Every thread could have their own unique ungraceful-close file that it should delete (via RAII) on clean exit. Starting with any present would show error. Uncoupled! 19:50:40 <gmaxwell> morcos: well at least the worst you get there is replay (due to keypool/hdwallets).. though replay can be pretty bad. 19:51:11 <wumpus> pretty bad but well at least they will never lose live keys 19:51:47 <BlueMatt> jeremyrubin: I have a feeling we can be more agressive on the super-strange issues, afaiu this stuff is pretty much hit just with out-of-disk everything else is rare enough..... 19:52:36 <sipa> maybe we should just have some std::atomic<bool> shits_on_fire_yo; which when set prevents RPCs etc 19:52:46 <wumpus> jeremyrubin: that's not what I mean with uncoupling, what I mean is that if one thread messes up it doesn't need to take the others with it because it can operate more-or-less independently 19:52:58 <gmaxwell> sipa: well shutdown can be more or less that. 19:53:14 <sipa> that can even be set from a signal handler 19:53:34 <BlueMatt> 20<BlueMatt>30 so maybe the solution is AbortNode gets renamed to ShutdownSoon() and use make sure disk corruption is something different? 19:53:53 <jeremyrubin> sipa: what do you do with an in progress RPC? 19:54:04 <BlueMatt> in ShutdownSoon cases (eg out of disk) we're ok to continue and just shut down, i think 19:54:08 <sipa> jeremyrubin: we can only do so much 19:54:08 <wumpus> you can't generally break off an in-progress RPC 19:54:11 <BlueMatt> in other cases, we can just assert(false) 19:54:13 <wumpus> although some will pay attention to fShutdown 19:54:15 <BlueMatt> because they're super rare anyway 19:54:21 <jeremyrubin> wumpus: ^ yes, O 19:54:21 <gmaxwell> but really, what probably should be done is having the rpcs being able to handle being told "No, you can't have access to [whatever chain state you wanted] right now." then these error conditions that could leave them unstable... could set them. 19:54:26 <BlueMatt> and if your disk is corrupted we probably shouldnt be flushing wallet and chainstate as a part of shutdown anyway 19:54:47 <BlueMatt> gmaxwell: yea, but thats a huge rewrite for cases that should be super rare 19:55:11 <gmaxwell> Or we take a whole different approach to failure messages, .e.g. wrap bitcoin-qt in another program that when qt exists it can go through the logs and give you a useful error. (though this doesn't answer morcos' wallet flush, but really that should be in another process.) 19:55:26 <wumpus> BlueMatt: I think there should be a clear separation between (A) I/O issues such as out of disk spae, which just happen regularly (B) rare implementation issues such as internal consistency errors 19:55:27 <cfields> iirc rpc has a IsRPCShuttingDown() or so, but only a few things (gbt only, maybe?) checks it 19:55:28 <jeremyrubin> wumpus: * yes, I'm aware that isn't quite what you meant, just making noise about having a graceful shutdown file because I think it's a reasonable way to mark if a node shut down dirty 19:55:44 <wumpus> (A) normal errors should just give the user an error as AbortNode does now and shut down properly 19:56:02 <wumpus> (B) internal state corruption errors could break off the process immediately 19:56:02 <gmaxwell> It's obnoxious that we can't preallocate resources such that we can guarentee that we never take a failure at an inoppturtune time. 19:56:14 <gmaxwell> having to slather the code in error handling to deal with that is not ideal. 19:56:41 <jeremyrubin> gmaxwell: you can preallocate lots of things? 19:56:44 <gmaxwell> wumpus: "I had to abort processing a block" is a weak kind of internal state corruption. It leaves assumed invariants violated. 19:56:57 <gmaxwell> jeremyrubin: in particular we can't make leveldb's disk usage constant. 19:56:58 <BlueMatt> wumpus: yes, thats my proposal 19:57:12 <BlueMatt> wumpus: but, specifically, B would include things like chainstate-on-disk-corruption 19:57:17 <BlueMatt> which it already partially does, just not completely 19:57:26 <wumpus> having code to handle normal errors is perfectly normal and all code has that, I agree that paranoid memory/disk corruption errors tend not to be possible to handle in a sane way 19:57:33 <wumpus> BlueMatt: yes 19:57:48 <BlueMatt> ok, soooo, acks on:<BlueMatt> <BlueMatt> so maybe the solution is AbortNode gets renamed to ShutdownSoon() and use make sure disk corruption is something different? 19:58:07 <wumpus> BlueMatt: as I said beforer, yes, that or add a flag/criticality level 19:58:11 <BlueMatt> s/use make sure/make sure/ 19:58:15 <jeremyrubin> BlueMatt: maybe if you paste it again 19:58:15 <BlueMatt> sure, that too 19:58:23 <BlueMatt> jeremyrubin: ok, <BlueMatt> ok, soooo, acks on:<BlueMatt> <BlueMatt> so maybe the solution is AbortNode gets renamed to ShutdownSoon() and use make sure disk corruption is something different? 19:58:39 <gmaxwell> wumpus: for example, if we run out of space while flushing our view of the blockchain will be corrupt, and information about the blockchain from rpcs will be wrong. In a world where error handling is free, every code path that gets access to the blockchain data would be able to gracefully handle being told it's just not available. But I think that would be an unreasonable and dangerous amount 19:58:45 <gmaxwell> of complexity. :( 19:58:50 <BlueMatt> wumpus: I missed that statement, but, yes 19:59:00 <gmaxwell> BlueMatt: that sounds okay to me. but I don't know that diskfull is really a shutdown soon though we really do need to give it a nice message. :( 19:59:03 <wumpus> gmaxwell: note that the out-of-disk error happens *before* the disk is entirely full 19:59:04 <jeremyrubin> BlueMatt: sgtm 19:59:12 <wumpus> gmaxwell: there's a threshold 19:59:22 <wumpus> gmaxwell: so that should already be mitigated 19:59:26 <gmaxwell> wumpus: yes, but it's not atomic. you can't check and then successfully allocate space. 19:59:38 <wumpus> no, it's not perfect, but it works pretty well 19:59:43 <wumpus> I've never had corruption on full disk 20:00:33 <wumpus> also leveldb write failing shouldn't generally be fatal 20:00:39 <gmaxwell> on my desktop, which runs with debug=1 it almost always gets checked at the start of the flush. It doesn't corrupt things on disk, but as matt points out the rpc would return somewhat incorrect results during that time. 20:00:39 <wumpus> it just means the last transaction is not committed 20:01:00 <jtimon> it seems it's time to abort the meeting 20:01:06 <wumpus> #endmeeting