Quiz: to be taken before you get to the end. Who is this, and why is he relevant?
The story so far: Imperial have some kind of epidemic model that was used to predict, errm, stuff1. After a bit people, predictably (arf!) enough, said "where's your source code" and Imperial said "errm, it's a bit of a mess actually, hang on a mo" and after a fair while and heavy massaging from folk in the private sector that turned an unreadable unmaintainable 15k single-file model bearing a powerful resemblance to S+C's MSU code into something that could be seen in public without too much embarrassment, they put it onto Github2. I even poked around in there for a bit before realising I didn't know how to look at Github, and getting bored.
But! Other people have looked, and are unimpressed. This is no great shock I think. The people I found were "Lockdown Sceptics" who, despite their sensible but derivative motto Stay sane. Protect the economy. Save livelihoods, are probably a bunch of nutters; they also have a dreadfully slow website. They have a post (arch) called Code Review of Ferguson’s Model.
Most of the post is about non-determinism. This is interesting (though probably no great flaw) and I'll get to it in a moment, but first a few other things they pick out. The first is the absence of any kind of tests, which seems a fair point. Writing tests is tedious and often neglected even in the Real World, so it is unsurprising that a bunch of amateurs didn't bother. Another is poor documentation of some parts; and again, meh, so it goes. let's go back to the interesting part, non-determinism.
There are a number of meanings for this that need to be disentangled. In running such a model, you probably want to run a pile of runs with similar but perturbed initial conditions and do some averaging of the results. In this sense the model is intended to be "stochastic", and that's fair enough. However, with a given random seed, you would rather like the model to be repeatable. It appears to be rather shaky at this. The first problem looks to be parallelism. This comes up in GCMs too, and indeed way back when I said:
There are two sorts of repeatability: you run the model again, and you get *exactly* the same results down to the last bit. This is call bit-reproducibility. Or, you run the model again, and you get *scientifically* the same result (the same climate; probably the same response to forcing within statistical error) but the exact details of the weather differ. Because the climate is chaotic (in the sense that small initial perturbations rapidly amplify) and GCMs reproduce this well, if your model diverges even slightly from bit-reproducibility it will diverge strongly from it, because the details of the individual weather will be totally different. But the climate (statistics of the weather) will be the same.So you can see Repeatability of GCMs for more. The problem with the parallelism stuff if that the easy way to implement it leads to numbers coming back from other processors in a different order, and so inevitably the exact last few bits of floating point calculations don't quite match. If you put effort in you can avoid that, at the cost of slowing the thing down a little3. Imperial's solution appears to be running the model single-threaded, instead (but because they didn't really care about repeatability, and had no tests to pick up problems, they still had repeatability bugs even in single-threaded mode).
If you look at the report, the non-repeatability looks to produce a pretty big difference. However, I rather suspect this is like GCMs: if you look at output after one month after a trivial perturbation, it will look very different even though the long-term climate is the same. What that report / pic appears to show is sensitivity to when the initial perturbation starts to grow. This might be a problem; but it might not (I think you'd need to see a whole ensemble of runs to see what it is supposed to look like).
So overall I think the criticism in scientific terms isn't exciting. As a lesson-for-our-times I think "when you look behind the curtain of just following the science you'll see some messy stuff" will do. Or even Laws are Like Sausages. Better Not to See Them Being Made.
The East is Red
Interestingly, seen from one bug report, people are shamelessly using the concept of "Red Team" in this context. It's almost as though the concept is sensible and helpful, when used in good faith.
Update
The fight for the soul of the Imperial model continues. Nurture has Critiqued coronavirus simulation gets thumbs up from code-checking efforts which (once you remove the goo and dribble) amounts to pointing out CODECHECK certificate for paper: Report 9: impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand. March 16, 2020 which looks to be part of a commendable effort to re-run codes leading to publication by Stephen J. Eglen; and they have, and get the same answers (in the stochastic sense).
Meanwhile Bryan Lawrence has a somewhat unconvincing defence in On scientific software - a beginning.
Refs
* The Imperial College code - ATTP
Notes
1. I'm a big-picture man. Don't bother me with details.
2. I've been re-reading Proust.
3. The Met Office / Hadley Center are not a bunch of amateurs and did put the effort in. Incidentally, one reason for wanting the reproducibility is rare bugs where you model blows up. If you don't have reproducibility, you can'r re-run with extra debug to get to the same point and see the failure more clearly, so you're pretty well stuffed for debugging.
15 comments:
I think you're probably wrong, though maybe I'm wrong in my claim.
GCMs have chaos to deal with (as you well know!) which means that any last-digit wobble blows up - you get this from different compilers and hardware for the "same" model unless you take extreme care (which the HC did for a while, don't know if they still do).
I don't think these stochastic epi models have the same sort of chaotic sensitivity - there is indeed a growing mode but that is the basic exponential growth of the epidemic itself and a last-digit perturbation shouldn't grow faster than the underlying infection anyway. So it should be easy to get stable repeatable results. It seems a little bit shoddier than it ought to be, but of course the sceptics don't know enough about it to even make a credible and relevant criticism (no change there, then).
I wasn't sure about how the sensitivity manifests itself in the Imperial model. I doubt it is exactly analogous to the GCM chaos and didn't mean to imply it was. It's easy to see that a multi-processor version will not be bit-accurate; and given what is said, it seems easy to see how that can lead to non-exactly-comparable results. Whether it leads to wildly divergent results is another matter - but I don't know enough about what kicks off the model to be sure.
James,
But does the chaos in GCMs change the typical state of the system? I thought WMC's argument was essentially that it mostly wouldn't (as I too had thought).
GCMs will (immeadiately, and at a "fixed" rate) begin to grow from a given perturbation. I think James's point is that the epidemic models have no obvious reason to show this same behaviour. On reflection that sounds plausible. In fact, looking again at that particular bug report (https://github.com/mrc-ide/covid-sim/issues/116) that issue was not two-different-runs it was two runs, one started from a saved file, and one not. And the solution appeared to be a bug in the RNG init. So (perhaps) there is no evidence for *significant* variations based on parallelism? I'm not sure at this point... Maybe the "red team" will produce a report?
I am so old school, but afaik once you pick a seed all subsequent seeds are deterministic (based on the 1st seed) in a prng. That is why you want a prng with like 2^6666 lengths in prng's as you want to have almost zero probability of picking the starting seed twice or even a very low probability of picking the same section of the prng twice.
That is the main reason they are called pseudo, as in, not real random number sequences.
Numerical models are different in terms of round off errors and higher order terms, but should ideally be deterministic given the same code and input afaik.
Don't think they uncovered anything though, just something lost in translation from old code to new code. Whatever they think they found they must show that the sequence deviates from the selected distribution(s), after all one usually runs tens of thousand of permutations from the underlying joint pdf's.
I actually reversed engineered the DEC prng from an algorithm I found in a book in our technical library. Surprised the heck out of me, same starting seeds for both same sequence of prng's for both. Back when Windows didn't have a prng in their old PowerFortran or whatever it was called.
The general rule is to write your own prng (or code the best existing algorithm with the longest sequence of prng's) so as to have some level of portability and reproducibility. Or just call the Intel libraries of prng's (which I still don't do). You do always want to be able to do an exact reproduction for coding purposes though. Always. Worry about whatever you think the prng is actually doing afterwards. Always.
In our water wave physical models (for running water wave makers) we always choose a random seed and used it in all subsequent runs (like thinking we had perfect control of the wave tank water level (we didn't)). In other words, the wave time series far from the physical wave maker never had the exact same water level time series because of slight real world differences in total water depth from run to run. I tried to explain this to our team at one time, but whoosh, way over their heads. They were so-called black box modelers.
Been doing this stuff for too long.
That is all.
OK, chaos. Numerical weather/climate models. Differences arise because of chaos. But when run to so-called equilibrium or quasi stationary/ergodic behaviors (e. g. PI runs) should produce same stochastic weather and climates, very idealistically, as compared to the real world? TIA
Ensemble averaging over many runs for weather forecasting/hindcasting (with updating using real world data for reanalysis/hindcasting) models.
I think I understand why weather/climate models produce different sequences. Even over relatively short time scales? That is my rather limited understanding anyways.
If you're interested in perturbation growth in GCMs, see Butterflies: notes for a post from 2005. I really should have knocked it up for publication; but there must be something like it actually published by now.
I dug into this non-determinism issue being raised to discredit the model.
First of all, in regard to getting two different results in single-processing mode, it was a bug introduced by the MicroSoft team. There's really no reason to believe it exists in the original version of the model used by IC to generate the earlier projections used by government.
As so often happens, this bug was introduced when a new feature was added. Generating the network of individuals, households, etc is time consuming, and the developers decided to introduce the option of generating and saving a network without running the simulation, then in the future loading the simulation. This resulted in the pseudo-random number generator being seeded twice. If you generated and then simulated the network in one run, the seeding was only done once. Different results. Hmmm. BUT once saved, every time you run the simulation on the same network you get the same results using the same seed. The developers had never compared the generate-save-restart-load-simulate workflow with the generate-simulate workflow and hadn't noticed the two scenarios gave different results with the same seed. It was fixed two days after it was reported and diagnosed, but the fallout has not died.
Now, regarding the multi-processing case, given the expense of network generation they don't serialize the processes to guarantee that each time it is run, individuals are assigned to the exact same households or schools. The assignments do follow the parameter values used to drive the algorithm, so the statistical distribution does fit those. The developers state this is intentional because the time savings is more important to them than guaranteeing reproducibility - after all, you can save a network and then rerun using that same network to your heart's content (regression testing, debugging, etc).
When run in multi-processing mode they guarantee reproducibility when you simulate the same network with the same number of threads. Again, important for regression testing etc.
Now, I can think of two possibilities here:
1. The developers from MicroSoft who are working on it haven't actually tested reproducibility under the conditions where they guarantee it and are lying about the fact that they have and, indeed, depend on it. I've found no evidence for this.
2. lockdownsceptic doesn't know what he's talking about. Having read his commentary, I'll say this is definitely true in some cases, at least.
If you're referring to the LDS post I linked to an arch of, that's by "Sue Denim", who I presume is female. But has a familiarish "sceptic" biog: the olde "I've been doing this since you were a young chap, young feller-me-lad" and so on.
Thanks for looking at the details of the non-repro; I think that makes clearer what I'd vaguely thought, but hadn't pinned down.
I'm more doubtful about the multi-core stuff, though. Are you sure they make that guarantee, and if so where? It is not entirely believeable, if combined with getting different results with different numbers of processors/threads.
"Sue Denim" = pseudonym
As to multithreaded and thread count I also don't know but ...
I always call a (large) block of prng's in one call before diving into the code further (e. g; the inner loops), using Intel Fortran 2020 edition. That is the fastest way to get the deterministic series of prng's from my own testing.
https://software.intel.com/content/www/us/en/develop/documentation/fortran-compiler-developer-guide-and-reference/top/language-reference/a-to-z-reference/q-to-r/random-seed.html
https://software.intel.com/content/www/us/en/develop/documentation/fortran-compiler-developer-guide-and-reference/top/language-reference/a-to-z-reference/q-to-r/random-number.html#random-number
If RANDOM_SEED is called with no arguments, the seed is set to a different, unpredictable value on each call.
The RANDOM_NUMBER generator uses two separate congruential generators together to produce a period of approximately 10**18, and produces real pseudorandom results with a uniform distribution in [0, 1).
I use the polar method to generate a gaussian distro. Good out to at least +/- 10 sigma.
The code only runs single threaded, so that on an AMD 3700X I run 16 copies at the same time, likewise on an i5-9400 I run six copies at the same time.
The algorithm, as is, is such that it can't possibly run multithreaded. But I'm thinking there must be a simpler compiler option to run multiple single threaded copies?
But then again, I am a retard at coding, now in my senior years, 30 years ago, that was when I was really sharp.
Anyways, no one appears to be staying in their lane afaik, All I know is that I had no one to help me figure things out, always had to figure it out myself from testbooks or existing code, no coding mentor so to speak.
Oh and from publications, papers, etceteras.
> "Sue Denim" = pseudonym
Duh. Yeah, OK.
BTW, as to the threading: my comments were for explicit parallelisation via MPI, so not really "threading"; in fact, Message Passing Programming with MPI.
If the model was using implicit compiler-provided parallel threads then I don't know how those would behave.
WMC,
Whomever wrote that article gave a hint, something to the effect of 'say it out loud', they also said they were not using their real name but an anonymous one. It took me several tries and the additional info really helps (I wound have given up or not even tried otherwise).
I ran into this multiprocessor bitlevel nonreproducibility thing in the late nineties when i was flogging a beowulf. It wasnt really a problem for us, and we wouldnt have bothered to fix it. But for arcane reasons we actually cared about exactly which processor's MPI messages were passed in which order, and basically rewrote the MPI AlltoAll call with a homebrew. Once that was done, a grad student used the algorithm to ensure floating point flowed in the right order and such, and we had bitlevel reproducibility.
I think we figured out that using the thing for bitlevel repro was costing us less than 10% wallclock testing on 256 CPUs and unnoticeable when we scaled up to a few thousand.
Very specialized calculation, of course, a molecular dynamics kinda thing.
sidd
For the record,Met Office models are bit comparable on the same hardware with different distributed domain decomposition and different bumpers of shared me more (openmp) threads.
I don't believe it's ever been possible to maintain that for different hardware though.
Post a Comment