You may be so fortunate as to not know what a BSOD is. Just in case, I’ll tell you: BSOD is short for Blue Screen Of Death, and it’s the thing Windows operating systems do when they panic. That is, at some point, immediately, your screen turns blue and the only flexibility you have in your recourse is when to turn your computer off. BSODs are a thing that should happen exceedingly rarely. In the ideal spectrum of software reliability, things more reliable than your operating system are pretty much only things that can kill people if they fail, like flight control systems. Don’t worry, most flight control systems do not run on Windows. A BSOD should be indicative of losses of assumptions that Windows holds dear, like being able to read and write a hard drive. A BSOD is Windows’ way of saying “I’m really sorry, but I can’t do the things you want me to do because something I rely on has let me down.”
BSODs also happen notoriously often. For example, a little after Windows 8 came out, I was looking around in a store where you could buy laptops with Windows 8 on them. There were 6 laptops, 5 of them showing the infamous new start screen, and 1 of them showing the infamous new BSOD.
I don’t mean to rag on Windows. I mean, I am totally ragging on Windows this entire post because of one such BSOD. But there’s tradeoffs between everything, and I think there’s a very tight race between Windows, OSX, and Ubuntu right now. All three have their own unique compelling advantages and mind-expanding frustrations. I don’t want the takeaway from this post to be that Windows is bad.
This is my first BSOD story. I say “first” not to foreshadow, but because I don’t know how else I’d disambiguate another BSOD story if it ever happens.
I’m working at Bionym (when it was still called that) doing my part to create the Nymi wristband authentication solution and its surrounding ecosystem. The bands are untrackable by definition, because we want to offer a digital identity that the user can control. The bands communicate with Bluetooth Low Energy (BLE). Windows has a security feature that requires the user to manually whitelist all BLE devices they want their computer to communicate with. This is at odds with the user experience we want to create. For example, you might want to unlock your Windows computer using your band. But first you’d have to whitelist the band, because you constantly have to whitelist the band, because it’s not trackable.
So we circumvent Windows. Instead of using Windows’ BLE support, we stick a BLE dongle in the computer. USB on one side, BLE on the other. Windows notes that a USB device has attached. The USB device and Windows perform a courtship dance, and a serial port is born. Software can open a serial port, read from and write to it, and no manual user consent is needed. And so software can talk to this serial port and get BLE functionality out of it.
Serial ports have their own problems. In particular, you can only have one program use a serial port at once. This is again at odds with the experience we want to create. We want to allow many programs to communicate with the band. So instead of having each program talk directly with the serial port, we create a service that owns the serial port and knows how to multiplex it to various other programs. I called this component the ecodaemon — ecosystem daemon. It was a roundabout journey, but we finally have the functionality we want.
Or so we thought. BSODs were not part of the intended experience, nor the software design. But BSODs there were. They’re intermittent so it wasn’t clear it was us aggravating them at first. How can software cause a BSOD anyway? That would be like a train causing a bridge to collapse. It’s not a planned method of failure. Removing supports that hold up the bridge to demolish it is expected. Having to replace rails on top of the bridge is expected (I guess, I don’t know exactly how railway bridges work). But the supports keep the rails up, and the train runs on the rails, and the train should have no way of breaking the supports.
But we eventually zeroed in on the ecodaemon, somehow it was aggravating Windows into doing something it shouldn’t. We then put it off because how can software prevent a BSOD? You might try reducing the mass of the train, but you’ve been guaranteed that the bridge wont break for some mass, so why would you trust loading half that mass? Your guarantee is gone, and for all you know the bridge might randomly break under any load. Asking software to not BSOD is like asking Windows to make sure the hard drive doesn’t explode. Windows can’t ensure the hard drive doesn’t explode, and software can’t make sure Windows doesn’t BSOD.
Eventually we had to fix it. I’m no Windows expert, but I wrote the ecodaemon, so I was still best suited for the task.
Faulty hardware was pretty easy to rule out. There was such a wide variety of setups failing. My own computer never failed, somehow it was on a stable island. There were a couple other stable islands in the office, I think. So Windows couldn’t complain about failing hardware. We knew the railway bridge’s supports were sound, because we could swap out any support design for any other, and the train still failed to cross the bridge. It’s astronomically more likely that a single set of rails are broken than multiple sets of supports.
We explored the possibility that the USB driver for the BLE dongle was faulty. We reached out to the company that makes them, Bluegiga. Turns out they just use a stock Microsoft driver. So regardless of where the actual problem lied, it would be in Microsoft’s court. The problem with having Microsoft as your solution is that Microsoft is a big company. They have much bigger issues on their plate, things more interesting than rounding out rarely-visited corners in support for some old communication standard. So though we did reach out to Microsoft, given the timings involved, we were more or less resigned to fixing this in software. We had to sweep a hole in the ground under a carpet, but still, there was something about the ecodaemon that made it encounter that hole. Bluegiga suggested we investigate our hardware setup, which came from a good heuristic. Fixing a BSOD by identifying bad hardware is like cleaning a dirty scab; fixing a BSOD in software is like tattooing over a scab. But Bluegiga’s response at least confirmed that it wasn’t a well-known problem. There was something about the ecodaemon I could identify.
I tried using some Windows debug tools. I got 2 extra laptops that were between users. I created a program that was as simple as start, wait, BSOD. You can have a derivative of it if you want, here. Eventually I got pretty much the best thing I could get: a live BSOD on one laptop, and debugger hooked up over the network, ready to inspect whatever I wanted to on the other laptop. But the crash happened in Microsoft’s driver (usually — something like a tenth of the time it happened in some other kernel process). And it wasn’t clear what caused it. I wondered what I expected.
I realized I didn’t have tools to debug anything. The Windows debug tools pointed to problems I couldn’t solve and the usual debugger wouldn’t turn anything up because there was no bug in the software, because software doesn’t cause BSODs. End users do not expect programs to crash, and software does not expect Windows to BSOD.
But something about the ecodaemon aggravated a BSOD, and I had to find it. So I did a binary search on what I did have control over: my own code. There was a huge gap between the old sans-BSOD not-even-an-ecodaemon-yet and the new BSOD-ridden ecodaemon, because we’d ignored it for so long. In retrospect, I feel privileged to have been blessed with the old sans-BSOD not-even-an-ecodaemon-yet so that I could do a binary search. I started crowdsourcing computer time because it took so long to do a test, especially if it was a configuration that didn’t cause a BSOD — how could you be sure? I got 2 more laptops that weren’t allocated yet. Slowly my passing and failing programs converged.
I started getting worried — could I really do a binary search on my code? What if it’s only in the optimized machine code that the aggravating pattern happens? Then I’d be at the mercy of Microsoft’s compiler. It was like trying to find a word in a dictionary but not being sure if the dictionary was alphabetical past, say, the 4th letter of a word.
I actually thought I got to this point. I had a change I could apply to the old code that made it crash. I (thought I) applied a reverse change to the new code, but it still crashed. I explained to my boss that we’re boned. I explained the binary search failure to him. He reasoned that if we could make a simple modification to Bluegiga’s example program and make it BSOD, we could leverage their help in finding a solution. Renewed with the hope this plan provided, I instead found that I had made a mistake in my reverse change to the new code.
Turns out that if you give small but nonzero timeouts to the SetCommTimeouts functions, you get a BSOD. Seems like there’s a race condition in the driver or surrounding code.