My Hardest Bug Ever (2013)

kabdib · on June 19, 2015

Ah yes:

- An OS would write the "please wake me up on the next interrupt" to a hardware register, then a shadow RAM location. Interrupt happened between the two writes, so the task scheduler wrote the wrong value to the register and stopped. Finding the problem: Two weeks. Fixing the problem: Swapped two instructions and rebuilt. Write-only registers are evil badness.

- USB controller with a temperature problem: It would also work first thing in the morning, and after lunch. After maybe four days of correlating this the "Aha!" was visible on my face at lunch and I left the table rather abruptly. The initial fix was a bag of ice on the chip. Real fix took several weeks of increasingly loud conversations with the chip vendor.

- Cache on a disk drive had a bad bit, occasionally corrupting file contents or (even more fun) directory structures. Memo: All the ECC in the world won't help you if your I/O system is flappin' in the breeze. End-to-end checksum anything you truly care about (in fact, we caught this one with a Merkle tree).

Never mind the heap corruptions by code you don't even have source for, the async callbacks that fire long after you've nuked the objects and assumed everything was dead, the DMAs that came from outer space because you set them up hours ago and they finally fired, wiping out something random every time, the incoherent cache coherency systems, the serial ports that drop bytes if you send too fast . . . it's amazing anything sufficiently complex actually works.

tonyarkles · on June 19, 2015

I think I would love your job. That's the kind of stuff I love chasing down.

jovdg · on June 19, 2015

Agreed. Also, I usually blame the hardware quite fast... Have seen enough faulty gbics, or wrongly plugged cables. But then that's what you see as a sysadmin, not as a (game) developer...

digi_owl · on June 19, 2015

Yep. If it was one thing a networking (the wires and cards kind) teacher drilled into me was to check wiring first and then work my way up the stack.

soft_dev_person · on June 19, 2015

Thank you for being. Both of you.

digi_owl · on June 19, 2015

War stories from the server rack?

kabdib · on June 19, 2015

No, these were some embedded systems (mostly in consumer products) that I've worked on over the years.

I've been doing server racks for a couple of years, kind of as a hobby. Don't get me started about server in racks :-)

13of40 · on June 18, 2015

About a thousand years ago, I bought a new 286 motherboard from the local computer shop, and when I brought it home, DOS would boot on it, things like edlin and debug worked, but anything like a game (Lemmings!) would hang. I tore my hair out over this for a while until I noticed another symptom -- when I ran the 'date' command, the system time never changed. It turned out the clock on the motherboard was faulty, so it never fired the interrupt to tell the computer time was passing, so anything that depended on that interrupt for poor-man's threading would hang. But DOS booted up and ran like a champ.

acomjean · on June 18, 2015

Speaking of time..

I inherited a bug that only showed up the last 5 minutes of an hour. Except I didn't know that. The first time I worked on it, I reproduced the error but before I got the debugger started the time went by and the new hour caused the bug to go away...(why does this bug not show when the debugger is on?)

Eventually by hand tracing I figured it out.. Some weird time function was rounding the minutes to the nearest 10 minutes. When the time crept past 55 minutes rounded up to 60 and passed to a time function that didn't like...

Only bug that was harder was a serial cable with flipped wires. Which kinda worked, but gave out garbage. We figured that out with a scope.

JoshTriplett · on June 18, 2015

This kind of hardware quirk is really common in old consoles. In the NES and SNES era, not only were these kinds of issues passed around by tribal knowledge, but half of them were repurposed as features and used to push the hardware further. Not random save corruption issues, obviously, but many other hardware quirks became semi-documented "features" of the console, and years later, emulators would have to reproduce them faithfully or games wouldn't run correctly.

Liru · on June 18, 2015

Got any examples of this? It seems interesting.

JoshTriplett · on June 18, 2015

Super Mario Bros 3 did diagonal scrolling, which was not an intended feature of the hardware; it also used some careful timing to split the screen between the play area and the status area on the bottom.

Many console games reprogrammed or switched palettes partway through scanout to get more colors on the screen than normally possible; some did so with careful timing in the middle of horizontal scanlines. Emulators commonly don't implement this, because it adds significant overhead to the scanout fast-path. See the screenshot of Air Strike Patrol in http://arstechnica.com/gaming/2011/08/accuracy-takes-power-o... , where failing to emulate intra-scanline changes results in the plane's shadow disappearing, which is a critical gameplay component that makes it easier to aim.

mkagenius · on June 18, 2015

This could be related:

While playing games using 8 bit NES cards powered by a stabiliser; if you increase and decrease the voltage alternatively very fast it would give you 30 lives in Super Contra.

Taniwha · on June 18, 2015

I chased a similar bug a lon time ago, we were putting Unix on a late '80s era platform (which I wont name), some system's keyboards failed if you used the floppy, but not others.

Turns out the floppy code sat in a tight loop polling the clock in a timer chip whenever it waited for sectors to pass by, when the hardware did this the clock to the keyboard controller changed (got faster) - turns out some keyboard chips had old firmware (they swore they didn't) that couldn't tolerate the faster clock.

A couple of no-ops in the floppy loop fixed it.

As someone point out above metastability issues are the real impossible bug - but the hardware guys (I wear both hats) should have got that right in the first place

DrScump · on June 18, 2015

This reminds me of when I was a staff Consultant at an RDBMS company 25 years ago.

A major telecom company was getting crashes in the server because of alleged I/O errors (mostly writes) in the raw i/o version of the server (no filesystem)... just a few times per several hundred thousand (or million) transactions. Subsequent disk and controller diagnostics were always fine. (This was before any fault tolerance was programmed into the server... my code for this problem was the first ever.) Most failures were writes, a few were reads. No obvious pattern (time of day, load, transaction type, lunar phase, sunspots, strength of coffee, etc.) This was on a DEC with Unix System V and DEC disk hardware, so only one vendor to deal with.

Anyway, seeing nothing "wrong" with the related code (other than no tolerance for error codes from the raw i/o calls), I theorized that maybe the "errors" were spurious and no actual fault or corruption was resulting, so I wrapped the i/o calls in retry loops (with the number of retry attempts tunable by the user) and logged any "failures" and results of the retry attempts. So, I did a build with my changes, had the customer run from my directory, then wait and watch for the carnage...

Turns out that every retry was successful. In fact, all but one of the dozen or two per day was successful on first try, and none needed more than two. No actual flaw in data was ever found.

Anyway, it turned out to be some spurious error specific to a specific drive type with that specific controller running that specific firmware... and apparently their filesystems code knew to work around it.

Client site personnel were really nice to me, too.

bargl · on June 18, 2015

This was an awesome article. It's been up here before. For relevant discussion check https://news.ycombinator.com/item?id=6654905

vmorgulis · on June 18, 2015

It's a little like a row hammer bug (https://en.wikipedia.org/wiki/Row_hammer).

castratikron · on June 18, 2015

Quantum mechanics? Just sounds like plain old induction to me.

danso · on June 18, 2015

The OP was originally published by Baggett on Quora, and then subsequently re-published on Gamasutra. The Quora posting has a few footnotes, including him admitting that "quantum mechanics" was mostly a flourish. What he meant was, that unlike other software bugs, "the behavior was -- at least at the level of the source code -- non-deterministic"

http://www.quora.com/Whats-the-hardest-bug-youve-debugged

------

Footnotes for posterity:

A few people have pointed out that this bug really wasn't a product of quantum mechanical effects, any more than any other bug is. Of course I was being hyperbolic mentioning quantum mechanics. But this bug did feel different to me, in that the behavior was -- at least at the level of the source code -- non-deterministic.

Some people have said I should have taken more electronics classes. That is absolutely true; I consider myself a "full stack" programmer, but my stack really only goes down to hand-writing assembly code, not to playing with transistors. Perhaps some day I will learn more about the "bare metal"...

Finally, a few have questioned whether a better development methodology would have prevented this kind of bug in the first place. I don't think so, but it's possible. I use test-driven development for some coding tasks these days, but it's doubtful we could have usefully applied these techniques given the constraints of the systems and tools we were using.

vardump · on June 18, 2015

Or an issue with clock divider/multiplier, voltage drops, metastability issues or really anything. Including quantum mechanics related things, they do happen with current chips. We can only speculate.

userbinator · on June 19, 2015

Reminds me of this 30-year-old hardware bug, related to metastability: https://news.ycombinator.com/item?id=5314959

(The article's location has changed, it's now at http://www.pouet.net/prod.php?which=61024#c637759 )

More information at: http://www.linusakesson.net/scene/safevsp/index.php

GuiA · on June 18, 2015

Debugged a (somewhat similar) problem a few months back on a USB device that would only happen when it was plugged in on non grounded computers (e.g. laptops running on battery). Working in hardware is fun.

davesque · on June 19, 2015

Yeah, I remember this being posted before. This reminds me of another article which I'm pretty sure was posted on HN as well about a single bit being randomly flipped on someone's `expr` binary as it was loaded in memory. The author, sort of jokingly, suggested that the bit flip was caused by a cosmic ray:

https://blogs.oracle.com/ksplice/entry/attack_of_the_cosmic_...

rbritton · on June 19, 2015

I used to work at a hotel that ran a Nomadix gateway appliance to get around guests having static IPs configured from their office environments. Without it we were sure to get a call about them being unable to connect.

One year the hotel purchased another across the street and connected it via fiber. As our network infrastructure was somewhat old we used a converter on each end to take a standard patch cable from each switch and transmit the data stream over the fiber. Everything worked as expected except for one key part: computers attempting to access the guest network would never receive a DHCP-assigned IP address. Static IP addresses worked just fine.

After quite a bit of packet sniffing and digging into specs I found the problem: the packet size of the initial DHCP responses from the Nomadix gateway was smaller than the minimum packet size of the converters. The fix was to switch to a different model converter that operated at layer 2 instead of layer 3 and thus didn't have the packet size minimum.

bwy · on June 18, 2015

This has been reposted a few times now. Past discussion: https://news.ycombinator.com/item?id=6654905

digi_owl · on June 19, 2015

Got me thinking about a Bill Herd video about working on the C116.

https://www.youtube.com/watch?v=xPD5N43VIsk

Specifically the part where he talks about sorting out the joystick port.

More hardware than software but still...

amelius · on June 18, 2015

Lesson: make sure you can always (re)run your code in a fully deterministic environment.

Tloewald · on June 19, 2015

Such as?

windowshopping · on June 18, 2015

That is cool as hell.