The Hardest Bug I've Ever Fixed: Best of ShopTalk

preview_player
Показать описание
Dave relates the story of one of the hardest bugs he's ever had to fix in this set of highlights from the 10-30-24 episode of ShopTalk on Dave's Attic:

Рекомендации по теме
Комментарии
Автор

I had a friend who was customer engineer working on IBM equipment, and got a call out to a machine which had a vertically mounted 8" floppy drive as the boot drive. The report said that it wouldn't boot and was making a horrible grinding noise.

Anyway, he inserted a new 8" diskette, turned the machine on, there was a horrible grinding noise and the machine didn't boot. He pulled out the diskette and found it was badly scratched.

Tried again, same result. So he took the machine apart

In the drive he found cigarette ends, aluminium foil and other debris.

Turned out the cleaning lady thought it was a flip top bin and was emptying the trash into it.

leightaylor
Автор

Its immensely rewarding chasing bugs to their root cause and fixing it.

kwazar
Автор

Back in about 1983 I had one of those transformative debugging experiences almost exactly as you described. I was working on an adventure game in Z-80 assembly on my TRS-80. Everything had been going just fine for a few days. Then, I added an item to a table and ... boonies. As I recall, a full on crash was not very eventful. A couple of clicks on the floppy disk and the TRS-80 prompt appeared on screen. I feel silly explaining it now as it seems so obvious, but at the time I just couldn't see it. Stared that mother down for a day and a half. Turns out that my 'items' table, (for which I had allocated what seemed like lots of memory), overwrote an adjacent pointer table by a couple of bytes. I've told this story many times, suggesting that until you've stared at a monitor for 13 hours straight trying to figure out how the simplest of changes brutally crashes the computer, you don't know what it's like to be a programmer.

YoutubeBorkedMyOldHandle_why
Автор

I think the hardest bug I had to track down was when HP wrote a new implementation of Appletalk as part of a new network card. I had a customized version of CAP (Columbia Appletalk Package) which work absolutely fine with this new implementation, except that occasionally a network connection would just hang up *forever.* After months of debugging, I finally figured out that the problem was in error-handling of a dropped packet. And what's worse, is it came down to subtle and ambiguous wording in the official specification of Appletalk.
The bug which took me the *longest* to track down was in a simple client for chat system. It just gets lines from the chat server and writes them out while letting the person at the terminal keep typing in their own messages. Every once-in-awhile my client would print out an extra blank line. It took me at least 18 years to track down that bug!

garanceadrosehn
Автор

This is really a blast from the past. I remember installing a hard drive on a floppy only system for one of my lab mates. I chose RLL versus MFM and the people we were buying the drive from kept questioning me. She said they didn't sell that many. I told her she should be happy that I was reducing her slow moving inventory.

robertjune
Автор

I was one of the coders on a Playstation 2 game. It was near completion but the game would be unstable, it would crash. All the coders were pointing fingers at each other, none of them were debugging the issue. It was obvious there was memory corruption, but the PS2 tools at the time did not have memory protection, did not have boundary checking. I spent days making the C/C++ code compile and run on a Windows PC using MS DevStudio. I enabled full heap and stack checking using the MS compiler options. Leaving the code to run a loop through the title screen, into the game levels, and back out again, quickly showed that an old bit of code that handled a general purpose link list of objects was to blame. It was not correctly handling unlinking and would lead to incorrect memory blocks being accessed and modified. In the end I made sure to point out what coder was responsible and make sure they knew exactly how much work and risk they had caused.

MartinPiper
Автор

@Dave's Garage: Thanks Mr. Dave.. I love your content, and always enjoy the thrill of catching up on new episodes. Thanks again. Ahh, you take me down time's road making me reminisce my childhood growing up on an Atari, Apple IIe, Apple Mac, 6x86 Cyrix; IBM PC-DOS; IBM 486DX66Mhz MS-DOS then Windows 3.1 then Red Hat Linux.

StarOfDavidKush
Автор

Look up, look way up! I will never forget that show.

bobmurphy
Автор

The hardest bug I've addressed was in smartphone that under certain carrier, at certain times, everyone's phone around a time would seem to run out of battery. Sure, it's near the end of the day so people would just assume it's end of the day (actually around 5-6pm, on Fridays in certain European countries). Turns out there's some cellular protocol under certain circumstances, would be put into certain modes to save power. Without this feature, most smartphone would be out of power within 15-30min. What happened was that there's a race condition between the different cores, the modems cores and the applications cores. Since in mobile, both systems think they are masters and the others are secondary, ... besides it's impossible to replicate, the modem core as well as all the code are highly protected to anyone outside the modem system. Out of entire world, was me and one HTC engineer both had a binary dump of the system and went through steps by steps to identify the bits. The modem team then had the visibility of the code mapped to the bit to determine the race condition.

The second hardest bug I've address was when I was a bios engineer and we'd have system dried randomly. This was back in the days we first made all-in-one pc laptops. Instead of using the $300+ intel mobile CPU, we decided to use the $100 desktop CPU. we had tens of thousands of computers in factory running burn in tests and would just fail at random. turns out under certain conditions, the fans wouldn't perform and would melt... yes you read that right, melt...

jasonchen-alienroid
Автор

I like this highly edited Q&A session format - you covered a lot of ground quickly, even though I’m sure the conversation was an hour plus long IRL.

Maybe this type of content could be a 2nd channel: Dave’s Attic, as you said at the end 😉

markmuir
Автор

The simulator for the Space Shuttle was known as the "SMS" (Shuttle Mission Simulator). Back in the days before the Shuttle's first launch the simulator had to be created to train the astronauts. This SMS was based on a Sperry UNIVAC 1100 mainframe with several Perkin Elmer 3250 minicomputers performing aux functions such as payload simulations (how do I still remember this crap?). The simulator was a "motion-base" simulator, meaning it would move to simulate the astronauts' real world experience.

The bug in question was a very noticeable mechanical oscillation when the simulator would execute the "roll maneuver" after launch. The bug persisted for months, to the extent that, for the first time, NASA put out a bounty on it.

The bug was never found, and eventually launch day arrived. The Shuttle launched perfectly and, when executing the roll maneuver, exhibited exactly the same oscillation. The problem was not that there was a bug in the software, but that the simulation was too accurate.

And now you know ... the ... rest ... of the story!

dukeofearl
Автор

I had a cheap Atari ST rollerball mouse, that had to be cleaned often when it stuttered or stopped. One weekend, it worked left/right but not up/down. After 3 tries at cleaning, I bought a new one, and an hour later back and running. Next weekend, same thing, but I knew it wasn't dirty yet. Turned out to be at that specific time of day, the sun through the window hit the side of the mouse, and affected the roller ball timing wheel. Can you imagine a user with that problem calling a help desk and being told to close the blinds.

ForbinKid
Автор

Dave. I can't help but tear up every time I see the Friendly Giant ending ... and I'm really starting to enjoy these episodes. Well done sir! Do you remember why you had latex gloves on?

michaelangellotti
Автор

The hardest bugs are always the ones that are hardest to reproduce.

If I can reproduce it in 10 seconds by running a testcase, I'll find out very fast. If it takes a couple of hours to reproduce, and only crashes half of the time... Well, good luck finding that one.

My favorite one was a user who claimed the system changed her inputted numbers, sometimes. It was a warehouse application, with numbered boxes. Most worked automatically, but they could manually fill a box and register it in the interface.

Safe to say, putting in the wrong box number caused quite some issues down the line.

I couldn't reproduce it at all, not in a test environment and not even in the live environment. So in the end, I drove a couple of hours to that warehouse in the hope I would see something by just observing what she did.

I watched her type in the number "063", and I thought I saw it change in a flash when she hit the save button.

It all came together, it was an interface bug, not a bug in one of the communication protocols.

I never tried entering numbers with leading zeroes, but apparently the interface framework we used, had a bug, and treated those as octal numbers.

Finding that big was such a strange feeling. Happy that I finally found it, but mad that I spent a couple of days in total to find a bug in a library.

sanderd
Автор

I liked the BASIC on the C128, sine you could merge existing programs then run 'Rebumber to tighten up the layout. I wrote a lot of subroutines, including primative printer drives that I colud combine for a majority of programming projects.

michaelterrell
Автор

The hardest bugs I've had to fix usually came down to hardware bugs in the ASIC that had to be worked around somehow. There is one bug fix I've always refused to sign off on, because it was all a lie by the sales team. It required an extra capacitor on the circuit, and there was no amount of code we could write that would fix that one. But somehow, that's how it was sold to the customer.

jeremiefaucher-goulet
Автор

The story of the byte reminded me of many arguments I had in the early years. At university I was taught a BYTE = 8 bits, but a WORD is whatever the size of the registers in the CPU you're using are. So, if the registers in your CPU is 8 bits, then a WORD is 8 bits, if the registers is 16 bits, then a WORD is 16 bits, etc. It makes sense, because you can imagine it as it is the amount your CPU can say in one go, a WORD. The problem was that at the time 16 bit CPUs, or 16 bit computers, were very popular and therefore a lot of people were taught or just learnt themselves that a WORD is 16 bits. Even now when you do a casual Google search you will probably be presented with a WORD = 16 bits (or 2 bytes), but when you dig a bit deeper you find it's not true, or at least it's only true in 16 bit CPUs 🙂

farab
Автор

Oh, I used GFA Basic! Loved it. It was a great implementation for the Atari. I made a library of widgets so I could have standardized UI elements. 15 years later, I took that same approach to ASP web development, with libraries for database connections, UI elements, user input sanitation and a state machine template. 15 years later, I created a GTK framework for making Linux GUI apps in Python. So, I owe GFA Basic a lot of thanks. Oh, and I learned Basic on a PDP 11-70. So, you're really poppin' off for me the last month or so!

thelanavishnuorchestra
Автор

I was slinging copper tube in my attic when I was installing my heat pump AC. It was 53C up there! :O
So I know what you felt like about the "they'll find my dead body" thing! I maxed out at about 10 minutes before I had to come down to rehydrate.

DrFiero
Автор

I was involved with a bug that had an entire group stumped. A report was not printing properly. The program was quite simple, and had been in use for months. Wrong printer control ribbon.

petervanderwaart