Dev Deletes Entire Production Database, Chaos Ensues

preview_player
Показать описание
If you're tasked with deleting a database, make sure you delete the right one.

Sources:

Notes:
1:05 - The middle bullet point about the account that had 47,000 IPs was never mentioned in the postmortem (there was an initial report the day of and a more detailed postmortem a bit over a week after that). Perhaps that was a red herring which they figured out later on didn't really matter.
3:07 - I made the error say too many open connections since it's easier to understand than semaphores
3:39 - This part was confusing, since the postmortem and the initial report conflicted. The postmortem said the engineers believed pg_basebackup was failing because previous attempts created some files in the data directory, but the initial report said the theory was because the data directory existed (despite being empty). So for some reason the engineers really wanted to delete the data directory, but for what reason who knows.
4:37 - They probably didn't check for backups in this order. I'm sure team-member-1 immediately called out he had taken a backup 6 hours earlier, and then they just had to verify the other backups in case there was a better one.
6:21 - Being reported by a troll will not automatically remove a user, but flag it for manual review. It was then incorrectly deleted after review.

Chapters:
0:00 Seconds before disaster
0:16 Part 1: Database issues
2:21 Part 2: The rm -rf moment
4:32 Part 3: Restore from backup
6:13 Part 4: Post incident discoveries
7:27 Lessons learned
9:46 The fate of team-member-1
10:11 ???

Music:
- Finding the Balance by Kevin MacLeod
- Eyes Gone Wrong by Kevin MacLeod
- Desert City by Kevin Macleod
- Jane Street by TrackTribe
Рекомендации по теме
Комментарии
Автор

Damn I cannot even imagine the stress that admin was feeling after he realised he deleted DB1. He must have aged twenty years.

VestigialHead
Автор

The fact they live streamed while trying to restore the data is a truly epic move.

Chris_Cross
Автор

I think the biggest problem (seemingly addressed at 6:21) is the fact they could delete an employee account by spam reporting it.

Webmage
Автор

This is why problems like this are actually sometimes good. Of course extremely stressful, but they found sooo many issues and fixed them all. Amazing.

Nickab
Автор

"You think it's expensive to hire a professional? Wait till you hire an amateur" - some old wise businessman.

Misanthrope
Автор

So you're telling me a platform as big as GitLab went down because one engineer picked the wrong SSH session?
Damn that makes me feel way better about my mistakes lol

SIMULATAN
Автор

This reminds me of Toy Story, and how like a month before release the entire animation was accidentally deleted, causing absolute panic and hell at Disney. Luckily, one employee had the whole thing on a hard drive that they were taking home to work on. Her initials are on one of the number plates of one of the cars in the film.

Always make a backup.

Edit: She was a project manager who had to work from home, and the numberplate was actually "Rm Rf" in reference to the notorious line of code that did it.

dragonfire
Автор

When I was still a junior developer at some startup company, I was working on a specific PHP online store. Every time we would upgrade the site, we would first do it on Staging, then copy it over to Production. The whole process was kinda annoying as there was no streamlined upgrade flow yet and no documentation anywhere - it was a relatively new project we took over. I have upgraded it before so I knew what to do, and I just did the thing I always did.

I was close to finishing it up and we had an office meeting coming up soon and lunch afterwards, so I wanted to be done with this before that - so I rushed a bit. And when I was copying files to Production, I overlooked something - I had also copied the staging config file (that contained database access info etc) to the production location and overwrote the production config file.

After the copying had finished, thinking I was finally done, I relaxed and prepared myself for the meeting. As I was closing everything, I also tried refreshing the production site, just to see if it works. And then I realized... Articles weren't appearing, images weren't loading, errors everywhere. Initially I didn't believe this was production at all, probably just localhost or something, RIGHT?? However after re-refreshing it and confirming I had actually broke production, panic set in.

Instead of informing anyone, I quietly moved closer to my computer, completely quiet, and started looking at what is wrong - with 100% focus, I don't think I was ever as focused as I was then - I didn't have time to inform anyone, it would only cause unnecessary delays. I had to restore this site ASAP.

I remember sweating... the meeting was starting and I remember colleagues asking me "if I am coming" - and I just blurted "ye ye, just checking some things..." completely "calmly" as I was PANICKING to fix the site as soon as possible. Luckily I quickly found the source of the mistake within a minute and had to find a backup config file - and then after recovering the config file, everything was fixed. Followed by a huge sigh of relief. The site must have been down for only around 2 minutes.

No one actually noticed what I had done - and I just joined the meeting as if nothing had happened - even though I was sweating and breathing quickly to calm myself down, I hid it pretty well.

And this was a long time ago - and still to this day, I still remember that panic very well. Now I always make sure I have quick recovery options available at all times in case something goes wrong - and if possible always automate the upgrade process to minimize human errors

CryShana
Автор

Something my first boss taught me (when I broke something big in production in my first few weeks) is that post mortems are to identify problems in a system and how to prevent them, avoiding blame to individuals.

This is huge. Making sure to identify why it was even possible for something like this to happen and how to prevent it in the future is a great way to handle a post mortem like this. Good on the GitLab team.

maxcohn
Автор

Given the trouble they were in after the deletion, a recovery time of 24h and a recovery point of 6h is actually pretty heroic. Especially considering the stress they would have been under. 😰

rosscads
Автор

Ugh, felt that "he slammed CTRL+C harder than he ever had before" (3:55). The only thing worse than deleting your own data is deleting everyone else's. In this case the poor guy kinda did both. Great story arc.

JeffThePoustman
Автор

There is an awful lot that could be learned from this.

1) You should "soft delete" i.e. use mv to either rename the data e.g. renaming MyData to something like MyData_old or MyData_backup, or just mv it out of the way so you can restore it later if needed. Don't just rm -rf it from orbit

2) Script all your changes. Everything you need to do should be wrapped in a peer-reviewed script and you just run the script, so that the pre-agreed actions are all that gets done. Do not go off piste, do not just SSH into prod boxes and start flinging arbitrary commands around

3) Change Control - as above

4) If you have Server A and Server B, you should NOT have both shell sessions open on the same machine. Either use a separate machine entirely or - better still - get a buddy to log onto Server A from their end and you get on Server B from yours. Total separation

5) Do not ever just su into root. You use sudo, or some kind of carefully managed solution such as CyberArk to get the root creds when needed

mxbx
Автор

_Luckily team 1 took a snapshot 6 hours before..._
This happened to me. I copied a clients database to my development environment about 2 hours before they accidently wiped it.
They called our company explaining what happened and it got around that I had a copy. Our company looked like a hero that day, and I got a bunch of credit for good luck.

jarrod
Автор

The rule I apply for backups is that no one should connect to both a backup server and a primary at the same time, two people should be working together. The employee that was logged on both DBs should have been really two physically separated employees

ludoviclagouardette
Автор

One of my most stressful moments as a software designer was when I accidentally broke a test environment right before a meeting with our client; I managed to have the project running at a 2nd test environment but that really taught me the importance of backups and telling the rest of staff about a problem ASAP.

Dairunt
Автор

The best practice is to rename the directory or file to something else. Idk how the developers are so calm when using deletion commands

TheDrTrouble
Автор

Imagine for a moment, that you're that guy. That feeling of pure dread and the adrenaline rush immediately after the realization of what you've just done. We've all felt it at some point.

gosnooky
Автор

A helpful hack is to set production terminal to red and test terminal to blue or something like that. Just a small helper to avoid human f’ups if you need to run manual commands in sensitive systems.

randomgeocacher
Автор

I'm glad they didn't fire the engineer. It goes to show the differences in mindsets from some organizations that care about it being a learning experience (albeit an expensive one). Many corporations would have fired the engineer as soon as the issue was resolved without hesitation. Thanks to those orgs who care about their team members and being more concerned with lessons learned.

robbybankston
Автор

Ultimate workplace comeback: "At least I've never nuked the entire database"

TmccreightGaming
welcome to shbcf.ru