How GitHub's Database Self-Destructed in 43 Seconds

preview_player
Показать описание
A brief maintenance accident turns for the worse as GitHub's database automatically fails over and breaks the website.

Sources:

Chapters:
0:00 Part 1: Intro
1:25 Part 2: GitHub's database explained
3:40 Part 3: The 43 seconds
5:04 Part 4: Fail back or not?
6:54 Part 5: Recovery process
10:32 Part 6: Aftermath

Notes:

Music:
- Hitman by Kevin MacLeod
- Blue Mood by Robert Munzinger
- Pixelland by Kevin MacLeod
- Dumb as a Box by Dan Lebowitz
Рекомендации по теме
Комментарии
Автор

"We can't delete user data, we aren't gitlab"
This video is a goldmine

sollybunn
Автор

It's bold to assume that
a) 50% of Github users are active on any given day
b) Their time is worth an average of $50/hr
c) Not syncing with remote for one day would affect the average user

kalebbruwer
Автор

These problems always occur during routine maintenance. That's why I don't do any maintenance whatsoever and my systems have never experienced downtime (although I've never checked)

RichieYT
Автор

The assumption that 50% of total github users are active is too optimistic

MaxwellHay
Автор

As a former bitbucket employee I can confirm we have disaster recovery plans for a lunar data center outage

Justin-jmfd
Автор

Interplanetary failovers are a struggle, not gonna lie.

axelboberg
Автор

Honestly I'm impressed that Bitbucket was able to lower the Earth-Mars latency down to 60 milliseconds.

ericlizama
Автор

What a goldmine of a channel. I'm here with you all, witnessing the birth of a great channel

riddixdan
Автор

This is one of those things that in hindsight, it is so easy to see how they set themselves up for failure. But I bet you a lot of brilliant people looked at this and still did not see the issue until it (inevitably) blew up. It do be like that sometimes...

manzenshaaegis
Автор

11:55 i'd say getting 60ms of latency over a 10 light-minute distance is still pretty good

TuMadre
Автор

I worked at a website that handles millions of write transactions per day across like 7 global data centers. We were starting to think of a way to drop into a “read only” mode in the event something like this happened. Then we wouldn’t need to paw through the mess of uncommitted transactions…

CoryKing
Автор

One of the greatest "history" channels on YouTube, love the content.

dybdab
Автор

The ending was hilarious. Great video overall.

thebeber
Автор

imagine being github and being unable to... MERGE two databases

edhahaz
Автор

Considering the scope of the GitHub disaster, it seems to me that recovery with 30 hours is very impressive. I've had to engineer recoveries from much smaller disasters and every one of them took me at least 48 hours if I remember correctly.

JohnAlbertRigali
Автор

When the east coast database recovered and started accepting writes again from applications, they dodged the very common bullet of those apps pushing work at the database as fast as they can and overwhelming it, causing a second wave of outage. In this case, it looks like the controls over the work rate (whether implicit in the nature and scale of the apps, or an explicit mechanism) were sufficient to prevent that.

ccthomas
Автор

I love these videos, I work in IT but for a much smaller national company, really interesting to learn some lessons from, plus the editing and storytelling makes it very entertaining.

Hopgop
Автор

I love how in the last 30 sec, Kevin was not only able to explain how interplanetary network would work but how a random command would blow everything up in exactly 30 sec 😆

rajarshichattopadhyay
Автор

Thank you! This was perfect. I love this. And the amount of explosions is tasteful and not overdone

hchris
Автор

This video is full of explosions and memes but in a tempered manner and it hits all the nerves in my brain. I need more videos like this.

acoolnameemm