(cache) GitLab Database Incident Report | Hacker News

Hacker News new | comments | show | ask | jobs | submit

login

		GitLab Database Incident Report (docs.google.com)
		152 points by sbuttgereit 1 hour ago \| hide \| past \| web \| 96 comments \| favorite

gizmo 54 minutes ago

This is painful to read. It's easy to say that they they should have tested their backups better, and so on, but there is another lesson here, one that's far more important and easily missed.

When doing something really critical (such as playing with the master database late at night) ALWAYS work with a checklist. Write down WHAT you are going to do, and if possible, talk to a coworker about it so you can vocalize the steps. If there is no coworker, talk to your rubber ducky or stapler on your desk. This will help you catch mistakes. Then when the entire plan looks sensible, go through the steps one by one. Don't deviate from the plan. Don't get distracted and start switching between terminal windows. While making the checklist ask yourself if what you're doing is A) absolutely necessary and B) risks making things worse. Even when the angry emails are piling up you can't allow that pressure to cloud your judgment.

Every startup has moments when last-minute panic-patching of a critical part of the server infrastructure is needed, but if you use a checklist you're not likely to mess up badly, even when tired.

nowarninglabel 19 minutes ago

Yes, good tip from "Turn the Ship Around" by David Marquet is to use the "I intend to" model. For every action you are going to undertake that is critical, first announce your intentions and give enough time for reactions from others before following through.

connorshea 41 minutes ago

Yup, it's never the fault of a person, always of the system. Once we get this resolved we'll definitely look at ways to prevent anything like it in the future.

click170 23 minutes ago

BBC's Horizon has a really good episode about checklists and how they're used to prevent mistakes in hospitals, and how they're being adopted in other environments in light of that success. It's called How To Avoid Mistakes In Surgery for the interested.

virtuabhi 17 minutes ago

Here's the book on this topic-

The Checklist Manifesto https://smile.amazon.com/Checklist-Manifesto-How-Things-Righ...

batbomb 22 minutes ago

And a logbook

ams6110 1 hour ago

23:00-ish

YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com

Good lesson on the risks of working on a live production system late at night when you're tired and/or frustrated.

theptip 55 minutes ago

Also, as a safety net, sometimes you don't need to run `rm -rf` (a command which should always be prefaced with 5 minutes of contemplation on a production system). In this case, `rmdir` would have been much safer, as it errors on non-empty directories.

piinbinary 45 minutes ago

Or use `mv x x.bak` when `rmdir` fails

bpchaps 47 minutes ago

These days, I've been very implicit in how I run rm. To the extent that I don't do rm -rf or rmdir (edit: immediately), but in separate lines as something like:

  pushd dir ; find . -type f -ls | less ; find . -type f -exec rm '{}' \; ; popd ; rm -rf dir

It takes a lot longer to do, but I've seen and made enough mistakes over the years that the forced extra time spent feels necessary. It's worked pretty well so far -- knock knock.

gizmo 19 minutes ago

Good advice.

I would add a step where you dump the output of find (after filtering) into a textfile, so you have a record of exactly what you deleted. Especially when deleting files recursively based on a regular expression that extra step is very worthwhile.

It's also a good practice to rename instead of delete whenever possible. Rename first, and the next day when you're fresh walk through the list of files you've renamed and only then nuke them for good.

eric_h 57 minutes ago

Good lesson on making command prompts on machines always tell you exactly what machine you're working on.

m_fam_wa_k 53 minutes ago

I like to color code my terminal. Production systems are always red. Dev are blue/green. Staging is yellow.

Perihelion 47 minutes ago

All of my non-production machines have emojis in PS1 somewhere. It sounds ridiculous, but I know that if I see a cheeseburger or a burrito I'm not about to completely mess everything up. Silly terminal = silly data that I can obliterate.

kofejnik 39 minutes ago

I've been color-coding my PS1 for years, but this is seriously brilliant, thanks!

kevindong 9 minutes ago

It seems Gitlab has noticed your comment.

Recovery item 3f currently says:

> Create issue to change terminal PS1 format/colours to make it clear whether you’re using production or staging (red production, yellow staging)

TimWolla 49 minutes ago

In this case it looks like it has been a confusion between two different replicated Production databases. So this would not have helped.

gknoy 45 minutes ago

I use iterm2's "badging" to set a large text badge on the terminal of the name of the system as part of my SSH-into-ec2-systems alias:

    i2-badge ()
    {
      printf "\e]1337;SetBadgeFormat=%s\a" $(echo -n "$1" | base64)
    }

It's not quite as good as having a separate terminal theme, but then I haven't been able to use that feature properly. :(

niftich 49 minutes ago

I do this too, but in this case both machines were production, so this alone would not have sufficed. The system-default prompts on the other hand are universally garbage.

sytse 48 minutes ago

Yep, good idea. The same thing has been suggested by team members http://imgur.com/a/TPt7O

cabargas 46 minutes ago

Is a really good idea, and is one of the improvements that are likely to be put in place as soon as possible. Its already listed on the document.

joshaidan 49 minutes ago

How do you go about colour coding your terminal?

TimWolla 47 minutes ago

I assume he color coded the prompt. You can use ANSI color escape codes in there to e.g. color your hostname.

Here's a generator for Bash: http://bashrcgenerator.com/, the prompt's format string is stored in the $PS1 variable.

tokenizerrr 48 minutes ago

How exactly do you color code it?

echelon 45 minutes ago

This doesn't really help if there are multiple production databases. It could be sharded, replicated, multi-tenant, etc.

gnufied 42 minutes ago

Why would it matter? In my last job we had user home directories synced via puppet (I am overly simplifying this) which enabled any ops guy to have same set of shell and vim configuration settings on production machines too.

I daresay - having hostname as part of prompt saves lot of trouble.

dorfsmay 26 minutes ago

Uh no! Don't rely on command prompt, there are hardcoded ones out there, and clonining scripts have duplicated them.

uname -n

Takes seconds.

vacri 17 minutes ago

colour is more noticeable than words. I do both, though.

cesarb 7 minutes ago

> Good lesson on the risks of working on a live production system late at night when you're tired and/or frustrated.

In these situations, I always keep the following xkcd in mind: https://xkcd.com/349/

sbuttgereit 50 minutes ago

Also a good lesson for testing your availability and disaster recovery measures for effectiveness.

Far, far too many companies get production going and then just check to see that certain things "completed successfully" or didn't throw an overt alert in terms of their safety nets.

Just because things seem to be working doesn't mean they are or that they are working in a way that is recoverable.

elementalest 14 minutes ago

>1. LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage >2. Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.

>3. SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.

>4. Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers. The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented >5. Our backups to S3 apparently don’t work either: the bucket is empty

>So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

Sounds like it was only a matter of time before something like this happened. How could so many systems be not working and no one noticed.

Walkman 50 minutes ago

Seems like very basic mistakes were made, not at the event but way long before. If you don't test to restore your backups, you don't have a backup. How does it go unnoticed that S3 backups don't work for so long?

tetrep 42 minutes ago

Yeah, the "You don't have backups unless you can restore them" stikes again.

Virtually the only way to lose data is to not have backups. We live in such fancy times that there's no reason to ever lose data that you care about.

ocdtrekkie 2 minutes ago

Helpful hint: Have a employee who regularly accidentally deletes folders. I have a couple, it's why I know my backups work. :D

mschuster91 42 minutes ago

> How does it go unnoticed that S3 backups don't work for so long?

My uneducated guess (this one hit a friend of mine): expired/revoked AWS credentials combined with a backup script that doesn't exit(1) on failure and just writes the exception trace to stderr.

niftich 22 minutes ago

I applaud their forthrightness and hope that it's recoverable so that most of the disaster is averted.

To me the most illuminating lesson is that debugging 'weird' issues is enough of a minefield; doing it in production is fraught with even more peril. Perhaps we as users (or developers with our 'user' hat on) expect so much availability as to cause companies to prioritize it so high, but (casually, without really being on the hook for any business impact) I'd say availability is nice to have, while durability is mandatory. To me, an emergency outage would've been preferable to give the system time to catch up or recover, with the added bonus of also kicking off the offending user causing spurious load.

My other observation is that troubleshooting -- the entire workflow -- is inevitably pure garbage. We engineer systems to work well -- these days often with elaborate instrumentation to spin up containers of managed services and whatnot, but once they no longer work well we have to dip down to the lowest adminable levels, tune obscure flags, restart processes to see if it's any better, muck about with temp files, and use shell commands that were designed 40 years ago for when it was a different time. This is a terrible idea. I don't have an easy solution for the 'unknown unknowns', but the collective state of 'what to do if this application is fucking up' feels like it's in the stone ages compared to what we've accomplished on the side of when things are actually working.

zenlikethat 16 minutes ago

Engineering things to work well and the troubleshooting process at a low level are one and the same. It's just that in some cases other people found these bugs and issues before you and fixed them. But this is the cost of OSS, you get awesome stuff for free (as in beer) and are expected to be on the hook for helping with this process. If you don't like it, pay somebody.

Really everyone could benefit from learning more about the systems they rely on such as Linux and observability tools like vmstat, etc. The less lucky guesses or cargo culted solutions you use the better.

ploxiln 54 minutes ago

Amazingly transparent and honest.

Unfortunately, this kind of situation, "only the ideal case ever worked at all", is not uncommon. I've seen it before ... when doing things the right way, dotting 'I's and crossing 'T's, requires an experienced employee a good week or two, it's very tempting for a lean startup to bang out something that seems to work in a couple days and move on.

jeffmcjunkin 8 minutes ago

For item 3h under recovery, consider:

  chattr +i /var/opt/gitlab/postgresql/data

Yes, it doesn't completely stop foot-guns, but it means you have to shoot twice [0].

[0]:

  chattr -i /whatever
  rm /whatever

CameronBanga 1 hour ago

Not sure if the doc here is refreshing or scary. But Godspeed GitLab team. I've loved the product for about two years now, so curious to see how this plays out.

sbuttgereit 1 hour ago

It's both.

I very much appreciate their forthrightness and the way they conduct their company generally. Having said that, I have the code I work on, related content, and a number of clients on the service.

[edit for additional point]

They need the infrastructure guy they've been looking for sooner than later. I hope there's good progress on that front.

sytse 42 minutes ago

We've hired some great new people recently but as you can see there is still a lot work to do. https://about.gitlab.com/jobs/production-engineer/

mnarayan01 58 minutes ago

Start at:

> At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden.

This is why I'm not a fan of emergency pager duty.

ckdarby 22 minutes ago

"Our backups to S3 apparently don’t work either: the bucket is empty"

6/6 failed backup procedures. Looks like they are going to be hiring a new sysadmin/devops person...

Washuu 11 minutes ago

The best system administrator is the one that has learned from their catastrophic fuck up.

To that effect, I still have the same job as I did before I ran "yum update" without knowing it attempts to do in place kernel upgrades. Which resulted in a corrupted RedHat installation on a server we could not turn off.

overcast 1 minute ago

There is learning from a catastrophic fuck up, and then there is incompetence. Backups is like Day 1, SysAdmin 101. I can't quite grasp how so many different backup systems were left unchecked. Every morning I receive messages saying everything is fine, yet I still go into the backup systems, to make sure they actually did run. In case there was issue with the system alerting me.

tony-allan 19 minutes ago

It's good when companies are open and honest about problems.

I imagine they will have a great multi-level tested backup process in the next day or two!

connorshea 14 minutes ago

It'll definitely be a priority now!

polygot 1 hour ago

"So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place."

echelon 47 minutes ago

Does this mean whatever was in that database is gone, with no available backups?

Is this an SOA where important data might lie in another service or data store, or is this a monolithic app and DB that is responsible for many (or all) things?

What was stored in that database? Does this affect user data? Code?

YorickPeterse 45 minutes ago

We have snapshots, but they're not very recent (see the document for more info). The most recent snapshot is roughly 6 hours old (relative to the data loss). The data loss only affects database data, Git repositories and Wikis still exist (though they are fairly useless without a corresponding project).

echelon 34 minutes ago

Best of luck with the recovery! I know this must be stressful. :(

TimWolla 43 minutes ago

The doc says that there is a LVM snapshot being 6 hours old. <strike>And there should be a regular logical backup with at most 24 hours age as well (they just can't find it for whatever reason).</strike> (Scratch that, my doc did not update, despite Google saying it should automatically update).

Regarding what's gone: The production PostgreSQL database. This suggests that the code itself is fine, but the mappings to the users are gone. But git is a distributed VCS after all, so all the code should be on the developer's machines as well.

graphememes 42 minutes ago

If you haven't tested your backups, you don't have backups.

ChuckMcM 32 minutes ago

So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

Ouch. That is so harsh. Sorry to hear about the incident. Testing ones backups can be a pain to do but it is so very important.

zenlikethat 22 minutes ago

Hang in there GitLab, it sucks now but mistakes like this can happen to anyone. The important thing is how you deal with it.

TimWolla 1 hour ago

Are 'YP' the initials of an employee or is this an acronym I don't know?

Perihelion 1 hour ago

Yes, those are the initials of an employee here. Sorry for the confusion!

detaro 58 minutes ago

As much as I appreciate GitLabs extreme openness, that's maybe something that by policy shouldn't be part of published reports. Internal process is one thing, if something goes really bad customers might not be so good at "blameless postmortems" if they have a name to blame.

sytse 45 minutes ago

That is why we went with initials. And I hope customers understand the blame is with all of us, starting with me. Not with the person in the arena. https://twitter.com/sytses/status/826598260831842308

grhmc 0 minutes ago

It seems to me that, as a customer, it is blame-shifting away from the company to a particular business. Blameless post-mortems are great, but when speaking to people outside the company I think it is important to own it collectively, "after a second or two we notice we ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com." I believe this isn't your intention, but that is how I interpreted it.

StavrosK 58 minutes ago

Is your username a Spin reference?

Perihelion 52 minutes ago

Haha, it wasn't intentional. I'm just a space nerd. That book ranks pretty highly on my list of things every space nerd should read though.

StavrosK 51 minutes ago

I quite enjoyed it! Also, +1 for space. And Greek.

CameronBanga 1 hour ago

I think it's a staff member. Can't remember first name, Yuri maybe, who is fairly active with the project.

YorickPeterse 59 minutes ago

Nope, that would be me.

bigiain 53 minutes ago

Tough night dude. I'll buy you a drink or three if you're ever in Sydney...

sytse 44 minutes ago

Sorry for the rough night Yorick. This could happen to all of us but of course it happens to the person that is working the hardest. <3

kfrzcode 26 minutes ago

I'm going to add to your message box unnecessarily, but I want to say I love GitLab and it's a shining example of a transparent company. I still have ambitions to work there someday, and this event is hopefully a net gain in the end, in that everyone here and there learns about backups.

refulgentis 57 minutes ago

Alas, poor Yorick!

AlexB138 56 minutes ago

Thanks for the transparency. Doesn't always feel good to have missteps aired in public, but it makes us all a little better as a community to be clear about where mistakes can be made.

ben_jones 40 minutes ago

I'll pour one out for you next time I go out.

gadders 12 minutes ago

Unlucky mate. Even monkeys fall out of trees. Good luck with the fixing.

plttn 1 hour ago

From looking at the context of the way YP is referenced (link to a slack archive), I believe YP is an employee.

alyandon 13 minutes ago

Thank you for the transparency. This is a good read and I'm going to be sharing it with coworkers tomorrow. :)

huula 26 minutes ago

I once rm-ed my home directory when I was writing and testing a script, but turned out the stuff like .m2, .ivy2 are huge and they are the first ones by default to be deleted by 'rm -rf'. So they kind of gave me some buffering time to figure out that something was wrong.

leesalminen 56 minutes ago

This is the stuff my nightmares consist of after 900 consecutive days of being on call (and counting).

slezakattack 48 minutes ago

Are you a one man team or...? My wife would probably leave me if I was on-call for that long.

leesalminen 33 minutes ago

I am :/ ... currently maintaining service for 500 high-volume businesses 24x7x365 in 9 timezones. Luckily the product and infrastructure is pretty stable and problems occur maybe once a quarter.

But the constant nagging in the back of your head that shit can go wrong at any second is draining and has been the biggest stressor in my life for a long time now.

My S.O. still gets mildly upset when I pack up the laptop on our way out to a fancy dinner, or disappear with my laptop when visiting her parents, but the fact that our life goals are aligned is the saving grace of all these situations. We both know what we want out of the next 5 years of our lives and are willing to sacrifice to achieve this goal (long term financial security).

overcast 15 minutes ago

I hope you are being SERIOUSLY compensated.

leesalminen 10 minutes ago

Cash salary today is well under market for my skill set. But, I do own 1/3 of the company so it's not all bad :).

overcast 6 minutes ago

With all of that clout, why aren't you hiring?

sashk 35 minutes ago

Backups sucked for the starting in 8.15 on our instances of GLE, because someone decided to add "readable" date stamp in addition to unix timestamp in backup file name without proper testing, which caused many issues. It was somewhat fixed, but I do still issues in 8.16.

I'm not complaining, but backup/restore is important part, with 100% test coverage and daily backup/restore runs.

btgeekboy 35 minutes ago

> Our backups to S3 apparently don’t work either: the bucket is empty

followed by

> So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

is no way to be running a public service with customer data. Did the person who set up that S3 job simply write a script or something and just go "yep, it's done" and walk away? Seriously?

overcast 24 minutes ago

Apparently the following insane interviewing process, wasn't enough to find someone competent enough to cover the basics.

https://about.gitlab.com/jobs/production-engineer/

-------------------

Applicants for this position can expect the hiring process to follow the order below. Please keep in mind that applicants can be declined from the position at any stage of the process. To learn more about someone who may be conducting the interview, find her/his job title on our team page.

Qualified applicants receive a short questionnaire and coding exercise from our Global Recruiters

The review process for this role can take a little longer than usual but if in doubt, check in with the Global recruiter at any point.

Selected candidates will be invited to schedule a 45min screening call with our Global Recruiters

Next, candidates will be invited to schedule a first 45 minute behavioral interview with the Infrastructure Lead

Candidates will then be invited to schedule a 45 minute technical interview with a Production Engineer

Candidates will be invited to schedule a third interview with our VP of Engineering

Finally, candidates will have a 50 minute interview with our CEO

Successful candidates will subsequently be made an offer via email

ChuckMcM 29 minutes ago

   > Did the person who set up that S3 job simply write a 
   > script or something and just go "yep, it's done" and 
   > walk away?

I don't know of course, but one failure mode that has to be explicitly tested for is continual monitoring that the existing backup process is still working. We had a backup process at Blekko which stopped working once when an S3 credential that appeared unrelated was removed, as I recall it was a Nagios test that detected that the next set of backups were too small and got that fixed.

Elizzy 25 minutes ago

To be fair the script is probably the same as what is in their repos, it probably wasn't set to report to a logging service that'd raise red flags.

TimWolla 21 minutes ago

> is no way to be running a public service with paying customers

Is it even possible to pay for the hosted GitLab.com instance?

Shelnutt2 7 minutes ago

Gitlab.com is free, not payed. https://about.gitlab.com/products/

The only for pay versions are self hosted EE and independent cloud hosting.

btgeekboy 5 minutes ago

Very well; my apologies. I've updated my comment.

piinbinary 40 minutes ago

If you haven't done so recently, TEST YOUR BACKUPS.

ArtDev 49 minutes ago

I noticed the issue when I was pushing code earlier today. Hopefully this gets resolved soon. You guys are doing a great job. Keep up the good work!

sytse 43 minutes ago

Thanks, not feeling great about the job we're doing today, but we'll learn from this.

cabargas 42 minutes ago

And we're sorry for the inconvenience this caused to your workflow today!

ArtDev 33 minutes ago

I deployed my code before the backup issue. So, no worries.

jongar_xyz 29 minutes ago

Wow. This is a joke. Gitlab is a joke. Just moved my stuff to BitBucket.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact