Hacker News new | comments | show | ask | jobs | submit login
GitLab Database Incident Report (docs.google.com)
152 points by sbuttgereit 1 hour ago | hide | past | web | 96 comments | favorite





This is painful to read. It's easy to say that they they should have tested their backups better, and so on, but there is another lesson here, one that's far more important and easily missed.

When doing something really critical (such as playing with the master database late at night) ALWAYS work with a checklist. Write down WHAT you are going to do, and if possible, talk to a coworker about it so you can vocalize the steps. If there is no coworker, talk to your rubber ducky or stapler on your desk. This will help you catch mistakes. Then when the entire plan looks sensible, go through the steps one by one. Don't deviate from the plan. Don't get distracted and start switching between terminal windows. While making the checklist ask yourself if what you're doing is A) absolutely necessary and B) risks making things worse. Even when the angry emails are piling up you can't allow that pressure to cloud your judgment.

Every startup has moments when last-minute panic-patching of a critical part of the server infrastructure is needed, but if you use a checklist you're not likely to mess up badly, even when tired.


Yes, good tip from "Turn the Ship Around" by David Marquet is to use the "I intend to" model. For every action you are going to undertake that is critical, first announce your intentions and give enough time for reactions from others before following through.

Yup, it's never the fault of a person, always of the system. Once we get this resolved we'll definitely look at ways to prevent anything like it in the future.

BBC's Horizon has a really good episode about checklists and how they're used to prevent mistakes in hospitals, and how they're being adopted in other environments in light of that success. It's called How To Avoid Mistakes In Surgery for the interested.

Here's the book on this topic-

The Checklist Manifesto https://smile.amazon.com/Checklist-Manifesto-How-Things-Righ...


And a logbook

23:00-ish

YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com

Good lesson on the risks of working on a live production system late at night when you're tired and/or frustrated.


Also, as a safety net, sometimes you don't need to run `rm -rf` (a command which should always be prefaced with 5 minutes of contemplation on a production system). In this case, `rmdir` would have been much safer, as it errors on non-empty directories.

Or use `mv x x.bak` when `rmdir` fails

These days, I've been very implicit in how I run rm. To the extent that I don't do rm -rf or rmdir (edit: immediately), but in separate lines as something like:

  pushd dir ; find . -type f -ls | less ; find . -type f -exec rm '{}' \; ; popd ; rm -rf dir
It takes a lot longer to do, but I've seen and made enough mistakes over the years that the forced extra time spent feels necessary. It's worked pretty well so far -- knock knock.

Good advice.

I would add a step where you dump the output of find (after filtering) into a textfile, so you have a record of exactly what you deleted. Especially when deleting files recursively based on a regular expression that extra step is very worthwhile.

It's also a good practice to rename instead of delete whenever possible. Rename first, and the next day when you're fresh walk through the list of files you've renamed and only then nuke them for good.


Good lesson on making command prompts on machines always tell you exactly what machine you're working on.

I like to color code my terminal. Production systems are always red. Dev are blue/green. Staging is yellow.

All of my non-production machines have emojis in PS1 somewhere. It sounds ridiculous, but I know that if I see a cheeseburger or a burrito I'm not about to completely mess everything up. Silly terminal = silly data that I can obliterate.

I've been color-coding my PS1 for years, but this is seriously brilliant, thanks!

It seems Gitlab has noticed your comment.

Recovery item 3f currently says:

> Create issue to change terminal PS1 format/colours to make it clear whether you’re using production or staging (red production, yellow staging)


In this case it looks like it has been a confusion between two different replicated Production databases. So this would not have helped.

I use iterm2's "badging" to set a large text badge on the terminal of the name of the system as part of my SSH-into-ec2-systems alias:

    i2-badge ()
    {
      printf "\e]1337;SetBadgeFormat=%s\a" $(echo -n "$1" | base64)
    }
It's not quite as good as having a separate terminal theme, but then I haven't been able to use that feature properly. :(

I do this too, but in this case both machines were production, so this alone would not have sufficed. The system-default prompts on the other hand are universally garbage.

Yep, good idea. The same thing has been suggested by team members http://imgur.com/a/TPt7O

Is a really good idea, and is one of the improvements that are likely to be put in place as soon as possible. Its already listed on the document.

How do you go about colour coding your terminal?

I assume he color coded the prompt. You can use ANSI color escape codes in there to e.g. color your hostname.

Here's a generator for Bash: http://bashrcgenerator.com/, the prompt's format string is stored in the $PS1 variable.


How exactly do you color code it?

This doesn't really help if there are multiple production databases. It could be sharded, replicated, multi-tenant, etc.

Why would it matter? In my last job we had user home directories synced via puppet (I am overly simplifying this) which enabled any ops guy to have same set of shell and vim configuration settings on production machines too.

I daresay - having hostname as part of prompt saves lot of trouble.


Uh no! Don't rely on command prompt, there are hardcoded ones out there, and clonining scripts have duplicated them.

uname -n

Takes seconds.


colour is more noticeable than words. I do both, though.

> Good lesson on the risks of working on a live production system late at night when you're tired and/or frustrated.

In these situations, I always keep the following xkcd in mind: https://xkcd.com/349/


Also a good lesson for testing your availability and disaster recovery measures for effectiveness.

Far, far too many companies get production going and then just check to see that certain things "completed successfully" or didn't throw an overt alert in terms of their safety nets.

Just because things seem to be working doesn't mean they are or that they are working in a way that is recoverable.


>1. LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage >2. Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.

>3. SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.

>4. Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers. The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented >5. Our backups to S3 apparently don’t work either: the bucket is empty

>So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

Sounds like it was only a matter of time before something like this happened. How could so many systems be not working and no one noticed.


Seems like very basic mistakes were made, not at the event but way long before. If you don't test to restore your backups, you don't have a backup. How does it go unnoticed that S3 backups don't work for so long?

Yeah, the "You don't have backups unless you can restore them" stikes again.

Virtually the only way to lose data is to not have backups. We live in such fancy times that there's no reason to ever lose data that you care about.


Helpful hint: Have a employee who regularly accidentally deletes folders. I have a couple, it's why I know my backups work. :D

> How does it go unnoticed that S3 backups don't work for so long?

My uneducated guess (this one hit a friend of mine): expired/revoked AWS credentials combined with a backup script that doesn't exit(1) on failure and just writes the exception trace to stderr.


I applaud their forthrightness and hope that it's recoverable so that most of the disaster is averted.

To me the most illuminating lesson is that debugging 'weird' issues is enough of a minefield; doing it in production is fraught with even more peril. Perhaps we as users (or developers with our 'user' hat on) expect so much availability as to cause companies to prioritize it so high, but (casually, without really being on the hook for any business impact) I'd say availability is nice to have, while durability is mandatory. To me, an emergency outage would've been preferable to give the system time to catch up or recover, with the added bonus of also kicking off the offending user causing spurious load.

My other observation is that troubleshooting -- the entire workflow -- is inevitably pure garbage. We engineer systems to work well -- these days often with elaborate instrumentation to spin up containers of managed services and whatnot, but once they no longer work well we have to dip down to the lowest adminable levels, tune obscure flags, restart processes to see if it's any better, muck about with temp files, and use shell commands that were designed 40 years ago for when it was a different time. This is a terrible idea. I don't have an easy solution for the 'unknown unknowns', but the collective state of 'what to do if this application is fucking up' feels like it's in the stone ages compared to what we've accomplished on the side of when things are actually working.


Engineering things to work well and the troubleshooting process at a low level are one and the same. It's just that in some cases other people found these bugs and issues before you and fixed them. But this is the cost of OSS, you get awesome stuff for free (as in beer) and are expected to be on the hook for helping with this process. If you don't like it, pay somebody.

Really everyone could benefit from learning more about the systems they rely on such as Linux and observability tools like vmstat, etc. The less lucky guesses or cargo culted solutions you use the better.


Amazingly transparent and honest.

Unfortunately, this kind of situation, "only the ideal case ever worked at all", is not uncommon. I've seen it before ... when doing things the right way, dotting 'I's and crossing 'T's, requires an experienced employee a good week or two, it's very tempting for a lean startup to bang out something that seems to work in a couple days and move on.


For item 3h under recovery, consider:

  chattr +i /var/opt/gitlab/postgresql/data
Yes, it doesn't completely stop foot-guns, but it means you have to shoot twice [0].

[0]:

  chattr -i /whatever
  rm /whatever

Not sure if the doc here is refreshing or scary. But Godspeed GitLab team. I've loved the product for about two years now, so curious to see how this plays out.

It's both.

I very much appreciate their forthrightness and the way they conduct their company generally. Having said that, I have the code I work on, related content, and a number of clients on the service.

[edit for additional point]

They need the infrastructure guy they've been looking for sooner than later. I hope there's good progress on that front.


We've hired some great new people recently but as you can see there is still a lot work to do. https://about.gitlab.com/jobs/production-engineer/

Start at:

> At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden.

This is why I'm not a fan of emergency pager duty.


"Our backups to S3 apparently don’t work either: the bucket is empty"

6/6 failed backup procedures. Looks like they are going to be hiring a new sysadmin/devops person...


The best system administrator is the one that has learned from their catastrophic fuck up.

To that effect, I still have the same job as I did before I ran "yum update" without knowing it attempts to do in place kernel upgrades. Which resulted in a corrupted RedHat installation on a server we could not turn off.


There is learning from a catastrophic fuck up, and then there is incompetence. Backups is like Day 1, SysAdmin 101. I can't quite grasp how so many different backup systems were left unchecked. Every morning I receive messages saying everything is fine, yet I still go into the backup systems, to make sure they actually did run. In case there was issue with the system alerting me.

It's good when companies are open and honest about problems.

I imagine they will have a great multi-level tested backup process in the next day or two!


It'll definitely be a priority now!

"So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place."

Does this mean whatever was in that database is gone, with no available backups?

Is this an SOA where important data might lie in another service or data store, or is this a monolithic app and DB that is responsible for many (or all) things?

What was stored in that database? Does this affect user data? Code?


We have snapshots, but they're not very recent (see the document for more info). The most recent snapshot is roughly 6 hours old (relative to the data loss). The data loss only affects database data, Git repositories and Wikis still exist (though they are fairly useless without a corresponding project).

Best of luck with the recovery! I know this must be stressful. :(

The doc says that there is a LVM snapshot being 6 hours old. <strike>And there should be a regular logical backup with at most 24 hours age as well (they just can't find it for whatever reason).</strike> (Scratch that, my doc did not update, despite Google saying it should automatically update).

Regarding what's gone: The production PostgreSQL database. This suggests that the code itself is fine, but the mappings to the users are gone. But git is a distributed VCS after all, so all the code should be on the developer's machines as well.


If you haven't tested your backups, you don't have backups.

So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

Ouch. That is so harsh. Sorry to hear about the incident. Testing ones backups can be a pain to do but it is so very important.


Hang in there GitLab, it sucks now but mistakes like this can happen to anyone. The important thing is how you deal with it.

Are 'YP' the initials of an employee or is this an acronym I don't know?

Yes, those are the initials of an employee here. Sorry for the confusion!

As much as I appreciate GitLabs extreme openness, that's maybe something that by policy shouldn't be part of published reports. Internal process is one thing, if something goes really bad customers might not be so good at "blameless postmortems" if they have a name to blame.

That is why we went with initials. And I hope customers understand the blame is with all of us, starting with me. Not with the person in the arena. https://twitter.com/sytses/status/826598260831842308

It seems to me that, as a customer, it is blame-shifting away from the company to a particular business. Blameless post-mortems are great, but when speaking to people outside the company I think it is important to own it collectively, "after a second or two we notice we ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com." I believe this isn't your intention, but that is how I interpreted it.


Is your username a Spin reference?

Haha, it wasn't intentional. I'm just a space nerd. That book ranks pretty highly on my list of things every space nerd should read though.

I quite enjoyed it! Also, +1 for space. And Greek.

I think it's a staff member. Can't remember first name, Yuri maybe, who is fairly active with the project.

Nope, that would be me.

Tough night dude. I'll buy you a drink or three if you're ever in Sydney...

Sorry for the rough night Yorick. This could happen to all of us but of course it happens to the person that is working the hardest. <3

I'm going to add to your message box unnecessarily, but I want to say I love GitLab and it's a shining example of a transparent company. I still have ambitions to work there someday, and this event is hopefully a net gain in the end, in that everyone here and there learns about backups.

Alas, poor Yorick!

Thanks for the transparency. Doesn't always feel good to have missteps aired in public, but it makes us all a little better as a community to be clear about where mistakes can be made.

I'll pour one out for you next time I go out.

Unlucky mate. Even monkeys fall out of trees. Good luck with the fixing.

From looking at the context of the way YP is referenced (link to a slack archive), I believe YP is an employee.

Thank you for the transparency. This is a good read and I'm going to be sharing it with coworkers tomorrow. :)

I once rm-ed my home directory when I was writing and testing a script, but turned out the stuff like .m2, .ivy2 are huge and they are the first ones by default to be deleted by 'rm -rf'. So they kind of gave me some buffering time to figure out that something was wrong.

This is the stuff my nightmares consist of after 900 consecutive days of being on call (and counting).

Are you a one man team or...? My wife would probably leave me if I was on-call for that long.

I am :/ ... currently maintaining service for 500 high-volume businesses 24x7x365 in 9 timezones. Luckily the product and infrastructure is pretty stable and problems occur maybe once a quarter.

But the constant nagging in the back of your head that shit can go wrong at any second is draining and has been the biggest stressor in my life for a long time now.

My S.O. still gets mildly upset when I pack up the laptop on our way out to a fancy dinner, or disappear with my laptop when visiting her parents, but the fact that our life goals are aligned is the saving grace of all these situations. We both know what we want out of the next 5 years of our lives and are willing to sacrifice to achieve this goal (long term financial security).


I hope you are being SERIOUSLY compensated.

Cash salary today is well under market for my skill set. But, I do own 1/3 of the company so it's not all bad :).

With all of that clout, why aren't you hiring?

Backups sucked for the starting in 8.15 on our instances of GLE, because someone decided to add "readable" date stamp in addition to unix timestamp in backup file name without proper testing, which caused many issues. It was somewhat fixed, but I do still issues in 8.16.

I'm not complaining, but backup/restore is important part, with 100% test coverage and daily backup/restore runs.


> Our backups to S3 apparently don’t work either: the bucket is empty

followed by

> So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

is no way to be running a public service with customer data. Did the person who set up that S3 job simply write a script or something and just go "yep, it's done" and walk away? Seriously?


Apparently the following insane interviewing process, wasn't enough to find someone competent enough to cover the basics.

https://about.gitlab.com/jobs/production-engineer/

-------------------

Applicants for this position can expect the hiring process to follow the order below. Please keep in mind that applicants can be declined from the position at any stage of the process. To learn more about someone who may be conducting the interview, find her/his job title on our team page.

Qualified applicants receive a short questionnaire and coding exercise from our Global Recruiters

The review process for this role can take a little longer than usual but if in doubt, check in with the Global recruiter at any point.

Selected candidates will be invited to schedule a 45min screening call with our Global Recruiters

Next, candidates will be invited to schedule a first 45 minute behavioral interview with the Infrastructure Lead

Candidates will then be invited to schedule a 45 minute technical interview with a Production Engineer

Candidates will be invited to schedule a third interview with our VP of Engineering

Finally, candidates will have a 50 minute interview with our CEO

Successful candidates will subsequently be made an offer via email


   > Did the person who set up that S3 job simply write a 
   > script or something and just go "yep, it's done" and 
   > walk away?
I don't know of course, but one failure mode that has to be explicitly tested for is continual monitoring that the existing backup process is still working. We had a backup process at Blekko which stopped working once when an S3 credential that appeared unrelated was removed, as I recall it was a Nagios test that detected that the next set of backups were too small and got that fixed.

To be fair the script is probably the same as what is in their repos, it probably wasn't set to report to a logging service that'd raise red flags.

> is no way to be running a public service with paying customers

Is it even possible to pay for the hosted GitLab.com instance?


Gitlab.com is free, not payed. https://about.gitlab.com/products/

The only for pay versions are self hosted EE and independent cloud hosting.


Very well; my apologies. I've updated my comment.

If you haven't done so recently, TEST YOUR BACKUPS.

I noticed the issue when I was pushing code earlier today. Hopefully this gets resolved soon. You guys are doing a great job. Keep up the good work!

Thanks, not feeling great about the job we're doing today, but we'll learn from this.

And we're sorry for the inconvenience this caused to your workflow today!

I deployed my code before the backup issue. So, no worries.

Wow. This is a joke. Gitlab is a joke. Just moved my stuff to BitBucket.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: