This is painful to read. It's easy to say that they they should have tested their backups better, and so on, but there is another lesson here, one that's far more important and easily missed.
When doing something really critical (such as playing with the master database late at night) ALWAYS work with a checklist. Write down WHAT you are going to do, and if possible, talk to a coworker about it so you can vocalize the steps. If there is no coworker, talk to your rubber ducky or stapler on your desk. This will help you catch mistakes. Then when the entire plan looks sensible, go through the steps one by one. Don't deviate from the plan. Don't get distracted and start switching between terminal windows. While making the checklist ask yourself if what you're doing is A) absolutely necessary and B) risks making things worse. Even when the angry emails are piling up you can't allow that pressure to cloud your judgment.
Every startup has moments when last-minute panic-patching of a critical part of the server infrastructure is needed, but if you use a checklist you're not likely to mess up badly, even when tired.
Yes, good tip from "Turn the Ship Around" by David Marquet is to use the "I intend to" model. For every action you are going to undertake that is critical, first announce your intentions and give enough time for reactions from others before following through.
Yup, it's never the fault of a person, always of the system. Once we get this resolved we'll definitely look at ways to prevent anything like it in the future.
BBC's Horizon has a really good episode about checklists and how they're used to prevent mistakes in hospitals, and how they're being adopted in other environments in light of that success. It's called How To Avoid Mistakes In Surgery for the interested.
YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com
Good lesson on the risks of working on a live production system late at night when you're tired and/or frustrated.
Also, as a safety net, sometimes you don't need to run `rm -rf` (a command which should always be prefaced with 5 minutes of contemplation on a production system). In this case, `rmdir` would have been much safer, as it errors on non-empty directories.
These days, I've been very implicit in how I run rm. To the extent that I don't do rm -rf or rmdir (edit: immediately), but in separate lines as something like:
pushd dir ; find . -type f -ls | less ; find . -type f -exec rm '{}' \; ; popd ; rm -rf dir
It takes a lot longer to do, but I've seen and made enough mistakes over the years that the forced extra time spent feels necessary. It's worked pretty well so far -- knock knock.
I would add a step where you dump the output of find (after filtering) into a textfile, so you have a record of exactly what you deleted. Especially when deleting files recursively based on a regular expression that extra step is very worthwhile.
It's also a good practice to rename instead of delete whenever possible. Rename first, and the next day when you're fresh walk through the list of files you've renamed and only then nuke them for good.
All of my non-production machines have emojis in PS1 somewhere. It sounds ridiculous, but I know that if I see a cheeseburger or a burrito I'm not about to completely mess everything up. Silly terminal = silly data that I can obliterate.
I do this too, but in this case both machines were production, so this alone would not have sufficed. The system-default prompts on the other hand are universally garbage.
Why would it matter? In my last job we had user home directories synced via puppet (I am overly simplifying this) which enabled any ops guy to have same set of shell and vim configuration settings on production machines too.
I daresay - having hostname as part of prompt saves lot of trouble.
Also a good lesson for testing your availability and disaster recovery measures for effectiveness.
Far, far too many companies get production going and then just check to see that certain things "completed successfully" or didn't throw an overt alert in terms of their safety nets.
Just because things seem to be working doesn't mean they are or that they are working in a way that is recoverable.
>1. LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
>2. Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
>3. SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
>4. Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
>5. Our backups to S3 apparently don’t work either: the bucket is empty
>So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.
Sounds like it was only a matter of time before something like this happened. How could so many systems be not working and no one noticed.
Seems like very basic mistakes were made, not at the event but way long before. If you don't test to restore your backups, you don't have a backup. How does it go unnoticed that S3 backups don't work for so long?
> How does it go unnoticed that S3 backups don't work for so long?
My uneducated guess (this one hit a friend of mine): expired/revoked AWS credentials combined with a backup script that doesn't exit(1) on failure and just writes the exception trace to stderr.
I applaud their forthrightness and hope that it's recoverable so that most of the disaster is averted.
To me the most illuminating lesson is that debugging 'weird' issues is enough of a minefield; doing it in production is fraught with even more peril. Perhaps we as users (or developers with our 'user' hat on) expect so much availability as to cause companies to prioritize it so high, but (casually, without really being on the hook for any business impact) I'd say availability is nice to have, while durability is mandatory. To me, an emergency outage would've been preferable to give the system time to catch up or recover, with the added bonus of also kicking off the offending user causing spurious load.
My other observation is that troubleshooting -- the entire workflow -- is inevitably pure garbage. We engineer systems to work well -- these days often with elaborate instrumentation to spin up containers of managed services and whatnot, but once they no longer work well we have to dip down to the lowest adminable levels, tune obscure flags, restart processes to see if it's any better, muck about with temp files, and use shell commands that were designed 40 years ago for when it was a different time. This is a terrible idea. I don't have an easy solution for the 'unknown unknowns', but the collective state of 'what to do if this application is fucking up' feels like it's in the stone ages compared to what we've accomplished on the side of when things are actually working.
Engineering things to work well and the troubleshooting process at a low level are one and the same. It's just that in some cases other people found these bugs and issues before you and fixed them. But this is the cost of OSS, you get awesome stuff for free (as in beer) and are expected to be on the hook for helping with this process. If you don't like it, pay somebody.
Really everyone could benefit from learning more about the systems they rely on such as Linux and observability tools like vmstat, etc. The less lucky guesses or cargo culted solutions you use the better.
Unfortunately, this kind of situation, "only the ideal case ever worked at all", is not uncommon. I've seen it before ... when doing things the right way, dotting 'I's and crossing 'T's, requires an experienced employee a good week or two, it's very tempting for a lean startup to bang out something that seems to work in a couple days and move on.
Not sure if the doc here is refreshing or scary. But Godspeed GitLab team. I've loved the product for about two years now, so curious to see how this plays out.
I very much appreciate their forthrightness and the way they conduct their company generally. Having said that, I have the code I work on, related content, and a number of clients on the service.
[edit for additional point]
They need the infrastructure guy they've been looking for sooner than later. I hope there's good progress on that front.
> At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden.
This is why I'm not a fan of emergency pager duty.
The best system administrator is the one that has learned from their catastrophic fuck up.
To that effect, I still have the same job as I did before I ran "yum update" without knowing it attempts to do in place kernel upgrades. Which resulted in a corrupted RedHat installation on a server we could not turn off.
There is learning from a catastrophic fuck up, and then there is incompetence. Backups is like Day 1, SysAdmin 101. I can't quite grasp how so many different backup systems were left unchecked. Every morning I receive messages saying everything is fine, yet I still go into the backup systems, to make sure they actually did run. In case there was issue with the system alerting me.
Does this mean whatever was in that database is gone, with no available backups?
Is this an SOA where important data might lie in another service or data store, or is this a monolithic app and DB that is responsible for many (or all) things?
What was stored in that database? Does this affect user data? Code?
We have snapshots, but they're not very recent (see the document for more info). The most recent snapshot is roughly 6 hours old (relative to the data loss). The data loss only affects database data, Git repositories and Wikis still exist (though they are fairly useless without a corresponding project).
The doc says that there is a LVM snapshot being 6 hours old. <strike>And there should be a regular logical backup with at most 24 hours age as well (they just can't find it for whatever reason).</strike> (Scratch that, my doc did not update, despite Google saying it should automatically update).
Regarding what's gone: The production PostgreSQL database. This suggests that the code itself is fine, but the mappings to the users are gone. But git is a distributed VCS after all, so all the code should be on the developer's machines as well.
As much as I appreciate GitLabs extreme openness, that's maybe something that by policy shouldn't be part of published reports. Internal process is one thing, if something goes really bad customers might not be so good at "blameless postmortems" if they have a name to blame.
It seems to me that, as a customer, it is blame-shifting away from the company to a particular business. Blameless post-mortems are great, but when speaking to people outside the company I think it is important to own it collectively, "after a second or two we notice we ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com." I believe this isn't your intention, but that is how I interpreted it.
I'm going to add to your message box unnecessarily, but I want to say I love GitLab and it's a shining example of a transparent company. I still have ambitions to work there someday, and this event is hopefully a net gain in the end, in that everyone here and there learns about backups.
Thanks for the transparency. Doesn't always feel good to have missteps aired in public, but it makes us all a little better as a community to be clear about where mistakes can be made.
I once rm-ed my home directory when I was writing and testing a script, but turned out the stuff like .m2, .ivy2 are huge and they are the first ones by default to be deleted by 'rm -rf'. So they kind of gave me some buffering time to figure out that something was wrong.
I am :/ ... currently maintaining service for 500 high-volume businesses 24x7x365 in 9 timezones. Luckily the product and infrastructure is pretty stable and problems occur maybe once a quarter.
But the constant nagging in the back of your head that shit can go wrong at any second is draining and has been the biggest stressor in my life for a long time now.
My S.O. still gets mildly upset when I pack up the laptop on our way out to a fancy dinner, or disappear with my laptop when visiting her parents, but the fact that our life goals are aligned is the saving grace of all these situations. We both know what we want out of the next 5 years of our lives and are willing to sacrifice to achieve this goal (long term financial security).
Backups sucked for the starting in 8.15 on our instances of GLE, because someone decided to add "readable" date stamp in addition to unix timestamp in backup file name without proper testing, which caused many issues. It was somewhat fixed, but I do still issues in 8.16.
I'm not complaining, but backup/restore is important part, with 100% test coverage and daily backup/restore runs.
> Our backups to S3 apparently don’t work either: the bucket is empty
followed by
> So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.
is no way to be running a public service with customer data. Did the person who set up that S3 job simply write a script or something and just go "yep, it's done" and walk away? Seriously?
Applicants for this position can expect the hiring process to follow the order below. Please keep in mind that applicants can be declined from the position at any stage of the process. To learn more about someone who may be conducting the interview, find her/his job title on our team page.
Qualified applicants receive a short questionnaire and coding exercise from our Global Recruiters
The review process for this role can take a little longer than usual but if in doubt, check in with the Global recruiter at any point.
Selected candidates will be invited to schedule a 45min screening call with our Global Recruiters
Next, candidates will be invited to schedule a first 45 minute behavioral interview with the Infrastructure Lead
Candidates will then be invited to schedule a 45 minute technical interview with a Production Engineer
Candidates will be invited to schedule a third interview with our VP of Engineering
Finally, candidates will have a 50 minute interview with our CEO
Successful candidates will subsequently be made an offer via email
> Did the person who set up that S3 job simply write a
> script or something and just go "yep, it's done" and
> walk away?
I don't know of course, but one failure mode that has to be explicitly tested for is continual monitoring that the existing backup process is still working. We had a backup process at Blekko which stopped working once when an S3 credential that appeared unrelated was removed, as I recall it was a Nagios test that detected that the next set of backups were too small and got that fixed.
When doing something really critical (such as playing with the master database late at night) ALWAYS work with a checklist. Write down WHAT you are going to do, and if possible, talk to a coworker about it so you can vocalize the steps. If there is no coworker, talk to your rubber ducky or stapler on your desk. This will help you catch mistakes. Then when the entire plan looks sensible, go through the steps one by one. Don't deviate from the plan. Don't get distracted and start switching between terminal windows. While making the checklist ask yourself if what you're doing is A) absolutely necessary and B) risks making things worse. Even when the angry emails are piling up you can't allow that pressure to cloud your judgment.
Every startup has moments when last-minute panic-patching of a critical part of the server infrastructure is needed, but if you use a checklist you're not likely to mess up badly, even when tired.
reply