AT&T ready to trial latest attempt at pumping internet over power lines

GitLab.com melts down after wrong directory deleted, backups fail

Apple CEO: 'Best ever' numbers would be better if we'd not fscked up our iPhone supply

Revealed: Soros Group behind mystery unit that gobbled Violin Memory

AT&T ready to trial latest attempt at pumping internet over power lines

GitLab.com melts down after wrong directory deleted, backups fail

Revealed: Soros Group behind mystery unit that gobbled Violin Memory

Datto buys into a right Mesh. The good kind

The nuts and bolts of high-impact webinars

Imagine a ChromeOS-style Windows 10 ... oh wait, there it is and it's called Windows Cloud

PDP-10 enthusiasts resurrect ancient MIT operating system

Linus Torvalds says Linux 4.10 just 'blew up' as rc6 bloats

Trump hits control-Z on cybersecurity order: No reason given for delay

Human memory, or the lack of it, is the biggest security bug on the 'net

Suffered a breach? Expect to lose cash, opportunities, and customers – report

We see you, ransomware flingers, testing out your baddest stuff on... Germany?

What might HPE do with SimpliVity?

Study shows 'BYOK' can unlock public cloud market for businesses

Microsoft is cooking virtual storage in Azure

UK courts experiencing surge in cyber-crime case load

'Treat your developers like creative workers – or watch them leave'

Is Kubernetes a little too terrifying? Platform9 has a safe space for you

Continuous Lifecycle London: Save over 25% with early bird tickets

SporeStack: Disposable, anonymous servers, via Bitcoin and Python

Digital Transformation Agency deletes links to GovShare project, says it's still alive

Axe net neutrality? Keep the set-top box lock-in? Easy as Pai: New FCC boss backs Big Cable

Trump's visa plan leaks: American techies first

Plucky upstart CityFibre expects to swing into profit in FY2016

Apple CEO: 'Best ever' numbers would be better if we'd not fscked up our iPhone supply

LG's $1,300 5K monitor foiled by Wi-Fi: Screens go blank near hotspots

'Grey technology' should be the new black

SpaceX shuffles deck, EchoStar launch bumped

NASA honors Apollo 1 crew 50 years after deadly launchpad fire

NASA brews better test to find ET in cosmic cocktails

This goldfish and its steerable robot tank will destroy humanity

Free smart fridges! App stores in fountains! Plus more from Canonical man

Fill out our AI survey before the machines take over completely

Tesla sues ex-manager 'for stealing 100GBs of Autopilot secrets'

Bloke launches twinkly range of BBC Micro:bit accessory boards

Verity Stob

Parliamentary Trump-off? Pro-Donald petition passes 100k signatures

Corn-based diet turns French hamsters into baby eating cannibals

God save the Queen... from Donald Trump. So say 1 million Britons

'Maker' couple asphyxiated, probably by laser cutter fumes

GitLab.com melts down after wrong directory deleted, backups fail

Upstart said it had outgrown the cloud – now five out of five restore tools have failed

1 Feb 2017 at 02:02, Simon Sharwood

Source-code hub Gitlab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups.

On Tuesday evening, Pacific Time, the startup issued the sobering series of tweets listed below. Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had accidentally deleted a directory on the wrong server during a frustrating database replication process: he wiped a folder containing 300GB of live production data that was due to be replicated.

Just 4.5GB remained by the time he canceled the deletion command. The last potentially viable backup was taken six hours beforehand.

We are performing emergency database maintenance, https://t.co/r11UmmDLDE will be taken offline
— GitLab.com Status (@gitlabstatus) January 31, 2017

we are experiencing issues with our production database and are working to recover
— GitLab.com Status (@gitlabstatus) February 1, 2017

We accidentally deleted production data and might have to restore from backup. Google Doc with live notes https://t.co/EVRbHzYlk8
— GitLab.com Status (@gitlabstatus) February 1, 2017

That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)."

So some solace there for users because not all is lost. But the document concludes with the following:

So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

The world doesn't contain enough faces and palms to even begin to offer a reaction to that sentence. Or, perhaps, to summarise the mistakes the startup candidly details as follows:

LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage

Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.

SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.

Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.

The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost

The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented

Our backups to S3 apparently don’t work either: the bucket is empty

Making matters worse is the fact that GitLab last year decreed it had outgrown the cloud and would build and operate its own Ceph clusters. GitLab's infrastructure lead Pablo Carranza said the decision to roll its own infrastructure “will make GitLab more efficient, consistent, and reliable as we will have more ownership of the entire infrastructure.”

At the time of writing, GitLab says it has no estimated restore time but is working to restore from a staging server that may be “without webhooks” but is “the only available snapshot.” That source is six hours old, so there will be some data loss.

Last year, GitLab, founded in 2014, scored US$20m of venture funding. Those investors may just be a little more ticked off than its users right now.

The Reg will update this story as more information comes to hand. The sysadmin who accidentally nuked the live data reckons "it’s best for him not to run anything with sudo any more today." ®

Sponsored: M3: Minds Mastering Machines. The ML & AI conference. Register now

Tips and corrections

2 Comments