(cache) **Gliffy Online System Outage** : Gliffy Support Desk

Comments

Katy Kelly

Gliffy, Inc.

3/21/16 5:30pm PST

We hope to have this resolved tonight. We cannot provide an exact ETA at this time unfortunately as there are several variables regarding this. Keep checking back for updates. Thanks again for your continued patience.

March 21, 2016, 5:43 PM

Katy Kelly

Gliffy, Inc.

3/21/16 8:00pm PST

We discovered an issue in one of our backup systems last Thursday night (03/17). Maintenance was scheduled to resolve the issue over the weekend. On working to resolve the issue, an administrator accidentally deleted the production database.

The good news is that we have copies of our database that are replicated daily, up until the exact point of time when the database was deleted. We are working hard to retrieve all of your data.

We have been in the process of restoring our database. Due to its sheer size, a restore can take days to complete. While that is running in the background, we’ve been attempting different tactics in parallel to restore the database and your data quicker. If one of the current attempts is successful, we can be online as early as tomorrow morning, pacific standard time (PST). However, we will not know for certain until it has been completed.

We feel like we have failed you, our customers, and you expected better from us. After the restoration is completed, we will be taking a hard look at our processes and procedures to understand how and why this happened in the first place, and if there are other issues to be resolved as well. It is important to us that we meet the needs of our customers and ensure there isn’t a recurrence of this issue or others that could hinder your productivity.

Please stay tuned for further updates.

March 21, 2016, 7:58 PM

Chris K

Gliffy, Inc.

3/22/16 3:19am PST

We have 3 parallel and redundant restore processes running to increase our chances of successfully bringing up the system today. We are investigating starting a 4th restore process in the event the first 3 fail for some reason. We believe one of the 3 restore processes will be complete in the next several hours. Once one of the restore processes is complete, there will be additional work that our engineering team must do to ensure data integrity.

I'm not able to provide an ETA for complete system restoration since we wont know if one of the restore processes has been successful until it's successful. I can say that it's unlikely that we'll be up and running in less than 3 hours from now.

Chris Kohlhardt, CEO

March 22, 2016, 3:31 AM

Chris K

Gliffy, Inc.

3/22/16 5:32am PST

The three restore processes vary in completeness between 70%-80%. Again, once one of the restore processes is complete, there will be additional work that our engineering team must do to ensure data integrity and get the system running.

March 22, 2016, 5:33 AM

Chris K

Gliffy, Inc.

3/22/16 7:24am PST

Unfortunately one of the restore processes failed because it used significantly more disk space than we anticipated. The other restore processes have been configured with more disk space to reduce the chance of this problem happening again.

We have added another restore process and we again have 3 restore processes running in parallel.

We estimate that one of the restore processes will complete in 12 hours and additional work will be required to fully bring that one online if it succeeds.

A second restore process could take as long as 24-36 hours to succeed.

A third restore process was just started and we do not have an estimate for completion at this time.

We are actively looking into other options that will bring the system online more quickly.

It goes without saying that we are very sorry for the impact this has had on our customers. We will report more as we have more to share.

Chris Kohlhardt, CEO

March 22, 2016, 7:30 AM

Eric Chiang

Gliffy, Inc.

3/22/16 12:00pm PST

Our restore processes are running smoothly.

I wanted to give everyone a little more information about the 3 restore processes that Chris has discussed.

The first restore process is running significantly faster than the others. The reason this is the case is because we have provisioned much more powerful hardware to accelerate the restore. However, this hardware exists outside of our production facility. Rest assured that we have maintained the same amount of strict security that we have in place for our production environment. We anticipate this to complete in ~8 hours time. However, we will need to prepare this restore and move it into our production facility, which I estimate to be ~4 hours.

The second restore process is running directly in our production facility. It is anticipated to complete in over 24 hours time. This restore is outside of our timeframe to deliver our systems back to availability for our customers, but we are having it continue to run as a backup plan.

Our third restore process is a backup to the first restore process and will only be needed if the first one fails for whatever reason.

In summary, we’re hoping to have our systems back up and running in the early hours of tomorrow morning.

Our engineers are working around the clock along with our hosting provider to get our systems up and running as soon as possible. We apologize for the inconvenience and appreciate your patience thru this process.

Thanks,
Eric Chiang
Head of Engineering

March 22, 2016, 12:00 PM

Eric Chiang

Gliffy, Inc.

3/22/16 3:30pm PST

Our restore progress continues to be on track and we're feeling confident that we'll be able to meet our previously proposed timeframes.

We're currently working with our hosting provider to provision more space to accommodate our restored database. This will most likely be the long pole in our process but we're hoping they will come thru for us before the end of the day today.

Thanks,

Eric Chiang

Head of Engineering

March 22, 2016, 3:24 PM

Eric Chiang

Gliffy, Inc.

3/22/16 5:00pm PST

Some good news to share... our hosting provider was able to provision the storage we needed in the timeframe we requested. Our recovery processes are also running smoothly.

Things continue to be on track!

Stay tuned.

Thanks,

Eric Chiang

Head of Engineering

March 22, 2016, 5:08 PM

Eric Chiang

Gliffy, Inc.

3/22/16 9:00pm PST

Our recovery process has completed and we're beginning to send over our restored database to our production facility. This process should take 4+ hours to complete.

We'll post another update when this has finished.

Thanks,

Eric Chiang

Head of Engineering

March 22, 2016, 9:03 PM

Eric Chiang

Gliffy, Inc.

3/23/16 12:30am PST

The data transfer is taking longer than expected. At this rate we're expecting it to complete closer to 11am.

March 23, 2016, 12:31 AM

Eric Chiang

Gliffy, Inc.

3/23/16 8:30am PST

The data transfer is going at the previously estimated rate. An update will be provided when that completes.

March 23, 2016, 8:27 AM

Eric Chiang

Gliffy, Inc.

3/23/16 11:00am PST

We've gotten our database backup restore from Saturday night into our production environment. We are now attempting to restore the remaining data between Saturday night to Sunday night from binary logs from the local system. This will recover all of your data up to the point of our outage.

We'll ensure that we have our replication process up again before starting up our application and restoring access to your diagrams.

March 23, 2016, 11:09 AM

Add a comment

Forums/Gliffy Online/Troubleshooting

Gliffy Online System Outage

Please scroll down to the comments section for additional updates.

Comments