Katy Kelly
posted this on Mar 21, 4:11 AM
3/21/16
Gliffy Online is currently experiencing a system failure. We are working as quickly as possible to restore access to our system. We will update this message when we have further information and also at the top of https://www.gliffy.com/.
Thanks for your patience and we apologize for the inconvenience this causes. No data has been lost with this outage.
We will update this post with additional ETA and information as it becomes available.
3/21/16 8:30am PST
We are still working on the issue. Support unfortunately cannot access your diagrams and all updates and ETA information will be posted here. Thanks for your patience.
3/21/16 11:00am PST
We have located the issue and are actively working to resolve it. We appreciate everyone's continued patience and we will post a confirmed ETA once we have one.
3/21/16 1:00pm PST
We have further pinpointed the issue and are close to a resolution. We will be able to give an ETA hopefully in the next update.
3/21/16 4:00pm PST
Hi everyone...we're truly sorry for the inconvenience this is causing and we have all hands on decks trying to correct the problem. Rest assured, this is our top priority and we are doing everything possible to get everyone access to their diagrams. We are hoping to have this resolved in the next several hours. We appreciate everyone's continued support and patience you've expressed.
If you would like automatic notifications when this is updated, click the "subscribe" link in the upper right corner of this article (must be logged into Zendesk). Thanks again.
Comments
3/21/16 5:30pm PST
We hope to have this resolved tonight. We cannot provide an exact ETA at this time unfortunately as there are several variables regarding this. Keep checking back for updates. Thanks again for your continued patience.
3/21/16 8:00pm PST
We discovered an issue in one of our backup systems last Thursday night (03/17). Maintenance was scheduled to resolve the issue over the weekend. On working to resolve the issue, an administrator accidentally deleted the production database.
The good news is that we have copies of our database that are replicated daily, up until the exact point of time when the database was deleted. We are working hard to retrieve all of your data.
We have been in the process of restoring our database. Due to its sheer size, a restore can take days to complete. While that is running in the background, we’ve been attempting different tactics in parallel to restore the database and your data quicker. If one of the current attempts is successful, we can be online as early as tomorrow morning, pacific standard time (PST). However, we will not know for certain until it has been completed.
We feel like we have failed you, our customers, and you expected better from us. After the restoration is completed, we will be taking a hard look at our processes and procedures to understand how and why this happened in the first place, and if there are other issues to be resolved as well. It is important to us that we meet the needs of our customers and ensure there isn’t a recurrence of this issue or others that could hinder your productivity.
Please stay tuned for further updates.
3/22/16 3:19am PST
We have 3 parallel and redundant restore processes running to increase our chances of successfully bringing up the system today. We are investigating starting a 4th restore process in the event the first 3 fail for some reason. We believe one of the 3 restore processes will be complete in the next several hours. Once one of the restore processes is complete, there will be additional work that our engineering team must do to ensure data integrity.
I'm not able to provide an ETA for complete system restoration since we wont know if one of the restore processes has been successful until it's successful. I can say that it's unlikely that we'll be up and running in less than 3 hours from now.
Chris Kohlhardt, CEO
3/22/16 5:32am PST
The three restore processes vary in completeness between 70%-80%. Again, once one of the restore processes is complete, there will be additional work that our engineering team must do to ensure data integrity and get the system running.
3/22/16 7:24am PST
Unfortunately one of the restore processes failed because it used significantly more disk space than we anticipated. The other restore processes have been configured with more disk space to reduce the chance of this problem happening again.
We have added another restore process and we again have 3 restore processes running in parallel.
We estimate that one of the restore processes will complete in 12 hours and additional work will be required to fully bring that one online if it succeeds.
A second restore process could take as long as 24-36 hours to succeed.
A third restore process was just started and we do not have an estimate for completion at this time.
We are actively looking into other options that will bring the system online more quickly.
It goes without saying that we are very sorry for the impact this has had on our customers. We will report more as we have more to share.
Chris Kohlhardt, CEO
3/22/16 12:00pm PST
Our restore processes are running smoothly.
I wanted to give everyone a little more information about the 3 restore processes that Chris has discussed.
The first restore process is running significantly faster than the others. The reason this is the case is because we have provisioned much more powerful hardware to accelerate the restore. However, this hardware exists outside of our production facility. Rest assured that we have maintained the same amount of strict security that we have in place for our production environment. We anticipate this to complete in ~8 hours time. However, we will need to prepare this restore and move it into our production facility, which I estimate to be ~4 hours.
The second restore process is running directly in our production facility. It is anticipated to complete in over 24 hours time. This restore is outside of our timeframe to deliver our systems back to availability for our customers, but we are having it continue to run as a backup plan.
Our third restore process is a backup to the first restore process and will only be needed if the first one fails for whatever reason.
In summary, we’re hoping to have our systems back up and running in the early hours of tomorrow morning.
Our engineers are working around the clock along with our hosting provider to get our systems up and running as soon as possible. We apologize for the inconvenience and appreciate your patience thru this process.
Thanks,
Eric Chiang
Head of Engineering
3/22/16 3:30pm PST
Our restore progress continues to be on track and we're feeling confident that we'll be able to meet our previously proposed timeframes.
We're currently working with our hosting provider to provision more space to accommodate our restored database. This will most likely be the long pole in our process but we're hoping they will come thru for us before the end of the day today.
Thanks,
Eric Chiang
Head of Engineering
3/22/16 5:00pm PST
Some good news to share... our hosting provider was able to provision the storage we needed in the timeframe we requested. Our recovery processes are also running smoothly.
Things continue to be on track!
Stay tuned.
Thanks,
Eric Chiang
Head of Engineering
3/22/16 9:00pm PST
Our recovery process has completed and we're beginning to send over our restored database to our production facility. This process should take 4+ hours to complete.
We'll post another update when this has finished.
Thanks,
Eric Chiang
Head of Engineering
3/23/16 12:30am PST
The data transfer is taking longer than expected. At this rate we're expecting it to complete closer to 11am.
3/23/16 8:30am PST
The data transfer is going at the previously estimated rate. An update will be provided when that completes.
3/23/16 11:00am PST
We've gotten our database backup restore from Saturday night into our production environment. We are now attempting to restore the remaining data between Saturday night to Sunday night from binary logs from the local system. This will recover all of your data up to the point of our outage.
We'll ensure that we have our replication process up again before starting up our application and restoring access to your diagrams.