3

I want to create digital archive of photos, documents and other imporant stuff to store in the cloud (likely Amazon Glacier). Preferably one year per archive, up to 10 gigabytes each. I want to make sure storage and network transfer errors won't break anything so I want to include a solid recovery data overhead.

Do you have any recommended best practices and tools here? RAR with recovery data? Is it worth to store each file checksum along with archive? Any other suggestions?

CC BY-SA 3.0

3 Answers 3

2

If you want to include additional recovery data with your backups, you could use Parchive-type solutions. You specify the amount of redundancy/recovery data that you want to generate and how (if at all) to split it. The benefit of using this method is that it's agnostic to the actual backup and storage methods you choose. You can use zip or tar or Windows Backup or anything else that generates files and feed them through Parchive tools to generate additional recovery files.

Keep in mind that both Amazon Glacier and S3 services have an ability to generate file checksum, so once you upload a file, you can compare local and remote checksums to make sure the file got transferred without errors.

Furthermore, this is what Amazon has to say on this topic:

Durable – Amazon Glacier is designed to provide average annual durability of 99.999999999% for an archive. The service redundantly stores data in multiple facilities and on multiple devices within each facility. To increase durability, Amazon Glacier synchronously stores your data across multiple facilities before returning SUCCESS on uploading archives. Unlike traditional systems which can require laborious data verification and manual repair, Glacier performs regular, systematic data integrity checks and is built to be automatically self-healing.

This means that there’s only a 0.00000000001 (1e-11) probability of any one of your files going poof over the course of a single year. Put another way, if you store 100 billion files in Glacier for one year, you can expect to lose one of them.

If you want additional assurance, consider uploading your data to multiple Glacier regions or to a totally different service provider in another geo region.

CC BY-SA 3.0
1
  • Combined with PAR archives, this could be a great solution -- depending on affordability. Also, I prefer MultiPar for PAR archives over QuickPar. Its main defining feature is that it will generate parity for subfolders, something QuickPar doesn't do.
    – afrazier
    Sep 7, 2012 at 12:58
0

Generally if you don't fully trust your storage medium's reliability, you want to introduce your own repair-capable redundancy.

A brute-force and quick-and-dirty way of doing this is merely uploading everything twice. You probably don't want to do that.

It's involved, but if you split your files into small blocks, and create "par2" files using a tool such as QuickPar. (here's a tutorial) then I believe if a file is missing it can be recovered. This is usually used to increase the reliability of binary files transferred and "retrieved" over Usenet (which was never really designed to do that), but it could be used anywhere you want to have this level of redundancy.

CC BY-SA 3.0
0

There are alternatives to the old PAR format: DVDisaster, DAR and pyFileFixity (which I developped). But cloud services should have their own system for data preservation of course, because with the storage space they offer, the rate of data corruption grows frighteningly high, so in any case you should be safe.

CC BY-SA 3.0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.