Hacker News new | past | comments | ask | show | jobs | submit login
Par2cmdline is a PAR 2.0 compatible file verification and repair tool (github.com/parchive)
53 points by Tomte 74 days ago | hide | past | favorite | 35 comments



Made a gui around it for keeping large filetrees (pictures, musical) healthy, have a look: https://github.com/brenthuisman/par2deep


"This tool will generate one parity file (plus a file for the recovery blocks) per file that you protect."

Maybe I used it differently: I'd have a collection of dozens or hundreds of files, and if some small portion of those files were missing or corrupt, the recovery would draw on the overall parity files.

With this tool, I might be misreading, but I think if a single file went missing, the recovery files would only be for that single file, so recovery would be impossible.


True. It doesnt 'freeze' a filesystem. Its written to handle a tree in flux (you might move/delete files), and all you have to do it move the par2 along. Backups would protect against deletions, par2deep against bitrot.


Hi. I tried it out with some random pictures and par2deep wasnt able to generate the par2 file for one of them. The error was createdfiles_err. I used the GUI on win10, the file names are of normal length, file permissions are the same as for the others. I can send it to you for testing if you want


Sure, if you can, file an Issue on Github!

Did you try manually creating parity for that file (using par2cmdline)?


I almost did but I tried something else and found the likely reason: the file name contained an "ä". Par2deep also fails when there's a CJK character in the filename. After removing these non-ascii characters it works. But this bug is going to cause problems for many people.

Edit: filed a bug report


The joys of pre-UTF8 software :) You're right to point it out, however, the speed at which par2cmdline accepts PRs (~0/yr), consider it unfixable. I'm looking into https://github.com/akalin/gopar and this may solve this.


You are right. The way I use it is like you say.

My backups consist of compressed and encrypted archive files, which are then processed with PAR, to add redundancy that will allow the recovery of any corrupted files from the archive.


This would be useful for keeping my unraid data from rotting. I have definitely seen small bits of corruption in my multi-terabyte storage, even with ECC in place.


If you're serious about data integrity and want to keep the archives offline, consider using dvdisaster to burn to Blu-Ray disks.

Always remember that tools like PAR are of limited system if the filesystem becomes corrupt.

Also worth mentioning:

https://pypi.org/project/pyFileFixity/

https://en.wikipedia.org/wiki/Dar_(disk_archiver)


Most of these tools (rsbep) etc. seem to use Reed-Solomon.

Is that because of the time they were written in, or is that more suitable than other codes?

Would using e.g. LDPC codes work just as well?


Why would we need to change it, from something that works? Isn't Reed-Solomon already at (or very close to) the theoretical maximum efficiency for redundancy codes?


I'm not talking about changing anything, I'd like to understand whether LDPC, Turbo codes, Reed-Solomon etc. "do the same thing".


Reed-Solomon, as implemented in par2, splits the original file into a number of blocks.

The total possible number of blocks I believe is 16k for par2, or some similar number. The specific number of blocks for each file can be configured, and the tradeoff is that the more blocks you want per file, the slower the generation will be.

Then it generates some parity data. Using that parity data it is possible (and guaranteed, if the parity data itself is not corrupt) to detect and completely restore 1 or more complete blocks of the target file.

The more parity data is saved, the larger number of blocks that can be restored with it. But it doesn't matter which blocks get corrupted, or how they are corrupted. For example, if a given parity data supports restoring 10 blocks, any 10 blocks of the source file can become corrupted, in any ways (even completely missing), as long as the corruption is contained in those blocks - the complete original file will be possible to restore.

AFAIK, some other codes are possibly better suited for other tasks, such as stream redundancy measures in radio networks, or other applications where the data corruption has certain known and restricted parameters. But for general purpose file protection, bitrot, bad blocks, bad drives, scratches on an optical disc, download errors etc - Reed-Solomon is very well suited for the task.


This is nifty! I used par2 when I was shooting video and archiving to optical media; I'd include the parity files on the media, and also keep copies elsewhere. I think the parity files would be able to pull in data even from corrupt, re-ripped iso files if the media started failing.


Optical media already use similar error correction, but the redundancy level isn't configurable, and it's stored physically close to the data, so it's somewhat less effective than what you're doing.

That said, having recently looked at some 20-year-old discs, professionally pressed discs held up well, and bargain basement CD-Rs had a lot of physical failures, even when stored indoors in a dark case. So par2 is good, but make sure you're getting high-quality media, too. Maybe M-Discs?


I still use it for our family photos (150 gigs). The par2 files cost about 5% additional storage, which I don't mind. My backups are online, on external SSDs and harddrives. I've had unexplainable corrupted photos before and I wish to take no chances.


I've always wondered why I don't come across mention of PAR more frequently, since I'm fairly active in some backup and sysadmin-related communities. For some customers I maintain archival data and verifying PARs is a crucial step.


For a long time I used the "TBB" (a parallelized version based on Intel's Thread Building Blocks library) variant of par2cmdline. This looks like another fork of the original; does anyone know how well it scales to many CPUs?


Great for archiving on spinning rust (and even more so for restoring), as badblocks can happen and this embeds redundancy.

For long term storage, I would recommend ZIP files + PAR2 over an EXFAT partition


While I've been doing this, I'm contemplating using ZFS with ncopies set to two and calling it a day.


While looking up for ZFS ncopies option to see what it actually does I stumbled onto this blog post:

https://jrs-s.net/2016/05/02/zfs-copies-equals-n/

"zfs: copies=n is not a substitute for device redundancy!"


Correct, because copies may place all copies onto the same disk.

The point of copies is belt + suspenders: you both have your typical raid protection as well as two individual copies of the file so if one copy becomes corrupted the second can be referenced.

Additionally it's useful in the case you have a single disk (like a laptop). That way if you end up with a corrupted block you can still recover your file.


No, but that's not the goal of parity files either. It protects against bitrot.


PAR2 files use much more advanced error correction than simple redundancy, they are capable of compensating for errors under a much wider spectrum of data loss situations, using the same amount of storage space.


My concern would be reading the data later. Zip files are sill ubiquitous and easily supported by all major OSes (though not mobile, so be slightly careful). With ZFS, you might have to first deal with reading off old hardware (luckily SATA looks like it has at least 10 more years) and using an old OS version to mount the pool.


ZFS seems like it will be a well supported fs for decades to come to me, but in any case I wouldn't actually stuff a drive in a cabinet somewhere and have a look 30 years later. Somewhat regular checks and moving to newer media is still generally a good idea I think.


I've read a rumor here of AWS S3 silently dropping objects when your collections get to really gargantuan sizes (billions of objects), and now I wonder if the people who ran into that are using PAR2 to compensate.


Just because you said AWS, I was backing up a lot of data in Glacier, so I mailed them a hard drive. I included par files in case something happened. There was a read issue with the hard drive (iirc, I wouldn't reproduce it when I got the drive back, but they wiped it, so who knows), so they stopped the import halfway through. A couple of months later, I noticed a surprisingly large Glacier bill, but I couldn't find the objects. After contacting support, I found out that the temp files were essentially in a hidden archive in my account.

I was pretty annoyed by the whole thing.


I had wrongly assumed chain of custody manifests are a thing in Snowball and AWS Import/Export. We just discussed on here about ECC memory where the consensus was ECC RAM even in laptops is good (especially at 32GB+), yet somehow mass storage dealing with orders of magnitude greater volume of data is given a pass on transits. Thanks for sharing your experience, good to know I have to preserve our manifests discipline outside our data lakes' borders.

Also interesting to find out AWS uses their customers' resources for temporary storage in the transit process, instead of elastically using process-only-bound ephemeral resources outside customer space in a more cloud-native fashion. Temporary consumption in the customers' resource space in a solution pattern gives me nightmare scenarios of stray code that scribbles the temporary objects into customer-owned data, or accidentally dropped into the wrong location and read by customer processes. Would be curious to hear the trade-offs involved in that decision, they could not have made it lightly, I always try to choose fail-safe design modes and at that level of solutioning I'm sure their teams are way smarter than I am so I'd love to learn from this use case.


I use to use par like 20 years ago, it was fairly popular in certain warez ftp communities back when you were still downloading on modems.

What are people using it for these days?


Protection against bitrot and other data corruption in backups and other storage of important information that is too expensive to get corrupted.


Yes, I also use it precisely for this in all my backups.


My experience is the scene used SFV (which is CRC32 and ptovides no redundancy) and Usenet used PAR(2). Together with a fill server (block account of X TB), it aids against bitrot.


Protect archival data like pictures and music from bitrot. At minimum, I'll know about it, even though some filesystems now have checksumming to help with that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: