Content-addressable storage is always neat. Does anyone know if using truncated md5 like this is somehow more robust than using some non-crypto hash like siphash, which already produces 64bit hashes.
Why rename them at all? There are already good tools for duplicate detection. An example is fdupes [1], which is smart enough to rule out dupes by other tricks like partial hashes etc., so you can avoid hashing some of the files.
Using md5 is only a problem here if someone has actually gained access to your files and then gone to the trouble of secretly adding new files and calculating/brute-forcing the correct 'chosen-prefixes' to ensure a clash. It would be a pretty weird attack to mount, that's for sure.
md5 is fine for deduplicating. It's extremely improbable you'd 'organically' get a md5 hash clash for two different files.
If you had a copy of the two image files from my second link, this 'dupe detector' would erroneously flag one as a dupe.
Also, what of truncating the hashes?
I don't get why people try to justify using severely weakened things when using the non-broken (ie, secure) version is a /trivial/ drop in replacement...
I'm not trying to justify anything. I'm just trying to suggest you're labouring under a misapprehension. And this has nothing to do with security. I'm guessing you've heard the (good) advice that md5 is not a secure hashing function for, say, storing passwords, and then promptly joined the 'md5 is bad for all the things' cargo cult.
So while you're correct about the two images on that blog, the only reason why you'd get a clash is because the author of that blog post spent ~15 hours on an AWS GPU instance to generate the correct prefixes which, when appended to those files, results in a clash.
So, I guess if you are in the habit of grabbing random files from your hdd, loading them on to an AWS GPU instance for 15 hours (per file) and generating hash collisions, then yeah, don't use fdupes.
fdupes is not a problem assuming wikipedia's description [1] is correct: "It first compares file sizes, partial MD5 signatures, full MD5 signatures, and then performs a byte-by-byte comparison for verification."
I was unimpressed by the md5 used in the shell script at the original link, which is using a truncated md5...
Ok, fair enough. I would agree with the view that using md5, presumably for the faster performance, is probably not the best trade-off to be making here. Unless we're dealing with an NVMe drive (or something more exotic), you're likely to be IO bound even if using more computationally intensive hashing functions.
And if you are deduping on really fast storage, you'd get way better performance (with comparable safety) using something like xxHash64 (https://cyan4973.github.io/xxHash/).
Why not just have a program that iterates through all of the files, hashes them, stores them in a map/dict and then reports if there's a duplicate? Seems easier than renaming everything multiple times.
That's basically fdupe. Also you only have to hash files with the same length, if they aren't the same length you can be quite sure they aren't the same file.
Even such a simple optimization can make a huge difference on a large directory of images or MP3s.