Linux: compute a single hash for a given folder & contents?

Question

Surely there must be a way to do this easily!

I've tried the Linux command-line apps such as sha1sum and md5sum but they seem only to be able to compute hashes of individual files and output a list of hash values, one for each file.

I need to generate a single hash for the entire contents of a folder (not just the filenames).

I'd like to do something like

sha1sum /folder/of/stuff > singlehashvalue

Edit: to clarify, my files are at multiple levels in a directory tree, they're not all sitting in the same root folder.

By 'entire contents' do you mean the logical data of all files in the directory or its data along with meta while arriving at the root hash? Since the selection criteria of your use case is quite broad, I've tried to address few practical ones in my answer. — six-k, Jan 10 '18 at 18:04

Nicholas Pipitone · Accepted Answer · 2019-02-11 17:48:49Z

One possible way would be:

sha1sum path/to/folder/* | sha1sum

If there is a whole directory tree, you're probably better off using find and xargs. One possible command would be

find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum

And, finally, if you also need to take account of permissions and empty directories:

(find path/to/folder -type f -print0  | sort -z | xargs -0 sha1sum;
 find path/to/folder \( -type f -o -type d \) -print0 | sort -z | \
   xargs -0 stat -c '%n %a') \
| sha1sum

The arguments to stat will cause it to print the name of the file, followed by its octal permissions. The two finds will run one after the other, causing double the amount of disk IO, the first finding all file names and checksumming the contents, the second finding all file and directory names, printing name and mode. The list of "file names and checksums", followed by "names and directories, with permissions" will then be checksummed, for a smaller checksum.

and don't forget to set LC_ALL=POSIX, so the various tools create locale independent output. — David Schmitt, Feb 15 '09 at 12:28
I found cat | sha1sum to be considerably faster than sha1sum | sha1sum. YMMV, try each of these on your system: time find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum; time find path/to/folder -type f -print0 | sort -z | xargs -0 cat | sha1sum — Bruno Bronosky, Apr 28 '11 at 17:02
@RichardBronosky - Let us assume we have two files, A and B. A contains "foo" and B contains "bar was here". With your method, we would not be able to separate that from two files C and D, where C contains "foobar" and D contains " was here". By hashing each file individually and then hash all "filename hash" pairs, we can see the difference. — Vatine, Dec 18 '12 at 10:18
To make this work irrespective of the directory path (i.e. when you want to compare the hashes of two different folders), you need to use a relative path and change to the appropriate directory, because the paths are included in the final hash: find ./folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum — robbles, Feb 14 '13 at 20:30
@robbles That is correct and why I did not put an initial / on the path/to/folder bit. — Vatine, Feb 15 '13 at 10:58

Michael Mior · Accepted Answer · 2019-05-30 02:30:40Z

25

Use a file system intrusion detection tool like aide.
hash a tar ball of the directory:

tar cvf - /path/to/folder | sha1sum
Code something yourself, like vatine's oneliner:

find /path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum

edited May 30 '19 at 2:30

Michael Mior

25.2k8 gold badges75 silver badges108 bronze badges

answered Feb 13 '09 at 10:04

David Schmitt

53.2k26 gold badges114 silver badges158 bronze badges

3

+1 for the tar solution. That is the fastest, but drop the v. verbosity only slows it down. – Bruno Bronosky Feb 5 '13 at 20:47
6

note that the tar soluition assumes the files are in the same order when you compare them. Whether they are would depend on the file system the files resides in when doing the comparison. – nos Feb 25 '13 at 14:19
5

The git hash is not suitable for this purpose since file contents are only a part of its input. Even for the initial commit of a branch, the hash is affected by the commit message and the commit metadata as well, like the time of the commit. If you commit the same directory structure multiple times, you will get different hash every time, thus the resulting hash is not suitable for determining whether two directories are exact copies of each other by only sending the hash over. – Zoltan May 17 '18 at 19:11
1

@Zoltan the git hash is perfectly fine, if you use a tree hash and not a commit hash. – hobbs May 30 '19 at 2:44
@hobbs The answer originally stated "commit hash", which is certainly not fit for this purpose. The tree hash sounds like a much better candidate, but there could still be hidden traps. One that comes to my mind is that having the executable bit set on some files changes the tree hash. You have to issue git config --local core.fileMode false before committing to avoid this. I don't know whether there are any more caveats like this. – Zoltan May 30 '19 at 7:45

add a comment |

davidtbernal · Accepted Answer · 2018-06-15 16:46:55Z

14

You can do tar -c /path/to/folder | sha1sum

edited Jun 15 '18 at 16:46

davidtbernal

11.9k8 gold badges40 silver badges57 bronze badges

answered Feb 13 '09 at 11:04

S.Lott

348k73 gold badges476 silver badges750 bronze badges

16

If you want to replicate that checksum on a different machine, tar might not be a good choice, as the format seems to have room for ambiguity and exist in many versions, so the tar on another machine might produce different output from the same files. – slowdog Jan 27 '11 at 18:42
2

slowdog's valid concerns notwithstanding, if you care about file contents, permissions, etc. but not modification time, you can add the --mtime option like so: tar -c /path/to/folder --mtime="1970-01-01" | sha1sum. – Binary Phile Dec 17 '15 at 19:44
@S.Lott if the directory size is big, I mean if the size of the directory is so big, zipping it and getting md5 on it will take more time – Kasun Siyambalapitiya Jul 24 '17 at 9:38

add a comment |

Shumoapp · Accepted Answer · 2016-12-08 00:09:08Z

13

If you just want to check if something in the folder changed, I'd recommend this one:

ls -alR --full-time /folder/of/stuff | sha1sum

It will just give you a hash of the ls output, that contains folders, sub-folders, their files, their timestamp, size and permissions. Pretty much everything that you would need to determine if something has changed.

Please note that this command will not generate hash for each file, but that is why it should be faster than using find.

answered Dec 8 '16 at 0:09

Shumoapp

1,22314 silver badges13 bronze badges

1

I'm unsure why this doesn't have more upvotes given the simplicity of the solution. Can anyone explain why this wouldn't work well? – Dave C Mar 15 '17 at 1:02
1

I suppose this isn't ideal as the generated hash will be based on file owner, date-format setup, etc. – Ryota Mar 15 '17 at 22:06
1

The ls command can be customized to output whatever you want. You can replace -l with -gG to omit the group and the owner. And you can change the date format with the --time-style option. Basically check out the ls man page and see what suits your needs. – Shumoapp Mar 16 '17 at 15:52
@DaveC Because it's pretty much useless. If you want to compare filenames, just compare them directly. They're not that big. – Navin Aug 18 '18 at 1:51
7

@Navin From the question it is not clear whether it is necessary to hash file contents or detect a change in a tree. Each case has its uses. Storing 45K filenames in a kernel tree, for example, is less practical than a single hash. ls -lAgGR --block-size=1 --time-style=+%s | sha1sum works great for me – yashma Aug 21 '18 at 2:26

add a comment |

six-k · Accepted Answer · 2018-01-10 17:55:05Z

A robust and clean approach

First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.
Different approaches for different needs/purpose (all of the below or pick what ever applies):
- Hash only the entry name of all entries in the directory tree
- Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
- For a symbolic link, its content is the referent name. Hash it or choose to skip
- Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
- If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
- Handle large files well(again, mind the RAM)
- Handle very deep directory trees (mind the open file descriptors)
- Handle non standard file names
- How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
- Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.

This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.

Here's a tool, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.

An example usage and output of `dtreetrawl`.

Usage:
  dtreetrawl [OPTION...] "/trawl/me" [path2,...]

Help Options:
  -h, --help                Show help options

Application Options:
  -t, --terse               Produce a terse output; parsable.
  -j, --json                Output as JSON
  -d, --delim=:             Character or string delimiter/separator for terse output(default ':')
  -l, --max-level=N         Do not traverse tree beyond N level(s)
  --hash                    Enable hashing(default is MD5).
  -c, --checksum=md5        Valid hashing algorithms: md5, sha1, sha256, sha512.
  -R, --only-root-hash      Output only the root hash. Blank line if --hash is not set
  -N, --no-name-hash        Exclude path name while calculating the root checksum
  -F, --no-content-hash     Do not hash the contents of the file
  -s, --hash-symlink        Include symbolic links' referent name while calculating the root checksum
  -e, --hash-dirent         Include hash of directory entries while calculating root checksum

A snippet of human friendly output:

...
... //clipped
...
/home/lab/linux-4.14-rc8/CREDITS
        Base name                    : CREDITS
        Level                        : 1
        Type                         : regular file
        Referent name                :
        File size                    : 98443 bytes
        I-node number                : 290850
        No. directory entries        : 0
        Permission (octal)           : 0644
        Link count                   : 1
        Ownership                    : UID=0, GID=0
        Preferred I/O block size     : 4096 bytes
        Blocks allocated             : 200
        Last status change           : Tue, 21 Nov 17 21:28:18 +0530
        Last file access             : Thu, 28 Dec 17 00:53:27 +0530
        Last file modification       : Tue, 21 Nov 17 21:28:18 +0530
        Hash                         : 9f0312d130016d103aa5fc9d16a2437e

Stats for /home/lab/linux-4.14-rc8:
        Elapsed time     : 1.305767 s
        Start time       : Sun, 07 Jan 18 03:42:39 +0530
        Root hash        : 434e93111ad6f9335bb4954bc8f4eca4
        Hash type        : md5
        Depth            : 8
        Total,
                size           : 66850916 bytes
                entries        : 12484
                directories    : 763
                regular files  : 11715
                symlinks       : 6
                block devices  : 0
                char devices   : 0
                sockets        : 0
                FIFOs/pipes    : 0

Can you give a brief example to get a robust and clean sha256 of a folder, maybe for a Windows folder with three subdirectories and a few files in there each? — Ferit, May 10 at 0:44

unbeknownunbeknown · Accepted Answer · 2009-02-13 09:54:41Z

If you just want to hash the contents of the files, ignoring the filenames then you can use

cat $FILES | md5sum

Make sure you have the files in the same order when computing the hash:

cat $(echo $FILES | sort) | md5sum

But you can't have directories in your list of files.

Moving the end of one file into the beginning of the file that follows it alphabetically would not affect the hash but should. A file-delimiter or file lengths would need to be included in the hash. — Jason Stangroome, Mar 12 '12 at 3:35

Jack · Accepted Answer · 2015-07-29 13:35:42Z

3

Another tool to achieve this:

http://md5deep.sourceforge.net/

As is sounds: like md5sum but also recursive, plus other features.

answered Jul 29 '15 at 13:35

Jack

312 bronze badges

1

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Mamoun Benghezal Jul 29 '15 at 13:51

add a comment |

ndbroadbent · Accepted Answer · 2019-07-07 00:01:30Z

3

If this is a git repo and you want to ignore any files in .gitignore, you might want to use this:

git ls-files <your_directory> | xargs sha256sum | cut -d" " -f1 | sha256sum | cut -d" " -f1

This is working well for me.

answered Jul 7 '19 at 0:01

ndbroadbent

12.1k3 gold badges50 silver badges77 bronze badges

Thanks a lot! :) – visortelle Mar 10 at 13:47
For many applications this approach is superior. Hashing just the source code files gets a sufficiently unique hash in a lot less time. – John McGehee Jul 16 at 17:57

add a comment |

Kingdon · Accepted Answer · 2011-01-25 17:12:41Z

There is a python script for that:

http://code.activestate.com/recipes/576973-getting-the-sha-1-or-md5-hash-of-a-directory/

If you change the names of a file without changing their alphabetical order, the hash script will not detect it. But, if you change the order of the files or the contents of any file, running the script will give you a different hash than before.

Joao da Silva · Accepted Answer · 2009-02-13 09:57:08Z

Try to make it in two steps:

create a file with hashes for all files in a folder
hash this file

Like so:

# for FILE in `find /folder/of/stuff -type f | sort`; do sha1sum $FILE >> hashes; done
# sha1sum hashes

Or do it all at once:

# cat `find /folder/of/stuff -type f | sort` | sha1sum

for F in 'find ...' ... doesn't work when you have spaces in names (which you always do nowadays). — mivk, Apr 10 '12 at 10:38

Rafał Dowgird · Accepted Answer · 2009-02-13 09:58:39Z

I would pipe the results for individual files through sort (to prevent a mere reordering of files to change the hash) into md5sum or sha1sum, whichever you choose.

haventchecked · Accepted Answer · 2016-03-29 04:59:56Z

I've written a Groovy script to do this:

import java.security.MessageDigest

public static String generateDigest(File file, String digest, int paddedLength){
    MessageDigest md = MessageDigest.getInstance(digest)
    md.reset()
    def files = []
    def directories = []

    if(file.isDirectory()){
        file.eachFileRecurse(){sf ->
            if(sf.isFile()){
                files.add(sf)
            }
            else{
                directories.add(file.toURI().relativize(sf.toURI()).toString())
            }
        }
    }
    else if(file.isFile()){
        files.add(file)
    }

    files.sort({a, b -> return a.getAbsolutePath() <=> b.getAbsolutePath()})
    directories.sort()

    files.each(){f ->
        println file.toURI().relativize(f.toURI()).toString()
        f.withInputStream(){is ->
            byte[] buffer = new byte[8192]
            int read = 0
            while((read = is.read(buffer)) > 0){
                md.update(buffer, 0, read)
            }
        }
    }

    directories.each(){d ->
        println d
        md.update(d.getBytes())
    }

    byte[] digestBytes = md.digest()
    BigInteger bigInt = new BigInteger(1, digestBytes)
    return bigInt.toString(16).padLeft(paddedLength, '0')
}

println "\n${generateDigest(new File(args[0]), 'SHA-256', 64)}"

You can customize the usage to avoid printing each file, change the message digest, take out directory hashing, etc. I've tested it against the NIST test data and it works as expected. http://www.nsrl.nist.gov/testdata/

gary-macbook:Scripts garypaduana$ groovy dirHash.groovy /Users/garypaduana/.config
.DS_Store
configstore/bower-github.yml
configstore/insight-bower.json
configstore/update-notifier-bower.json
filezilla/filezilla.xml
filezilla/layout.xml
filezilla/lockfile
filezilla/queue.sqlite3
filezilla/recentservers.xml
filezilla/sitemanager.xml
gtk-2.0/gtkfilechooser.ini
a/
configstore/
filezilla/
gtk-2.0/
lftp/
menus/
menus/applications-merged/

79de5e583734ca40ff651a3d9a54d106b52e94f1f8c2cd7133ca3bbddc0c6758

NVRM · Accepted Answer · 2018-01-28 16:14:11Z

1

I had to check into a whole directory for file changes.

But with excluding, timestamps, directory ownerships.

Goal is to get a sum identical anywhere, if the files are identical.

Including hosted into other machines, regardless anything but the files, or a change into them.

md5sum * | md5sum | cut -d' ' -f1

It generate a list of hash by file, then concatenate those hashes into one.

This is way faster than the tar method.

For a stronger privacy in our hashes, we can use sha512sum on the same recipe.

sha512sum * | sha512sum | cut -d' ' -f1

The hashes are also identicals anywhere using sha512sum but there is no known way to reverse it.

edited Jan 28 '18 at 16:14

answered Jan 28 '18 at 15:17

NVRM

4,8101 gold badge38 silver badges50 bronze badges

This seems much simpler than the accepted answer for hashing a directory. I wasn't finding the accepted answer reliable. One issue... is there a chance the hashes could come out in a different order? sha256sum /tmp/thd-agent/* | sort is what i'm trying for a reliable ordering, then just hashing that. – thinktt Jan 30 at 19:57
Hi, looks like the hashes comes in alphabetical order by default. What do you mean by reliable ordering? You have to organize all that by yourself. For example using associative arrays, entry + hash. Then you sort this array by entry, this gives a list of computed hashes in the sort order. I believe you can use a json object otherwise, and hash the whole object directly. – NVRM Jan 31 at 1:27
If I understand you're saying it hashes the files in alphabetical order. That seems right. Something in the accepted answer above was giving me intermittent different orders sometimes, so I'm just trying to make sure that doesn't happen again. I'm going to stick with putting sort at the end. Seems to be working. Only issue with this method vs accepted answer I see is it doesn't deal with nested folders. In my case I don't have any folders so this works great. – thinktt Jan 31 at 17:23
what about ls -r | sha256sum ? – NVRM Jan 31 at 22:27
@NVRM tried it and it just checked for file name changes, not the file content – Gi0rgi0s Aug 14 at 15:32

add a comment |

Ronny Vindenes · Accepted Answer · 2009-02-13 09:57:21Z

0

You could sha1sum to generate the list of hash values and then sha1sum that list again, it depends on what exactly it is you want to accomplish.

answered Feb 13 '09 at 9:57

Ronny Vindenes

2,3031 gold badge18 silver badges15 bronze badges

add a comment |

Thomas Perl · Accepted Answer · 2018-03-08 11:17:38Z

Here's a simple, short variant in Python 3 that works fine for small-sized files (e.g. a source tree or something, where every file individually can fit into RAM easily), ignoring empty directories, based on the ideas from the other solutions:

import os, hashlib

def hash_for_directory(path, hashfunc=hashlib.sha1):                                                                                            
    filenames = sorted(os.path.join(dp, fn) for dp, _, fns in os.walk(path) for fn in fns)         
    index = '\n'.join('{}={}'.format(os.path.relpath(fn, path), hashfunc(open(fn, 'rb').read()).hexdigest()) for fn in filenames)               
    return hashfunc(index.encode('utf-8')).hexdigest()

It works like this:

Find all files in the directory recursively and sort them by name
Calculate the hash (default: SHA-1) of every file (reads whole file into memory)
Make a textual index with "filename=hash" lines
Encode that index back into a UTF-8 byte string and hash that

You can pass in a different hash function as second parameter if SHA-1 is not your cup of tea.

Linux: compute a single hash for a given folder & contents?

15 Answers 15

A robust and clean approach

An example usage and output of `dtreetrawl`.

Your Answer

Not the answer you're looking for? Browse other questions tagged linux bash hash or ask your own question.

Linked

Hot Network Questions

Linux: compute a single hash for a given folder & contents?

15 Answers 15

A robust and clean approach

An example usage and output of dtreetrawl.

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged linux bash hash or ask your own question.

Linked

Related

Hot Network Questions

An example usage and output of `dtreetrawl`.