Skip to content

Instantly share code, notes, and snippets.

@sneakers-the-rat
Last activeMay 27, 2024 03:44
    Strip PDF Metadata
    # --------------------------------------------------------------------
    # Recursively find pdfs from the directory given as the first argument,
    # otherwise search the current directory.
    # Use exiftool and qpdf (both must be installed and locatable on $PATH)
    # to strip all top-level metadata from PDFs.
    #
    # Note - This only removes file-level metadata, not any metadata
    # in embedded images, etc.
    #
    # Code is provided as-is, I take no responsibility for its use,
    # and I make no guarantee that this code works
    # or makes your PDFs "safe," whatever that means to you.
    #
    # You may need to enable execution of this script before using,
    # eg. chmod +x clean_pdf.sh
    #
    # example:
    # clean current directory:
    # >>> ./clean_pdf.sh
    #
    # clean specific directory:
    # >>> ./clean_pdf.sh some/other/directory
    # --------------------------------------------------------------------
    # Color Codes so that warnings/errors stick out
    GREEN="\e[32m"
    RED="\e[31m"
    CLEAR="\e[0m"
    # loop through all PDFs in first argument ($1),
    # or use '.' (this directory) if not given
    DIR="${1:-.}"
    echo "Cleaning PDFs in directory $DIR"
    # use find to locate files, pip to while read to get the
    # whole line instead of space delimited
    # Note -- this will find pdfs recursively!!
    find $DIR -type f -name "*.pdf" | while read -r i
    do
    # output file as original filename with suffix _clean.pdf
    TMP=${i%.*}_clean.pdf
    # remove the temporary file if it already exists
    if [ -f "$TMP" ]; then
    rm "$TMP";
    fi
    exiftool -q -q -all:all= "$i" -o "$TMP"
    qpdf --linearize --deterministic-id --replace-input "$TMP"
    echo -e $(printf "${GREEN}Processed ${RED}${i} ${CLEAR}as ${GREEN}${TMP}${CLEAR}")
    done
    @muddynat
    Copy link

    How would one change this to replace the existing file, rather than creating a new one with the _clean.pdf suffix?

    @RooneyMcNibNug
    Copy link

    @muddynat you could probably just do something like the following one-liner for this:

    for f in ./*.pdf; do exiftool -q -q -all:all= "$i" && qpdf --linearize --replace-input; done
    

    @sneakers-the-rat
    Copy link
    Author

    that^^ would work, just need to add "$i" to the qpdf part, i believe. most of this script is just to add comments and tell the person running it what's going on. I have never gotten the hang of writing arguments for shell scripts, but it would be nice to have a --suffix flag (that you could just give "").

    @muddynat
    Copy link

    @RooneyMcNibNug & @sneakers-the-rat thanks! I don't know much about bash scripting - where would this "$i" go in the qpdf part?

    @sneakers-the-rat
    Copy link
    Author

    @muddynat that's a string replacement, so you're substituting "$i" for the value of what you are iterating over in for or while . taking a second look at the code in the above comment it also needs its variable renamed and to use the while pattern, so it would be like this:

    find $DIR -type f -name "*.pdf" | while read -r i
    do
      exiftool -q -q -all:all= "$i"
      qpdf --linearize --replace-input "$i"
    done

    @bigfakelaugh
    Copy link

    @sneakers-the-rat just saw this user on reddit recommending the use of the --deterministic-id command from QPDF to achieve cleaner results: https://reddit.com/r/Piracy/comments/12ai3so/how_to_remove_all_metadata_identifiers_when/. From what I understood, this way each cleaned up file generated from a certain source pdf would have the exact same UUID

    End result in line 52 would be simply qpdf --linearize --deterministic-id --replace-input "$TMP"

    @WeAreLegion999
    Copy link

    WeAreLegion999 commented Apr 8, 2023

    @sneakers-the-rat just saw this user on reddit recommending the use of the --deterministic-id command from QPDF to achieve cleaner results: https://reddit.com/r/Piracy/comments/12ai3so/how_to_remove_all_metadata_identifiers_when/. From what I understood, this way each cleaned up file generated from a certain source pdf would have the exact same UUID

    End result in line 52 would be simply qpdf --linearize --deterministic-id --replace-input "$TMP"

    the documentation related to --deterministic-id on QPDF here and a thread explaining it more clearly. the same article from Elsvr downloaded from multiple institutional accesses will generate byte-for-byte identical outputs from ExifTool+QPDF when using this method.

    @sneakers-the-rat
    Copy link
    Author

    @bigfakelaugh yes good addition, edited!

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment