How to batch convert pdf files to text
Frequently I am asked: I have a bunch of pdf files, how can I convert them to plain text so that analyze them using quantitative techniques? Here is my recommendation.
Download the xpdf suite of tools for your platform. This includes the part we will use, pdftotext.
Alternatives are the Apache PDFBox Java pdf library, and the Python-based PDFminer.[Windows only – Mac and Linux/Unix have this built in to the Terminal or shell already]: You will need a bash shell for your platform. (It is possible to do what I suggest below using the Windows shell, but it’s been so long since I programmed in the Windows DOS/command line script language that I won’t even attempt it now.) The main options seem to be win-bash and Cygwin.
Create a folder called pdfs in your home folder (for this example – of course it can be elsewhere). Copy your pdf files to this folder.
In a text edtor, create a text file called
convertmyfiles.sh
with the following contents:#!/bin/bash FILES=~//pdfs/*.pdf for f in $FILES do echo "Processing $f file..." pdftotext -enc UTF-8 $f done
(I am not providing a link because if you cannot create a text file and copy this text to it — and crucially edit it slightly for your own needs — then you probably won’t have much luck with these steps anyway.)
* Open the bash shell (Terminal.app or win-bash or equivalent) and execute the following: cd pdfs
./convertmyfiles.sh
Now you will have a set of text files (ending with .txt) converted as a set. These will probably need tidying up, as the conversion tends to include cruft like headers, page numbers, etc. Sometimes the layout: single
pdftotext -h
Note that in the file provided, the extracted text is given a UTF-8 (Unicode) character encoding, which is what you should be using whenever possible.
Example: (from Terminal.app on my Mac)
Last login: Thu Jul 31 11:29:44 on ttys001
KBs-MBP13:~ kbenoit$ cd pdfs
KBs-MBP13:pdfs kbenoit$ pwd
/Users/kbenoit/pdfs
KBs-MBP13:pdfs kbenoit$ rm *txt
KBs-MBP13:pdfs kbenoit$ ls
11centerpartiet2004.pdf
11folkpartiet2004.pdf
11kristdemokraterna2004.pdf
11kristdemokraterna2004_300k.pdf
11miljopartiet_de_grone2004.pdf
13radikale_venste2004_ENGL.pdf
13socialdemokraterne2004.pdf
21Ecolo_programme_2004.pdf
21Mouvement_Reformateur_100_propositions_pour_2_Θlect_Vlaams_en_europe.PDF
21SPA_europeesprogramma2004.pdf
convertmyfiles.sh
KBs-MBP13:pdfs kbenoit$ ./convertmyfiles.sh
Processing /Users/kbenoit//pdfs/11centerpartiet2004.pdf file...
Processing /Users/kbenoit//pdfs/11folkpartiet2004.pdf file...
Processing /Users/kbenoit//pdfs/11kristdemokraterna2004.pdf file...
Processing /Users/kbenoit//pdfs/11kristdemokraterna2004_300k.pdf file...
Processing /Users/kbenoit//pdfs/11miljopartiet_de_grone2004.pdf file...
Processing /Users/kbenoit//pdfs/13radikale_venste2004_ENGL.pdf file...
Processing /Users/kbenoit//pdfs/13socialdemokraterne2004.pdf file...
Processing /Users/kbenoit//pdfs/21Ecolo_programme_2004.pdf file...
Processing /Users/kbenoit//pdfs/21SPA_europeesprogramma2004.pdf file...
KBs-MBP13:pdfs kbenoit$ ls
11centerpartiet2004.pdf
11centerpartiet2004.txt
11folkpartiet2004.pdf
11folkpartiet2004.txt
11kristdemokraterna2004.pdf
11kristdemokraterna2004.txt
11kristdemokraterna2004_300k.pdf
11kristdemokraterna2004_300k.txt
11miljopartiet_de_grone2004.pdf
11miljopartiet_de_grone2004.txt
13radikale_venste2004_ENGL.pdf
13radikale_venste2004_ENGL.txt
13socialdemokraterne2004.pdf
13socialdemokraterne2004.txt
21Ecolo_programme_2004.pdf
21Ecolo_programme_2004.txt
21Mouvement_Reformateur_100_propositions_pour_2_Θlect_Vlaams_en_europe.PDF
21SPA_europeesprogramma2004.pdf
21SPA_europeesprogramma2004.txt
convertmyfiles.sh
KBs-MBP13:pdfs kbenoit$
Update 12 November 2015 for Windows (thanks Thomas)
For Windows, one way to do the is to use Windows PowerShell ISE (Integrated scripting environment) in Programs/Accessories as follows:
cd mypdffolder
$FILES= ls *.pdf
foreach ($f in $FILES) {
C:\Program` Files\xpdf\bin32\pdftotext -enc UTF-8 $f
}