I have a collection of .pdf files with comments that were added in Adobe Acrobat. I would like to be able to analyze these comments, but I'm kind of stuck on extracting them. I've looked at the pdftools package, but it seems to only be able to extract the text and not the comments. Is there a method available for extracting the comments within R?
3 Answers
PyMuPDF (https://pymupdf.readthedocs.io/en/latest/) is the only python library I have found working.
Installation in Debian/Ubuntu-based distributions:
apt-get install python3-fitz
Script:
import fitz
doc = fitz.open("example.pdf")
for i in range(doc.pageCount):
page = doc[i]
for annot in page.annots():
print(annot.info["content"])
-
BTW, people may find useful to know that to install fitz in a conda environment, you should activate the environment, then run
pip install fitz
. See github.com/kastman/fitz/blob/master/doc/source/installing.rst Or, even better,pip install pymupdf
(it installs fitz, and avoids errors like this github.com/pymupdf/PyMuPDF/issues/523#issuecomment-830746585) May 20, 2022 at 3:23 -
Is there a way to make fitz extract the highlighted content as well? I created a related question: stackoverflow.com/questions/72311956/… May 20, 2022 at 3:27
Did you try PoDoFo or another OpenSource tool that can access the PDF elements? You can also look at Extracting PDF annotations/comments here on stackoverflow if you will do little programming
-
I've tried a few tools, but they all seem focused on extracting images and text.The Python method you linked to combined with the reticulate package looks promising and I'd actually played around with that a bit last week, but the poppler module doesn't seem to want to install. I guess there isn't a native solution in R. Jun 14, 2018 at 15:06
-
I got it. Sometimes it´s hard to find working solution for such specific cases. Have you tried looking for some paid solution that would work? Some of them offer free trial. Which programming language and platform do you prefer?– PDFixJun 15, 2018 at 13:49
-
My preference would be a method that imported the comments into R on Windows as a data.frame. I was finally able to get poppler working using the Linux subsystem on Windows which is less than optimal, but better than nothing. Jun 18, 2018 at 19:11
Screenshot of how >> Export the comments as an Excel file, then import it into R?
Eg: in PDF-X-change Editor
, go to comment > summarize comments > export
into whatever format you want. Similar in Adobe.
-
2As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.– Community BotSep 14, 2021 at 5:34