PDF text extraction | FilingDB, Hacker News

There is a common view that extracting text from a PDF document should not be too difficult. After all, the text is right there in front of our eyes and humans consume PDF content all the time with great success. Why would it be difficult to automatically extract the text data?

Turns out, much how working with human names is difficult due to numerous edge cases and incorrect assumptions, working with PDFs is difficult due to the extreme flexibility given by the PDF format.

The main problem is that PDF was never really designed as a data input format, but rather, it was designed as an output format giving fine grained control over the resulting document.

At its core, the PDF format consists of a stream of instructions describing how to draw on a page. In particular, text data isn’t stored as paragraphs – or even words – but as characters which are painted at certain locations on the page. As a result, most of the content semantics are lost when a text or word document is converted to PDF – all the implied text structure is converted into an almost amorphous soup of characters floating on pages.

As part of building FilingDB , we’ve extracted text data from tens of thousands of PDF documents. In the process, we have seen how every single assumption we had about how PDF files are structured was proven incorrect. Our mission was particularly difficult as we had to process PDF documents coming from a variety of sources, with wildly different styling, typesetting and presentation choices.

The list below documents some of the ways PDF files have made it difficult (or even impossible) to extract text contents.

PDF read protection

You may have come across PDF files which refuse to let you copy their text content. For example, here is what SumatraPDF shows when attempting to copy text from a copy-protected document.

Interestingly, the text is already visible, yet the PDF viewer is refusing to populate the clipboard with the highlighted text.

The way this is implemented is by having several “access permissions” flags, one of which controls whether copying content is allowed. It’s important to keep in mind that this restriction is not enforced by the PDF file – the actual PDF contents are unaffected and it is up to the pdf renderer to honor this flag.

Needless to say, this offers no real protection against extracting the text out of the PDF, as any reasonably sophisticated PDF handling library will allow the user to either toggle the flags or ignore them.

Off-page characters

It is not uncommon for PDF files to contain more textual data than is actually displayed on the page. Take this page from the 4466 Nestle annual report.

is more text associated with this page than meets the eye. In particular, the following can be found in the content data associated with this page:

“KitKat celebrated its (th anniversary in) but remains young and in touch with trends, having over 2.5 million Facebook fans. It is sold in over countries and enjoys good growth in the developed world and emerging markets, such as the Middle East, India and Russia. Japan is its second largest market. ”

This text is actually positioned outside the page’s bounding box, so it is not displayed by most PDF viewers, but the data is there and will appear when programmatically extracting the text.

This sometimes happens due to last minute decisions to remove or replace text during the type setting process.

Small / invisible characters on page

PDFs sometimes introduce very small or hidden text on the page. For example, here is a page from the Nestle (annual report.)

The page contains small white text on white background with the following contents:

“Wyeth Nutrition logo Identity Guidance to markets

Vevey Octobre (RCC / CI & D ”)

This is sometimes done for the benefit of accessibility, similar to how the alt attribute is used in HTML.

Too many spaces

Sometimes PDFs include extra spaces between letters in a word. This is most likely done for kerning purposes. (“Kerning” is the process of adjusting distances between characters during the type setting process)