in ,

PDF text extraction | FilingDB, Hacker News

There is a common view that extracting text from a PDF document should not be too difficult. After all, the text is right there in front of our eyes and humans consume PDF content all the time with great success. Why would it be difficult to automatically extract the text data?

Turns out, much how working with human names is difficult due to numerous edge cases and incorrect assumptions, working with PDFs is difficult due to the extreme flexibility given by the PDF format.

The main problem is that PDF was never really designed as a data input format, but rather, it was designed as an output format giving fine grained control over the resulting document.

At its core, the PDF format consists of a stream of instructions describing how to draw on a page. In particular, text data isn’t stored as paragraphs – or even words – but as characters which are painted at certain locations on the page. As a result, most of the content semantics are lost when a text or word document is converted to PDF – all the implied text structure is converted into an almost amorphous soup of characters floating on pages.

As part of building FilingDB , we’ve extracted text data from tens of thousands of PDF documents. In the process, we have seen how every single assumption we had about how PDF files are structured was proven incorrect. Our mission was particularly difficult as we had to process PDF documents coming from a variety of sources, with wildly different styling, typesetting and presentation choices.

The list below documents some of the ways PDF files have made it difficult (or even impossible) to extract text contents.

PDF read protection

You may have come across PDF files which refuse to let you copy their text content. For example, here is what SumatraPDF shows when attempting to copy text from a copy-protected document.

Interestingly, the text is already visible, yet the PDF viewer is refusing to populate the clipboard with the highlighted text.

The way this is implemented is by having several “access permissions” flags, one of which controls whether copying content is allowed. It’s important to keep in mind that this restriction is not enforced by the PDF file – the actual PDF contents are unaffected and it is up to the pdf renderer to honor this flag.

Needless to say, this offers no real protection against extracting the text out of the PDF, as any reasonably sophisticated PDF handling library will allow the user to either toggle the flags or ignore them.

Off-page characters

It is not uncommon for PDF files to contain more textual data than is actually displayed on the page. Take this page from the 4466 Nestle annual report.

What do you think?

Leave a Reply

Your email address will not be published. Required fields are marked *

GIPHY App Key not set. Please check settings

FaZe Dubs' Career Shouldn't Be Over Despite Fortnite Stream Racial Slur, Crypto Coins News

FaZe Dubs' Career Shouldn't Be Over Despite Fortnite Stream Racial Slur, Crypto Coins News

Ask HN: How to Take Good Notes ?, Hacker News