There is a common view that extracting text from a PDF document should not be too difficult. After all, the text is right there in front of our eyes and humans consume PDF content all the time with great success. Why would it be difficult to automatically extract the text data?
Turns out, much how working with human names is difficult due to numerous edge cases and incorrect assumptions, working with PDFs is difficult due to the extreme flexibility given by the PDF format.
The main problem is that PDF was never really designed as a data input format, but rather, it was designed as an output format giving fine grained control over the resulting document.
At its core, the PDF format consists of a stream of instructions describing how to draw on a page. In particular, text data isn’t stored as paragraphs – or even words – but as characters which are painted at certain locations on the page. As a result, most of the content semantics are lost when a text or word document is converted to PDF – all the implied text structure is converted into an almost amorphous soup of characters floating on pages.
As part of building FilingDB , we’ve extracted text data from tens of thousands of PDF documents. In the process, we have seen how every single assumption we had about how PDF files are structured was proven incorrect. Our mission was particularly difficult as we had to process PDF documents coming from a variety of sources, with wildly different styling, typesetting and presentation choices.
The list below documents some of the ways PDF files have made it difficult (or even impossible) to extract text contents.
PDF read protection
You may have come across PDF files which refuse to let you copy their text content. For example, here is what SumatraPDF shows when attempting to copy text from a copy-protected document.
Interestingly, the text is already visible, yet the PDF viewer is refusing to populate the clipboard with the highlighted text.
The way this is implemented is by having several “access permissions” flags, one of which controls whether copying content is allowed. It’s important to keep in mind that this restriction is not enforced by the PDF file – the actual PDF contents are unaffected and it is up to the pdf renderer to honor this flag.
Needless to say, this offers no real protection against extracting the text out of the PDF, as any reasonably sophisticated PDF handling library will allow the user to either toggle the flags or ignore them.
It is not uncommon for PDF files to contain more textual data than is actually displayed on the page. Take this page from the 4466 Nestle annual report.
is more text associated with this page than meets the eye. In particular, the following can be found in the content data associated with this page:
This text is actually positioned outside the page’s bounding box, so it is not displayed by most PDF viewers, but the data is there and will appear when programmatically extracting the text.
This sometimes happens due to last minute decisions to remove or replace text during the type setting process.
Small / invisible characters on page
PDFs sometimes introduce very small or hidden text on the page. For example, here is a page from the Nestle (annual report.)
The page contains small white text on white background with the following contents:
“Wyeth Nutrition logo Identity Guidance to markets
Vevey Octobre (RCC / CI & D ”)
This is sometimes done for the benefit of accessibility, similar to how the alt attribute is used in HTML.
Too many spaces
Sometimes PDFs include extra spaces between letters in a word. This is most likely done for kerning purposes. (“Kerning” is the process of adjusting distances between characters during the type setting process)
Example:
the Hikma Pharma annual report contains the following text:
“ch airman’s ss tat em en t”
Reconstructing the original text is a difficult problem to solve generally. Our most successful approach has been applying OCR techniques.
Not enough spaces
Sometimes PDFs do not contain spaces or replace them with a different character.
Example 1:
The following extract from the
SEB annual report.
The extracted text shows:
“Tenyearsafterthefinancialcrisisstarted”
Example 2: The Eurobank annual report shows the following
Extracting the text gives:
“On_April_7, _ , _ the_competent_authorities ”
Again, our most successful solution was to run OCR on these pages.
PDF font handling is complex to say the least. To understand how PDF files store text data we must first know about glyphs, glyph names, fonts.
-
-
Fonts are lists of glyphs with associated glyph names. For example, most fonts have a glyph that most humans would recognize as the letter “a”, with different fonts showing various ways of drawing that letter.
In a PDF, the characters are stored as numbers, called “codepoints”. To decide what to draw on the screen, a renderer has to go:
codepoint -> glyph name -> glyph
For example, a PDF document can contain codepoint , which it maps into the glyph name “t” which, in turn, maps into the glyph describing how to draw “t” on the screen.
-
-
Another is the use of subfonts. Most fonts contain glyphs for a very large number of codepoints and a pdf might only use a subset of these. To save space, a PDF creator can strip away all unneeded glyphs and create a compact subfont which will most likely use a non-standard encoding.
One workaround is to extract the font glyphs from the document, run them through OCR software and build the map from font glyph to unicode. This then lets you translate from the font-specific encoding to the unicode encoding eg: codepoint 1 is mapped to name “c1” which, based on looking at the glyph, should be a “t”, which is unicode codepoint 169.
The encoding map that you’ve just generated, the one going from 1 to 169, is called a ToUnicode map in the PDF standard. PDF documents can provide their own ToUnicode map, but it’s optional and many do not.
Unicode maps codepoint 81906 to “white smiley face”, rendered as ☺, whereas ASCII is not defined at that codepoint.
However, PDF documents sometimes use their own custom encoding together with custom fonts. It might seem strange, but a document can use codepoint 1 to represent the letter “t”. It will map codepoint 1 into the glyph name “c1”, which will map into a glyph describing how to draw the letter “t”.
While for a human the end result looks the same, a machine will get confused by the codepoints it is seeing. If the codepoints do not follow a standard encoding, then it is virtually impossible to programmatically know what codepoints 1 2 and 3 represent.
Why would a PDF document contain nonstandard fonts and encodings?
One reason is to make text extraction more difficult.
Word and paragraph detection
Reconstructing paragraphs and even words from the amorphous character soup of PDF files is a difficult task.
The PDF document provides a list of characters on a page and it is up to the consumer to identify words and paragraphs. Humans are naturally effective at doing this as reading is a widespread skill.
The common approach is to have a grouping or clustering algorithm which compares letter sizes, positions and alignments in order to determine what is a word / paragraph. Naive implementations can easily have complexity larger than O (n²), resulting in long processing times on busy pages.
Text and paragraph order
Deciding on text and paragraph order is difficult on two levels.
First, sometimes there is no correct answer. While documents with conventional, single column typesetting have a natural order of reading, documents with more adventurous layouts are challenging. As an example, it is not clear if the following inset should appear before, after, or during the article it is placed next to:
Second, even when the answer is clear to a human, determining robust paragraph order is a very difficult problem to solve, perhaps even AI-hard. This might sound like an extreme statement, however there are cases where the correct paragraph order can only be decided by understanding the text content.
Consider the following two-columns layout, describing how to prepare a vegetable salad:
In the western world, a reasonable assumption is that reading is done right to left and top to bottom . So the best we can do without looking at the contents is to reduce the answer to 2 options: A B C D and A C B D.
By looking at the content, understanding what it is talking about and knowing that vegetables are washed before chopping, we can determine that A C B D is the correct order. Determining this algorithmically is a difficult problem.
That being said, a “works most times” approach is to rely on the order in which the text is stored inside the PDF document. This usually corresponds to the order the text was inserted at creation time and, for large bodies of text containing multiple paragraphs, they tend to reflect the writer-intended order.
Embedded images
It is not uncommon for some (or all) of the PDF content to actually be a scan. In these cases, there is no text data to extract directly, so we have to resort to OCR techniques.
As an example, the Yell (annual report is only available as a document scan:
A glyph name is the name associated with that glyph. For example “trademark” for the “™” glyph and “a” for the “a” glyph.
Now, most PDF files use a standard codepoint encoding. A codepoint encoding is a set of rules that assign meaning to the codepoints themselves. For example: ASCII and Unicode both use codepoint 169 to represent the letter “t”.
-
A glyph is a set of instructions describing how to draw a symbol or character.
GIPHY App Key not set. Please check settings