Parsing and indexing PDF in Python
I have a Doxie Go scanner and I scan all the documents I receive in paper. That's nice, but it creates another problem. All the resulting PDF files have to be named, organized and stored... Doing that manually is boring and time consuming. Of course that's something I want to automate!
I even bought Hazel a while ago. It's a nice software that monitors files in a folder and performs specific instructions based on the rules you defined. It works well but I felt a bit limited and I thought I could probably write something more tailored to my use case. And that would be more fun :-)
Parsing PDF in Python¶
A quick solution I found was to run pdftotext using subprocess. I looked at PDFMiner, a pure Python PDF parser but I found pdftotext output to be more accurate. On MacOS, you can install it using Homebrew:
$ brew install Caskroom/cask/pdftotext
Here is a simple Python function to do that:
import subprocess
def parse_pdf(filename):
try:
content = subprocess.check_output(["pdftotext", '-enc', 'UTF-8', filename, "-"])
except subprocess.CalledProcessError as e:
print('Skipping {} (pdftotext returned status {})'.format(filename, e.returncode))
return None
return content.decode('utf-8')
Let's try to parse a pdf file. We'll use requests
to download a sample file.
import requests
url = 'http://www.cbu.edu.zm/downloads/pdf-sample.pdf'
response = requests.get(url)
with open('/tmp/pdf-sample.pdf', 'wb') as f:
f.write(response.content)
Let's first look at the PDF:
from IPython.display import IFrame
IFrame('http://www.cbu.edu.zm/downloads/pdf-sample.pdf', width=600, height=870)
Nothing complex. It should be easy to parse.
content = parse_pdf('/tmp/pdf-sample.pdf')
content
This works quite well. The layout is not respected but it's the text that matters. It would be easy to define some regex to define rules based on the PDF content.
This could be the first step in naming and organizing the scanned documents. But it would be nice to have an interface to easily search in all the files. I've already used MongoDB full text search in a webapp I wrote and it worked well for my use case. But I read about Elasticsearch and I always wanted to give it a try.
Elasticsearch Ingest Attachment Processor Plugin¶
I could just index the result from pdftotext, but I know there is a plugin that can parse PDF files.
The Mapper Attachments Type plugin is deprecated in 5.0.0. It has been replaced with the ingest-attachment plugin. So let's look at that.
Running Elasticsearch¶
To run Elasticsearch, the easiest is to use Docker. As the official image from Docker Hub comes with no plugin, we'll create our own image. See Elasticsearch Plugin Management with Docker for more information.
Here is our Dockerfile
:
FROM elasticsearch:5
RUN /usr/share/elasticsearch/bin/elasticsearch-plugin install ingest-attachment
Create the elasticsearch-ingest
docker image:
$ docker build -t elasticsearch-ingest .
We can now run elasticsearch with the ingest-attachment plugin:
$ docker run -d -p 9200:9200 elasticsearch-ingest
Python Elasticsearch Client¶
We'll use elasticsearch-py to interact with our Elasticsearch cluster.
from elasticsearch import Elasticsearch
es = Elasticsearch()
Let's first check that our elasticsearch cluster is alive by asking about its health:
es.cat.health()
Nice! We can start playing with our ES cluster.
As described in the documentation, we first have to create a pipeline to use the Ingest Attachment Processor Plugin:
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data"
}
}
]
}
OK, how do we do that using the Python client?
body = {
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data"
}
}
]
}
es.index(index='_ingest', doc_type='pipeline', id='attachment', body=body)
Now, we can send a document to our pipeline. Let's start by using the same example as in the documentation:
PUT my_index/my_type/my_id?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
Using Python client, this gives:
result1 = es.index(index='my_index', doc_type='my_type', pipeline='attachment',
body={'data': "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="})
result1
Let's try to get the created document based on its id:
es.get(index='my_index', doc_type='my_type', id=result1['_id'])
We can see that the binary data passed to the pipeline was a Rich Text Format file and that the content was extracted: Lorem ipsum dolor sit amet
Displaying the binary data is not very useful. It doesn't matter in this example as it's quite small.
But it would be much bigger even on small files. We can exclude it using _source_exclude
:
es.get(index='my_index', doc_type='my_type', id=result1['_id'], _source_exclude=['data'])
Indexing PDF files¶
Let's try to parse the same sample pdf as before.
url = 'http://www.cbu.edu.zm/downloads/pdf-sample.pdf'
response = requests.get(url)
Note that we have to encode the content of the pdf before to pass it to ES. The source field must be a base64 encoded binary.
import base64
data = base64.b64encode(response.content).decode('ascii')
result2 = es.index(index='my_index', doc_type='my_type', pipeline='attachment',
body={'data': data})
result2
We can get the document based on its id:
doc = es.get(index='my_index', doc_type='my_type', id=result2['_id'], _source_exclude=['data'])
doc
Or with a basic search:
es.search(index='my_index', doc_type='my_type', q='Adobe', _source_exclude=['data'])
Of course Elasticsearch allows much more complex queries. But that's something for another time.
One interesting thing is that by printing the content, we can see that even the layout is quite acurate! Much better than the pdftotext output:
print(doc['_source']['attachment']['content'])
The ingest-attachment plugin uses the Apache text extraction library Tika. It's really powerful. It detects and extracts metadata and text from many file types.
Sending the file directly to Elasticsearch is nice, but in my use case, I'd like to process the file (change its title, move it to a specific location...) based on its content. I could of course update the document in ES after processing it.
It might be better in some case to decorelate the parsing and processing from the indexing. So let's check how to use Tika from Python.
Apache Tika¶
Tika-Python makes Apache Tika available as a Python library. It can even starts a Tika REST server in the background, but this requires Java 7+ to be installed. I prefer to run the server myself using the prebuilt docker image: docker-tikaserver. Like that I have control of what is running.
$ docker run --rm -p 9998:9998 logicalspark/docker-tikaserver
We can then set Tika-Python to use Client mode only:
import tika
tika.TikaClientOnly = True
from tika import parser
parsed = parser.from_file('/tmp/pdf-sample.pdf', 'http://localhost:9998/tika')
parsed
print(parsed['content'].strip())
Not sure why we get the title of the PDF inside the content. Anyway the text is extracted properly and we even get a lot of metadata:
parsed['metadata']
Conclusion¶
We saw different methods to extract text from PDF in Python. Depending on what you want to do, one might suit you better. And this was of course not exhaustive.
If you want to index PDFs, Elasticsearch might be all you need. The ingest-attachment plugin uses Apache Tika which is very powerful.
And thanks to Tika-Python, it's very easy to use Tika directly from Python. You can let the library starts the server or use Docker to start your own.
Comments
Comments powered by Disqus