Unparalleled suite of productivity-boosting Web APIs & cloud-based micro-service applications for developers and companies of any size.

API

Build your own Resume Parser Using Python and NLP

A step by step guide to building your own Resume Parser using Python and natural language processing (NLP).

Let’s start with making one thing clear. A resume is a brief summary of your skills and experience over one or two pages while a CV is more detailed and a longer representation of what the applicant is capable of doing. Saying so, let’s dive into building a parser tool using Python and basic natural language processing techniques.

Resumes are a great example of unstructured data. Since there is no widely accepted resume layout; each resume has its own style of formatting, different text blocks, or even category titles do vary a lot. I don’t even need to mention how big a challenge it is to parse multilingual resumes.

One of the misconceptions about building a resume parser is to think that it is an easy task. “No, it is not”.

Let’s just talk about predicting the names of the applicant looking at the resume.

There are millions of person names around the world, varying from Björk Guðmundsdóttir to 毛泽东, from Наина Иосифовна to Nguyễn Tấn Dũng. Many cultures are accustomed to using middle initials such as Maria J. Sampson while some cultures use prefixes extensively such as Ms. Maria Brown. Trying to build a database of names is a desperate effort because you’ll never keep up with it.

So, with an understanding of how complex is this thing, shall we start building our own resume parser?

Understanding the tools

We’ll use Python 3 for its wide range of libraries that is already available and for its general acceptance in the data sciences area.

We’ll also be using nltk for NLP (natural language processing) tasks such as stop word filtering and tokenization, docx2txt and pdfminer.six for extracting text from MS Word and PDF formats.

We assume you already have Python3, pip3 on your system and possibly using the marvels of virtualenv. We’ll not get into details for installing those. We also assume you are running on a Posix based system such as Linux (Debian based) or macOS.

Converting resumes into plain text

Extracting text from PDF files

Let’s give a start by extracting text from PDF files with pdfminer. You can install it using pip3 (Python Package Installer) utility or compile it from source code (not recommended). Using pip it is as simple as running the following at the command prompt.

Using pdfminer you can easily extract text from PDF files, using the following code.

Pretty simple, right? PDF files are very popular among resumes, but some people will prefer docx and doc formats. Let’s move on to extracting text from these formats also.

Extracting text from docx files

In order to extract text from docx files, the procedure is pretty similar to what we’ve done for PDF files. Let’s install the required dependency (docx2txt) using pip and then write some code to do the actual work.

And the code is as follows:

So simple. But the problem arises when we try to extract text from old-fashioned doc files. These formats are not handled correctly by the docx2txt package so we’ll do a trick to extract text from them. Please move on.

Extracting text from doc files

In order to extract text from doc files, we’ll use the neat but extremely powerful catdoc command line tool from Pete Warden.

Catdoc reads MS-Word file and prints readable ASCII text to stdout, just like Unix cat command. We’ll install it using apt tool, as we’re running on Ubuntu Linux. You should choose to run your favorite package installer or you may build the utility from source code.

When ready, now we can type the code which will instantiate a subprocess, capture the stdout into a string variable and return it as we did for pdf and docx file formats.

Now as we’ve got the resume in text format, we can start extracting specific fields from it.

Extracting fields from resumes

Extracting names from resumes

This may seem easy but in reality one of most challenging tasks of resume parsing is to extract the person’s name out of it. There are millions of names around the world and living a globalized world, we may come up with a resume from anywhere.

This is where natural language processing comes into play. Let’s start with installing a new library called nltk (Natural Language Toolkit), which is quite popular for such tasks.

Now it’s time to write some code to test what nltk’s “named entity recognition” (NER) functionality is capable of doing.

Actually, nltk’s person name detection algorithm is far from being correct. Run this code and try to see if it works for you. If it doesn’t, you may try to use the NER model from Stanford University. A detailed tutorial is available at Listendata.

Extracting phone numbers from resumes

Unlike extracting person names from resumes, phone numbers are much easier to deal with. Generally speaking, using a simple regex will do fine for most cases. Try the following code to extract phone numbers from a resume. You may choose to modify the regex to your taste.

Extracting email addresses from resumes

Similar to phone number extraction, this is also pretty straightforward. Just fire up a regular expression and extract the email address from the resume. The first one that appears above others is generally the applicant’s actual email address, since people tend to place their contact details in the header section of their resumes.

Extracting skills from the resumes

Well you’ve been so far so good. This is the section where things get trickier. Exporting skills from a text is a very challenging task and in order to increase accuracy, you’ll need a database or an API to verify if a text is a skill or not.

Check out the following code. It first, uses the nltk library to filter out the the stopwords and generates tokens.

It’s not easy to maintain an up to date database of skills for each and every industry. You may wish to have a look at the Skills API, which will give you an easy and affordable alternative to maintaining your own skills database. The Skills API, features 70.000+ skills, well organized and updated often. Just check the following code and see how easy it would be to extract the skills from a resume if the Skills API was used.

Before proceeding, first you’ll need to import a new dependency called requests, also using the pip tool.

Now below is the source code using the Skills API.

Extracting education and schools from resumes

If you’ve understood the principles of skills extraction, you’ll be more comfortable with the education and school extraction topic.

Not surprisingly, there are many ways of doing it.

First, you can use a database that contains all (is it?) the school names from all around the world.

You may use our school names database to train your own NER model using Spacy or any other NLP framework, but we’ll follow a much simpler way. Saying so, in the following code, we look for words such as “university, college etc… in named entities labeled as organization types. Believe it or not, it performs quite well for most cases. You may enrich the reserved_words list in the code as you wish.

Similar to person name extraction we’ll first filter out the reserved words and punctuation. Secondly, we’ll store all the “organization typed” named entities into a list and check if they contain reserved words or not.

Check the following code, as it speaks for itself.

Last words:

Resume parsing is tricky. There are hundreds of ways of doing it. We’ve just covered one easy way of doing it and unfortunately do not expect miracles. It may work for some layouts and otherwise for some.

If you need a professional solution, have a look at our hosted solution called: Resume Parser API. It is well maintained and supported by its API provider, which is also the maintainer for the Skills API. It is pre-trained with thousands of different resume layout formats and is the most affordable solution in the market, compared with others.

Free tier is available and no credit cards are asked during registration. Just check it if it suits your needs or not. Feel free to leave comments below. Any contribution is appreciated

Related posts
APIIPLocation

Mastering IP Geolocation: Understanding the Google IP and Geolocation API Integration

APICurrency

Achieving Precision in Currency Conversion: The Role of Accurate Exchange Rate API

APICurrency

From Dollars to Euros: Navigating the Landscape of Currency Conversion API

APIIPLocation

7 Essential Insights You Can Gain From IP Geolocation Data

Leave a Reply

Your email address will not be published. Required fields are marked *