A step by step guide to building your own Resume Parser using Python and natural language processing (NLP).
Let’s start with making one thing clear. A resume is a brief summary of your skills and experience over one or two pages while a CV is more detailed and a longer representation of what the applicant is capable of doing. Saying so, let’s dive into building a parser tool using Python and basic natural language processing techniques.
Resumes are a great example of unstructured data. Since there is no widely accepted resume layout; each resume has its own style of formatting, different text blocks, or even category titles do vary a lot. I don’t even need to mention how big a challenge it is to parse multilingual resumes.
One of the misconceptions about building a resume parser is to think that it is an easy task. “No, it is not”.
Let’s just talk about predicting the names of the applicant looking at the resume.
There are millions of person names around the world, varying from Björk Guðmundsdóttir to 毛泽东, from Наина Иосифовна to Nguyễn Tấn Dũng. Many cultures are accustomed to using middle initials such as Maria J. Sampson while some cultures use prefixes extensively such as Ms. Maria Brown. Trying to build a database of names is a desperate effort because you’ll never keep up with it.
So, with an understanding of how complex is this thing, shall we start building our own resume parser?
Table of Contents
Understanding the tools
We’ll use Python 3 for its wide range of libraries that is already available and for its general acceptance in the data sciences area.
We’ll also be using nltk for NLP (natural language processing) tasks such as stop word filtering and tokenization, docx2txt and pdfminer.six for extracting text from MS Word and PDF formats.
We assume you already have Python3, pip3 on your system and possibly using the marvels of virtualenv. We’ll not get into details for installing those. We also assume you are running on a Posix based system such as Linux (Debian based) or macOS.
Converting resumes into plain text
Extracting text from PDF files
Let’s give a start by extracting text from PDF files with pdfminer. You can install it using pip3 (Python Package Installer) utility or compile it from source code (not recommended). Using pip it is as simple as running the following at the command prompt.
1 |
pip install pdfminer.six |
Using pdfminer you can easily extract text from PDF files, using the following code.
1 2 3 4 5 6 7 8 9 10 11 12 |
# example_01.py from pdfminer.high_level import extract_text def extract_text_from_pdf(pdf_path): return extract_text(pdf_path) if __name__ == '__main__': print(extract_text_from_pdf('./resume.pdf')) # noqa: T001 |
Pretty simple, right? PDF files are very popular among resumes, but some people will prefer docx and doc formats. Let’s move on to extracting text from these formats also.
Extracting text from docx files
In order to extract text from docx files, the procedure is pretty similar to what we’ve done for PDF files. Let’s install the required dependency (docx2txt) using pip and then write some code to do the actual work.
1 |
pip install docx2txt |
And the code is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# example_02.py import docx2txt def extract_text_from_docx(docx_path): txt = docx2txt.process(docx_path) if txt: return txt.replace('\t', ' ') return None if __name__ == '__main__': print(extract_text_from_docx('./resume.docx')) # noqa: T001 |
So simple. But the problem arises when we try to extract text from old-fashioned doc files. These formats are not handled correctly by the docx2txt package so we’ll do a trick to extract text from them. Please move on.
Extracting text from doc files
In order to extract text from doc files, we’ll use the neat but extremely powerful catdoc command line tool from Pete Warden.
Catdoc reads MS-Word file and prints readable ASCII text to stdout, just like Unix cat command. We’ll install it using apt tool, as we’re running on Ubuntu Linux. You should choose to run your favorite package installer or you may build the utility from source code.
1 2 3 |
apt-get update yes | apt-get install catdoc |
When ready, now we can type the code which will instantiate a subprocess, capture the stdout into a string variable and return it as we did for pdf and docx file formats.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# example_03.py import subprocess # noqa: S404 import sys def doc_to_text_catdoc(file_path): try: process = subprocess.Popen( # noqa: S607,S603 ['catdoc', '-w', file_path], stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True, ) except ( FileNotFoundError, ValueError, subprocess.TimeoutExpired, subprocess.SubprocessError, ) as err: return (None, str(err)) else: stdout, stderr = process.communicate() return (stdout.strip(), stderr.strip()) if __name__ == '__main__': text, err = doc_to_text_catdoc('./resume-word.doc') if err: print(err) # noqa: T001 sys.exit(2) print(text) # noqa: T001 |
Now as we’ve got the resume in text format, we can start extracting specific fields from it.
Extracting fields from resumes
Extracting names from resumes
This may seem easy but in reality one of most challenging tasks of resume parsing is to extract the person’s name out of it. There are millions of names around the world and living a globalized world, we may come up with a resume from anywhere.
This is where natural language processing comes into play. Let’s start with installing a new library called nltk (Natural Language Toolkit), which is quite popular for such tasks.
1 2 3 |
pip install nltk pip install numpy # (also required by nltk, for running the following code) |
Now it’s time to write some code to test what nltk’s “named entity recognition” (NER) functionality is capable of doing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
# example_04.py import docx2txt import nltk nltk.download('punkt') nltk.download('averaged_perceptron_tagger') nltk.download('maxent_ne_chunker') nltk.download('words') def extract_text_from_docx(docx_path): txt = docx2txt.process(docx_path) if txt: return txt.replace('\t', ' ') return None def extract_names(txt): person_names = [] for sent in nltk.sent_tokenize(txt): for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))): if hasattr(chunk, 'label') and chunk.label() == 'PERSON': person_names.append( ' '.join(chunk_leave[0] for chunk_leave in chunk.leaves()) ) return person_names if __name__ == '__main__': text = extract_text_from_docx('resume.docx') names = extract_names(text) if names: print(names[0]) # noqa: T001 |
Actually, nltk’s person name detection algorithm is far from being correct. Run this code and try to see if it works for you. If it doesn’t, you may try to use the NER model from Stanford University. A detailed tutorial is available at Listendata.
Extracting phone numbers from resumes
Unlike extracting person names from resumes, phone numbers are much easier to deal with. Generally speaking, using a simple regex will do fine for most cases. Try the following code to extract phone numbers from a resume. You may choose to modify the regex to your taste.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
# example_05.py import re import subprocess # noqa: S404 PHONE_REG = re.compile(r'[\+\(]?[1-9][0-9 .\-\(\)]{8,}[0-9]') def doc_to_text_catdoc(file_path): try: process = subprocess.Popen( # noqa: S607,S603 ['catdoc', '-w', file_path], stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True, ) except ( FileNotFoundError, ValueError, subprocess.TimeoutExpired, subprocess.SubprocessError, ) as err: return (None, str(err)) else: stdout, stderr = process.communicate() return (stdout.strip(), stderr.strip()) def extract_phone_number(resume_text): phone = re.findall(PHONE_REG, resume_text) if phone: number = ''.join(phone[0]) if resume_text.find(number) >= 0 and len(number) < 16: return number return None if __name__ == '__main__': text = doc_to_text_catdoc('resume.pdf') phone_number = extract_phone_number(text) print(phone_number) # noqa: T001 |
Extracting email addresses from resumes
Similar to phone number extraction, this is also pretty straightforward. Just fire up a regular expression and extract the email address from the resume. The first one that appears above others is generally the applicant’s actual email address, since people tend to place their contact details in the header section of their resumes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# example_06.py import re from pdfminer.high_level import extract_text EMAIL_REG = re.compile(r'[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+') def extract_text_from_pdf(pdf_path): return extract_text(pdf_path) def extract_emails(resume_text): return re.findall(EMAIL_REG, resume_text) if __name__ == '__main__': text = extract_text_from_pdf('resume.pdf') emails = extract_emails(text) if emails: print(emails[0]) # noqa: T001 |
Extracting skills from the resumes
Well you’ve been so far so good. This is the section where things get trickier. Exporting skills from a text is a very challenging task and in order to increase accuracy, you’ll need a database or an API to verify if a text is a skill or not.
Check out the following code. It first, uses the nltk library to filter out the the stopwords and generates tokens.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
# example_07.py import docx2txt import nltk nltk.download('stopwords') # you may read the database from a csv file or some other database SKILLS_DB = [ 'machine learning', 'data science', 'python', 'word', 'excel', 'English', ] def extract_text_from_docx(docx_path): txt = docx2txt.process(docx_path) if txt: return txt.replace('\t', ' ') return None def extract_skills(input_text): stop_words = set(nltk.corpus.stopwords.words('english')) word_tokens = nltk.tokenize.word_tokenize(input_text) # remove the stop words filtered_tokens = [w for w in word_tokens if w not in stop_words] # remove the punctuation filtered_tokens = [w for w in word_tokens if w.isalpha()] # generate bigrams and trigrams (such as artificial intelligence) bigrams_trigrams = list(map(' '.join, nltk.everygrams(filtered_tokens, 2, 3))) # we create a set to keep the results in. found_skills = set() # we search for each token in our skills database for token in filtered_tokens: if token.lower() in SKILLS_DB: found_skills.add(token) # we search for each bigram and trigram in our skills database for ngram in bigrams_trigrams: if ngram.lower() in SKILLS_DB: found_skills.add(ngram) return found_skills if __name__ == '__main__': text = extract_text_from_docx('resume.docx') skills = extract_skills(text) print(skills) # noqa: T001 |
It’s not easy to maintain an up to date database of skills for each and every industry. You may wish to have a look at the Skills API, which will give you an easy and affordable alternative to maintaining your own skills database. The Skills API, features 70.000+ skills, well organized and updated often. Just check the following code and see how easy it would be to extract the skills from a resume if the Skills API was used.
Before proceeding, first you’ll need to import a new dependency called requests, also using the pip tool.
1 |
pip install requests |
Now below is the source code using the Skills API.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
# example_08.py import docx2txt import nltk import requests nltk.download('stopwords') def extract_text_from_docx(docx_path): txt = docx2txt.process(docx_path) if txt: return txt.replace('\t', ' ') return None def skill_exists(skill): url = f'https://api.apilayer.com/skills?q={skill}&count=1' headers = {'apikey': 'YOUR API KEY'} response = requests.request('GET', url, headers=headers) result = response.json() if response.status_code == 200: return len(result) > 0 and result[0].lower() == skill.lower() raise Exception(result.get('message')) def extract_skills(input_text): stop_words = set(nltk.corpus.stopwords.words('english')) word_tokens = nltk.tokenize.word_tokenize(input_text) # remove the stop words filtered_tokens = [w for w in word_tokens if w not in stop_words] # remove the punctuation filtered_tokens = [w for w in word_tokens if w.isalpha()] # generate bigrams and trigrams (such as artificial intelligence) bigrams_trigrams = list(map(' '.join, nltk.everygrams(filtered_tokens, 2, 3))) # we create a set to keep the results in. found_skills = set() # we search for each token in our skills database for token in filtered_tokens: if skill_exists(token.lower()): found_skills.add(token) # we search for each bigram and trigram in our skills database for ngram in bigrams_trigrams: if skill_exists(ngram.lower()): found_skills.add(ngram) return found_skills if __name__ == '__main__': text = extract_text_from_docx('resume.docx') skills = extract_skills(text) print(skills) # noqa: T001 |
Extracting education and schools from resumes
If you’ve understood the principles of skills extraction, you’ll be more comfortable with the education and school extraction topic.
Not surprisingly, there are many ways of doing it.
First, you can use a database that contains all (is it?) the school names from all around the world.
You may use our school names database to train your own NER model using Spacy or any other NLP framework, but we’ll follow a much simpler way. Saying so, in the following code, we look for words such as “university, college etc… in named entities labeled as organization types. Believe it or not, it performs quite well for most cases. You may enrich the reserved_words list in the code as you wish.
Similar to person name extraction we’ll first filter out the reserved words and punctuation. Secondly, we’ll store all the “organization typed” named entities into a list and check if they contain reserved words or not.
Check the following code, as it speaks for itself.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
# example_09.py import docx2txt import nltk nltk.download('stopwords') nltk.download('punkt') nltk.download('averaged_perceptron_tagger') nltk.download('maxent_ne_chunker') nltk.download('words') RESERVED_WORDS = [ 'school', 'college', 'univers', 'academy', 'faculty', 'institute', 'faculdades', 'Schola', 'schule', 'lise', 'lyceum', 'lycee', 'polytechnic', 'kolej', 'ünivers', 'okul', ] def extract_text_from_docx(docx_path): txt = docx2txt.process(docx_path) if txt: return txt.replace('\t', ' ') return None def extract_education(input_text): organizations = [] # first get all the organization names using nltk for sent in nltk.sent_tokenize(input_text): for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))): if hasattr(chunk, 'label') and chunk.label() == 'ORGANIZATION': organizations.append(' '.join(c[0] for c in chunk.leaves())) # we search for each bigram and trigram for reserved words # (college, university etc...) education = set() for org in organizations: for word in RESERVED_WORDS: if org.lower().find(word) >= 0: education.add(org) return education if __name__ == '__main__': text = extract_text_from_docx('resume.docx') education_information = extract_education(text) print(education_information) # noqa: T001 |
Resume parsing is tricky. There are hundreds of ways of doing it. We’ve just covered one easy way of doing it and unfortunately do not expect miracles. It may work for some layouts and otherwise for some.
If you need a professional solution, have a look at our hosted solution called: Resume Parser API. It is well maintained and supported by its API provider, which is also the maintainer for the Skills API. It is pre-trained with thousands of different resume layout formats and is the most affordable solution in the market, compared with others.
Free tier is available and no credit cards are asked during registration. Just check it if it suits your needs or not. Feel free to leave comments below. Any contribution is appreciated