
Using Python can be a practical starting point in web scraping. Python classes and objects are far easier to use than many other programming languages. Therefore, web scraping with Python is the most commonly adopted web scraping project.
Furthermore, numerous libraries exist that make scripts in web scraping with Python very easy. This tutorial explains everything needed for a straightforward application. Let’s dig into this exciting concept of the web scraping process. We will start from the fundamentals to get a complete understanding of the whole process.

Table of Contents
What Are Some Important Web Fundamentals?
First of all, we should be very clear about the fact that it takes different technologies to view a simple web page. This tutorial may not highlight details on each of those technologies. But we will explain those few technologies that are helpful in web scraping through Python code. Let’s dig into it.
Python Web Scraping: HyperText Transfer Protocol
HTTP makes use of the client/server models to perform functions. An HTTP client opens the connection to send a message to our server. The server will return a code response, and the connection is closed.
When we write a website address in our browser, it forms a code like this:
1 2 3 4 5 6 |
GET /product/ HTTP/1.1 Host: example.com Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Encoding: gzip, deflate, sdch, br Connection: keep-alive User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 12_3_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 |
Since we want to fetch or get data extraction, we can see the use of “GET” at the top. This is one of the HTTP methods. There are so many other HTTP methods to use. For example, POST, PUT, HEAD, and DELETE.
Then there is an address that we want to interact with.
There also exists the version of HTTP. In this tutorial, we have to focus on HTTP1. Then there are different header files, such as connection and user-agent. Some important header files are:
Host
- Indicates the hostname for which we send the request
- Important for name-based virtual hosting
User-Agent
- User-agent has information about the client that initiates the request
- It is either used for statistics or the prevention of violations by bots
- Modifiable
Accept
- List of MIME types our clients accepts in the form of responses from our server
- The types of content are numerous, such as application/JSON, image/JPEG, and text/HTML
Cookie
- Includes a list of name-value pairs
- Describes the way our website stores the data. The data could have some expiration date or be temporary until we close the browser. The first is standard cookies, and the second is session cookies
- Cookies are used for purposes like authentication, user preferences, user tracking, and much more
Referer
- It contains a URL from which we requested the actual URL
- It is used to change the behavior of websites based on where the user came from
Once we send the request to the server, we will get the response as under:
1 2 3 4 5 6 7 8 9 |
HTTP/1.1 200 OK Server: nginx/1.4.6 (Ubuntu) Content-Type: text/html; charset=utf-8 Content-Length: 3352 <!DOCTYPE html> <html> <head> <meta charset="utf-8" /> ...[HTML CODE] |
200 Ok shows that the request was handled properly. Then we have response headers.
After the response headers, we got a blank line and then a piece of data for which we requested.
Let’s explore different ways of sending HTTP requests to fetch data using Python.
How To Send The HTTP Requests By Manually Opening A Socket?
Socket
Opening a TCP socket to send an HTTP request manually is the most basic way to perform web scraping in Python. For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import socket HOST = 'www.google.com' # Server hostname or IP address PORT = 80 # The standard port for HTTP is 80, for HTTPS it is 443 client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) server_address = (HOST, PORT) client_socket.connect(server_address) request_header = b'GET / HTTP/1.0\r\nHost: www.google.com\r\n\r\n' client_socket.sendall(request_header) response = '' while True: recv = client_socket.recv(1024) if not recv: break response += str(recv) print(response) client_socket.close() |
Regular Expressions
Once we get an HTTP response, we use regular expressions to extract data. A regular expression is a versatile tool in the form of a string that uses a standard syntax to define the search pattern.
It helps in validating, handling, and parsing data. Regular expressions can help us when we get the data as under:
1 |
<p>Price : 19.99$</p> |
We use the XPath expression to select this text code, and then use the statement given below to get the price:
1 |
^Price\s*:\s*(\d+\.\d{2})\$ |
Using HTML tags for this purpose can be a little tricky, but we can do it simply as under:
1 2 3 4 5 6 7 |
import re html_content = '<p>Price : 19.99$</p>' m = re.match('<p>(.+)<\/p>', html_content) if m: print(m.group(1)) |
The process is a little complicated, but we can use high-level APIs to do it easily. Let’s check the easiest way to do it.
How To Web Scrape In Python Using Zenscrape API Easily?
Zenscrape is the most reliable web scraper tool we use easily. We can scrape data without getting blocked, and perform extraction at a scale. It has easy-to-follow Zenserp documentation that developers and web scrapers use without complications.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
import requests headers = { "apikey": "YOUR-APIKEY"} params = ( ("url","https://httpbin.org/ip"), ("premium","true"), ("country","de"), ("render","true"), ); response = requests.get('https://app.zenscrape.com/api/v1/get', headers=headers, params=params); print(response.text) |
What Is The Role Of urllib3 & LXML In Python Web Scraping?
Urllib and Urllib2 are the two most popular libraries of the standard library urllib in Python. Urllib2 is also known as urllib3. It is a high-level package that allows us to do so much with an HTTP request.
Simply put, we use fewer lines of codes in urllib3 compared to the previous section to extract data. For example,
1 2 3 4 |
import urllib3 http = urllib3.PoolManager() r = http.request('GET', 'http://www.google.com') print(r.data) |
Now, if we use a proxy and add some headers, we use the code as:
1 2 3 4 |
import urllib3 user_agent_header = urllib3.make_headers(user_agent="<USER AGENT>") pool = urllib3.ProxyManager(f'<PROXY IP>', headers=user_agent_header) r = pool.request('GET', 'https://www.google.com/') |
This is how we perform web scraping using very few lines of code. Now let’s move to XPath.
XPath
XPath is a technology that is much similar to the CSS selectors concept. It uses path expressions to select nodes or node sets. We need three things to extract data from XPath:
- Some XPath expressions
- HTML document
- An Xpath engine to run the expressions
We will use LXML to assist in the whole process. It is a fast XML and HTML library that supports Xpath.
First, we install by using the code as under:
1 |
pip install lxml |
Then we get:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from lxml import html # We reuse the response from urllib3 data_string = r.data.decode('utf-8', errors='ignore') # We instantiate a tree object from the HTML tree = html.fromstring(data_string) # We run the XPath against this HTML # This returns an array of element links = tree.xpath('//a') for link in links: # For each element we can easily get back the URL print(link.get('href')) |
The output comes as under:
1 2 3 4 5 6 7 8 |
https://books.google.fr/bkshp?hl=fr&tab=wp https://www.google.fr/shopping?hl=fr&source=og&tab=wf https://www.blogger.com/?tab=wj https://photos.google.com/?tab=wq&pageId=none http://video.google.fr/?hl=fr&tab=wv https://docs.google.com/document/?usp=docs_alc ... https://www.google.fr/intl/fr/about/products?tab=wh |
This is just a simple example of XPath. We can perform some more powerful tasks using XPath.

How To Use Beautiful Soup & Requests Library?
Requests
Requests are the most commonly used Python package, with more than 11,000,000 downloads. First, we install it as:
1 |
pip install requests |
Then, we make a request as under:
1 2 3 4 |
import requests r = requests.get('https://www.scrapingninja.co') print(r.text) |
Authentication to Hacker News
Let’s suppose we build a Python scraper. It aims to submit a blog to any forum. Here we will take an example site as hacker news. We will need to authenticate on those forums before posting anything. This is where beautiful shops and requests help us. The login form of hacker news is as follows:

We see three <input> tags with a (name) attribute here. The first one has a type hidden with the name “goto,” while the others two are passwords and usernames. When we submit the form using Chrome, it takes a cookie to make the server know the authentication.
Handling cookies in such a case is done through the Session object. This is how easily requests perform the process.
Beautiful Soup
First, we need to install it:
1 |
pip install beautifulsoup4 |
Then we need to post three inputs as under:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import requests from bs4 import BeautifulSoup BASE_URL = 'https://news.ycombinator.com' USERNAME = "" PASSWORD = "" s = requests.Session() data = {"goto": "news", "acct": USERNAME, "pw": PASSWORD} r = s.post(f'{BASE_URL}/login', data=data) soup = BeautifulSoup(r.text, 'html.parser') if soup.find(id='logout') is not None: print('Successfully logged in') else: print('Authentication Error') |
Then, we need to inspect the Hacker News page’s HTML content, as given in the screenshot below:

We need to find <tr> tag with class (athing) through the code below:
1 |
links = soup.findAll('tr', class_='athing') |
Then, we extract the URL, ID, Rank, and title through the code below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
import requests from bs4 import BeautifulSoup r = requests.get('https://news.ycombinator.com') soup = BeautifulSoup(r.text, 'html.parser') links = soup.findAll('tr', class_='athing') formatted_links = [] for link in links: data = { 'id': link['id'], 'title': link.find_all('td')[2].a.text, "url": link.find_all('td')[2].a['href'], "rank": int(link.find_all('td')[0].span.text.replace('.', '')) } formatted_links.append(data) print(formatted_links) |
Now let’s make the scraper a bit more robust!
Storing Web scraping data in PostgreSQL
First of all, we pick the appropriate package for the operating system. Then, we install it and set up a database. After this, we add a Hacker News table link to it:
1 2 3 4 5 6 |
CREATE TABLE "hn_links" ( "id" INTEGER NOT NULL, "title" VARCHAR NOT NULL, "url" VARCHAR NOT NULL, "rank" INTEGER NOT NULL ); |
Install Psychopg as under:
1 |
pip install psycopg2 |
We get the connection as under:
1 |
con = psycopg2.connect(host="127.0.0.1", port="5432", user="postgres", password="", database="scrape_demo") |
Then we get a database cursor:
1 |
cur = con.cursor() |
Then, we use execute to run the command:
1 |
cur.execute("INSERT INTO table [HERE-GOES-OUR-DATA]") |
We have to (commit) our implicit database transaction. And, we are ready to go after one more con.commit()
Here is the complete code to store data in the database:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
import psycopg2 import requests from bs4 import BeautifulSoup # Establish database connection con = psycopg2.connect(host="127.0.0.1", port="5432", user="postgres", password="", database="scrape_demo") # Get a database cursor cur = con.cursor() r = requests.get('https://news.ycombinator.com') soup = BeautifulSoup(r.text, 'html.parser') links = soup.findAll('tr', class_='athing') for link in links: cur.execute(""" INSERT INTO hn_links (id, title, url, rank) VALUES (%s, %s, %s, %s) """, ( link['id'], link.find_all('td')[2].a.text, link.find_all('td')[2].a['href'], int(link.find_all('td')[0].span.text.replace('.', '')) ) ) # Commit the data con.commit(); # Close our database connections cur.close() con.close() |
Python Web Scraping: Conclusion
The examples above show that Beautiful soup and requests are great libraries for different functions. This is how easily we use the Python library to scrape websites. But we need to handle multiple parts by ourselves when performing the web scraping.
FAQs
Is Python Good for Web Scraping?
Yes. Python is the most popularly used language used for web scraping purposes.
What Can You Do With Python Web Scraping?
We extract data from different sites using Python web scraper tools. We can retrieve the unstructured data.
Is Python Web Scraping Easy to Learn?
Yes. It is easier to learn and understand than most other languages.
How Long Does It Take To Learn Python Web Scraping?
It depends on our Python skills. We can learn it within a day or a year.
Sign up for free now to scrape data using Zenserp and get 1000 requests per month for free in a free plan.