
Let’s say there is a webpage on the internet with many email addresses, and you want to write a Python script that can extract all the email addresses. This email extractor in Python is a small application of Python web scraping where we access data from the Internet.
Whenever we say web scraping with Python, the first library that comes to our mind is requests
, but in this tutorial, we will not be using the Python requests
library. Instead, we will use the requests-html
library that supports all features of the requests
library and more. You might be wondering why to use the requests-html
library if web scraping can be performed using requests
.
The main reason behind using requests-html
is that it supports JavaScript. In some websites, the data is rendered on the browser by the JavaScript code, but when we request a webpage with the requests
library, the JavaScript code does not execute. However, with requests-html
, we can execute the JavaScript code of the responded object.
Required Libraries and Dependencies
Alright, now let’s discuss and install the libraries that we will be using to develop an email extractor in Python.
1) Python requests-html
Library
The requests-html
library is an open-source, HTML parsing Python library, and in this tutorial, we will be using this library as an alternative for the Python requests
library. To install the requests-html
library for your Python environment, run the following pip install command on your terminal or command prompt:
pip install requests-html
2) Python beautifulsoup4
Library
Beautiful Soup is a Python open-source library that is used to extract or pull data from HTML and XML files. In this tutorial, we will be using the beautifulsoup4
library to extract email data from an HTML page. To install the beautifulsoup4
library for your Python environment, run the following pip install command:
pip install beautifulsoup4
3) Python re
Module
The Python re
module stands for regular expression, and it is a standard Python library that is used to match string patterns from a text using regular expressions.
In this tutorial, we will extract emails from a webpage. An email is a specific sequence of characters, and by using the regular expression, we can grab only that text or string data that matches the specific sequence or pattern.
Random Email Generator
For this tutorial, we will be extracting emails from the https://www.randomlists.com/email-addresses URL, which generates random emails with every request. If you want, you can use any other webpage URL to extract emails.
How to Make an Email Extractor in Python?
Let’s start with importing all the modules.
from requests_html import HTMLSession import re from bs4 import BeautifulSoup
Now set the url
and pattern
identifiers that represent the webpage URL and regular expression pattern for the emails.
#page url url =r"https://www.randomlists.com/email-addresses" #regex pattern pattern =r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+"
Next, initialize the HTMLSession() object, which sets cookies for the persistence connection.
#initialize the session session = HTMLSession()
After initializing the session, let’s send a GET request to the page URL.
#send the get request response = session.get(url)
After sending the GET request, we get the response
or HTML data from the server. Now, let’s run all the JavaScript code of the response
object using the html.render()
method.
#simulate JS running code response.html.render()
For the first time, it will download the Chromium simulator for your Python environment. Thus, do not worry when you see a downloading process during code execution. The data you see on the webpage is generally put inside the HTML <body> tag. So, let’s grab the body tag from the response object.
#get body element body = response.html.find("body")[0]
The find("body")
function will return a list of <body>
elements. As an HTML page can have only one body, that’s why we used the [0] index to grab the first result. Next, let’s extract the list of emails from the body text and print all the emails.
#extract emails emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", body.text) for index,email in enumerate(emails): print(index+1, "---->", email)
Now let us put all the code together and execute it.
Python Program to Extract Emails from a Webpage
from requests_html import HTMLSession import re from bs4 import BeautifulSoup #page url url =r"https://www.randomlists.com/email-addresses" #regex pattern pattern =r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+" #initialize the session session = HTMLSession() #send the get request response = session.get(url) #simulate JS running code response.html.render() #get body element body = response.html.find("body")[0] #extract emails emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", body.text) for index,email in enumerate(emails): print(index+1, "---->", email)
Output
1 ----> horrocks@yahoo.com 2 ----> leocharre@live.com 3 ----> howler@gmail.com 4 ----> naoya@me.com 5 ----> gfxguy@gmail.com 6 ----> kalpol@outlook.com 7 ----> scato@hotmail.com 8 ----> tkrotchko@live.com 9 ----> citizenl@aol.com 10 ----> sagal@mac.com 11 ----> afeldspar@sbcglobal.net 12 ----> maneesh@gmail.com
Conclusion
In this Python tutorial, we learned how to make an email extractor in Python that can extract emails from the webpage using requests-html
, beautifulsoup4
, and re
Python libraries. You can also extract emails from a text file using Python file handling methods and regular expression as we have done above.
We hope you like this article, and if you have any queries or suggestions related to the above article or program, please let us know by commenting below. Thanks for reading!
People are also reading: