Let’s say there is a webpage on the internet with many email addresses, and you want to write a Python script that can extract all the email addresses. This email extractor in Python is a small application of Python web scraping where we access data from the Internet.
Whenever we say web scraping with Python, the first library that comes to our mind is
requests, but in this tutorial, we will not be using the Python
requests library. Instead, we will use the
requests-html library that supports all features of the
requests library and more. You might be wondering why to use the
requests-html library if web scraping can be performed using
The main reason behind using
Required Libraries and Dependencies
Alright, now let’s discuss and install the libraries that we will be using to develop an email extractor in Python.
requests-html library is an open-source, HTML parsing Python library, and in this tutorial, we will be using this library as an alternative for the Python
requests library. To install the
requests-html library for your Python environment, run the following pip install command on your terminal or command prompt:
pip install requests-html
Beautiful Soup is a Python open-source library that is used to extract or pull data from HTML and XML files. In this tutorial, we will be using the
beautifulsoup4 library to extract email data from an HTML page. To install the
beautifulsoup4 library for your Python environment, run the following pip install command:
pip install beautifulsoup4
re module stands for regular expression, and it is a standard Python library that is used to match string patterns from a text using regular expressions.
In this tutorial, we will extract emails from a webpage. An email is a specific sequence of characters, and by using the regular expression, we can grab only that text or string data that matches the specific sequence or pattern.
Random Email Generator
For this tutorial, we will be extracting emails from the https://www.randomlists.com/email-addresses URL, which generates random emails with every request. If you want, you can use any other webpage URL to extract emails.
How to Make an Email Extractor in Python?
Let’s start with importing all the modules.
from requests_html import HTMLSession import re from bs4 import BeautifulSoup
Now set the
pattern identifiers that represent the webpage URL and regular expression pattern for the emails.
#page url url =r"https://www.randomlists.com/email-addresses" #regex pattern pattern =r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+"
Next, initialize the HTMLSession() object, which sets cookies for the persistence connection.
#initialize the session session = HTMLSession()
After initializing the session, let’s send a GET request to the page URL.
#send the get request response = session.get(url)
After sending the GET request, we get the
response object using the
#simulate JS running code response.html.render()
For the first time, it will download the Chromium simulator for your Python environment. Thus, do not worry when you see a downloading process during code execution. The data you see on the webpage is generally put inside the HTML <body> tag. So, let’s grab the body tag from the response object.
#get body element body = response.html.find("body")
find("body") function will return a list of
<body> elements. As an HTML page can have only one body, that’s why we used the  index to grab the first result. Next, let’s extract the list of emails from the body text and print all the emails.
#extract emails emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", body.text) for index,email in enumerate(emails): print(index+1, "---->", email)
Now let us put all the code together and execute it.
Python Program to Extract Emails from a Webpage
from requests_html import HTMLSession import re from bs4 import BeautifulSoup #page url url =r"https://www.randomlists.com/email-addresses" #regex pattern pattern =r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+" #initialize the session session = HTMLSession() #send the get request response = session.get(url) #simulate JS running code response.html.render() #get body element body = response.html.find("body") #extract emails emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", body.text) for index,email in enumerate(emails): print(index+1, "---->", email)
1 ----> firstname.lastname@example.org 2 ----> email@example.com 3 ----> firstname.lastname@example.org 4 ----> email@example.com 5 ----> firstname.lastname@example.org 6 ----> email@example.com 7 ----> firstname.lastname@example.org 8 ----> email@example.com 9 ----> firstname.lastname@example.org 10 ----> email@example.com 11 ----> firstname.lastname@example.org 12 ----> email@example.com
In this Python tutorial, we learned how to make an email extractor in Python that can extract emails from the webpage using
re Python libraries. You can also extract emails from a text file using Python file handling methods and regular expression as we have done above.
We hope you like this article, and if you have any queries or suggestions related to the above article or program, please let us know by commenting below. Thanks for reading!
People are also reading: