How to Make an Email Extractor in Python?

By | November 15, 2021
How to Make an Email Extractor in Python?

Let’s say there is a webpage on the internet with many email addresses, and you want to write a Python script that can extract all the email addresses. This email extractor in Python is a small application of Python web scraping where we access data from the Internet.

Whenever we say web scraping with Python, the first library that comes to our mind is requests, but in this tutorial, we will not be using the Python requests library. Instead, we will use the requests-html library that supports all features of the requests library and more. You might be wondering why to use the requests-html library if web scraping can be performed using requests.

Vamware

The main reason behind using requests-html is that it supports JavaScript. In some websites, the data is rendered on the browser by the JavaScript code, but when we request a webpage with the requests library, the JavaScript code does not execute. However, with requests-html, we can execute the JavaScript code of the responded object.

Required Libraries and Dependencies

Alright, now let’s discuss and install the libraries that we will be using to develop an email extractor in Python.

1) Python requests-html Library

The requests-html library is an open-source, HTML parsing Python library, and in this tutorial, we will be using this library as an alternative for the Python requests library. To install the requests-html library for your Python environment, run the following pip install command on your terminal or command prompt:

pip install requests-html

2) Python beautifulsoup4 Library

Beautiful Soup is a Python open-source library that is used to extract or pull data from HTML and XML files. In this tutorial, we will be using the beautifulsoup4 library to extract email data from an HTML page. To install the beautifulsoup4 library for your Python environment, run the following pip install command:

pip install beautifulsoup4

3) Python re Module

The Python re module stands for regular expression, and it is a standard Python library that is used to match string patterns from a text using regular expressions.

In this tutorial, we will extract emails from a webpage. An email is a specific sequence of characters, and by using the regular expression, we can grab only that text or string data that matches the specific sequence or pattern.

Random Email Generator

For this tutorial, we will be extracting emails from the https://www.randomlists.com/email-addresses URL, which generates random emails with every request. If you want, you can use any other webpage URL to extract emails.

How to Make an Email Extractor in Python?

Let’s start with importing all the modules.

from requests_html import HTMLSession
import re
from bs4 import BeautifulSoup

Now set the url and pattern identifiers that represent the webpage URL and regular expression pattern for the emails.

#page url
url =r"https://www.randomlists.com/email-addresses"

#regex pattern
pattern =r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+"

Next, initialize the HTMLSession() object, which sets cookies for the persistence connection.

#initialize the session
session = HTMLSession()

After initializing the session, let’s send a GET request to the page URL.

#send the get request
response = session.get(url)

After sending the GET request, we get the response or HTML data from the server. Now, let’s run all the JavaScript code of the response object using the html.render() method.

#simulate JS running code
response.html.render()

For the first time, it will download the Chromium simulator for your Python environment. Thus, do not worry when you see a downloading process during code execution. The data you see on the webpage is generally put inside the HTML <body> tag. So, let’s grab the body tag from the response object.

#get body element
body = response.html.find("body")[0]

The find("body") function will return a list of <body> elements. As an HTML page can have only one body, that’s why we used the [0] index to grab the first result. Next, let’s extract the list of emails from the body text and print all the emails.

#extract emails
emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", body.text)

for index,email in enumerate(emails):
    print(index+1, "---->", email)

Now let us put all the code together and execute it.

Python Program to Extract Emails from a Webpage

from requests_html import HTMLSession
import re
from bs4 import BeautifulSoup

#page url
url =r"https://www.randomlists.com/email-addresses"

#regex pattern
pattern =r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+"

#initialize the session
session = HTMLSession()

#send the get request
response = session.get(url)

#simulate JS running code
response.html.render()

#get body element
body = response.html.find("body")[0]

#extract emails
emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", body.text)

for index,email in enumerate(emails):
    print(index+1, "---->", email)

Output

1 ----> horrocks@yahoo.com
2 ----> leocharre@live.com
3 ----> howler@gmail.com
4 ----> naoya@me.com
5 ----> gfxguy@gmail.com
6 ----> kalpol@outlook.com
7 ----> scato@hotmail.com
8 ----> tkrotchko@live.com
9 ----> citizenl@aol.com
10 ----> sagal@mac.com
11 ----> afeldspar@sbcglobal.net
12 ----> maneesh@gmail.com

Conclusion

In this Python tutorial, we learned how to make an email extractor in Python that can extract emails from the webpage using requests-html, beautifulsoup4, and re Python libraries. You can also extract emails from a text file using Python file handling methods and regular expression as we have done above.

We hope you like this article, and if you have any queries or suggestions related to the above article or program, please let us know by commenting below. Thanks for reading!

People are also reading: 

Leave a Reply

Your email address will not be published. Required fields are marked *