How to Make an Email Extractor in Python?

Posted in /  

How to Make an Email Extractor in Python?
vinaykhatri

Vinay Khatri
Last updated on April 26, 2024

    Let's say there is a webpage on the internet with many email addresses, and you want to write a Python script that can extract all the email addresses. This email extractor in Python is a small application of Python web scraping where we access data from the Internet .

    Whenever we say web scraping with Python, the first library that comes to our mind is requests , but in this tutorial, we will not be using the Python requests library. Instead, we will use the requests-html library that supports all features of the requests library and more.

    You might be wondering why to use the requests-html library if web scraping can be performed using requests . The main reason behind using requests-html is that it supports JavaScript.

    In some websites, the data is rendered on the browser by the JavaScript code, but when we request a webpage with the requests library, the JavaScript code does not execute. However, with requests-html , we can execute the JavaScript code of the responded object.

    Required Libraries and Dependencies

    Alright, now let's discuss and install the libraries that we will be using to develop an email extractor in Python.

    1) Python requests-html Library

    The requests-html library is an open-source, HTML parsing Python library, and in this tutorial, we will be using this library as an alternative for the Python requests library. To install the requests-html library for your Python environment, run the following pip install command on your terminal or command prompt:

    pip install requests-html

    2) Python beautifulsoup4 Library

    Beautiful Soup is a Python open-source library that is used to extract or pull data from HTML and XML files. In this tutorial, we will be using the beautifulsoup4 library to extract email data from an HTML page. To install the beautifulsoup4 library for your Python environment, run the following pip install command:

    pip install beautifulsoup4

    3) Python re Module

    The Python re module stands for regular expression, and it is a standard Python library that is used to match string patterns from a text using regular expressions.

    In this tutorial, we will extract emails from a webpage. An email is a specific sequence of characters, and by using the regular expression, we can grab only that text or string data that matches the specific sequence or pattern.

    Random Email Generator

    For this tutorial, we will be extracting emails from the https://www.randomlists.com/email-addresses URL, which generates random emails with every request. If you want, you can use any other webpage URL to extract emails.

    How to Make an Email Extractor in Python?

    Let's start with importing all the modules.

    from requests_html import HTMLSession
    import re
    from bs4 import BeautifulSoup

    Now set the url and pattern identifiers that represent the webpage URL and regular expression pattern for the emails.

    #page url
    url =r"https://www.randomlists.com/email-addresses"
    
    #regex pattern
    pattern =r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+"

    Next, initialize the HTMLSession() object, which sets cookies for the persistence connection.

    #initialize the session
    session = HTMLSession()

    After initializing the session, let's send a GET request to the page URL.

    #send the get request
    response = session.get(url)

    After sending the GET request, we get the response or HTML data from the server. Now, let's run all the JavaScript code of the response object using the html.render() method.

    #simulate JS running code
    response.html.render()

    For the first time, it will download the Chromium simulator for your Python environment. Thus, do not worry when you see a downloading process during code execution. The data you see on the webpage is generally put inside the HTML <body> tag. So, let's grab the body tag from the response object.

    #get body element
    body = response.html.find("body")[0]

    The find("body") function will return a list of <body> elements. As an HTML page can have only one body, that's why we used the [0] index to grab the first result. Next, let's extract the list of emails from the body text and print all the emails.

    #extract emails
    emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", body.text)
    
    for index,email in enumerate(emails):
        print(index+1, "---->", email)

    Now let us put all the code together and execute it.

    Python Program to Extract Emails from a Webpage

    from requests_html import HTMLSession
    import re
    from bs4 import BeautifulSoup
    
    #page url
    url =r"https://www.randomlists.com/email-addresses"
    
    #regex pattern
    pattern =r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+"
    
    #initialize the session
    session = HTMLSession()
    
    #send the get request
    response = session.get(url)
    
    #simulate JS running code
    response.html.render()
    
    #get body element
    body = response.html.find("body")[0]
    
    #extract emails
    emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", body.text)
    
    for index,email in enumerate(emails):
        print(index+1, "---->", email)

    Output

    1 ----> horrocks@yahoo.com
    2 ----> leocharre@live.com
    3 ----> howler@gmail.com
    4 ----> naoya@me.com
    5 ----> gfxguy@gmail.com
    6 ----> kalpol@outlook.com
    7 ----> scato@hotmail.com
    8 ----> tkrotchko@live.com
    9 ----> citizenl@aol.com
    10 ----> sagal@mac.com
    11 ----> afeldspar@sbcglobal.net
    12 ----> maneesh@gmail.com

    Conclusion

    In this Python tutorial, we learned how to make an email extractor in Python that can extract emails from the webpage using requests-html , beautifulsoup4 , and re Python libraries. You can also extract emails from a text file using Python file handling methods and regular expression as we have done above.

    We hope you like this article, and if you have any queries or suggestions related to the above article or program, please let us know by commenting below.

    Thanks for reading!

    People are also reading:

    Leave a Comment on this Post

    0 Comments