How to Extract All Website Links in Python?

By | September 26, 2021
How to Extract All Website Links in Python?

A webpage is a collection of data, and the data can be anything text, image, video, file, links, etc. And with the help of web-scraping, we can extract that data from the web-page.

Let’s say there is a web-page on the internet and you want to extract only URLs or links from that page to analyze the number of internal and external links. There are many web-applications on the internet which charge hundred dollars to provide such features where they extract valuable data from the other web-pages, to get their insights strategies.

Vamware

You do not need to buy or rely on other applications to perform such trivial tasks when you can write a Python script that can extract all the URL links from the web-page.

How to Extract All Website Links in Python

Here in this Python tutorial, I will walk you through the Python program that can extract links or URLs from a web-page. But before diving into coding let’s install the required libraries, that we will be using in this Python tutorial.

Install Required Libraries

Here is the list of all required libraries that we will be going to use in this tutorial.

Python requests library 

The requests is the Python de-facto library to make HTTP requests. We will be using this library to send GET requests to the web-page URL.

You can install the requests library for your Python environment using pip install command

pip install requests

Python beautifulsoup4 library 

beautifulsoup4 is an open-source library that is used to extract or pull data from an HTML or XML page. In this tutorial, we will be using this library to extract <a> tag href links from the web-page HTML.

To install beautifulsoup for your Python environment run the following pip install command.

pip install beautifulsoup4

Python Colorama Library

The colorama library is used to print colorful text output on the terminal or command prompt.

Although this library is optional for this tutorial, we will only be using this library to print the output in colorful format.

To install colorama for your Python environment run the following pip install command.

pip install colorama

Alright then we are all set, now open your best Python IDE or Text Editor and code along.

How to Extract URL from web-pages in Python?

Let’s begin with importing the required modules

#modules
from colorama import Back
import requests
from bs4 import BeautifulSoup

#for windows 
from colorama import init
init()

I am on windows, that’s why I need to mention two additional statements from colorama import init  and init(), because on windows, we need to filter ANSI escape sequence out of any text sent to stdout or stderr, and replace them with equivalent Win32 calls.

If you are on Mac or Linux there you do not need to write the above two statements. Even if you write it would have no effect.

After initializing the colorama init() let’s define the webpage URL with urlidentifier and send a GET request to the url.

#page url
url = r"https://www.techgeekbuzz.com/"

#send get request
response = requests.get(url)

Now we can parse the response HTML text using the BeautifulSoup() module and find all the <a> tags present in the response HTML page.

#parse html page
html_page = BeautifulSoup(response.text, "html.parser")

#get all <a> tags
all_urls = html_page.findAll("a")

The findAll() function will return a list of all <a> tags present in the html_page .

As we want to extract internal and external urls present on the web page let’s define two empty Python sets internal_urls and external_urls .

internal_urls = set()
external_urls =set()

Now we will loop through every <a> tag present in the all_urls list and get their href attribute value using get() function, because href attribute has link URL value.

for link in all_urls:
    href=link.get('href')
    
    if href:
        if r"techgeekbuzz.com" in href:    #internal link
            internal_urls.add(href)

        elif href[0]=="#":   #same page target link   
            internal_urls.add(f"{url}{href}")
            
        else:                       #external link
            external_urls.add(href)

add() is the set method that adds elements to the set object.

Now let’s print all internal URLs with green background and External Links with Red Background.

print( Back.MAGENTA + f"Total External URLs: {len(internal_urls)}\n")
for url in internal_urls:
    print(Back.GREEN + f"Internal URL {url}")


print(Back.MAGENTA + f"\n\nTotal External URLs: {len(external_urls)}\n")
for url in external_urls:
    print(Back.RED + f"External URL {url}")

Put all the code together and execute

Python program to extract URLs from the Web Page

#modules
from colorama import Back
import requests
from bs4 import BeautifulSoup

#set  windows 
from colorama import init
init()

#page url
url = r"https://www.techgeekbuzz.com/"

#send get request
response = requests.get(url)

#parse html page
html_page = BeautifulSoup(response.text, "html.parser")

#get all  tags
all_urls = html_page.findAll("a")

internal_urls = set()
external_urls =set()

for link in all_urls:
    href=link.get('href')
    
    if href:
        if r"techgeekbuzz.com" in href:    #internal link
            internal_urls.add(href)
         
        elif href[0]=="#":   #same page target link   
            internal_urls.add(f"{url}{href}")
            
        else:                       #external link
            external_urls.add(href)

print( Back.MAGENTA  + f"Total External URLs: {len(internal_urls)}\n")
for url in internal_urls:
    print(Back.GREEN + f"Internal URL {url}")


print(Back.MAGENTA  + f"\n\nTotal External URLs: {len(external_urls)}\n")
for url in external_urls:
    print(Back.RED + f"External URL {url}")
  

Output

Conclusion

In this Python tutorial, you learned “How to Extract Internal and External URL of a Web-Page in Python?” The above program is an application of web scraping with Python. I would recommend you read the official documentation of beautifulsoup4 and requests to know more about web data extraction with Python

People are also reading: 

Leave a Reply

Your email address will not be published. Required fields are marked *