How to Extract All Website Links in Python? [A Step-by-Step Guide]

A webpage is a collection of data, and the data can be anything text, image, video, file, links, and so on. With the help of web scraping, we can extract that data from the webpage. Here, we will discuss how to extract all website links in Python.

Let's say there is a webpage, and you want to extract only URLs or links from that page to know the number of internal and external links. There are many web applications on the internet that charge hundreds of dollars to provide such features, where they extract valuable data from other webpages to get insights into their strategies.

You do not need to buy or rely on other applications to perform such trivial tasks when you can write a Python script that can extract all the URL links from the webpage, and that's what we are going to do in this tutorial.

How to Extract All Website Links in Python?

Here, in this Python tutorial, we will walk you through the Python program that can extract links or URLs from a webpage. However, before we dive into the code, let's install the required libraries that we will be using in this Python tutorial.

Install Required Libraries

Here is the list of all required libraries and how to install them that we will be going to use in this tutorial:

1) Python `requests` Library

requests is the de-facto Python library to make HTTP requests. We will be using this library to send GET requests to the URL of the webpage. You can install the requests library for your Python environment using the following pip install command:

pip install requests

2) Python `beautifulsoup4` Library

beautifulsoup4 is an open-source library that is used to extract or pull data from an HTML or XML page. In this tutorial, we will be using this library to extract <a> tag href links from the HTML of the webpage. To install beautifulsoup for your Python environment, run the following pip install command:

pip install beautifulsoup4

3) Python Colorama Library

The colorama library is used to print colorful text output on the terminal or command prompt. This library is optional for this tutorial, and we will be using this library only to print the output in a colorful format. To install colorama for your Python environment, run the following pip install command:

pip install colorama

Alright then, we are all set now. Open your best Python IDE or text editor and start coding.

How to Extract URLs from Webpages in Python?

Let's begin with importing the required modules.

#modules
from colorama import Back
import requests
from bs4 import BeautifulSoup

#for windows 
from colorama import init
init()

We are using a Windows system, and that's why we need to mention two additional statements, from colorama import init and init() . This is because, on Windows, we need to filter the ANSI escape sequence out of any text sent to stdout or stderr and replace them with equivalent Win32 calls.

If you are on Mac or Linux, then you do not need to write the above two statements. Even if you write them, the two statements will have no effect. After initializing the colorama init() method, let's define the webpage URL with the url identifier and send a GET request to the URL.

#page url
url = r"https://www.techgeekbuzz.com/"

#send get request
response = requests.get(url)

Now, we can parse the response HTML text using the beautifulSoup() module and find all the <a> tags present in the response HTML page.

#parse html page
html_page = BeautifulSoup(response.text, "html.parser")

#get all <a> tags
all_urls = html_page.findAll("a")

The findAll() function will return a list of all <a> tags present in the html_page . As we want to extract internal and external URLs present on the web page, let's define two empty Python sets , namely internal_urls and external_urls .

internal_urls = set()
external_urls =set()

Next, we will loop through every <a> tag present in the all_urls list and get their href attribute value using the get() function because href attribute has the link URL value.

for link in all_urls:
    href=link.get('href')
    
    if href:
        if r"techgeekbuzz.com" in href:    #internal link
            internal_urls.add(href)

        elif href[0]=="#":   #same page target link   
            internal_urls.add(f"{url}{href}")
            
        else:                       #external link
            external_urls.add(href)

add() is the set method that adds elements to the set object. Now, let's print all internal URLs with a green background and external links with a red background.

print( Back.MAGENTA + f"Total External URLs: {len(internal_urls)}\n")
for url in internal_urls:
    print(Back.GREEN + f"Internal URL {url}")


print(Back.MAGENTA + f"\n\nTotal External URLs: {len(external_urls)}\n")
for url in external_urls:
    print(Back.RED + f"External URL {url}")

Put all the code together and execute.

Python Program to Extract URLs from the Webpage

#modules
from colorama import Back
import requests
from bs4 import BeautifulSoup

#set  windows 
from colorama import init
init()

#page url
url = r"https://www.techgeekbuzz.com/"

#send get request
response = requests.get(url)

#parse html page
html_page = BeautifulSoup(response.text, "html.parser")

#get all  tags
all_urls = html_page.findAll("a")

internal_urls = set()
external_urls =set()

for link in all_urls:
    href=link.get('href')
    
    if href:
        if r"techgeekbuzz.com" in href:    #internal link
            internal_urls.add(href)
         
        elif href[0]=="#":   #same page target link   
            internal_urls.add(f"{url}{href}")
            
        else:                       #external link
            external_urls.add(href)

print( Back.MAGENTA  + f"Total External URLs: {len(internal_urls)}\n")
for url in internal_urls:
    print(Back.GREEN + f"Internal URL {url}")


print(Back.MAGENTA  + f"\n\nTotal External URLs: {len(external_urls)}\n")
for url in external_urls:
    print(Back.RED + f"External URL {url}")

Output

Conclusion

In this tutorial, you learned how to extract all website links in Python? The above program is an application of web scraping with Python. We would recommend you read the official documentation of beautifulsoup4 and requests to know more about web data extraction with Python.

People are also reading:

How to Extract All Website Links in Python?

How to Extract All Website Links in Python?

Install Required Libraries

1) Python `requests` Library

2) Python `beautifulsoup4` Library

3) Python Colorama Library

How to Extract URLs from Webpages in Python?

Python Program to Extract URLs from the Webpage

Conclusion

Related Blogs

7 Most Common Programming Errors Every Programmer Should Know

Carbon Programming Language - A Successor to C++

Introduction to Elixir Programming Language

Leave a Comment on this Post

0 Comments

How to Extract All Website Links in Python?

Table of Content

How to Extract All Website Links in Python?

Install Required Libraries

1) Python requests Library

2) Python beautifulsoup4 Library

3) Python Colorama Library

How to Extract URLs from Webpages in Python?

Python Program to Extract URLs from the Webpage

Conclusion

Related Blogs

7 Most Common Programming Errors Every Programmer Should Know

Carbon Programming Language - A Successor to C++

Introduction to Elixir Programming Language

Leave a Comment on this Post

0 Comments

1) Python `requests` Library

2) Python `beautifulsoup4` Library