How to Extract All PDF Links in Python?

By | September 27, 2021
How to Extract All PDF Links in Python

A Portable Document Format (PDF) is a file format that generally contains text and image data. The text data can also be links leading to websites or web pages.

There are many Python libraries that can be used to read and write pdf files, but when it comes to reading or extracting specific data, such as images, and links then only a few of those libraries come useful.

Vamware

Here in this Python tutorial, I will walk you through a Python program that extracts all the external links from the PDF.

A PDF could also have internal links that lead the user to the specific section of the page, but in this tutorial, we are not covering that part, but in the program below we have provided the comment code to access the internal linking links.

Before diving into the program let’s install the required library.

Install Required Library

For this program “Extract All PDF Links in Python” we will be using the Python open-source PyMuPDF library, which is a powerful and straightforward pdf and other book format reading tool.

To install the PyMuPDF library run the following pip command on your terminal or command prompt.

pip install PyMuPDF

You will also require a PDF from which you wish to extract links. I would suggest you, store the pdf in the same directory of your Python script so you could load the PDF file in Python by mentioning the relative path, else you have to specify the absolute path to the pdf file.

Now you are all set, open your favorite Python IDE or Text editor and start coding.

Python Code Implementation

Let’s begin with importing the required module.

import fitz # PyMuPDF

Now specific the filename in string format.

#filename
filename = "book.pdf"

Here my pdf file "book.pdf"reside at the same directory of the Python script that’s why I am specifying the relative path. If your pdf file is located at some other directory or drive, then you need to specify the absolute path. You can also specify the relative path but you have to be precise.

Now open our pdf file with fitz.open() method.

with fitz.open(filename) as my_pdf_file:

    #loop through every page
    for page_number in range(1, len(my_pdf_file)+1):

        # acess individual page
        page = my_pdf_file[page_number-1]

        for link in page.links():
            #if the link is an extrenal link with http or https (URI)
            if "uri" in link:
                #access link url
                url = link["uri"]
                print(f'Link: "{url}" found on page number --> {page_number}')
            #if the link is internal or file with no URI
            else:
                pass
                # if "page" in link:
                #     print("Internal page linking to page no", link["page"])
                # else:
                #     print("File linking", link["file"]

The fitz.open(filename) as my_pdf_filestatement will open the pdf file.

The page.links()statement will return a list of all the links present on the page.

The link is a dictionary object which has key such as uri , page , file , kind , etc.

The link will have the Uniform Resource Identifier (uri) if it starts with HTTP, https, or mailto.

Now put all the code together and execute.

#Python program to Extract All PDF Links in Python

import fitz # PyMuPDF

#filename
filename = "book.pdf"

with fitz.open(filename) as my_pdf_file:

    #loop through every page
    for page_number in range(1, len(my_pdf_file)+1):

        # acess individual page
        page = my_pdf_file[page_number-1]

        for link in page.links():
            #if the link is an extrenal link with http or https (URI)
            if "uri" in link:
                url = link["uri"]
                print(f'Link: "{url}" found on page number --> {page_number}')
            #if the link is internal or file with no URI
            else:
                pass
                # if "page" in link:
                #     print("Internal page linking to page no", link["page"])
                # else:
                #     print("File linking", link["file"])

Output

Link: "https://twoscoopspress.com" found on page number --> 4
Link: "http://2scoops.co/malcolm-tredinnick-memorial" found on page number --> 7
Link: "https://github.com/twoscoops/two-scoops-of-django-1.8/issues" found on page number --> 32
Link: "http://www.2scoops.co/1.8-errata/" found on page number --> 32
Link: "https://docs.djangoproject.com/en/1.8/intro/tutorial01/" found on page number --> 33
Link: "http://www.2scoops.co/1.8-code-examples/" found on page number --> 34
Link: "https://docs.djangoproject.com/en/1.8/misc/design-philosophies/" found on page number --> 36
Link: "http://12factor.net" found on page number --> 37
Link: "http://www.2scoops.co/1.8-change-list/" found on page number --> 37
Link: "https://github.com/twoscoops/two-scoops-of-django-1.8/issues" found on page number --> 38
Link: "https://github.com/twoscoops/two-scoops-of-django-1.8/issues" found on page number --> 38
Link: "http://www.python.org/dev/peps/pep-0008/" found on page number --> 40
Link: "http://2scoops.co/hobgoblin-of-little-minds" found on page number --> 40
Link: "http://www.python.org/dev/peps/pep-0008/#maximum-line-length" found on page number --> 41
Link: "http://2scoops.co/guido-on-pep-8-vs-pep-328" found on page number --> 45
Link: "http://www.python.org/dev/peps/pep-0328/" found on page number --> 45
Link: "http://2scoops.co/1.8-coding-style" found on page number --> 47
Link: "https://github.com/rwaldron/idiomatic.js/" found on page number --> 48
Link: "https://github.com/madrobby/pragmatic.js" found on page number --> 48
Link: "https://github.com/airbnb/javascript" found on page number --> 48
............
.........
.......
....
...
.
Link: "http://ponycheckup.com/" found on page number --> 506

As from the above output you can see that we only extracted the URI links which are the external links or URL starts with HTTP or mailto.

Conclusion

In this Python tutorial, we learned how to access all the links from the PDF, you can also extract links from a specific page number, you just need to tweak the above code a little bit and you can access all the links from the specific page.

I have also written a Python tutorial on how to extract images from the PDF using Python and pyMuPDF library, I would suggest you read it if you want to play with Python and PDF.

People are also reading: 

Leave a Reply

Your email address will not be published. Required fields are marked *