How to Extract All PDF Links in Python?

Posted in /  

How to Extract All PDF Links in Python?
vinaykhatri

Vinay Khatri
Last updated on April 27, 2024

    In this tutorial, we will discuss how to extract all PDF links in Python. A Portable Document Format (PDF) is a file format that generally contains text and image data. The text data can also be links leading to websites or web pages.

    There are many Python libraries that can be used to read and write pdf files, but when it comes to reading or extracting specific data, such as images and links, then only a few of those libraries come useful.

    Here in this Python tutorial, we will walk you through a Python program that extracts all the external links from the PDF. A PDF could also have internal links that lead the user to a specific section of the page, but in this tutorial, we are not covering that part, but in the program below, we have provided code - in the form of comments - to access the internal linking links. Before diving into the program, let's install the required library.

    Install Required Library to Extract All PDF Links in Python

    For this program to " Extract All PDF Links in Python, " we will be using the Python open-source PyMuPDF library , which is a powerful and straightforward pdf and other book format reading tool. To install the PyMuPDF library, run the following pip command on your terminal or command prompt:

    pip install PyMuPDF

    You will also require a PDF from which you wish to extract the links. We would suggest you store the pdf file in the same directory of your Python script so you can load the PDF file in Python by mentioning the relative path. Otherwise, you have to specify the absolute path to the pdf file. Now you are all set. So, open your favorite Python IDE or text editor and start coding.

    How to Extract All PDF Links in Python?

    Let's begin with importing the required module.

    import fitz # PyMuPDF

    Now specify the filename in string format.

    #filename
    filename = "book.pdf"

    Here, our pdf file, "book.pdf" , resides in the same directory of the Python script that's why we are specifying the relative path. If your pdf file is located in some other directory or drive, then you need to specify the absolute path. You can also specify the relative path, but you have to be precise. Now open the pdf file with the fitz.open() method.

    with fitz.open(filename) as my_pdf_file:
    
        #loop through every page
        for page_number in range(1, len(my_pdf_file)+1):
    
            # acess individual page
            page = my_pdf_file[page_number-1]
    
            for link in page.links():
                #if the link is an extrenal link with http or https (URI)
                if "uri" in link:
                    #access link url
                    url = link["uri"]
                    print(f'Link: "{url}" found on page number --> {page_number}')
                #if the link is internal or file with no URI
                else:
                    pass
                    # if "page" in link:
                    #     print("Internal page linking to page no", link["page"])
                    # else:
                    #     print("File linking", link["file"]
    • The fitz.open(filename) as my_pdf_file statement will open the pdf file.
    • The page.links() statement will return a list of all the links present on the page.
    • link is a dictionary object, which has keys, such as uri , page , file , kind , and so on.
    • The link will have the Uniform Resource Identifier (URI) if it starts with HTTP, https, or mailto.

    Now, put all the code together and execute.

    #A Simple Program to Extract All PDF Links in Python

    import fitz # PyMuPDF
    
    #filename
    filename = "book.pdf"
    
    with fitz.open(filename) as my_pdf_file:
    
        #loop through every page
        for page_number in range(1, len(my_pdf_file)+1):
    
            # acess individual page
            page = my_pdf_file[page_number-1]
    
            for link in page.links():
                #if the link is an extrenal link with http or https (URI)
                if "uri" in link:
                    url = link["uri"]
                    print(f'Link: "{url}" found on page number --> {page_number}')
                #if the link is internal or file with no URI
                else:
                    pass
                    # if "page" in link:
                    #     print("Internal page linking to page no", link["page"])
                    # else:
                    #     print("File linking", link["file"])

    Output

    Link: "https://twoscoopspress.com" found on page number --> 4
    Link: "http://2scoops.co/malcolm-tredinnick-memorial" found on page number --> 7
    Link: "https://github.com/twoscoops/two-scoops-of-django-1.8/issues" found on page number --> 32
    Link: "http://www.2scoops.co/1.8-errata/" found on page number --> 32
    Link: "https://docs.djangoproject.com/en/1.8/intro/tutorial01/" found on page number --> 33
    Link: "http://www.2scoops.co/1.8-code-examples/" found on page number --> 34
    Link: "https://docs.djangoproject.com/en/1.8/misc/design-philosophies/" found on page number --> 36
    Link: "http://12factor.net" found on page number --> 37
    Link: "http://www.2scoops.co/1.8-change-list/" found on page number --> 37
    Link: "https://github.com/twoscoops/two-scoops-of-django-1.8/issues" found on page number --> 38
    Link: "https://github.com/twoscoops/two-scoops-of-django-1.8/issues" found on page number --> 38
    Link: "http://www.python.org/dev/peps/pep-0008/" found on page number --> 40
    Link: "http://2scoops.co/hobgoblin-of-little-minds" found on page number --> 40
    Link: "http://www.python.org/dev/peps/pep-0008/#maximum-line-length" found on page number --> 41
    Link: "http://2scoops.co/guido-on-pep-8-vs-pep-328" found on page number --> 45
    Link: "http://www.python.org/dev/peps/pep-0328/" found on page number --> 45
    Link: "http://2scoops.co/1.8-coding-style" found on page number --> 47
    Link: "https://github.com/rwaldron/idiomatic.js/" found on page number --> 48
    Link: "https://github.com/madrobby/pragmatic.js" found on page number --> 48
    Link: "https://github.com/airbnb/javascript" found on page number --> 48
    ............
    .........
    .......
    ....
    ...
    .
    Link: "http://ponycheckup.com/" found on page number --> 506

    From the above output, you can see that we only extracted the URI links that are the external links or URLs starting with HTTP or mailto.

    Conclusion

    In this Python tutorial, we learned how to extract all PDF links in Python. You can also extract links from a specific page number. You just need to tweak the above code a little bit, and you can access all the links from the specific page.

    We have also written a Python tutorial on how to extract images from the PDF using Python and pyMuPDF library . We would suggest you read it if you want to work with Python and PDFs.

    People are also reading:

    Leave a Comment on this Post

    0 Comments