How to Extract Images from PDF in Python?

Posted in /  

How to Extract Images from PDF in Python?

Vinay Khatri
Last updated on November 14, 2022

    A Portable Document Format (PDF) is a file format that presents documents containing text and image data. Reading text from a PDF document is straightforward in Python. However, when it comes to images, there is no direct and standard way in Python to read the images.

    We often come across various PDFs from which we want to extract images, and for that, we can use some PDF reader or PDF image extraction application. But as a Python developer why install applications for such trivial tasks when we can write a Python script that extracts images from a PDF file.

    In this Python tutorial, we will walk you through the Python code that can extract images from PDF files and save them in the same directory as that of the code file. But before we write the Python program to extract images from PDF, we need to install certain Python libraries.

    Install Required Libraries

    Here, we will be using three Python libraries, namely pillow , io , and PyMuPDF . Among these three libraries, io is a part of the Python standard library, whereas pillow and PyMuPDF are open-source third-party libraries.

    To install pillow and PyMuPDF libraries for your Python environment, you need to run the following pip install commands on the command prompt or terminal app on your system: pillow: Pillow is a popular Python image handling library.

    pip install Pillow

    PyMuPDF : PyMuPDF library is used to access PDF, XPS, OpenXPS, epub, comic, and fiction book format files.

    pip install PyMuPDF

    io: io library is used to deal with various I/O streams.

    Python Implementation

    Once you have successfully installed the required libraries, you need to open your favorite Python IDE or code editor and start coding. Let's start with importing the required module.

    import fitz #the PyMuPDF module
    from PIL import Image
    import io

    Now, open the pdf file my_file.pdf with fitz.open() method, loop through every page, and extract images from every page and save them locally.

    filename = "my_file.pdf"
    # open file
    with fitz.open(filename) as my_pdf_file:
    
        #loop through every page
        for page_number in range(1, len(my_pdf_file)+1):
    
            # acess individual page
            page = my_pdf_file[page_number-1]
    
            # accesses all images of the page
            images = page.getImageList()
    
            # check if images are there
            if images:
                print(f"There are {len(images)} image/s on page number {page_number}[+]")
            else:
                print(f"There are No image/s on page number {page_number}[!]")
    
            # loop through all images present in the page 
            for image_number, image in enumerate(page.getImageList(), start=1):
    
                #access image xerf
                xref_value = image[0]
                
                #extract image information
                base_image = my_pdf_file.extractImage(xref_value)
    
                # access the image itself
                image_bytes = base_image["image"]
    
                #get image extension
                ext = base_image["ext"]
    
                #load image
                image = Image.open(io.BytesIO(image_bytes))
    
                #save image locally
                image.save(open(f"Page{page_number}Image{image_number}.{ext}", "wb"))

    Here's a brief overview of the functions and methods mentioned in the above code: The fitz.open(filename) as my_pdf_file statement opens the PDF file. page.getImageList() returns a list of all images present on the single page.

    The my_pdf_file.extractImage(xref_value) method returns all the information about the image, including its byte code and image extension. io.BytesIO(image_bytes) changes the image bytes-like object to proper bytes object. Image.open(io.BytesIO(image_bytes)) method opens the image byte object.

    image.save(open(f"Page{page_number}Image{image_number}.{ext}", "wb")) method saves the image locally.

    Now, put all the code together and execute.

    Python Program to Extract Images from the PDF File

    import fitz # PyMuPDF
    import io
    from PIL import Image
    
    #filename
    filename = "my_file.pdf"
    
    # open file
    with fitz.open(filename) as my_pdf_file:
    
        #loop through every page
        for page_number in range(1, len(my_pdf_file)+1):
    
            # acess individual page
            page = my_pdf_file[page_number-1]
    
            # accesses all images of the page
            images = page.getImageList()
    
            # check if images are there
            if images:
                print(f"There are {len(images)} image/s on page number {page_number}[+]")
            else:
                print(f"There are No image/s on page number {page_number}[!]")
    
            # loop through all images present in the page
            for image_number, image in enumerate(page.getImageList(), start=1):
    
                #access image xerf
                xref_value = image[0]
    
                #extract image information
                base_image = my_pdf_file.extractImage(xref_value)
    
                # access the image itself
                image_bytes = base_image["image"]
    
                #get image extension
                ext = base_image["ext"]
    
                #load image
                image = Image.open(io.BytesIO(image_bytes))
    
                #save image locally
                image.save(open(f"Page{page_number}Image{image_number}.{ext}", "wb"))

    Output: When you execute the above program, you will see an output similar to the one as follows (output depends on the images in the PDF file that you have chosen):

    There are 2 image/s on page number 1[+]
    There are 2 image/s on page number 2[+]
    There are 2 image/s on page number 3[+]
    There are 2 image/s on page number 4[+]
    There are 2 image/s on page number 5[+]
    There are 2 image/s on page number 6[+]
    There are 2 image/s on page number 7[+]
    There are 2 image/s on page number 8[+]
    There are 2 image/s on page number 9[+]
    There are 2 image/s on page number 10[+]
    There are 2 image/s on page number 11[+]
    There are 2 image/s on page number 12[+]
    There are 2 image/s on page number 13[+]
    There are 2 image/s on page number 14[+]
    There are 2 image/s on page number 15[+]
    There are 2 image/s on page number 16[+]
    There are 2 image/s on page number 17[+]
    There are 2 image/s on page number 18[+]
    There are 2 image/s on page number 19[+]
    There are 2 image/s on page number 20[+]
    There are 2 image/s on page number 21[+]
    There are 2 image/s on page number 22[+]
    There are 2 image/s on page number 23[+]
    There are 2 image/s on page number 24[+]
    There are 2 image/s on page number 25[+]
    There are 2 image/s on page number 26[+]
    There are 2 image/s on page number 27[+]
    There are 2 image/s on page number 28[+]
    There are 2 image/s on page number 29[+]
    There are 2 image/s on page number 30[+]

    The PDF that we have selected contains 2 images per page that's why we got the output shown above. If you check the directory where your Python script is present, you will see that all the images have been saved there.

    Conclusion

    In this Python tutorial, we learned how to access all the images in a PDF file using the PyMuPDF library and save them locally using the Python Pillow library. You can simply copy and paste the aforementioned Python program and replace the my_file.pdf filename with your own PDF file name and extract all the images present in it.

    To learn the Python language in-depth, purchase this course here.

    People are also reading:

    Leave a Comment on this Post

    0 Comments