How to Extract Wikipedia Data in Python?

Posted in /  

How to Extract Wikipedia Data in Python?
vinaykhatri

Vinay Khatri
Last updated on March 29, 2024

    Wikipedia is the world's largest online encyclopedia. It contains millions of articles on different topics, you name it. Wikipedia provides you the public data about it. We have used Wikipedia to build school projects and write assignments. All thanks to free online articles, we can read and learn about anything for free. As Wikipedia contains millions of articles and topics, we can use the Python program to get data about all those specific topics.

    Let's say you are building a Python project where you want to extract data for a specific topic from Wikipedia. In that case, you can either use Python web-scraping and get data from Wikipedia or use the Python Wikipedia API or Library.

    In this Python tutorial, I will guide you through How to use the Python Wikipedia Library or API to get data for a specific topic. Before we start coding and learn more about how to use the Python Wikipedia library in a Python program, let's install the library first.

    To install the Wikipedia library for your Python environment, run the following pip install command on your terminal(Linux/macOS) or command prompt(windows)

    pip install wikipedia

    Search Wikipedia topics with Python

    Let's start with searching Wikipedia topics with Python. The wikipedia module provides a search() function that returns a list of relevant results based on the search query. The search(query, results=10, suggestion=False) function accepts 3 parameters: query is the topic which we want to search. results is the number of results that the search function should return; by default, its value is 10.

    The suggetstion parameter will return the relevant suggestions for the topic in tuples if its value is True, but by default, its value is False.

    Now let's use the search() function and search for the topic "Python," and let's see what result we get.

    import wikipedia
    topic = "Python"
    #search for Python
    results = wikipedia.search(topic, results =15)
    
    print(results)

    Output

    ['Python (programming language)', 'Python', 'Monty Python', 'Burmese python', 'Ball python', 'PYTHON', 'History of Python', 'Reticulated python', 'Python (genus)', 'Monty Python and the Holy Grail', 'Python molurus', 'Colt Python', 'Python (missile)', 'African rock python', 'Burmese pythons in Florida']

    From the output, you can see that the search() function returns a list of 15 elements for query topic Python. All results we get from the search() functions are the official webpage title for the Wikipedia topics.

    Fetch Wikipedia topic data with Python

    Using the search() function, we can search for the relevant top topics for the query. Now let's say we also want to get some summary or description of the topic itself, so how would we get that?- The answer is summary() function. The summary() function returns a text string or summary about the specified page or topic.

    summary( query , sentences=0 , chars=0 , auto_suggest=True , redirect=True) The query parameter specifies the page or topic name. sentances parameter specifies the number of sentences, 0 represents all the sentences. chars parameter represents the number of characters that should be returned from the summary 0 represents printing all the characters.

    redirect parameter allows redirection without any RedirectError. Now, let's print the 100-character summary from the top 3 search results.

    import wikipedia
    
    topic = "Python"
    
    #top 3 best result 
    results = wikipedia.search(topic, results=3)
    
    for topic in results:
        print("Page---->", topic, ":")
        print(wikipedia.summary(topic, chars=100))
        print() #new line

    Output

    Page----> Python (programming language) :
    Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy...
    
    Page----> Python :
    Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy...
    
    Page----> Monty Python :
    Monty Python (also collectively known as the Pythons) were a British surreal comedy troupe who created...

    Fetch Wikipedia page data with Python

    A Wikipedia page does not contain only text data but also images, links, references, page id, etc. Now let's see how we can get all the data from a Wikipedia page using the Python wikipedia module.

    In wikipedia module, we get the WikipediaPage() class that returns a Wikipedia object with properties like page categories , content , coordinates , images , links , references etc.

    WikipediaPage( title=None , pageid=None , redirect=True , preload=False , original_title=u'' )

    The WikipediaPage() class accepts the page title name as a mandatory parameter. pageid parameter specifies the page number for the title. redirect allow redirection without any error. preload parameter load the page data such as summary, images, content, and links. Now let's get the Wikipedia data for the page "Python (programming language)".

    import wikipedia
    title = "Python (programming language)"
    page = wikipedia.WikipediaPage(title)
    
    #get page content
    print(page.content)
    
    #get page images
    print(f"The page {title} has {len(page.images)}: ")
    for image_url in page.images:
        print(image_url)
    
    #page links
    print(page.links)

    Output

    Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.Python was..........................
    The page Python (programming language) has 20: 
    https://upload.wikimedia.org/wikipedia/commons/b/b5/DNC_training_recall_task.gif
    https://upload.wikimedia.org/wikipedia/commons/3/31/Free_and_open-source_software_logo_%282009%29.svg
    https://upload.wikimedia.org/wikipedia/commons/9/94/Guido_van_Rossum_OSCON_2006_cropped.png
    https://upload.wikimedia.org/wikipedia/commons/5/52/Merge-arrows.svg
    https://upload.wikimedia.org/wikipedia/commons/6/6f/Octicons-terminal.svg
    https://upload.wikimedia.org/wikipedia/commons/c/c3/Python-logo-notext.svg
    https://upload.wikimedia.org/wikipedia/commons/1/10/Python_3._The_standard_type_hierarchy.png
    https://upload.wikimedia.org/wikipedia/commons/f/f8/Python_logo_and_wordmark.svg
    https://upload.wikimedia.org/wikipedia/commons/8/89/Symbol_book_class2.svg
    https://upload.wikimedia.org/wikipedia/commons/d/df/Wikibooks-logo-en-noslogan.svg
    https://upload.wikimedia.org/wikipedia/commons/f/fa/Wikibooks-logo.svg
    https://upload.wikimedia.org/wikipedia/commons/f/ff/Wikidata-logo.svg
    https://upload.wikimedia.org/wikipedia/commons/f/fa/Wikiquote-logo.svg
    https://upload.wikimedia.org/wikipedia/commons/0/0b/Wikiversity_logo_2017.svg
    https://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
    https://upload.wikimedia.org/wikipedia/en/8/8a/OOjs_UI_icon_edit-ltr-progressive.svg
    https://upload.wikimedia.org/wikipedia/en/9/96/Symbol_category_class.svg
    https://upload.wikimedia.org/wikipedia/en/d/db/Symbol_list_class.svg
    https://upload.wikimedia.org/wikipedia/en/e/e2/Symbol_portal_class.svg
    https://upload.wikimedia.org/wikipedia/en/9/94/Symbol_support_vote.svg

    With WikipediaPage() module properties like categories, content, images, html(), links, references, etc., you can fetch the data from a Wikipedia page. In the above example, I have listed all the image URLs present on the page. If you want to know how to download images from a web page, click here .

    Conclusion

    In this Python tutorial, you learned how to use the Python Wikipedia library to extract data from Wikipedia pages. We do not need to use slow and inefficient web-scrapping to extract data from Wikipedia with this library. I would recommend you read the Python Wikipedia library's official documentation to know more about its functions.

    People are also reading:

    Leave a Comment on this Post

    0 Comments