How to Extract Wikipedia Data in Python?

By | February 21, 2021
How to Extract Wikipedia Data in Python?

Wikipedia is the world’s largest online encyclopedia, it contains millions of articles on different topics, you name it Wikipedia provide you the public data about it. We have used Wikipedia to build school projects, and write assignments. All thanks to is free online articles we can read and learn about anything for free.

As Wikipedia contains millions of articles and topics, we can use the Python program to get data about all those specific topics. Let’s say you are building a Python project where you want to extract data for a specific topic from Wikipedia. In that case, you can either use Python web-scraping and get data from Wikipedia or you use the Python Wikipedia API or Library.

Vamware

In this Python tutorial, I will guide you through How to use the Python Wikipedia Library or API to get data for a specific topic.

Before we start coding and learn more about how to use the Python Wikipedia library in a Python program let’s install the library first.

To install the Wikipedia library for your Python environment run the following pip install command on your terminal(Linux/macOS) or command prompt(windows)

pip install wikipedia

Search Wikipedia topics with Python

Let’s start with searching Wikipedia topics with Python.

The wikipedia module provides a search() function that returns a list of relevant results based on the search query.

The search(query, results=10, suggestion=False) function accepts 3 parameters:

query is the topic which we want to search.

results is the number of results that the search function should return; by default, its value is 10.

suggetstion parameter will return the relevant suggestions for the topic in tuples if its value is True, but by default its value is False.

Now let’s use the search() function and search for the topic “Python” and let’s see what result we get.

import wikipedia
topic = "Python"
#search for Python
results = wikipedia.search(topic, results =15)

print(results)

Output

['Python (programming language)', 'Python', 'Monty Python', 'Burmese python', 'Ball python', 'PYTHON', 'History of Python', 'Reticulated python', 'Python (genus)', 'Monty Python and the Holy Grail', 'Python molurus', 'Colt Python', 'Python (missile)', 'African rock python', 'Burmese pythons in Florida']

From the output, you can see that the search() function returns a list of 15 elements for query topic Python. All results we get from the search() functions are the official webpage title for the Wikipedia topics.

Fetch Wikipedia topic data with Python

Using the search() function we can search for the relevant top topics for the query, now let’s say we also want to get some summary or description about the topic itself so how would we get that?- The answer is summary() function.

The summary() function returns a text string or summary about the specified page or topic.

summary(querysentences=0chars=0auto_suggest=Trueredirect=True)

The query parameter specifies the page or topic name.

sentances parameter specifies the number of sentences, 0 represents all the sentences.

chars parameter represents the number of characters, that should be returned from the summary 0 represent print all the characters .

redirect parameter allows redirection without any RedirectError.

Now let’s print the 100 character summary from the top 3 search results.

import wikipedia

topic = "Python"

#top 3 best result 
results = wikipedia.search(topic, results=3)

for topic in results:
    print("Page---->", topic, ":")
    print(wikipedia.summary(topic, chars=100))
    print() #new line

Output

Page----> Python (programming language) :
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy...

Page----> Python :
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy...

Page----> Monty Python :
Monty Python (also collectively known as the Pythons) were a British surreal comedy troupe who created...

Fetch Wikipedia page data with Python

A Wikipedia page does not contain only text data, but also images, links, references, page id, etc.

Now let’s see how can we get that all the data from a Wikipedia page using Python wikipedia module.

In wikipedia module we get the WikipediaPage() class that returns a Wikipedia object with properties like page categoriescontent , coordinatesimageslinks , references etc.

WikipediaPage(title=Nonepageid=Noneredirect=Truepreload=Falseoriginal_title=u'')

The WikipediaPage() class accepts the page title name as a mandatory parameter.

pageid parameter specifies page number for the title.

redirect allow redirection without any error.

preload parameter load the page data such as summary, images, content, and links.

Now let’s get the Wikipedia data for the page “Python (programming language)”.

import wikipedia
title = "Python (programming language)"
page = wikipedia.WikipediaPage(title)

#get page content
print(page.content)

#get page images
print(f"The page {title} has {len(page.images)}: ")
for image_url in page.images:
print(image_url)

#page links
print(page.links)

Output

Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.Python was..........................
The page Python (programming language) has 20: 
https://upload.wikimedia.org/wikipedia/commons/b/b5/DNC_training_recall_task.gif
https://upload.wikimedia.org/wikipedia/commons/3/31/Free_and_open-source_software_logo_%282009%29.svg
https://upload.wikimedia.org/wikipedia/commons/9/94/Guido_van_Rossum_OSCON_2006_cropped.png
https://upload.wikimedia.org/wikipedia/commons/5/52/Merge-arrows.svg
https://upload.wikimedia.org/wikipedia/commons/6/6f/Octicons-terminal.svg
https://upload.wikimedia.org/wikipedia/commons/c/c3/Python-logo-notext.svg
https://upload.wikimedia.org/wikipedia/commons/1/10/Python_3._The_standard_type_hierarchy.png
https://upload.wikimedia.org/wikipedia/commons/f/f8/Python_logo_and_wordmark.svg
https://upload.wikimedia.org/wikipedia/commons/8/89/Symbol_book_class2.svg
https://upload.wikimedia.org/wikipedia/commons/d/df/Wikibooks-logo-en-noslogan.svg
https://upload.wikimedia.org/wikipedia/commons/f/fa/Wikibooks-logo.svg
https://upload.wikimedia.org/wikipedia/commons/f/ff/Wikidata-logo.svg
https://upload.wikimedia.org/wikipedia/commons/f/fa/Wikiquote-logo.svg
https://upload.wikimedia.org/wikipedia/commons/0/0b/Wikiversity_logo_2017.svg
https://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
https://upload.wikimedia.org/wikipedia/en/8/8a/OOjs_UI_icon_edit-ltr-progressive.svg
https://upload.wikimedia.org/wikipedia/en/9/96/Symbol_category_class.svg
https://upload.wikimedia.org/wikipedia/en/d/db/Symbol_list_class.svg
https://upload.wikimedia.org/wikipedia/en/e/e2/Symbol_portal_class.svg
https://upload.wikimedia.org/wikipedia/en/9/94/Symbol_support_vote.svg

With WikipediaPage()module properties like categories, content, images, html(), links, references, etc. you can fetch the data from a Wikipedia page.

In the above example, I have listed all the image URLs present on the page, If you want to know how to download images from a web page click here.

Conclusion

In this Python tutorial, you learned how to use the Python wikipedia library to extract data from the Wikipedia pages. We do not need to use the slow and inefficient web-scrapping to extract data from Wikipedia with this library. I would recommend you read the Python wikipedia library’s official documentation to know more about its functions.

Leave a Reply

Your email address will not be published. Required fields are marked *