Hamza Altarturi - Articles

How to Extract Wikipedia Data (WikiCorpus) Using Python: Guideline

Last update September 23, 2023

Python, a versatile language known for its applications in web scraping, is a powerful tool for extracting valuable information from websites. This blog post will guide you through the process of extracting data from Wikipedia using Python. We will explore two distinct methods: using the Wikipedia API and Gensim’s WikiCorpus.

Method 1: Leveraging the Wikipedia API with Python Libraries

Wikipedia offers an API that enables us to access and parse data from its vast repository of pages. To interact with this API, we need to install the wikipedia library in Python. You can install it using pip:

pip install wikipedia

Extracting a Topic Summary with Python

One of the most straightforward applications of the wikipedia library is extracting the summary of a specific topic. Here’s a Python code snippet that fetches the summary of “Python Programming”:

This Python script will print a concise summary of Python Programming sourced directly from Wikipedia.

Detailed Information Extraction from Wikipedia Pages

The wikipedia library also allows us to extract more detailed information from a Wikipedia page, such as the page’s content, categories, links, and images. Here’s an example using BeautifulSoup:

This Python script will print the paragraphs, images, links, headings, and remaining content of the Wikipedia page for Python Programming.

Method 2: Utilizing Gensim’s WikiCorpus for Offline Wikipedia Data Extraction

The gensim library in Python provides a WikiCorpus class that allows you to work with the Wikipedia dump. This is particularly useful if you want to work with a large amount of data from Wikipedia offline, or if you want to perform more complex operations that are not supported by the Wikipedia API.

Here’s an example of how you can use WikiCorpus to extract text from the Wikipedia dump:

In this example, enwiki-latest-pages-articles.xml.bz2 is the Wikipedia dump file, which you can download from the Wikipedia dumps page. The get_texts method of WikiCorpus returns an iterator that yields one article at a time as a list of words.

Please note that the Wikipedia dump is very large (the English dump is tens of gigabytes), so processing it will take a long time and require a lot of memory. Also, the dump contains all articles from Wikipedia, so the resulting text file will be very large as well.

Conclusion

Python provides powerful tools for web scraping, and the Wikipedia API and Gensim’s WikiCorpus make it easy to extract useful information from Wikipedia pages. Whether you’re interested in a brief summary or detailed information, Python and these libraries have you covered. Learn more about Python web scraping and data extraction with my Tech Blog.