How to Extract Wikipedia Data (WikiCorpus) Using Python: Guideline
Last update
September 23, 2023
Python, a versatile language known for its applications in web scraping, is a powerful tool for extracting valuable information from websites. This blog post will guide you through the process of extracting data from Wikipedia using Python. We will explore two distinct methods: using the Wikipedia API and Gensim’s WikiCorpus.
Method 1: Leveraging the Wikipedia API with Python Libraries
Wikipedia offers an API that enables us to access and parse data from its vast repository of pages. To interact with this API, we need to install the wikipedia
library in Python. You can install it using pip:
pip install wikipedia
Extracting a Topic Summary with Python
One of the most straightforward applications of the wikipedia
library is extracting the summary of a specific topic. Here’s a Python code snippet that fetches the summary of “Python Programming”:
This Python script will print a concise summary of Python Programming sourced directly from Wikipedia.
Detailed Information Extraction from Wikipedia Pages
The wikipedia
library also allows us to extract more detailed information from a Wikipedia page, such as the page’s content, categories, links, and images. Here’s an example using BeautifulSoup:
This Python script will print the paragraphs, images, links, headings, and remaining content of the Wikipedia page for Python Programming.
Method 2: Utilizing Gensim’s WikiCorpus for Offline Wikipedia Data Extraction
The gensim
library in Python provides a WikiCorpus
class that allows you to work with the Wikipedia dump. This is particularly useful if you want to work with a large amount of data from Wikipedia offline, or if you want to perform more complex operations that are not supported by the Wikipedia API.
Here’s an example of how you can use WikiCorpus
to extract text from the Wikipedia dump:
In this example, enwiki-latest-pages-articles.xml.bz2
is the Wikipedia dump file, which you can download from the Wikipedia dumps page. The get_texts
method of WikiCorpus
returns an iterator that yields one article at a time as a list of words.
Please note that the Wikipedia dump is very large (the English dump is tens of gigabytes), so processing it will take a long time and require a lot of memory. Also, the dump contains all articles from Wikipedia, so the resulting text file will be very large as well.
Conclusion
Python provides powerful tools for web scraping, and the Wikipedia API and Gensim’s WikiCorpus make it easy to extract useful information from Wikipedia pages. Whether you’re interested in a brief summary or detailed information, Python and these libraries have you covered. Learn more about Python web scraping and data extraction with my Tech Blog.