Dataset Summary
Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback.
This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.).
Invitation for Feedback
The dataset is built as part of the Structured Contents initiative and based on the Wikimedia Enterprise html snapshots. It is an early beta release to improve transparency in the development process and request feedback. This first version includes pre-parsed Wikipedia abstracts, short descriptions, main images links, infoboxes and article sections, excluding non-prose sections (e.g. references). More elements (such as lists and tables) may be added over time. For updates follow the project’s blog and our Mediawiki Quarterly software updates on MediaWiki.
As this is an early beta release, we highly value your feedback to help us refine and improve this dataset. Please share your thoughts, suggestions, and any issues you encounter either on the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset here on Kaggle.
The contents of this dataset of Wikipedia articles is collectively written and curated by a global volunteer community. All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.
The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking.
We would love to hear more about your use cases.
Data Fields
The data fields are the same among all, noteworthy included fields:
name - title of the article.
identifier - ID of the article.
url - URL of the article.
version: metadata related to the latest specific revision of the article
version.editor - editor-specific signals that can help contextualize the revision
version.scores - returns assessments by ML models on the likelihood of a revision being reverted.
main entity - Wikidata QID the article is related to.
abstract - lead section, summarizing what the article is about.
description - one-sentence description of the article for quick reference.
image - main image representing the article's subject.
infoboxes - parsed information from the side panel (infobox) on the Wikipedia article.
sections - parsed sections of the article, including links.
Note: excludes other media/images, lists, tables and references or similar non-prose sections.
Full data dictionary is available here: https://enterprise.wikimedia.com/docs/data-dictionary/
Curation Rationale
This dataset has been created as part of the larger Structured Contents initiative at Wikimedia Enterprise with the aim of making Wikimedia data more machine readable. These efforts are both focused on pre-parsing Wikipedia snippets as well as connecting the different projects closer together.
Even if Wikipedia is very structured to the human eye, it is a non-trivial task to extract the knowledge lying within in a machine readable manner. Projects, languages, domains all have their own specific community experts and way of structuring data, bolstered by various templates and best practices. A specific example we’ve addressed in this release is article infoboxes. Infoboxes are panels that commonly appear in the top right corner of a Wikipedia article and summarize key facts and statistics of the article’s subject. The editorial community works hard to keep infoboxes populated with the article’s most pertinent and current metadata, and we’d like to lower the barrier of entry significantly so that this data is also accessible at scale without the need for bespoke parsing systems.
We also include the link to the Wikidata Q Identifier (corresponding Wikidata entity), and the link to (main and infobox) images to facilitate easier access to additional information on the specific topics.
You will also find Credibility Signals fields included. These can help you decide when, how, and why to use what is in the dataset. These fields mirror the over 20 years of editorial policies created and kept by the Wikipedia editing communities, taking publicly available information and structuring it. Like with article structures, because this information is not centralized (neither on a single project nor across them), it is hard to access. Credibility signals shine a light on that blind spot. You will find most of these signals under the ‘version’ object, but other objects like ‘protection’ and ‘watchers_count’ offer similar insight.
This is an early beta release of pre-parsed Wikipedia articles in bulk, as a means to improve transparency in the development process and gather insights of current use cases to follow where the AI community needs us most; as well as feedback points, to develop this further through collaboration. There will be limitations (see ‘known limitations’ section below), but in line with our values, we believe it is better to share early, often, and respond to feedback.
You can also test out more languages on an article by article basis through our beta Structured Contents On-demand endpoint with a free account.
Attribution is core to the sustainability of the Wikimedia projects. It is what drives new editors and donors to Wikipedia. With consistent attribution, this cycle of content creation and reuse ensures encyclopedic content of high-quality, reliability, and verifiability will continue being written on Wikipedia and ultimately remain available for reuse via datasets such as these.
As such, we require all users of this dataset to conform to our expectations for proper attribution. Detailed attribution requirements for use of this dataset are outlined below.
Beyond attribution, there are many ways of contributing to and supporting the Wikimedia movement. and various other ways of supporting and participating in the Wikimedia movement below. To discuss your specific circumstances please contact Nicholas Perry from the Wikimedia Foundation technical partnerships team at nperry@wikimedia.org. You can also contact us on either the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset on Kaggle.
Attribution Information
Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (available at https://foundation.wikimedia.org/wiki/Visual_identity_guidelines) when identifying Wikimedia as the source of content.
An error occurred: Failed to fetch
| date | Views |
|---|---|
| Jul 15, 2025 | 49 |
| Jul 16, 2025 | 99 |
| Jul 17, 2025 | 88 |
| Jul 18, 2025 | 51 |
| Jul 19, 2025 | 47 |
| Jul 20, 2025 | 24 |
| Jul 21, 2025 | 52 |
| Jul 22, 2025 | 53 |
| Jul 23, 2025 | 54 |
| Jul 24, 2025 | 63 |
| Jul 25, 2025 | 46 |
| Jul 26, 2025 | 32 |
| Jul 27, 2025 | 27 |
| Jul 28, 2025 | 36 |
| Jul 29, 2025 | 46 |
| Jul 30, 2025 | 43 |
| Jul 31, 2025 | 53 |
| Aug 1, 2025 | 41 |
| Aug 2, 2025 | 36 |
| Aug 3, 2025 | 35 |
| Aug 4, 2025 | 52 |
| Aug 5, 2025 | 74 |
| Aug 6, 2025 | 63 |
| Aug 7, 2025 | 85 |
| Aug 8, 2025 | 51 |
| Aug 9, 2025 | 24 |
| Aug 10, 2025 | 43 |
| Aug 11, 2025 | 53 |
| date | Views |
|---|---|
| Jul 15, 2025 | 49 |
| Jul 16, 2025 | 99 |
| Jul 17, 2025 | 88 |
| Jul 18, 2025 | 51 |
| Jul 19, 2025 | 47 |
| Jul 20, 2025 | 24 |
| Jul 21, 2025 | 52 |
| Jul 22, 2025 | 53 |
| Jul 23, 2025 | 54 |
| Jul 24, 2025 | 63 |
| Jul 25, 2025 | 46 |
| Jul 26, 2025 | 32 |
| Jul 27, 2025 | 27 |
| Jul 28, 2025 | 36 |
| Jul 29, 2025 | 46 |
| Jul 30, 2025 | 43 |
| Jul 31, 2025 | 53 |
| Aug 1, 2025 | 41 |
| Aug 2, 2025 | 36 |
| Aug 3, 2025 | 35 |
| Aug 4, 2025 | 52 |
| Aug 5, 2025 | 74 |
| Aug 6, 2025 | 63 |
| Aug 7, 2025 | 85 |
| Aug 8, 2025 | 51 |
| Aug 9, 2025 | 24 |
| Aug 10, 2025 | 43 |
| Aug 11, 2025 | 53 |
| date | Views |
|---|---|
| Jul 15, 2025 | 49 |
| Jul 16, 2025 | 99 |
| Jul 17, 2025 | 88 |
| Jul 18, 2025 | 51 |
| Jul 19, 2025 | 47 |
| Jul 20, 2025 | 24 |
| Jul 21, 2025 | 52 |
| Jul 22, 2025 | 53 |
| Jul 23, 2025 | 54 |
| Jul 24, 2025 | 63 |
| Jul 25, 2025 | 46 |
| Jul 26, 2025 | 32 |
| Jul 27, 2025 | 27 |
| Jul 28, 2025 | 36 |
| Jul 29, 2025 | 46 |
| Jul 30, 2025 | 43 |
| Jul 31, 2025 | 53 |
| Aug 1, 2025 | 41 |
| Aug 2, 2025 | 36 |
| Aug 3, 2025 | 35 |
| Aug 4, 2025 | 52 |
| Aug 5, 2025 | 74 |
| Aug 6, 2025 | 63 |
| Aug 7, 2025 | 85 |
| Aug 8, 2025 | 51 |
| Aug 9, 2025 | 24 |
| Aug 10, 2025 | 43 |
| Aug 11, 2025 | 53 |
| date | Downloads |
|---|---|
| Jul 15, 2025 | 190 |
| Jul 16, 2025 | 33 |
| Jul 17, 2025 | 68 |
| Jul 18, 2025 | 5 |
| Jul 19, 2025 | 2 |
| Jul 20, 2025 | 45 |
| Jul 21, 2025 | 1 |
| Jul 22, 2025 | 310 |
| Jul 23, 2025 | 82 |
| Jul 24, 2025 | 73 |
| Jul 25, 2025 | 8 |
| Jul 26, 2025 | 7 |
| Jul 27, 2025 | 1 |
| Jul 28, 2025 | 81 |
| Jul 29, 2025 | 5 |
| Jul 30, 2025 | 14 |
| Jul 31, 2025 | 50 |
| Aug 1, 2025 | 37 |
| Aug 2, 2025 | 2 |
| Aug 3, 2025 | 5 |
| Aug 4, 2025 | 11 |
| Aug 5, 2025 | 10 |
| Aug 6, 2025 | 8 |
| Aug 7, 2025 | 23 |
| Aug 8, 2025 | 5 |
| Aug 10, 2025 | 7 |
| Aug 11, 2025 | 7 |
| date | Downloads |
|---|---|
| Jul 15, 2025 | 190 |
| Jul 16, 2025 | 33 |
| Jul 17, 2025 | 68 |
| Jul 18, 2025 | 5 |
| Jul 19, 2025 | 2 |
| Jul 20, 2025 | 45 |
| Jul 21, 2025 | 1 |
| Jul 22, 2025 | 310 |
| Jul 23, 2025 | 82 |
| Jul 24, 2025 | 73 |
| Jul 25, 2025 | 8 |
| Jul 26, 2025 | 7 |
| Jul 27, 2025 | 1 |
| Jul 28, 2025 | 81 |
| Jul 29, 2025 | 5 |
| Jul 30, 2025 | 14 |
| Jul 31, 2025 | 50 |
| Aug 1, 2025 | 37 |
| Aug 2, 2025 | 2 |
| Aug 3, 2025 | 5 |
| Aug 4, 2025 | 11 |
| Aug 5, 2025 | 10 |
| Aug 6, 2025 | 8 |
| Aug 7, 2025 | 23 |
| Aug 8, 2025 | 5 |
| Aug 10, 2025 | 7 |
| Aug 11, 2025 | 7 |
| date | Downloads |
|---|---|
| Jul 15, 2025 | 190 |
| Jul 16, 2025 | 33 |
| Jul 17, 2025 | 68 |
| Jul 18, 2025 | 5 |
| Jul 19, 2025 | 2 |
| Jul 20, 2025 | 45 |
| Jul 21, 2025 | 1 |
| Jul 22, 2025 | 310 |
| Jul 23, 2025 | 82 |
| Jul 24, 2025 | 73 |
| Jul 25, 2025 | 8 |
| Jul 26, 2025 | 7 |
| Jul 27, 2025 | 1 |
| Jul 28, 2025 | 81 |
| Jul 29, 2025 | 5 |
| Jul 30, 2025 | 14 |
| Jul 31, 2025 | 50 |
| Aug 1, 2025 | 37 |
| Aug 2, 2025 | 2 |
| Aug 3, 2025 | 5 |
| Aug 4, 2025 | 11 |
| Aug 5, 2025 | 10 |
| Aug 6, 2025 | 8 |
| Aug 7, 2025 | 23 |
| Aug 8, 2025 | 5 |
| Aug 10, 2025 | 7 |
| Aug 11, 2025 | 7 |
| date | Views |
|---|---|
| Jul 15, 2025 | 49 |
| Jul 16, 2025 | 99 |
| Jul 17, 2025 | 88 |
| Jul 18, 2025 | 51 |
| Jul 19, 2025 | 47 |
| Jul 20, 2025 | 24 |
| Jul 21, 2025 | 52 |
| Jul 22, 2025 | 53 |
| Jul 23, 2025 | 54 |
| Jul 24, 2025 | 63 |
| Jul 25, 2025 | 46 |
| Jul 26, 2025 | 32 |
| Jul 27, 2025 | 27 |
| Jul 28, 2025 | 36 |
| Jul 29, 2025 | 46 |
| Jul 30, 2025 | 43 |
| Jul 31, 2025 | 53 |
| Aug 1, 2025 | 41 |
| Aug 2, 2025 | 36 |
| Aug 3, 2025 | 35 |
| Aug 4, 2025 | 52 |
| Aug 5, 2025 | 74 |
| Aug 6, 2025 | 63 |
| Aug 7, 2025 | 85 |
| Aug 8, 2025 | 51 |
| Aug 9, 2025 | 24 |
| Aug 10, 2025 | 43 |
| Aug 11, 2025 | 53 |
| date | Views |
|---|---|
| Jul 15, 2025 | 49 |
| Jul 16, 2025 | 99 |
| Jul 17, 2025 | 88 |
| Jul 18, 2025 | 51 |
| Jul 19, 2025 | 47 |
| Jul 20, 2025 | 24 |
| Jul 21, 2025 | 52 |
| Jul 22, 2025 | 53 |
| Jul 23, 2025 | 54 |
| Jul 24, 2025 | 63 |
| Jul 25, 2025 | 46 |
| Jul 26, 2025 | 32 |
| Jul 27, 2025 | 27 |
| Jul 28, 2025 | 36 |
| Jul 29, 2025 | 46 |
| Jul 30, 2025 | 43 |
| Jul 31, 2025 | 53 |
| Aug 1, 2025 | 41 |
| Aug 2, 2025 | 36 |
| Aug 3, 2025 | 35 |
| Aug 4, 2025 | 52 |
| Aug 5, 2025 | 74 |
| Aug 6, 2025 | 63 |
| Aug 7, 2025 | 85 |
| Aug 8, 2025 | 51 |
| Aug 9, 2025 | 24 |
| Aug 10, 2025 | 43 |
| Aug 11, 2025 | 53 |
| date | Views |
|---|---|
| Jul 15, 2025 | 49 |
| Jul 16, 2025 | 99 |
| Jul 17, 2025 | 88 |
| Jul 18, 2025 | 51 |
| Jul 19, 2025 | 47 |
| Jul 20, 2025 | 24 |
| Jul 21, 2025 | 52 |
| Jul 22, 2025 | 53 |
| Jul 23, 2025 | 54 |
| Jul 24, 2025 | 63 |
| Jul 25, 2025 | 46 |
| Jul 26, 2025 | 32 |
| Jul 27, 2025 | 27 |
| Jul 28, 2025 | 36 |
| Jul 29, 2025 | 46 |
| Jul 30, 2025 | 43 |
| Jul 31, 2025 | 53 |
| Aug 1, 2025 | 41 |
| Aug 2, 2025 | 36 |
| Aug 3, 2025 | 35 |
| Aug 4, 2025 | 52 |
| Aug 5, 2025 | 74 |
| Aug 6, 2025 | 63 |
| Aug 7, 2025 | 85 |
| Aug 8, 2025 | 51 |
| Aug 9, 2025 | 24 |
| Aug 10, 2025 | 43 |
| Aug 11, 2025 | 53 |
| date | Downloads |
|---|---|
| Jul 15, 2025 | 190 |
| Jul 16, 2025 | 33 |
| Jul 17, 2025 | 68 |
| Jul 18, 2025 | 5 |
| Jul 19, 2025 | 2 |
| Jul 20, 2025 | 45 |
| Jul 21, 2025 | 1 |
| Jul 22, 2025 | 310 |
| Jul 23, 2025 | 82 |
| Jul 24, 2025 | 73 |
| Jul 25, 2025 | 8 |
| Jul 26, 2025 | 7 |
| Jul 27, 2025 | 1 |
| Jul 28, 2025 | 81 |
| Jul 29, 2025 | 5 |
| Jul 30, 2025 | 14 |
| Jul 31, 2025 | 50 |
| Aug 1, 2025 | 37 |
| Aug 2, 2025 | 2 |
| Aug 3, 2025 | 5 |
| Aug 4, 2025 | 11 |
| Aug 5, 2025 | 10 |
| Aug 6, 2025 | 8 |
| Aug 7, 2025 | 23 |
| Aug 8, 2025 | 5 |
| Aug 10, 2025 | 7 |
| Aug 11, 2025 | 7 |
| date | Downloads |
|---|---|
| Jul 15, 2025 | 190 |
| Jul 16, 2025 | 33 |
| Jul 17, 2025 | 68 |
| Jul 18, 2025 | 5 |
| Jul 19, 2025 | 2 |
| Jul 20, 2025 | 45 |
| Jul 21, 2025 | 1 |
| Jul 22, 2025 | 310 |
| Jul 23, 2025 | 82 |
| Jul 24, 2025 | 73 |
| Jul 25, 2025 | 8 |
| Jul 26, 2025 | 7 |
| Jul 27, 2025 | 1 |
| Jul 28, 2025 | 81 |
| Jul 29, 2025 | 5 |
| Jul 30, 2025 | 14 |
| Jul 31, 2025 | 50 |
| Aug 1, 2025 | 37 |
| Aug 2, 2025 | 2 |
| Aug 3, 2025 | 5 |
| Aug 4, 2025 | 11 |
| Aug 5, 2025 | 10 |
| Aug 6, 2025 | 8 |
| Aug 7, 2025 | 23 |
| Aug 8, 2025 | 5 |
| Aug 10, 2025 | 7 |
| Aug 11, 2025 | 7 |
| date | Downloads |
|---|---|
| Jul 15, 2025 | 190 |
| Jul 16, 2025 | 33 |
| Jul 17, 2025 | 68 |
| Jul 18, 2025 | 5 |
| Jul 19, 2025 | 2 |
| Jul 20, 2025 | 45 |
| Jul 21, 2025 | 1 |
| Jul 22, 2025 | 310 |
| Jul 23, 2025 | 82 |
| Jul 24, 2025 | 73 |
| Jul 25, 2025 | 8 |
| Jul 26, 2025 | 7 |
| Jul 27, 2025 | 1 |
| Jul 28, 2025 | 81 |
| Jul 29, 2025 | 5 |
| Jul 30, 2025 | 14 |
| Jul 31, 2025 | 50 |
| Aug 1, 2025 | 37 |
| Aug 2, 2025 | 2 |
| Aug 3, 2025 | 5 |
| Aug 4, 2025 | 11 |
| Aug 5, 2025 | 10 |
| Aug 6, 2025 | 8 |
| Aug 7, 2025 | 23 |
| Aug 8, 2025 | 5 |
| Aug 10, 2025 | 7 |
| Aug 11, 2025 | 7 |