Kaggle uses cookies from Google to deliver and enhance the quality of its services and to analyze traffic.
Learn more
OK, Got it.
Wikimedia and 7 collaborators · Updated 4 months ago

Wikipedia Structured Contents

Pre-parsed English and French Wikipedia Articles, Including Infoboxes

Wikipedia Structured Contents

About Dataset

Dataset Summary
Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback.

This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.).

Invitation for Feedback
The dataset is built as part of the Structured Contents initiative and based on the Wikimedia Enterprise html snapshots. It is an early beta release to improve transparency in the development process and request feedback. This first version includes pre-parsed Wikipedia abstracts, short descriptions, main images links, infoboxes and article sections, excluding non-prose sections (e.g. references). More elements (such as lists and tables) may be added over time. For updates follow the project’s blog and our Mediawiki Quarterly software updates on MediaWiki.
As this is an early beta release, we highly value your feedback to help us refine and improve this dataset. Please share your thoughts, suggestions, and any issues you encounter either on the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset here on Kaggle.

The contents of this dataset of Wikipedia articles is collectively written and curated by a global volunteer community. All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.

The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking.
We would love to hear more about your use cases.

Data Fields
The data fields are the same among all, noteworthy included fields:
name - title of the article.
identifier - ID of the article.
url - URL of the article.
version: metadata related to the latest specific revision of the article
version.editor - editor-specific signals that can help contextualize the revision
version.scores - returns assessments by ML models on the likelihood of a revision being reverted.
main entity - Wikidata QID the article is related to.
abstract - lead section, summarizing what the article is about.
description - one-sentence description of the article for quick reference.
image - main image representing the article's subject.
infoboxes - parsed information from the side panel (infobox) on the Wikipedia article.
sections - parsed sections of the article, including links.
Note: excludes other media/images, lists, tables and references or similar non-prose sections.
Full data dictionary is available here: https://enterprise.wikimedia.com/docs/data-dictionary/

Curation Rationale
This dataset has been created as part of the larger Structured Contents initiative at Wikimedia Enterprise with the aim of making Wikimedia data more machine readable. These efforts are both focused on pre-parsing Wikipedia snippets as well as connecting the different projects closer together.
Even if Wikipedia is very structured to the human eye, it is a non-trivial task to extract the knowledge lying within in a machine readable manner. Projects, languages, domains all have their own specific community experts and way of structuring data, bolstered by various templates and best practices. A specific example we’ve addressed in this release is article infoboxes. Infoboxes are panels that commonly appear in the top right corner of a Wikipedia article and summarize key facts and statistics of the article’s subject. The editorial community works hard to keep infoboxes populated with the article’s most pertinent and current metadata, and we’d like to lower the barrier of entry significantly so that this data is also accessible at scale without the need for bespoke parsing systems.

We also include the link to the Wikidata Q Identifier (corresponding Wikidata entity), and the link to (main and infobox) images to facilitate easier access to additional information on the specific topics.

You will also find Credibility Signals fields included. These can help you decide when, how, and why to use what is in the dataset. These fields mirror the over 20 years of editorial policies created and kept by the Wikipedia editing communities, taking publicly available information and structuring it. Like with article structures, because this information is not centralized (neither on a single project nor across them), it is hard to access. Credibility signals shine a light on that blind spot. You will find most of these signals under the ‘version’ object, but other objects like ‘protection’ and ‘watchers_count’ offer similar insight.

This is an early beta release of pre-parsed Wikipedia articles in bulk, as a means to improve transparency in the development process and gather insights of current use cases to follow where the AI community needs us most; as well as feedback points, to develop this further through collaboration. There will be limitations (see ‘known limitations’ section below), but in line with our values, we believe it is better to share early, often, and respond to feedback.

You can also test out more languages on an article by article basis through our beta Structured Contents On-demand endpoint with a free account.

Attribution is core to the sustainability of the Wikimedia projects. It is what drives new editors and donors to Wikipedia. With consistent attribution, this cycle of content creation and reuse ensures encyclopedic content of high-quality, reliability, and verifiability will continue being written on Wikipedia and ultimately remain available for reuse via datasets such as these.

As such, we require all users of this dataset to conform to our expectations for proper attribution. Detailed attribution requirements for use of this dataset are outlined below.

Beyond attribution, there are many ways of contributing to and supporting the Wikimedia movement. and various other ways of supporting and participating in the Wikimedia movement below. To discuss your specific circumstances please contact Nicholas Perry from the Wikimedia Foundation technical partnerships team at nperry@wikimedia.org. You can also contact us on either the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset on Kaggle.

Attribution Information
Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (available at https://foundation.wikimedia.org/wiki/Visual_identity_guidelines) when identifying Wikimedia as the source of content.

Usability

info

8.13

License

CC BY-SA 4.0

Expected update frequency

Not specified

Tags

An error occurred: Failed to fetch

See what others are saying about this dataset

What have you used this dataset for?

How would you describe this dataset?

Metadata

Collaborators

Authors

Coverage

DOI Citation

Provenance

License

Expected Update Frequency

Activity Overview

Views

65.9K
dateViews
Jul 15, 202549
Jul 16, 202599
Jul 17, 202588
Jul 18, 202551
Jul 19, 202547
Jul 20, 202524
Jul 21, 202552
Jul 22, 202553
Jul 23, 202554
Jul 24, 202563
Jul 25, 202546
Jul 26, 202532
Jul 27, 202527
Jul 28, 202536
Jul 29, 202546
Jul 30, 202543
Jul 31, 202553
Aug 1, 202541
Aug 2, 202536
Aug 3, 202535
Aug 4, 202552
Aug 5, 202574
Aug 6, 202563
Aug 7, 202585
Aug 8, 202551
Aug 9, 202524
Aug 10, 202543
Aug 11, 202553
dateViews
Jul 15, 202549
Jul 16, 202599
Jul 17, 202588
Jul 18, 202551
Jul 19, 202547
Jul 20, 202524
Jul 21, 202552
Jul 22, 202553
Jul 23, 202554
Jul 24, 202563
Jul 25, 202546
Jul 26, 202532
Jul 27, 202527
Jul 28, 202536
Jul 29, 202546
Jul 30, 202543
Jul 31, 202553
Aug 1, 202541
Aug 2, 202536
Aug 3, 202535
Aug 4, 202552
Aug 5, 202574
Aug 6, 202563
Aug 7, 202585
Aug 8, 202551
Aug 9, 202524
Aug 10, 202543
Aug 11, 202553
dateViews
Jul 15, 202549
Jul 16, 202599
Jul 17, 202588
Jul 18, 202551
Jul 19, 202547
Jul 20, 202524
Jul 21, 202552
Jul 22, 202553
Jul 23, 202554
Jul 24, 202563
Jul 25, 202546
Jul 26, 202532
Jul 27, 202527
Jul 28, 202536
Jul 29, 202546
Jul 30, 202543
Jul 31, 202553
Aug 1, 202541
Aug 2, 202536
Aug 3, 202535
Aug 4, 202552
Aug 5, 202574
Aug 6, 202563
Aug 7, 202585
Aug 8, 202551
Aug 9, 202524
Aug 10, 202543
Aug 11, 202553
1420in the last 30 days

Downloads

6925
dateDownloads
Jul 15, 2025190
Jul 16, 202533
Jul 17, 202568
Jul 18, 20255
Jul 19, 20252
Jul 20, 202545
Jul 21, 20251
Jul 22, 2025310
Jul 23, 202582
Jul 24, 202573
Jul 25, 20258
Jul 26, 20257
Jul 27, 20251
Jul 28, 202581
Jul 29, 20255
Jul 30, 202514
Jul 31, 202550
Aug 1, 202537
Aug 2, 20252
Aug 3, 20255
Aug 4, 202511
Aug 5, 202510
Aug 6, 20258
Aug 7, 202523
Aug 8, 20255
Aug 10, 20257
Aug 11, 20257
dateDownloads
Jul 15, 2025190
Jul 16, 202533
Jul 17, 202568
Jul 18, 20255
Jul 19, 20252
Jul 20, 202545
Jul 21, 20251
Jul 22, 2025310
Jul 23, 202582
Jul 24, 202573
Jul 25, 20258
Jul 26, 20257
Jul 27, 20251
Jul 28, 202581
Jul 29, 20255
Jul 30, 202514
Jul 31, 202550
Aug 1, 202537
Aug 2, 20252
Aug 3, 20255
Aug 4, 202511
Aug 5, 202510
Aug 6, 20258
Aug 7, 202523
Aug 8, 20255
Aug 10, 20257
Aug 11, 20257
dateDownloads
Jul 15, 2025190
Jul 16, 202533
Jul 17, 202568
Jul 18, 20255
Jul 19, 20252
Jul 20, 202545
Jul 21, 20251
Jul 22, 2025310
Jul 23, 202582
Jul 24, 202573
Jul 25, 20258
Jul 26, 20257
Jul 27, 20251
Jul 28, 202581
Jul 29, 20255
Jul 30, 202514
Jul 31, 202550
Aug 1, 202537
Aug 2, 20252
Aug 3, 20255
Aug 4, 202511
Aug 5, 202510
Aug 6, 20258
Aug 7, 202523
Aug 8, 20255
Aug 10, 20257
Aug 11, 20257
1090in the last 30 days

Engagement

0.10513
downloads per view

Comments

43
posted

Top Contributors

Detail View

Views

07/2107/2808/0408/11050100
dateViews
Jul 15, 202549
Jul 16, 202599
Jul 17, 202588
Jul 18, 202551
Jul 19, 202547
Jul 20, 202524
Jul 21, 202552
Jul 22, 202553
Jul 23, 202554
Jul 24, 202563
Jul 25, 202546
Jul 26, 202532
Jul 27, 202527
Jul 28, 202536
Jul 29, 202546
Jul 30, 202543
Jul 31, 202553
Aug 1, 202541
Aug 2, 202536
Aug 3, 202535
Aug 4, 202552
Aug 5, 202574
Aug 6, 202563
Aug 7, 202585
Aug 8, 202551
Aug 9, 202524
Aug 10, 202543
Aug 11, 202553
dateViews
Jul 15, 202549
Jul 16, 202599
Jul 17, 202588
Jul 18, 202551
Jul 19, 202547
Jul 20, 202524
Jul 21, 202552
Jul 22, 202553
Jul 23, 202554
Jul 24, 202563
Jul 25, 202546
Jul 26, 202532
Jul 27, 202527
Jul 28, 202536
Jul 29, 202546
Jul 30, 202543
Jul 31, 202553
Aug 1, 202541
Aug 2, 202536
Aug 3, 202535
Aug 4, 202552
Aug 5, 202574
Aug 6, 202563
Aug 7, 202585
Aug 8, 202551
Aug 9, 202524
Aug 10, 202543
Aug 11, 202553
dateViews
Jul 15, 202549
Jul 16, 202599
Jul 17, 202588
Jul 18, 202551
Jul 19, 202547
Jul 20, 202524
Jul 21, 202552
Jul 22, 202553
Jul 23, 202554
Jul 24, 202563
Jul 25, 202546
Jul 26, 202532
Jul 27, 202527
Jul 28, 202536
Jul 29, 202546
Jul 30, 202543
Jul 31, 202553
Aug 1, 202541
Aug 2, 202536
Aug 3, 202535
Aug 4, 202552
Aug 5, 202574
Aug 6, 202563
Aug 7, 202585
Aug 8, 202551
Aug 9, 202524
Aug 10, 202543
Aug 11, 202553

Downloads

07/2107/2808/0408/110200400
dateDownloads
Jul 15, 2025190
Jul 16, 202533
Jul 17, 202568
Jul 18, 20255
Jul 19, 20252
Jul 20, 202545
Jul 21, 20251
Jul 22, 2025310
Jul 23, 202582
Jul 24, 202573
Jul 25, 20258
Jul 26, 20257
Jul 27, 20251
Jul 28, 202581
Jul 29, 20255
Jul 30, 202514
Jul 31, 202550
Aug 1, 202537
Aug 2, 20252
Aug 3, 20255
Aug 4, 202511
Aug 5, 202510
Aug 6, 20258
Aug 7, 202523
Aug 8, 20255
Aug 10, 20257
Aug 11, 20257
dateDownloads
Jul 15, 2025190
Jul 16, 202533
Jul 17, 202568
Jul 18, 20255
Jul 19, 20252
Jul 20, 202545
Jul 21, 20251
Jul 22, 2025310
Jul 23, 202582
Jul 24, 202573
Jul 25, 20258
Jul 26, 20257
Jul 27, 20251
Jul 28, 202581
Jul 29, 20255
Jul 30, 202514
Jul 31, 202550
Aug 1, 202537
Aug 2, 20252
Aug 3, 20255
Aug 4, 202511
Aug 5, 202510
Aug 6, 20258
Aug 7, 202523
Aug 8, 20255
Aug 10, 20257
Aug 11, 20257
dateDownloads
Jul 15, 2025190
Jul 16, 202533
Jul 17, 202568
Jul 18, 20255
Jul 19, 20252
Jul 20, 202545
Jul 21, 20251
Jul 22, 2025310
Jul 23, 202582
Jul 24, 202573
Jul 25, 20258
Jul 26, 20257
Jul 27, 20251
Jul 28, 202581
Jul 29, 20255
Jul 30, 202514
Jul 31, 202550
Aug 1, 202537
Aug 2, 20252
Aug 3, 20255
Aug 4, 202511
Aug 5, 202510
Aug 6, 20258
Aug 7, 202523
Aug 8, 20255
Aug 10, 20257
Aug 11, 20257

Similar Datasets

Combined dataset from Arxiv and Wikipedia
Monica Avagyan · Updated 6 months ago
Usability 5.9 · 1 GB · 146 downloads
1 File (CSV)
0
Wikipedia Article Ratings
Christopher Akiki · Updated 3 months ago
Usability 7.1 · 458 MB · 4 downloads
1 File (CSV)
1
databricks dolly 15k
databricks · Updated 2 years ago
Usability 10.0 · 5 MB · 1,615 downloads
2 Files (JSON, other)
27
WikiTableQuestions (Semi-structured Tables Q&A)
The Devastator · Updated 3 years ago
Usability 7.6 · 45 kB · 813 downloads
49 Files (CSV)
1