Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation which can be edited by anyone. Its knowledge is increasingly used within Wikipedia as well as in all kinds of information systems, which imposes high demands on its integrity. Nevertheless, Wikidata frequently gets vandalized, exposing all its users to the risk of spreading vandalized and falsified information.
Given a Wikidata revision, compute a vandalism score denoting the likelihood of this revision beeing vandalism (or similarly damaging).
The three best-performing approaches submitted by eligible participants as per the performance measures used for this task will receive the following awards, kindly sponsored by Adobe Systems, Inc.:
Furthermore, Wikimedia Germany supports the transfer of the scientific insights gained in this task by inviting the eligible participants who submitted the best-performing approaches to visit them for a couple of days in order to work together on planning a potential integration of the approach into Wikidata.
The goal of the vandalism detection task is to detect vandalism nearly in real time as soon as it happens. Hence, the following rules apply:
To develop your software, we provide you with a training corpus that consists of Wikidata revisions and whether they are considered vandalism.
The Wikidata Vandalism Corpus 2016 contains revisions of the knowledge base Wikidata. The corpus comprises manual revisions only, while all revisions by official bots were filtered out. For each revision, we provide the information whether it is considered vandalism (ROLLBACK_REVERTED) or not. Unlike the Wikidata dumps, revisions are ordered chronologically by REVISION_ID (i.e., in the order they arrived at Wikidata). For training, we provide data until February 29, 2016. The evaluation will be conducted on later data.
The provided training data consists of 23 files in total. You can check their validty via their md5 or sha1 checksums.
Name | Types | Description |
---|---|---|
REVISION_ID | Integer | The Wikidata revision id |
REVISION_SESSION_ID | Integer | The Wikidata revision id of the first revision in this session |
USER_COUNTRY_CODE | String | Country code for IP address (only available for unregistered users) |
USER_CONTINENT_CODE | String | Continent code for IP address (only available for unregistered users) |
USER_TIME_ZONE | String | Time zone for IP address (only available for unregistered users) |
USER_REGION_CODE | String | Region code for IP address (only available for unregistered users) |
USER_CITY_NAME | String | City name for IP address (only available for unregistered users) |
USER_COUNTY_NAME | String | County name for IP address (only available for unregistered users) |
REVISION_TAGS | List<String> | The Wikidata revision tags |
Name | Types | Description |
---|---|---|
REVISION_ID | Integer | The Wikidata revision id |
ROLLBACK_REVERTED | Boolean | Whether this revision was reverted via the rollback feature |
UNDO_RESTORE_REVERTED | Boolean | Whether this revision was reverted via the undo/restore feature |
The ROLLBACK_REVERTED field encodes the official ground truth for this competition. The UNDO_RESTORE_REVERTED field serves informational purposes only.
The truth file will only be available for the training dataset but not for test datasets.
The corpus can be processed, for example, with Wikidata Toolkit.
For each Wikidata revision in the test corpus, your software shall output a vandalism score in the range [0,1]. The output shall be formatted as a CSV file in the format RFC4180 and consist of two columns: The first column denotes Wikidata's revision id as an integer and the second column denotes the vandalism score as a float32. Here are a few example rows:
Revision Id | Vandalism Score |
---|---|
123 | 0.95 |
124 | 0.30 |
125 | 12.e-5 |
For determining the winner, we use ROC-AUC as primary evaluation measure.
For informational purposes, we might compute further evaluation measures such as PR-AUC and the runtime of the software.
Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.
During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.
After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.
We ask you to prepare your software so that it can be executed via a command line call without any parameters.
> mySoftware
You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:
Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.
Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the WSDM Cup 2017. We agree not to share your software with a third party or use it for other purposes than the WSDM Cup 2017.
Your software will receive the uncompressed Wikidata revisions on stdin
, and the uncompressed meta data on the named pipe meta
. Your program must produce the results on stdout
in form of a RFC4180 CSV file containing the two columns REVISION_ID and VANDALISM_SCORE. You will only receive new revisions on stdin
and meta
after having reported your vandalism score on stdout
. To enable fast and concurrent processing of data, we will introduce a backpressure window of k revisions, i.e., you will receive revision n + k on stdin
as soon as having reported your result for revision n on stdout
(the exact constant k is still to be determined but you can expect it to be around 16 revisions). We will provide you a demo program in the following days enabling you to test your software against it.