Vandalism Detection
Sponsor

Supporter

Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation which can be edited by anyone. Its knowledge is increasingly used within Wikipedia as well as in all kinds of information systems, which imposes high demands on its integrity. Nevertheless, Wikidata frequently gets vandalized, exposing all its users to the risk of spreading vandalized and falsified information.

Task

Given a Wikidata revision, compute a vandalism score denoting the likelihood of this revision beeing vandalism (or similarly damaging).

Awards

The three best-performing approaches submitted by eligible participants as per the performance measures used for this task will receive the following awards, kindly sponsored by Adobe Systems, Inc.:

$1500 for the best-performing approach,
$750 for the second best-performing approach, and
$250 for the third best-performing approach.

Furthermore, Wikimedia Germany supports the transfer of the scientific insights gained in this task by inviting the eligible participants who submitted the best-performing approaches to visit them for a couple of days in order to work together on planning a potential integration of the approach into Wikidata.

Task Rules

The goal of the vandalism detection task is to detect vandalism nearly in real time as soon as it happens. Hence, the following rules apply:

Use of any additional data that is newer than the provided training data is forbidden. In particular, you may not scrape any Wikimedia website, use the API, the dumps, or any related data source to obtain data that is newer than February 29, 2016.
You may use sources of publicly available external data having to do with geographical information, demographic information, natural language processing, etc. This data must not relate to the specific revision label (vandalism vs regular).

Wikidata Vandalism Corpus 2016

To develop your software, we provide you with a training corpus that consists of Wikidata revisions and whether they are considered vandalism.

The Wikidata Vandalism Corpus 2016 contains revisions of the knowledge base Wikidata. The corpus comprises manual revisions only, while all revisions by official bots were filtered out. For each revision, we provide the information whether it is considered vandalism (ROLLBACK_REVERTED) or not. Unlike the Wikidata dumps, revisions are ordered chronologically by REVISION_ID (i.e., in the order they arrived at Wikidata). For training, we provide data until February 29, 2016. The evaluation will be conducted on later data.

The provided training data consists of 23 files in total. You can check their validty via their md5 or sha1 checksums.

Revision Data Files (21 files)

Meta File (1 File)

wdvc16_meta.csv.7z (168 MB)

Name	Types	Description
REVISION_ID	Integer	The Wikidata revision id
REVISION_SESSION_ID	Integer	The Wikidata revision id of the first revision in this session
USER_COUNTRY_CODE	String	Country code for IP address (only available for unregistered users)
USER_CONTINENT_CODE	String	Continent code for IP address (only available for unregistered users)
USER_TIME_ZONE	String	Time zone for IP address (only available for unregistered users)
USER_REGION_CODE	String	Region code for IP address (only available for unregistered users)
USER_CITY_NAME	String	City name for IP address (only available for unregistered users)
USER_COUNTY_NAME	String	County name for IP address (only available for unregistered users)
REVISION_TAGS	List<String>	The Wikidata revision tags

Truth File (1 File)

wdvc16_truth.csv.7z (44 MB)

Name	Types	Description
REVISION_ID	Integer	The Wikidata revision id
ROLLBACK_REVERTED	Boolean	Whether this revision was reverted via the rollback feature
UNDO_RESTORE_REVERTED	Boolean	Whether this revision was reverted via the undo/restore feature

The ROLLBACK_REVERTED field encodes the official ground truth for this competition. The UNDO_RESTORE_REVERTED field serves informational purposes only.

The truth file will only be available for the training dataset but not for test datasets.

The corpus can be processed, for example, with Wikidata Toolkit.

Output

For each Wikidata revision in the test corpus, your software shall output a vandalism score in the range [0,1]. The output shall be formatted as a CSV file in the format RFC4180 and consist of two columns: The first column denotes Wikidata's revision id as an integer and the second column denotes the vandalism score as a float32. Here are a few example rows:

Revision Id	Vandalism Score
123	0.95
124	0.30
125	12.e-5

Performance Measures

For determining the winner, we use ROC-AUC as primary evaluation measure.

For informational purposes, we might compute further evaluation measures such as PR-AUC and the runtime of the software.

Test Corpus

Once you finished tuning your approach to achieve satisfying performance on the training corpus, you should run your software on the test corpus.

During the competition, the test corpus will not be released publicly. Instead, we ask you to submit your software for evaluation at our site as described below.

After the competition, the test corpus is available including ground truth data. This way, you have all the necessities to evaluate your approach on your own, yet being comparable to those who took part in the competition.

Submission

We ask you to prepare your software so that it can be executed via a command line call without any parameters.

  > mySoftware

You can choose freely among the available programming languages and among the operating systems Microsoft Windows and Ubuntu. We will ask you to deploy your software onto a virtual machine that will be made accessible to you after registration. You will be able to reach the virtual machine via ssh and via remote desktop. More information about how to access the virtual machines can be found in the user guide below:

Virtual Machine User Guide »

Once deployed in your virtual machine, we ask you to access TIRA at www.tira.io, where you can self-evaluate your software on the test data.

Note: By submitting your software you retain full copyrights. You agree to grant us usage rights only for the purpose of the WSDM Cup 2017. We agree not to share your software with a third party or use it for other purposes than the WSDM Cup 2017.

Your software will receive the uncompressed Wikidata revisions on stdin, and the uncompressed meta data on the named pipe meta. Your program must produce the results on stdout in form of a RFC4180 CSV file containing the two columns REVISION_ID and VANDALISM_SCORE. You will only receive new revisions on stdin and meta after having reported your vandalism score on stdout. To enable fast and concurrent processing of data, we will introduce a backpressure window of k revisions, i.e., you will receive revision n + k on stdin as soon as having reported your result for revision n on stdout (the exact constant k is still to be determined but you can expect it to be around 16 revisions). We will provide you a demo program in the following days enabling you to test your software against it.

Related Work

Stefan Heindorf, Martin Potthast, Benno Stein, and Gregor Engels. Vandalism Detection in Wikidata. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM 16) (to appear) , October 2016. ACM. [Paper]
Stefan Heindorf, Martin Potthast, Benno Stein, and Gregor Engels. Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis. In Ricardo Baeza-Yates, Mounia Lalmas, Alistair Moffat, and Berthier Ribeiro-Neto, editors, 38th International ACM Conference on Research and Development in Information Retrieval (SIGIR 15), pages 831-834, August 2015. ACM. ISBN 978-1-4503-3621-5 [Paper] [Poster] [Corpus]
Martin Potthast, Benno Stein, and Robert Gerling. Automatic Vandalism Detection in Wikipedia. In Craig Macdonald et al, editors, Advances in Information Retrieval. 30th European Conference on IR Research (ECIR 08) volume 4956 of Lecture Notes in Computer Science, pages 663-668, Berlin Heidelberg New York, 2008. Springer. ISBN 978-3-540-78645-0. ISSN 0302-9743. [Paper] [Poster]

Vandalism Detection
Sponsor

Supporter

Revision Data Files (21 files)

Meta File (1 File)

Truth File (1 File)

Task Chairs

Task Committee

Vandalism Detection Sponsor Supporter

Revision Data Files (21 files)

Meta File (1 File)

Truth File (1 File)

Task Chairs

Task Committee

Vandalism Detection
Sponsor

Supporter