Skip to content

Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

Notifications You must be signed in to change notification settings

kay-wong/Wiki-Reliability

Repository files navigation

Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

This repo contains the processing code we used to create the Wiki-Reliability dataset.

The processing notebooks should be run in the order of:

1. MatchTemplatesUDF.ipynb

  • The MatchTemplatesUDF.ipynb works on in PySpark and uses an AVRO version of the XLM Wikipedia dumps (dumps.wikipedia.org). The code can be adapted to the XML version  embedding the ""getTemplatesRegexReliability()"" function on the mwxml package.

  • Wikipedia dumps in XML can be downloaded here. To get all the revisions you need to download the "All pages with complete edit history" files.

  • If you decide to work in PyPspark, you can use this repository to transform the XML dump to AVRO.

  • A recent version of the  mediawiki_history table can be  downloaded from https://dumps.wikimedia.org/other/mediawiki_history/

2. Process_Templates.ipynb

  • Processes the full history of revisions returned by MatchTemplatesUDF to extract positive and negative template pairs. Positive examples are versions of an article which contain a reliability issue, signalled by the addition of the template, while negative examples are article revisions where the issue has been resolved, signalled by the removal of the template.

  • For each revision example, extracts a set of metadata features.

3. Process_Templates_Text.ipynb

  • For each revision example in the dataset, parses the article's textual contents to be included as part of a text dataset for NLP tasks.

  • For each revision, extracts the:

    • full text of the revision
    • diff'd version of the revision text, containing only the changed sections of text between each revision pair

Wiki-Reliability Multilingual

We expanded the Wiki-Reliability processing code beyond English to support the creation of similar template-based datasets for multilingual language projects on Wikipedia. The processing pipeline was rewritten in Spark and is available under the multilingual/Process_Multilingual_Templates.ipynb folder. The Spark version of the pipeline is also significantly faster and has slightly higher recall of template addition/removal pairs. We plan to release the multilingual datasets soon.

Project Information

About

Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •