This repo contains the processing code we used to create the Wiki-Reliability dataset.
The processing notebooks should be run in the order of:
-
The
MatchTemplatesUDF.ipynbworks on in PySpark and uses an AVRO version of the XLM Wikipedia dumps (dumps.wikipedia.org). The code can be adapted to the XML version embedding the ""getTemplatesRegexReliability()"" function on the mwxml package. -
Wikipedia dumps in XML can be downloaded here. To get all the revisions you need to download the "All pages with complete edit history" files.
-
If you decide to work in PyPspark, you can use this repository to transform the XML dump to AVRO.
-
A recent version of the mediawiki_history table can be downloaded from https://dumps.wikimedia.org/other/mediawiki_history/
-
Processes the full history of revisions returned by
MatchTemplatesUDFto extract positive and negative template pairs. Positive examples are versions of an article which contain a reliability issue, signalled by the addition of the template, while negative examples are article revisions where the issue has been resolved, signalled by the removal of the template. -
For each revision example, extracts a set of metadata features.
-
For each revision example in the dataset, parses the article's textual contents to be included as part of a text dataset for NLP tasks.
-
For each revision, extracts the:
- full text of the revision
- diff'd version of the revision text, containing only the changed sections of text between each revision pair
We expanded the Wiki-Reliability processing code beyond English to support the creation of similar template-based datasets for multilingual language projects on Wikipedia. The processing pipeline was rewritten in Spark and is available under the multilingual/Process_Multilingual_Templates.ipynb folder. The Spark version of the pipeline is also significantly faster and has slightly higher recall of template addition/removal pairs. We plan to release the multilingual datasets soon.
-
Authors: KayYen Wong, Miriam Redi and Diego Saez-Trumper.
-
Paper: Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia
-
Datasets are available for download on Figshare.
-
For more details, see the Research Project Page for additional information.