Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

This repo contains the processing code we used to create the Wiki-Reliability dataset.

The processing notebooks should be run in the order of:

1. MatchTemplatesUDF.ipynb

The MatchTemplatesUDF.ipynb works on in PySpark and uses an AVRO version of the XLM Wikipedia dumps (dumps.wikipedia.org). The code can be adapted to the XML version embedding the ""getTemplatesRegexReliability()"" function on the mwxml package.
Wikipedia dumps in XML can be downloaded here. To get all the revisions you need to download the "All pages with complete edit history" files.
If you decide to work in PyPspark, you can use this repository to transform the XML dump to AVRO.
A recent version of the mediawiki_history table can be downloaded from https://dumps.wikimedia.org/other/mediawiki_history/

2. Process_Templates.ipynb

Processes the full history of revisions returned by MatchTemplatesUDF to extract positive and negative template pairs. Positive examples are versions of an article which contain a reliability issue, signalled by the addition of the template, while negative examples are article revisions where the issue has been resolved, signalled by the removal of the template.
For each revision example, extracts a set of metadata features.

3. Process_Templates_Text.ipynb

For each revision example in the dataset, parses the article's textual contents to be included as part of a text dataset for NLP tasks.
For each revision, extracts the:
- full text of the revision
- diff'd version of the revision text, containing only the changed sections of text between each revision pair

Wiki-Reliability Multilingual

We expanded the Wiki-Reliability processing code beyond English to support the creation of similar template-based datasets for multilingual language projects on Wikipedia. The processing pipeline was rewritten in Spark and is available under the multilingual/Process_Multilingual_Templates.ipynb folder. The Spark version of the pipeline is also significantly faster and has slightly higher recall of template addition/removal pairs. We plan to release the multilingual datasets soon.

Project Information

Authors: KayYen Wong, Miriam Redi and Diego Saez-Trumper.
Paper: Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia
Datasets are available for download on Figshare.
For more details, see the Research Project Page for additional information.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
multilingual		multilingual
MatchTemplatesUDF.ipynb		MatchTemplatesUDF.ipynb
Process_Templates.ipynb		Process_Templates.ipynb
Process_Templates_Text.ipynb		Process_Templates_Text.ipynb
README.md		README.md
reliability_templates_list.txt		reliability_templates_list.txt
wikireliability-sigir21-poster.pdf		wikireliability-sigir21-poster.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

1. MatchTemplatesUDF.ipynb

2. Process_Templates.ipynb

3. Process_Templates_Text.ipynb

Wiki-Reliability Multilingual

Project Information

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

kay-wong/Wiki-Reliability

Folders and files

Latest commit

History

Repository files navigation

Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

1. MatchTemplatesUDF.ipynb

2. Process_Templates.ipynb

3. Process_Templates_Text.ipynb

Wiki-Reliability Multilingual

Project Information

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages