Skip to content

katbailey/doc2vec_cqadupstack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Doc2Vec CQADupStack

This repo provides code for reproducing the results of J. Lau and T. Baldwin in their paper, An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation

It currently only works with the dataset used for the forum question duplication task, CQADupStack. The dataset for this is available from http://nlp.cis.unimelb.edu.au/resources/cqadupstack/ and the script for processing it is included as a submodule of this repo.

The notebook.ipynb illustrates plotting the results of the cosine similarity on learned vectors for one of the forums, as well as visualizing reduced vectors using TSNE.

There is one script in this repo, run.py, which allows you to perform all the necessary actions to extract a forum dataset, extract a small training set to work with (as described in the Lau paper), extract a test set of 10M documents, train a doc2vec model, and infer document vectors for all documents. It also allows you to infer document vectors using Lau et al's pretrained models from external corpora, Associated Press News and Wikipedia. These pretrained models are linked to from https://github.com/jhlau/doc2vec.

In addition to the pre-trained doc2vec models from external corpora, Lau et al created pretrained word2vec word embeddings from AP News and Wikipedia. These word embeddings are also linked to from the above github repo and can be used with this script to train new models.

It is also possible to use GloVe embeddings after using the script to convert them to the correct format. See instructions below.

Requirements

Lau et all forked gensim to add the ability to train document vectors using pre-trained word embeddings. This repo provides a conda environment.yml file so you can create the environment needed to use this forked version of gensim.

Usage

To create the required python environment and activate it, run the following from the command line:

$ cd doc2vec_cqadup
$ conda env create -f environment.yml
$ source activate doc2vec

Assuming you have downloaded the zip files for the CQADupStack forum data to "path/to/cqadup/zip/files", to extract a particular forum dataset and run the train/test split provided by the CQADupStack script, run:

$ python run.py --name="english" --location="some/path" --cqadup-path="path/to/cqadup/zip/files" extract-dataset

This will place the extracted files for the "english" forum at "some/path". You will use this location for all of the extracted/processed files, including trained models and inferred vectors.

To extract a tiny training set of roughly 3000 negative and 300 positive examples, run:

$ python run.py --name="english" --location="some/path" extract-train-set

To extract a test set of 10M docs (using uniform random sampling, per the paper), run:

$ python run.py --name="english" --location="some/path" extract-test-set

To extract the text for all the documents into a file that has one document per line (as required by doc2vec), run:

$ python run.py --name="english" --location="some/path" extract-doc-text

To train a model from scratch using all of the documents:

$ python run.py --name="english" --location="some/path" train-model

This will result in a model.bin file being place at "some/path". To infer vectors for all the docs in the forum based on this model, run:

$ python run.py --name="english" --location="some/path" infer-doc-vectors

To infer vectors for some other set of documents based on this model, run:

$ python run.py --name="english" --location="some/path" --docs="path/to/some/docs/file" infer-doc-vectors

To use a different model, e.g. one of of the pretrained doc2vec models, to infer vectors for all the docs in the forum, run:

$ python run.py --name="english" --location="some/path" --model="path/to/pretrained/doc2vec/model/file" infer-doc-vectors

To train a model using pre-trained word-embeddings, the embeddings need to be in the non-binary word2vec format. The files linked to the from the Lau repo are not in this format but you can use the convert-pretrained command to convert them:

$ python run.py --words="path/to/apnews_sg/word2vec.bin" convert-pretrained

This will produce "path/to/apnews_sg/word2vec.txt", which can then be used to train new document vectors:

$ python run.py --words="path/to/apnews_sg/word2vec.txt" --name="english" --location="some/location" train-model

To use GloVe embeddings, these first need to be converted to word2vec format. Instead of --words, use the --gloves option when running convert-pretrained:

$ python run.py --gloves="path/to/glove/embeddings.txt" convert-pretrained

This will create a new file at "path/to/glove/embeddings.word2vec.txt", which can then be used to train a doc2vec model:

$ python run.py --words="path/to/glove/embeddings.word2vec.txt" --name="english" --location="some/location" train-model

The default number of iterations when training a model is 20, for inference it's 1000. For either command this can be overridden with the --iter option.

About

Code for reproducing J Lau's doc2vec results on CQADupStack

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published