44At CSET, we aim to produce a more comprehensive set of scholarly literature by ingesting multiple sources and then
55deduplicating articles. This repository contains CSET's current method of cross-dataset article linking. Note that we
66use "article" very loosely, although in a way that to our knowledge is fairly consistent across the datasets we draw
7- from. Books, for example, are included. We currently include articles from arXiv, Web of Science, Papers With Code,
7+ from. Books, for example, are included. We currently include articles from arXiv, Papers With Code,
88Semantic Scholar, The Lens, and OpenAlex. Some of these sources are largely duplicative (e.g. arXiv is well covered by
99other corpora) but are included to aid in linking to additional metadata (e.g. arXiv fulltext).
1010
@@ -15,12 +15,11 @@ article linkage, see the [ETO documentation](https://eto.tech/dataset-docs/mac/)
1515
1616To match articles, we need to extract the data that we want to use in matching and put it in a consistent format. The
1717SQL queries specified in the ` sequences/generate_{dataset}_data.tsv ` files are run in the order they appear in those
18- files. For OpenAlex we exclude documents with a ` type ` of Dataset, Peer Review, or Grant. Additionally, we take every
19- combination of the Web of Science titles, abstracts, and pubyear so that a match on any of these combinations will
20- result in a match on the shared WOS id. Finally, for Semantic Scholar, we exclude any documents that have a non-null
18+ files. For OpenAlex we exclude documents with a ` type ` of Dataset, Peer Review, or Grant. Finally, for Semantic Scholar,
19+ we exclude any documents that have a non-null
2120publication type that is one of Dataset, Editorial, LettersAndComments, News, or Review.
2221
23- For each article in arXiv, Web of Science, Papers With Code, Semantic Scholar, The Lens, and OpenAlex
22+ For each article in arXiv, Papers With Code, Semantic Scholar, The Lens, and OpenAlex
2423we [ normalized] ( utils/clean_corpus.py ) titles, abstracts, and author last names to remove whitespace, punctuation,
2524and other artifacts thought to not be useful for linking. For the purpose of matching, we filtered out titles,
2625abstracts, and DOIs that occurred more than 10 times in the corpus. We then considered each group of articles
0 commit comments