Skip to content

Commit aa4698f

Browse files
authored
Merge pull request #55 from georgetown-cset/54-update-sources
Update sources
2 parents 281b7ff + 1cf3af3 commit aa4698f

15 files changed

+21
-149
lines changed

.github/workflows/main.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,10 @@ jobs:
99

1010
steps:
1111
- uses: actions/checkout@v2
12-
- name: Set up Python 3.7
12+
- name: Set up Python 3.9
1313
uses: actions/setup-python@v1
1414
with:
15-
python-version: 3.7
15+
python-version: 3.9
1616
- name: Install dependencies
1717
run: |
1818
python -m pip install --upgrade pip

.github/workflows/pythonapp.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,10 @@ jobs:
99

1010
steps:
1111
- uses: actions/checkout@v2
12-
- name: Set up Python 3.7
12+
- name: Set up Python 3.9
1313
uses: actions/setup-python@v1
1414
with:
15-
python-version: 3.7
15+
python-version: 3.9
1616
- name: Install dependencies
1717
run: |
1818
python -m pip install --upgrade pip

README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
At CSET, we aim to produce a more comprehensive set of scholarly literature by ingesting multiple sources and then
55
deduplicating articles. This repository contains CSET's current method of cross-dataset article linking. Note that we
66
use "article" very loosely, although in a way that to our knowledge is fairly consistent across the datasets we draw
7-
from. Books, for example, are included. We currently include articles from arXiv, Web of Science, Papers With Code,
7+
from. Books, for example, are included. We currently include articles from arXiv, Papers With Code,
88
Semantic Scholar, The Lens, and OpenAlex. Some of these sources are largely duplicative (e.g. arXiv is well covered by
99
other corpora) but are included to aid in linking to additional metadata (e.g. arXiv fulltext).
1010

@@ -15,12 +15,11 @@ article linkage, see the [ETO documentation](https://eto.tech/dataset-docs/mac/)
1515

1616
To match articles, we need to extract the data that we want to use in matching and put it in a consistent format. The
1717
SQL queries specified in the `sequences/generate_{dataset}_data.tsv` files are run in the order they appear in those
18-
files. For OpenAlex we exclude documents with a `type` of Dataset, Peer Review, or Grant. Additionally, we take every
19-
combination of the Web of Science titles, abstracts, and pubyear so that a match on any of these combinations will
20-
result in a match on the shared WOS id. Finally, for Semantic Scholar, we exclude any documents that have a non-null
18+
files. For OpenAlex we exclude documents with a `type` of Dataset, Peer Review, or Grant. Finally, for Semantic Scholar,
19+
we exclude any documents that have a non-null
2120
publication type that is one of Dataset, Editorial, LettersAndComments, News, or Review.
2221

23-
For each article in arXiv, Web of Science, Papers With Code, Semantic Scholar, The Lens, and OpenAlex
22+
For each article in arXiv, Papers With Code, Semantic Scholar, The Lens, and OpenAlex
2423
we [normalized](utils/clean_corpus.py) titles, abstracts, and author last names to remove whitespace, punctuation,
2524
and other artifacts thought to not be useful for linking. For the purpose of matching, we filtered out titles,
2625
abstracts, and DOIs that occurred more than 10 times in the corpus. We then considered each group of articles

linkage_dag.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@
4646

4747
production_dataset = "literature"
4848
staging_dataset = f"staging_{production_dataset}"
49-
args = get_default_args(pocs=["Jennifer"])
49+
args = get_default_args(pocs=["James"])
5050
args["retries"] = 1
5151

5252
with DAG(
@@ -79,7 +79,7 @@
7979
# standard format
8080
metadata_sequences_start = []
8181
metadata_sequences_end = []
82-
for dataset in ["arxiv", "wos", "papers_with_code", "openalex", "s2", "lens"]:
82+
for dataset in ["arxiv", "papers_with_code", "openalex", "s2", "lens"]:
8383
ds_commands = []
8484
query_list = [
8585
t.strip()
@@ -407,12 +407,12 @@
407407

408408
prep_environment = BashOperator(
409409
task_id="prep_environment",
410-
bash_command=f'gcloud compute ssh jm3312@{gce_resource_id} --zone {GCP_ZONE} --command "{prep_environment_vm_script}"',
410+
bash_command=f'gcloud compute ssh airflow@{gce_resource_id} --zone {GCP_ZONE} --command "{prep_environment_vm_script}"',
411411
)
412412

413413
create_cset_ids = BashOperator(
414414
task_id="create_cset_ids",
415-
bash_command=f'gcloud compute ssh jm3312@{gce_resource_id} --zone {GCP_ZONE} --command "bash run_ids_scripts.sh &> log &"',
415+
bash_command=f'gcloud compute ssh airflow@{gce_resource_id} --zone {GCP_ZONE} --command "bash run_ids_scripts.sh &> log &"',
416416
inlets=[
417417
BigQueryTable(
418418
project_id=project_id, dataset_id=production_dataset, table_id="sources"

requirements.txt

Lines changed: 8 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,10 @@
1-
apache-beam[gcp]>2.19.0
2-
attrs==19.3.0
3-
avro-python3==1.9.2.1
4-
cachetools==3.1.1
5-
certifi==2019.11.28
6-
chardet==3.0.4
7-
crcmod==1.7
8-
dill==0.3.1.1
9-
docopt==0.6.2
10-
docutils==0.15.2
11-
fastavro==0.21.24
12-
fasteners==0.15
13-
future==0.18.2
14-
gensim==3.8.1
15-
google-cloud-bigquery>=1.17.1
16-
hdfs==2.5.8
17-
idna==2.9
18-
importlib-metadata==1.5.0
19-
jmespath==0.9.5
20-
mock==2.0.0
21-
monotonic==1.5
22-
more-itertools==8.2.0
23-
packaging==20.1
24-
pbr==5.4.4
25-
pluggy==0.13.1
26-
py>=1.10.0
27-
pyarrow==0.15.1
28-
pyasn1==0.4.8
29-
pyasn1-modules==0.2.8
30-
pycld2==0.41
31-
pydot==1.4.1
32-
pymongo==3.10.1
33-
pyparsing==2.4.6
34-
pytest==5.3.5
35-
python-dateutil==2.8.1
36-
pytz==2019.3
37-
requests==2.23.0
38-
rsa>=4.7
39-
s3transfer==0.3.3
40-
scipy==1.4.1
41-
six==1.14.0
42-
smart-open==1.9.0
43-
tqdm==4.43.0
44-
typing==3.7.4.1
45-
typing-extensions==3.7.4.1
46-
wcwidth==0.1.8
47-
zipp==3.0.0
1+
apache-beam[gcp]
2+
chardet
3+
gensim
4+
google-cloud-bigquery
5+
pycld2
6+
requests
7+
tqdm
488
pre-commit
499
coverage
10+
pytest

sequences/generate_wos_metadata.tsv

Lines changed: 0 additions & 6 deletions
This file was deleted.

sql/ids_to_drop.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@ SELECT DISTINCT merged_id
22
FROM
33
literature.sources
44
WHERE
5-
orig_id IN (SELECT id1 FROM staging_literature.unlink)
5+
orig_id IN (SELECT id1 FROM {{ staging_dataset }}.unlink)

sql/union_ids.sql

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
-- glue all the ids together (used in validation)
22
SELECT id FROM {{ staging_dataset }}.arxiv_ids
33
UNION ALL
4-
SELECT id FROM {{ staging_dataset }}.wos_ids
5-
UNION ALL
64
SELECT id FROM {{ staging_dataset }}.papers_with_code_ids
75
UNION ALL
86
SELECT id FROM {{ staging_dataset }}.openalex_ids

sql/union_metadata.sql

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -11,17 +11,6 @@ WITH meta AS (
1111
"arxiv" AS dataset
1212
FROM {{ staging_dataset }}.arxiv_metadata
1313
UNION ALL
14-
SELECT
15-
cast(id AS STRING) AS id,
16-
title,
17-
abstract,
18-
clean_doi,
19-
cast(year AS INT64) AS year,
20-
last_names,
21-
references,
22-
"wos" AS dataset
23-
FROM {{ staging_dataset }}.wos_metadata
24-
UNION ALL
2514
SELECT
2615
cast(id AS STRING) AS id,
2716
title,

sql/wos_abstracts.sql

Lines changed: 0 additions & 6 deletions
This file was deleted.

0 commit comments

Comments
 (0)