This repository contains a Python-based analysis to assess journal coverage overlap between JSTOR collections and other library collections, based on subscription metadata.
The input data represents journal subscriptions over time, where each row corresponds to a journal × period × collection triple.
Using journal identifiers (print and online ISSNs), the analysis examines whether the same journal-period appears in other JSTOR collections or in non-JSTOR collections, with full, partial, or complementary temporal coverage.
Special attention is given to journals that would be lost if JSTOR access were reduced or cancelled.
A TSV file containing all journal subscriptions held by the library. Each row must represent a single subscription period in a specific collection.
Required columns:
| Column name | Description |
|---|---|
oclc_collection_name |
Name of the collection (used to identify JSTOR and sub-collections) |
publication_title |
Journal title (may vary slightly across collections) |
print_identifier |
Print ISSN (used for matching) |
online_identifier |
Online ISSN (used for matching) |
date_first_issue_online |
Start of coverage period |
date_last_issue_online |
End of coverage period (missing = assumed ongoing) |
Notes:
- Journals are matched using both identifiers, with fallback logic if identifiers are swapped across sources.
- Only records marked as "fulltext" are used for the analysis.
- Missing end dates are treated as coverage through 2026.
For each journal-period pair, matches are classified as:
- Full: the comparison period completely covers the source period
- Partial: periods overlap, but coverage is incomplete
- Complementary: no overlap, but periods are adjacent or disjoint
Overlap percentages (share of years covered) are also calculated.
For each JSTOR journal-period pair (excluding Books):
- Check for matches in all other collections
- Flag whether it has:
- any match
- at least one full match
- at least one partial match
- at least one complementary match
- Count total matches by overlap type
Analyzes overlap between JSTOR sub-collections themselves.
For each pair of JSTOR collections:
- Counts how many unique journal-periods co-occur
- Distinguishes full, partial, and complementary coverage
For each JSTOR collection:
- Identifies journals that also appear in non-JSTOR collections
- Classifies overlap type
Creates a fully expanded match table intended for interactive use in Power BI.
Users can select a specific journal-period in a JSTOR collection and see:
- All other collections where it appears
- Periods covered
- Overlap type and percentage
Identifies two mutually exclusive categories:
- No matches: journals not present in any other collection
- JSTOR matches only: journals present in JSTOR collections but not outside JSTOR
For at-risk journals, the analysis can be enriched with:
- Number of VU publications (last 10 years) from PURE as a proxy for local usage
- SJR indicator as a proxy for journal importance/prestige
Matching is performed using multiple ISSN columns with priority rules.
Performs a dedicated analysis comparing JSTOR Arts & Sciences IV collection against all other collections in the main file.
This analysis provides:
-
All individual matches: A complete row-by-row listing showing each Arts & Sciences IV journal-period and every collection it matches with, including:
- Source journal details (title, ISSNs, coverage period)
- Matching collection and journal details
- Overlap type and percentage of years covered
-
Journal-level summary: Aggregated statistics per Arts & Sciences IV journal-period, including:
- Match flags (any match, full, partial, complementary)
- Match counts by overlap type
-
JSTOR overlap summary: How Arts & Sciences IV journals overlap with other JSTOR collections
-
Non-JSTOR overlap summary: How Arts & Sciences IV journals overlap with non-JSTOR collections
-
Unique analysis: Arts & Sciences IV journals with either:
- No matches in any collection
- Matches only within JSTOR (at risk if JSTOR is cancelled)
When run as a script, the analysis produces a single Excel workbook:
journal_overlap_analysis_results.xlsx
with the following sheets:
| Sheet name | Description |
|---|---|
Overall |
JSTOR journal-periods with co-occurrence flags and counts |
JSTOR_InterCollection |
Overlap between JSTOR sub-collections |
JSTOR_vs_NonJSTOR |
Overlap between JSTOR and non-JSTOR collections |
Drilldown |
Detailed journal-period match table |
Unique_with_VU_Pubs |
At-risk journals enriched with VU publications and SJR |
(or) Unique |
At-risk journals without enrichment |
ArtsSci_All_Matches |
Every individual match for Arts & Sciences IV journals |
ArtsSci_Summary |
Summary statistics per Arts & Sciences IV journal |
ArtsSci_JSTOR_Overlap |
Arts & Sciences IV overlap with JSTOR collections |
ArtsSci_NonJSTOR_Overlap |
Arts & Sciences IV overlap with non-JSTOR collections |
ArtsSci_Unique_Analysis |
Arts & Sciences IV journals with no or JSTOR-only matches |
A separate script (Citations_from_VU.ipynb) performs citation analysis using the OpenAlex API to quantify how frequently VU researchers cite articles from at-risk journals.
This analysis enriches the overlap analysis by measuring actual citation usage, providing evidence of which journals are actively used by VU researchers beyond just subscription coverage.
The citation analysis consists of three main steps:
For each at-risk journal from the overlap analysis:
- Resolves print and online ISSNs to OpenAlex source IDs
- Handles cases where ISSNs resolve to the same or different sources
- Collects all works published in each journal source
- Creates checkpointed CSV files for incremental processing
Outputs:
journal_source_resolution.csv: ISSN to OpenAlex source mappingjournal_works.csv: All works published in analyzed journals
- Retrieves all VU publications from 2016-2026 using institutional affiliation
- Extracts all referenced works from VU publications
- Creates a citation network showing VU works → referenced works
Output:
vu_2016_2026_citation_edges.csv: Citation edges from VU works
- Matches VU references against journal works
- Counts how many times VU researchers cited each journal
- Produces final citation counts per journal
Output:
vu_references_per_journal.csv: Citation counts by journal
- OpenAlex API key (insert in script)
- Input:
Unique_with_VU_Pubssheet from overlap analysis - Python packages:
requests,pandas,tqdm
- Checkpointing: Saves progress incrementally to handle API interruptions
- Rate limiting: Includes delays to respect API limits
- ISSN resolution: Handles multiple ISSNs per journal and source merging
- Large dataset handling: Processes VU citation data in batches
The final citation counts indicate:
- High citation count: Journal is actively used by VU researchers
- Low/zero citation count: Journal may have coverage but limited actual use
- Combined with overlap analysis, helps prioritize which at-risk journals are truly essential
This citation metric provides empirical evidence of journal importance beyond subscription coverage, supporting data-driven collection management decisions.