Skip to content

ubvu/jstor-collections

Repository files navigation

Journal Overlap Analysis for JSTOR Collections

This repository contains a Python-based analysis to assess journal coverage overlap between JSTOR collections and other library collections, based on subscription metadata.

The input data represents journal subscriptions over time, where each row corresponds to a journal × period × collection triple.

Using journal identifiers (print and online ISSNs), the analysis examines whether the same journal-period appears in other JSTOR collections or in non-JSTOR collections, with full, partial, or complementary temporal coverage.

Special attention is given to journals that would be lost if JSTOR access were reduced or cancelled.


Input Data

A TSV file containing all journal subscriptions held by the library. Each row must represent a single subscription period in a specific collection.

Required columns:

Column name Description
oclc_collection_name Name of the collection (used to identify JSTOR and sub-collections)
publication_title Journal title (may vary slightly across collections)
print_identifier Print ISSN (used for matching)
online_identifier Online ISSN (used for matching)
date_first_issue_online Start of coverage period
date_last_issue_online End of coverage period (missing = assumed ongoing)

Notes:

  • Journals are matched using both identifiers, with fallback logic if identifiers are swapped across sources.
  • Only records marked as "fulltext" are used for the analysis.
  • Missing end dates are treated as coverage through 2026.

Methodology

Period overlap classification

For each journal-period pair, matches are classified as:

  • Full: the comparison period completely covers the source period
  • Partial: periods overlap, but coverage is incomplete
  • Complementary: no overlap, but periods are adjacent or disjoint

Overlap percentages (share of years covered) are also calculated.


Analysis

1. Overall co-occurrences

For each JSTOR journal-period pair (excluding Books):

  • Check for matches in all other collections
  • Flag whether it has:
    • any match
    • at least one full match
    • at least one partial match
    • at least one complementary match
  • Count total matches by overlap type

2. JSTOR inter-collection overlap

Analyzes overlap between JSTOR sub-collections themselves.

For each pair of JSTOR collections:

  • Counts how many unique journal-periods co-occur
  • Distinguishes full, partial, and complementary coverage

3. JSTOR vs non-JSTOR overlap

For each JSTOR collection:

  • Identifies journals that also appear in non-JSTOR collections
  • Classifies overlap type

4. Detailed drill-down table

Creates a fully expanded match table intended for interactive use in Power BI.

Users can select a specific journal-period in a JSTOR collection and see:

  • All other collections where it appears
  • Periods covered
  • Overlap type and percentage

5. Unique and at-risk journals

Identifies two mutually exclusive categories:

  • No matches: journals not present in any other collection
  • JSTOR matches only: journals present in JSTOR collections but not outside JSTOR

6. Enrichment: VU publications and SJR

For at-risk journals, the analysis can be enriched with:

  • Number of VU publications (last 10 years) from PURE as a proxy for local usage
  • SJR indicator as a proxy for journal importance/prestige

Matching is performed using multiple ISSN columns with priority rules.


7. Arts & Sciences IV comparison

Performs a dedicated analysis comparing JSTOR Arts & Sciences IV collection against all other collections in the main file.

This analysis provides:

  • All individual matches: A complete row-by-row listing showing each Arts & Sciences IV journal-period and every collection it matches with, including:

    • Source journal details (title, ISSNs, coverage period)
    • Matching collection and journal details
    • Overlap type and percentage of years covered
  • Journal-level summary: Aggregated statistics per Arts & Sciences IV journal-period, including:

    • Match flags (any match, full, partial, complementary)
    • Match counts by overlap type
  • JSTOR overlap summary: How Arts & Sciences IV journals overlap with other JSTOR collections

  • Non-JSTOR overlap summary: How Arts & Sciences IV journals overlap with non-JSTOR collections

  • Unique analysis: Arts & Sciences IV journals with either:

    • No matches in any collection
    • Matches only within JSTOR (at risk if JSTOR is cancelled)

Outputs

When run as a script, the analysis produces a single Excel workbook:

journal_overlap_analysis_results.xlsx

with the following sheets:

Sheet name Description
Overall JSTOR journal-periods with co-occurrence flags and counts
JSTOR_InterCollection Overlap between JSTOR sub-collections
JSTOR_vs_NonJSTOR Overlap between JSTOR and non-JSTOR collections
Drilldown Detailed journal-period match table
Unique_with_VU_Pubs At-risk journals enriched with VU publications and SJR
(or) Unique At-risk journals without enrichment
ArtsSci_All_Matches Every individual match for Arts & Sciences IV journals
ArtsSci_Summary Summary statistics per Arts & Sciences IV journal
ArtsSci_JSTOR_Overlap Arts & Sciences IV overlap with JSTOR collections
ArtsSci_NonJSTOR_Overlap Arts & Sciences IV overlap with non-JSTOR collections
ArtsSci_Unique_Analysis Arts & Sciences IV journals with no or JSTOR-only matches

Citation Analysis

A separate script (Citations_from_VU.ipynb) performs citation analysis using the OpenAlex API to quantify how frequently VU researchers cite articles from at-risk journals.

Purpose

This analysis enriches the overlap analysis by measuring actual citation usage, providing evidence of which journals are actively used by VU researchers beyond just subscription coverage.

Methodology

The citation analysis consists of three main steps:

1. Journal resolution and work collection

For each at-risk journal from the overlap analysis:

  • Resolves print and online ISSNs to OpenAlex source IDs
  • Handles cases where ISSNs resolve to the same or different sources
  • Collects all works published in each journal source
  • Creates checkpointed CSV files for incremental processing

Outputs:

  • journal_source_resolution.csv: ISSN to OpenAlex source mapping
  • journal_works.csv: All works published in analyzed journals

2. VU publication citation extraction

  • Retrieves all VU publications from 2016-2026 using institutional affiliation
  • Extracts all referenced works from VU publications
  • Creates a citation network showing VU works → referenced works

Output:

  • vu_2016_2026_citation_edges.csv: Citation edges from VU works

3. Citation matching

  • Matches VU references against journal works
  • Counts how many times VU researchers cited each journal
  • Produces final citation counts per journal

Output:

  • vu_references_per_journal.csv: Citation counts by journal

Requirements

  • OpenAlex API key (insert in script)
  • Input: Unique_with_VU_Pubs sheet from overlap analysis
  • Python packages: requests, pandas, tqdm

Features

  • Checkpointing: Saves progress incrementally to handle API interruptions
  • Rate limiting: Includes delays to respect API limits
  • ISSN resolution: Handles multiple ISSNs per journal and source merging
  • Large dataset handling: Processes VU citation data in batches

Output interpretation

The final citation counts indicate:

  • High citation count: Journal is actively used by VU researchers
  • Low/zero citation count: Journal may have coverage but limited actual use
  • Combined with overlap analysis, helps prioritize which at-risk journals are truly essential

This citation metric provides empirical evidence of journal importance beyond subscription coverage, supporting data-driven collection management decisions.

About

Overlap analysis for the JSTOR collections

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published