This repository contains scripts for analyzing and validating cinema data for the Clusterflick project. These scripts help discover new venues, validate cinema IDs, and check coordinate accuracy.
Install dependencies:
npm installThis will install the scripts package from GitHub along with other required
dependencies.
Some scripts require environment variables. Copy the example file and fill in your values:
cp .env.example .envThen edit .env with your API keys:
MAPS_API_KEY=your_google_maps_api_keyThese scripts check that cinema IDs in the database match the IDs used by cinema chain websites.
npm run check:cineworld-ids # Validate Cineworld cinema IDs
npm run check:curzon-ids # Validate Curzon cinema IDs
npm run check:everyman-ids # Validate Everyman cinema IDs
npm run check:myvue-ids # Validate MyVue cinema IDs
npm run check:odeon-ids # Validate Odeon cinema IDs
npm run check:omniplex-ids # Validate Omniplex cinema IDs
npm run check:picturehouse-ids # Validate Picturehouse cinema IDsnpm run check:coordinates # Validate cinema coordinates using Google Maps APIRequires MAPS_API_KEY environment variable.
These scripts discover new venues from event platforms that may need to be added to the cinema database.
npm run discover:designmynight # Discover venues from DesignMyNight
npm run discover:dice # Discover venues from Dice.fm
npm run discover:eventbrite # Discover venues from Eventbrite
npm run discover:outsavvy # Discover venues from Outsavvy
npm run discover:ticketsource # Discover venues from TicketSourceThese scripts find potential new cinemas from external data sources.
npm run find:openstreetmap # Find cinemas from OpenStreetMap data
npm run find:mycommunitycinema # Find cinemas from MyCommunity Cinema
npm run find:independentcinemaoffice # Find cinemas from Independent Cinema Office
npm run find:pearl-and-dean # Find cinemas from Pearl & Deannpm run generate:map # Generate a KML map file of all cinemasnpm run compare:releases -- <current-dir> <previous-dir> <current-tag> <previous-tag>Compares two transformed data releases to identify changes between pipeline runs.
Compares our transformed accessibility data against Accessible Screenings UK (UKCA) to identify gaps in our accessibility tagging.
npm run download:accessible-screenings # Fetch UKCA data (extracts JWT tokens from their website)
npm run compare:accessible-screenings -- <ukca-data-path> <transformed-data-dir>The comparison:
- Matches venues by coordinates (within 250m) then name similarity, using a greedy best-match algorithm.
- Matches performances in three tiers: normalised booking URL, then
performance ID extracted from URL query parameters (e.g.
id,perfcode,showtimeId), then falls back to title similarity (Jaccard ≥ 0.3) + time (within 15 minutes). The time fallback is guarded so that two performances with different explicit IDs are never cross-matched. - Compares accessibility tags on matched performances, mapping UKCA tags
(
AudioDescription,AutismFriendly,DementiaFriendly,Subtitled,ClosedCaption,OpenCaption) to our fields (audioDescription,relaxed,subtitled,hardOfHearing,babyFriendly). - Reports mismatches, UKCA-only performances with accessibility tags, and
venues with no gaps. Outputs a JSON log to
output/.
UKCA's data has some known inaccuracies that would otherwise produce false positives. The comparison rolls these up as informational notes rather than real mismatches:
- Cineworld screen-level audio description: UKCA propagates
AUDIO_DESCRIPTIONfrom screen capabilities to all showtimes on that screen, even when AD isn't active for a specific showing. Performances where the only mismatch is missingaudioDescriptionand the screen is listed as AD-capable in UKCA's own theater data are treated as info. - Cineworld stale audio description: UKCA sometimes retains
AudioDescriptiontags on Cineworld performances after Cineworld has removed them (e.g. when a film moves to a different screen). The comparison verifies AD-only mismatches against Cineworld's showtimes API — if Cineworld confirms the performance does not haveaudio-described, the mismatch is treated as stale UKCA data. Verification is capped at 25 per venue to avoid excessive API calls; if exceeded, mismatches are kept unverified with a warning. If the API is unreachable (e.g. Cloudflare blocking the runner IP), the mismatches are also kept with a warning. - Cineworld stale listings: UKCA-only accessible performances for Cineworld are verified against Cineworld's order API to check if the session still exists. Stale listings (removed from Cineworld but still in UKCA) are filtered out and reported as info.
- Vue 10am baby-friendly vs autism-friendly: Vue's 10am "Mini Mornings"
screenings are baby-friendly. UKCA categorises these as
AutismFriendly(mapping torelaxed). The comparison detects Vue 10am showings where the only relevant mismatch isAutismFriendly → relaxedand rolls them up as info. - Extra tags we have that UKCA lacks: Where our data has accessibility flags
that UKCA doesn't track (e.g.
babyFriendly), these are separated from real mismatches since they represent us being more detailed, not a gap.
The comparison runs automatically after each transform via
.github/workflows/compare-accessible-screenings.yml, triggered by
repository_dispatch (compare_releases) or manually via workflow_dispatch.
The JSON report is uploaded as a GitHub Actions artifact (retained 14 days).
Compares our transformed screening data against CinemaGuide to identify differences in coverage and screentimes.
npm run download:cinemaguide-screenings
npm run compare:cinemaguide-screenings -- cinemaguide-data/cinemaguide-data.json transformed-data/current/The download fetches all London venues from the CinemaGuide API and saves the
result to cinemaguide-data/cinemaguide-data.json.
The comparison:
- Matches venues using URL overlap as the primary signal, falling back to
name similarity (threshold: 0.8) for venues without bookable links. For
chains where URL formats differ between CinemaGuide and our data, a
venue-level key is extracted from the URL instead of comparing full URLs:
- Picturehouse: site code from
/movie-details/{code}/or/showtimes/{code}- - Vue: venue ID from
/book-tickets/summary/{id}/(CinemaGuide's URLs also have a spurious double-slash which is normalised out) - One hard-coded alias handles "Electric Cinema Notting Hill" →
electriccinema.co.uk-portobello, where CinemaGuide uses the neighbourhood name and we use the street name.
- Picturehouse: site code from
- Matches screenings in six tiers (each only runs on unmatched entries):
- Normalised booking URL exact match
- Performance ID extracted from URL query parameters (
id,perfcode,showtimeId,eid) - Normalised showing URL + time within 15 minutes (for venues like Barbican where CinemaGuide links to the event page rather than a booking system)
- Normalised showing URL + BST-adjusted time within 15 minutes — Barbican only (see carve-outs below)
- URL slug tokens vs title tokens, Jaccard ≥ 0.5 + time within 15 minutes (for venues like BFI where CinemaGuide uses clean slug URLs whose words match our title words reordered)
- Title similarity, Jaccard ≥ 0.3 + time within 15 minutes (last resort; prevented from cross-matching entries with conflicting explicit perf IDs)
- Reports per-venue differences: screenings in CinemaGuide only (possible gaps in our data) and screenings in our data only (events CinemaGuide doesn't cover, not treated as failures).
Each matched venue in the output shows the match method (url overlap: X%,
name-only, or hard-coded alias) so low-confidence matches can be spotted at
a glance.
CinemaGuide's data has some known inaccuracies and structural quirks that would otherwise produce false positives:
- Known mismatches — sports and live events: We deliberately exclude sports screenings (e.g. football, rugby, Grand Prix) and FANPARK events. When CinemaGuide lists these, they appear as "Expected gaps" in the report rather than real failures. Patterns matched: cup/league screenings, Union Jack Classic, Super Bowl, Six Nations, AFCON, Grand Prix, FANPARK.
- Garden Cinema parser artifacts: CinemaGuide's parser sometimes fails to read the date for Garden Cinema screenings and defaults to January 1st while keeping the time. Any CG-only Garden Cinema entry dated January 1st is treated as a parser artifact and folded into "Expected gaps".
- CinemaGuide duplicate entries: CinemaGuide sometimes lists the same
TicketSource event under multiple venue slugs — once under the programme title
and once under the individual film title. Entries with identical
link+timeper venue are deduplicated before matching. - Stale listing verification: For certain venues, CG-only screenings are
verified against the venue's own API to check whether they've been removed.
Stale listings (removed from the venue but still in CG) are reported as
informational rather than failures. Verification is capped at 25 per venue; if
exceeded, entries are kept unverified with a warning.
- Cineworld: verified via
experience.cineworld.co.uk/api/OrderMedia - The Nickel: verified via
thenickel.co.uk/api/screenings/{id} - Vue: verified via
myvue.com/api/microservice/showings/cinemas/{id}/showings/{showingId}using Playwright (to avoid 401s from direct fetch calls)
- Cineworld: verified via
- Barbican BST time offset: Barbican's website has incorrect
datetimeattributes during British Summer Time — they write the local time as if it were UTC (e.g.18:30:00Zfor an event that is actually 18:30 BST = 17:30 UTC). CinemaGuide trusts this attribute; our pipeline reads the display text and stores the correct UTC time. For Barbican events during BST, CG times are therefore 1 hour ahead of ours. The comparison accounts for this by adding a dedicated matching tier that shifts CG's time back by 1 hour for BST-period Barbican events before checking URL + time.
The data/ directory contains reference data files:
London_GLA_Boundary.geojson- GeoJSON boundary of Greater Londonopenstreetmap.json- Cinema data exported from OpenStreetMapmycommunitycinema.json- Cinema data from MyCommunity Cinemaindependentcinemaoffice.json- Cinema data from Independent Cinema Office
This project uses the scripts
package as a dependency. The scripts package provides:
scripts/common/utils- Utility functions (readJSON, fetchText, fetchJson, etc.)scripts/common/geo-utils- Geographic utilities (isInLondon)scripts/common/cache- Caching utilities (dailyCache)scripts/common/distance-in-km-between-coordinates- Geo calculationsscripts/common/source-utils- Source matching utilitiesscripts/cinemas- Cinema data access (getAllCinemaNames, getCinemaAttributes, etc.)scripts/sources- Event source access (getSourceDiscoverVenues, etc.)
MIT