This repository contains a Jupyter notebook that was completed as part of the NHS DigData Step-Up Challenge.
The aim of the project is to explore and visualise national antidepressant prescribing using the NHS Business Services Authority (NHSBSA) Prescription Cost Analysis (PCA) dataset.
The notebook walks through:
- Loading and understanding the PCA regional drug summary data
- Cleaning and preparing the
YEAR_MONTHand drug fields - Answering specific challenge questions about antidepressant prescribing
- Creating visualisations for key trends
- Generating summary metrics and insights on volume and cost
- Source: NHSBSA Open Data Portal – Prescription Cost Analysis (PCA) monthly data
- Main file used:
BSA_ODP_PCA_REGIONAL_DRUG_SUMMARY.csv(cloned from thenhsengland/DigdataGitHub repo)
Key columns used in the analysis include:
YEAR_MONTH– Year and month inYYYYMMformatREGION– NHS regionBNF_CHEMICAL_SUBSTANCE– Drug / chemical substance nameITEMS– Number of prescription itemsCOST– Prescribing cost (in GBP)
The data is openly available and anonymised.
- Clone the official Digdata GitHub repository:
git clone https://github.com/nhsengland/Digdata
- Load
BSA_ODP_PCA_REGIONAL_DRUG_SUMMARY.csvinto a pandas DataFrame. - Inspect the structure and summary statistics to understand the size and layout of the dataset.
- Treat
YEAR_MONTHas a string to handle inconsistencies. - Remove invalid
YEAR_MONTHentries where the month part is00or greater than12. - Convert valid
YEAR_MONTHvalues into proper datetime objects. - Create cleaned datasets such as:
- National monthly total prescribing cost for all drugs
- Monthly cost time series for specific antidepressants (e.g. escitalopram)
- Subsets by region and drug name
The notebook answers a number of specific analytical questions, for example:
- Calculate the monthly national cost of mirtazapine prescribing.
- Calculate the annual spend on sertraline hydrochloride in a specific region (e.g. the Midlands).
- Identify which antidepressants are most frequently prescribed nationally by total number of items and total cost.
These questions are answered using grouped aggregations, filtering by BNF_CHEMICAL_SUBSTANCE, and summarising ITEMS and COST.
Several visualisations are created to make the results easier to interpret, including:
- A horizontal bar chart of the top 5 most prescribed drugs in 2024, sorted in descending order of item volume.
- A vertical bar chart showing the total annual cost of sertraline prescribing for a chosen region (for example, the North West).
- A line chart of the national monthly cost of escitalopram, rounded to the nearest pound.
- A line chart of the total national monthly prescribing cost across all drugs.
These plots use matplotlib and seaborn to show trends, highlight peaks and troughs, and compare different years.
The notebook then moves into higher-level metrics and insight generation:
- Annual summary statistics (min, Q1, median, Q3, max) for the monthly national prescribing cost, grouped by year.
- A grouped boxplot comparing the distribution of monthly costs between years (e.g. 2021–2024), to see how variability and central tendency change over time.
- An antidepressant-focused summary that:
- Aggregates total items and total cost by
BNF_CHEMICAL_SUBSTANCE - Calculates each drug’s percentage share of the total antidepressant volume and cost
- Computes the mean cost per item for each antidepressant
- Aggregates total items and total cost by
- A more detailed look at a specific antidepressant (such as escitalopram), where:
- Monthly items and cost are aggregated over time
- Trends in volume and cost are plotted to highlight changes and possible seasonality
Some key messages that emerge from the analysis include:
- A small set of antidepressants (for example, drugs like escitalopram and fluoxetine) contribute a large share of total items and total spend, making them important for both clinical and cost planning.
- Some medicines have a high mean cost per item despite lower prescribing volumes, which can still have a significant budget impact and may warrant closer monitoring.
- The distribution of monthly national cost changes over time, with different years showing different ranges and medians rather than a flat or static pattern.
- Time-series plots for key antidepressants suggest ongoing trends and potential seasonal patterns, which could be related to service demand, guideline changes or mental health awareness activities.
These insights are intended as a starting point for further exploration rather than definitive clinical conclusions.
-
Clone this repository
git clone https://github.com/ahmedkansulum/NHS_DigData_Analysis.git cd NHS_DigData_Analysis -
(Optional) Create and activate a virtual environment
Using
venv:python -m venv .venv # On Windows: .venv\Scripts\activate # On macOS / Linux: source .venv/bin/activate
-
Install dependencies
You can either use a
requirements.txtfile or install the main libraries directly:pip install pandas numpy matplotlib seaborn jupyter
-
Download the data
Option A – Let the notebook clone the Digdata repo (recommended):
- Ensure
gitis installed. - In the notebook, run the first cell:
git clone https://github.com/nhsengland/Digdata
- Confirm that
BSA_ODP_PCA_REGIONAL_DRUG_SUMMARY.csvis available under theDigdata/folder.
Option B – Manual download:
- Visit the NHSBSA Open Data Portal and download the PCA file.
- Save
BSA_ODP_PCA_REGIONAL_DRUG_SUMMARY.csvinto aDigdata/folder next to the notebook.
- Ensure
-
Launch Jupyter and open the notebook
jupyter notebook
Then open:
NHS_DigData_Step_Up_Challenge_Analysis.ipynb
and run the cells from top to bottom.
-
NHS_DigData_Step_Up_Challenge_Analysis.ipynb
Main analysis notebook that contains all data preparation, analysis, visualisation and commentary for the NHS DigData Step-Up Challenge. -
(Optional, generated by the notebook)
national_prescribing_monthly_cost.csv– Cleaned monthly national prescribing cost across all drugs.escitalopram_monthly_cost.csv– Cleaned monthly national cost for escitalopram.
You can choose whether or not to commit the generated CSV files to your GitHub repository.
- Python
- pandas for data loading, cleaning and aggregation
- NumPy for numerical operations
- Matplotlib and Seaborn for visualisations
- Jupyter Notebook for interactive analysis
- Data provided by NHS Business Services Authority (NHSBSA) via the NHSBSA Open Data Portal.
- Notebook created as part of the NHS DigData Step-Up Challenge – Step Up programme.
This project is for learning and demonstration purposes only.
The analysis is based on openly available data and does not represent official NHS analytics, advice or policy.



