This project performs comprehensive analysis of educational texts from Option2 and Option1 perspectives in Northern Ireland, including content analysis, topic modeling with BERTopic, and sentiment analysis using OpenAI. The analysis compares content across different document types including textbooks, policy documents, and teacher interviews.
Northern_Ireland_Education_Text/
βββ README.md
βββ environment.yml
βββ environment_setup.md
βββ scripts/
β βββ config.py
β βββ utils.py
β βββ file_reader.py
β βββ main.py
β βββ preprocess.py
β βββ stopwords.txt
β βββ phrases.txt
βββ analysis/
β βββ BERTopic.ipynb
β βββ descriptives.ipynb
β βββ descriptives_no_interview.ipynb
β βββ sentiment.ipynb
βββ data/
β βββ metadata.csv
β βββ strand1/
β βββ both/
β β βββ GCSE History (2017)-specification-Standard.docx
β β βββ Reconciled_interviews/
β β βββ TeacherF_reconciled.docx
β β βββ TeacherI_reconciled.docx
β βββ option1/
β β βββ GCSE Planning Framework History Unit 1 Section B Option 1.docx
β β βββ Madden (2007) History for CCEA GCSE Revision Guide - Chapter 3 (Peace, War and Neutrality).docx
β β βββ Madden (2009) History for CCEA GCSE Second Edition - Chapter 2 (Peace, War and Neutrality) (1).docx
β β βββ Madden (2011) ccea revision guide chp 2 peace war neutrality.docx
β β βββ Madden and McBride History for CCEA GCSE Chapter 2.docx
β β βββ option1_combined all.docx
β β βββ TeacherC_reconciled.docx
β β βββ TeacherD_reconciled.docx
β β βββ TeacherE_reconciled.docx
β β βββ TeacherH_reconciled.docx
β β βββ TeacherL_reconciled.docx
β β βββ TeacherN_reconciled.docx
β β βββ TeacherO_reconciled.docx
β β βββ TeacherP_reconciled.docx
β β βββ TeacherQ_reconciled.docx
β β βββ Updated Johnston&Johnston no textboxes.docx
β βββ option2/
β βββ Doherty (2001) Northern Ireland since c.1960 .docx
β βββ GCSE Planning Framework History Unit 1 Section B Option 2.docx
β βββ Madden (2007) History for CCEA GCSE Revision Guide - Chapter 4 (Changing Relationships) (2).docx
β βββ Madden (2009) History for CCEA GCSE Second Edition - Chapter 3 (Changing Relationships).docx
β βββ Madden (2011) CCEA revision guide Chp 3. Changing Relationships.docx
β βββ Madden and McBride History for CCEA GCSE Chapter 3.docx
β βββ option2_combined all.docx
β βββ TeacherA_reconciled.docx
β βββ TeacherB_reconciled.docx
β βββ TeacherG_reconciled.docx
β βββ TeacherJ_reconciled.docx
β βββ TeacherK_reconciled.docx
β βββ TeacherM_reconciled.docx
β βββ Updated Doherty (2001) no textboxes.docx
βββ outputs/
β βββ processed_text_data.csv
β βββ cleaned_text_data.csv
β βββ analysis_results/
β βββ models/
β βββ raw/
β βββ no_interview/
β βββ reduce_outlier/
β βββ all_visuals/
β βββ option1_sentiment_analysis_fixed/
β βββ option2_sentiment_analysis_fixed/
option2/: All Option2 perspective documents (textbooks, teacher interviews, etc.)option1/: All Option1 perspective documents (textbooks, teacher interviews, etc.)both/: All shared/interview/policy documents (e.g., reconciled teacher interviews, policy docs)
- Textbooks: Educational materials by Madden, Doherty, Johnston
- Policy Documents: GCSE Planning Frameworks and specifications
- Combined Resources: Comprehensive resource collections
- Teacher Interviews: Teacher interview transcripts (can be under
option2/,option1/, orboth/)
This project uses conda for dependency management. Follow these steps to set up your environment:
# Clone or download the project
cd Northern_Ireland_Education_Text
# Create the conda environment (recommended)
conda env create -f environment.yml
# Activate the environment
conda activate bertopic_env
To use sentiment analysis features, you'll need an OpenAI API key:
- Get an API key from OpenAI
- Create a
.envfile in the project root:
# In your .env file
OPENAI_API_KEY=your_actual_api_key_here# Read and process raw documents
python -m scripts.main
# Clean and preprocess text
python scripts/preprocess.py# Generate descriptive statistics
jupyter notebook analysis/descriptives.ipynb# Run Jupyter notebook for topic analysis
jupyter notebook analysis/BERTopic.ipynb# Run sentiment analysis on key terms
jupyter notebook analysis/sentiment.ipynbYour analysis will generate:
outputs/processed_text_data.csv- Processed document dataoutputs/cleaned_text_data.csv- Cleaned text for analysisoutputs/analysis_results/- Topic modeling results and visualizations; Sentiment analysis results
The pipeline includes URL processing capabilities for combined documents:
- Raw Content Fetching: Uses enhanced web scraping to fetch live content from URLs
- AI Knowledge-Based Fallback: When raw fetching fails, uses OpenAI to generate summaries based on training data
URL processing can be configured in scripts/config.py:
# URL processing parameters
FETCH_URLS = False # Set this to False to skip all URL processing
MAX_URL_CHARS = 8000 # Maximum characters to extract from each URL
URL_TIMEOUT = 15 # Timeout for URL requests in seconds
# OpenAI fallback parameters
USE_OPENAI_FALLBACK = False # Set this to False to disable AI completely
OPENAI_MODEL = "gpt-4o-mini" # OpenAI model to use for summarization
# OpenAI API key will be loaded from .env file or environment variable
MAX_AI_SUMMARY_CHARS = 2000 # Maximum characters for AI-generated summariesTo use the AI fallback functionality:
- Install the required libraries:
pip install openai python-dotenv- Set your OpenAI API key in the
.envfile:
# In your .env file
OPENAI_API_KEY=your-api-key-here- The system will automatically:
- Try to fetch raw content from URLs using enhanced web scraping
- If raw fetching fails, use OpenAI to generate knowledge-based summaries
- Focus on Northern Ireland education and history relevance
- Provide summaries based on AI's training data about the domain