We present Video-RAC, an adaptive chunking methodology for lecture videos within Retrieval-Augmented Generation (RAG) pipelines. Using CLIP embeddings and SSIM to detect coherent slide transitions, plus entropy-based keyframe selection, we construct multimodal chunks that align audio transcripts and visual frames.
Alongside the method, we release EduViQA, a slide-centric, bilingual (Persian/English) lecture dataset containing 20 videos from 5 professors across STEM and education topics. Each lecture is paired with 50 synthetic QA items and categorized by duration (40% mid-length, ~20โ40 minutes) to support controlled RAG benchmarking.
This repository is the official implementation of the CSICC 2025 paper by Hemmat et al.
Hemmat, A., Vadaei, K., Shirian, M., Heydari, M.H., Fatemi, A. โAdaptive Chunking for VideoRAG Pipelines with a Newly Gathered Bilingual Educational Dataset.โ Proceedings of the 30th International Computer Society of Iran Computer Conference (CSICC 2025), University of Isfahan.
This framework underpins the EduViQA bilingual dataset, designed for evaluating lecture-based RAG systems in both Persian and English. The dataset and code form a unified ecosystem for multimodal question generation and retrieval evaluation.
Key Contributions:
- ๐ฅ Adaptive Hybrid Chunking โ Combines CLIP cosine similarity with SSIM-based visual comparison.
- ๐งฎ Entropy-Based Keyframe Selection โ Extracts high-information frames for retrieval.
- ๐ฃ๏ธ TranscriptโFrame Alignment โ Synchronizes ASR transcripts with visual semantics.
- ๐ Multimodal Retrieval โ Integrates visual and textual embeddings for RAG.
- ๐ง Benchmark Dataset โ 20 bilingual educational videos with 50 QA pairs each.
Dataset composition highlighting topic distribution and lecture duration proportions.
| Metric | Value |
|---|---|
| Total Videos | 20 (10 Persian, 10 English) |
| Professors | 5 |
| Duration Focus | 40% mid-length (20โ40 minutes) |
| QA Pairs per Video | 50 synthetic QA pairs |
| Format | JSON annotations |
- Computer Architecture
- Data Structures
- System Dynamics and Control
- Teaching Skills
- Descriptive Research
- Regions in Human Geography
- Differentiated Instruction
- Business
The dataset also captures slide transitions and keyframes extracted via CLIP+SSIM chunking, enabling multimodal retrieval experiments with aligned visuals and transcripts.
๐ฅ Access Dataset: Hugging Face - EduViQA
pip install VideoRACfrom VideoRAC.Modules import HybridChunker
chunker = HybridChunker(
clip_model='openai/clip-vit-base-patch32',
alpha=0.6,
threshold_embedding=0.85,
threshold_ssim: float=0.8,
interval: int=1,
)
chunks, timestamps, duration = chunker.chunk("lecture.mp4")
chunker.evaluate()from VideoRAC.Modules import VideoQAGenerator
def my_llm_fn(messages):
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(model="gpt-4o", messages=messages)
return response.choices[0].message.content
urls = ["https://www.youtube.com/watch?v=2uYu8nMR5O4"]
qa = VideoQAGenerator(video_urls=urls, llm_fn=my_llm_fn)
qa.process_videos()| Method | AR | CR | F | Notes |
|---|---|---|---|---|
| VideoRAC (CLIP+SSIM) | 0.87 | 0.82 | 0.91 | Best performance overall |
| CLIP-only | 0.80 | 0.75 | 0.83 | Weaker temporal segmentation |
| Simple Slicing | 0.72 | 0.67 | 0.76 | Time-based only |
Evaluated using RAGAS metrics: Answer Relevance (AR), Context Relevance (CR), and Faithfulness (F).
Licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
You may share and adapt this work with attribution. Please cite our paper when using VideoRAC or EduViQA:
@INPROCEEDINGS{10967455,
author={Hemmat, Arshia and Vadaei, Kianoosh and Shirian, Melika and Heydari, Mohammad Hassan and Fatemi, Afsaneh},
booktitle={2025 29th International Computer Conference, Computer Society of Iran (CSICC)},
title={Adaptive Chunking for VideoRAG Pipelines with a Newly Gathered Bilingual Educational Dataset},
year={2025},
volume={},
number={},
pages={1-7},
keywords={Measurement;Visualization;Large language models;Pipelines;Retrieval augmented generation;Education;Question answering (information retrieval);Multilingual;Standards;Context modeling;Video QA;Datasets Preparation;Academic Question Answering;Multilingual},
doi={10.1109/CSICC65765.2025.10967455}}University of Isfahan โ Department of Computer Engineering
- Kianoosh Vadaei โ kia.vadaei@gmail.com
- Melika Shirian โ mel.shirian@gmail.com
- Arshia Hemmat โ amirarshia.hemmat@kellogg.ox.ac.uk
- Mohammad Hassan Heydari โ heidary0081@gmail.com
- Afsaneh Fatemi โ a.fatemi@eng.ui.ac.ir
โญ Official CSICC 2025 Implementation โ Give it a star if you use it in your research! โญ Made with โค๏ธ at University of Isfahan

