Skip to content

daria425/simulated-streaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simulated Streaming Video Understanding

My mini weekend project: A POC of a system that simulates real-time video understanding using models that don't natively support video or streaming.

What It Does

Instead of analyzing a video all at once, this system processes it frame-by-frame and emits events only when something changes — mimicking live video understanding.

How It Works

  1. Sample frames from video (e.g., every 10th frame)
  2. Describe each frame using vision-to-text model
  3. Extract structured state (subject, attributes, motion) via tool call
  4. Diff against previous state deterministically
  5. Stream narration of changes only (no repeated descriptions)

Example Output

The LEGO block changes color from red to pink.

(Silence means nothing changed)

Architecture

The system uses three separate LLM calls with distinct responsibilities:

Component Input Output Purpose
Perception Frame image Descriptive text Vision-to-text, no temporal reasoning
Extractor Description Structured JSON state Canonical labels (temperature=0)
Narrator State diff Streamed sentence Natural language change description

Deterministic processor (not an LLM) handles:

  • State persistence across frames
  • Diffing logic (ignores noise like phrasing variations)
  • Event emission decisions

State Model

{
  "subject": "lego_block",
  "attributes": {
    "color": "red"
  },
  "state": {
    "motion": "stationary"
  }
}

Only relevant fields (color, motion) participate in diffing. Shape, orientation, and size jitter is ignored.

Local Setup

Prerequisites

Installation

  1. Clone the repository and navigate to the project directory

  2. Create and activate a virtual environment:

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt

    Note: This includes PyTorch, transformers, and OpenCV. Installation may take a few minutes.

  4. Create a .env file in the project root:

    MISTRAL_API_KEY=your_api_key_here
  5. Add video files to the input/ directory:

    mkdir -p input
    # Place your .mp4 files in input/

Running

Process a video and get streaming narration:

python process.py

By default, this processes input/sample2.mp4 with a frame step of 10. After processing, you'll enter an interactive Q&A mode where you can ask questions about what happened in the video.

To process a different video or adjust the frame sampling, modify the __main__ block in process.py:

video_context = get_stream("./input/your_video.mp4", frame_step=10)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages