A Thousand Hands: AI Agentic Video Generation System with Scene Consistency

An end-to-end application that transforms text scripts into complete video productions with synchronized audio, narration, and visual effects using Claude, Gemini, ElevenLabs, and LumaAI.

Features

Script analysis and scene breakdown
Physical environment generation
Video generation with custom durations (5-9s)
Sound effect and narration synthesis
Automated video stitching
LoRA training and frame generation
Multi-model support (Gemini/Claude)
User-friendly Gradio web interface
Initial frame customization options:
- Upload local images as starting frames
- Generate initial frames using Luma AI or FAL prompts
- First frame generation for each scene
- Choice between Luma AI and FAL for image generation
Support for custom starting frames with LTX engine
Random script generation with customizable elements
Flexible video initialization options:
- Upload local images as starting frames
- Generate initial frames using Luma AI prompts
- Seamless integration with video generation pipeline

⚠️ Important Note (February 11, 2025): The Luma AI Ray-2 video generation model currently does not support first and last frame keyframe generation. As a result:

Video generation is only available with the LTX video engine

Attempts to use Luma will result in errors

This limitation is temporary and will be resolved in an upcoming Luma API update

Please use the LTX video engine in the meantime

System Architecture

High-Level Flow

graph TD
    A[Movie Script] --> B[Scene Analysis]
    B --> C[Generate Metadata]
    C --> D[Generate Environment]
    D --> E[Generate Frames]
    E --> F[Generate Videos]
    F --> G[Generate Audio]
    G --> H[Final Video]
    
    B --> I[Gemini/Claude API]
    D --> J[LumaAI API]
    E --> K[FAL API]
    G --> L[ElevenLabs API]

Video Generation Process

graph TD
    A[Script Input] --> B[Scene Metadata Generation]
    B --> C[Environment Generation]
    C --> D[Frame Generation]
    D --> E[Video Generation]
    
    F[Sound Effects] --> H[Audio Mixing]
    G[Narration] --> H
    
    E --> I[Video Stitching]
    H --> I
    I --> J[Final Video]

LoRA-Enhanced Video Generation Process

The video_generation_with_lora.py script provides an enhanced version of the video generation process that uses LoRA (Low-Rank Adaptation) models for better scene consistency:

graph TD
    A[Movie Script] --> B[Scene Analysis]
    B --> C[Scene Metadata]
    B --> D[Environment Prompts]
    
    D --> E[Environment Image Generation]
    E --> F[LoRA Training Data Preparation]
    F --> G[Train Environment LoRAs]
    
    C --> H[Frame Generation with LoRAs]
    G --> H
    
    H --> I[First Frame]
    H --> J[Last Frame]
    
    I --> K[Video Generation]
    J --> K
    
    L[Sound Effects] --> N[Final Video Assembly]
    M[Narration] --> N
    K --> N
    
    subgraph LoRA Training Process
        E
        F
        G
    end
    
    subgraph Frame Generation
        H
        I
        J
    end
    
    style LoRA Training Process fill:#f9f,stroke:#333,stroke-width:2px
    style Frame Generation fill:#bbf,stroke:#333,stroke-width:2px

Key differences from the basic video generation:

Environment LoRA Training:
- Generates training images for each environment
- Trains custom LoRA models for consistent scene aesthetics
- Creates environment-specific trigger words
Frame Generation:
- Uses trained LoRAs to generate consistent first/last frames
- Maintains visual style across scene transitions
- Better environment and character consistency
Video Generation:
- Uses LoRA-generated frames as keyframes
- Ensures smooth transitions between scenes
- Maintains consistent visual style throughout
Reusability:
- Trained LoRAs can be saved and reused
- Supports pre-trained LoRA directory input
- Enables consistent style across multiple videos

Directory Structure

The system creates the following directory structure during execution:

generated_videos/
└── videos_[TIMESTAMP]/
    ├── scene_metadata_[TIMESTAMP].json
    ├── scene_physical_environment_[TIMESTAMP].json
    ├── narration_text_[TIMESTAMP].txt
    ├── narration_audio_[TIMESTAMP].mp3
    ├── narration_audio_adjusted_[TIMESTAMP].mp3
    ├── final_video_[TIMESTAMP].mp4
    ├── lora_training_data/
    │   └── environment_[N]/
    │       └── [training images]
    ├── scene_frames/
    │   └── scene_[N]/
    │       ├── first_frame.jpg
    │       └── last_frame.jpg
    └── scene_[N]_all_vid_[TIMESTAMP]/
        ├── scene_[N]_[TIMESTAMP].mp4
        └── scene_[N]_sound.mp3

Prompt Flow:

Prompt Structure and Video Generation Flow

The system uses a sophisticated prompting structure to ensure consistency across generated video segments:

Environment Description Prompts

Each scene has a single, comprehensive environment description prompt that spans all video segments
This "anchor prompt" ensures physical environment consistency across the entire scene
Contains detailed descriptions of:
- Physical layout and architecture
- Lighting conditions
- Atmospheric elements
- Key environmental features
- Time of day and weather conditions

Scene Metadata Prompts

Multiple metadata prompts per scene guide the specific actions and movements
Each prompt corresponds to a video segment within the scene
Contains:
- Character positions and movements
- Camera angles and transitions
- Specific actions and events
- Temporal flow markers

Video Segmentation

Due to technical limitations of current video generation models (including LumaAI):

Each scene is broken into multiple shorter video segments (5-9 seconds each)
Videos are generated with overlapping elements for smooth transitions
The consistent environment prompt ensures visual continuity
Multiple videos are stitched together to form complete scenes

Sound Effects

Each video segment has corresponding sound effects
Effects are synchronized with specific actions
Mixed with narration track during final assembly

Final Assembly

Individual video segments are combined using the overlapping sections
Sound effects and narration are synchronized
Environment consistency across segments creates seamless longer scenes
Multiple scenes are combined into the final video production

graph TD
    A[Environment Description] --> B[Scene 1 Videos]
    A --> C[Scene 1 Metadata]
    C --> B
    B --> D[Video Segment 1]
    B --> E[Video Segment 2]
    B --> F[Video Segment 3]
    D --> G[Scene Assembly]
    E --> G
    F --> G
    H[Sound Effects] --> G
    G --> I[Final Scene]

Where:

[TIMESTAMP]: Format YYYYMMDD_HHMMSS
[N]: Scene number
Each scene can have multiple video segments based on duration

Installation

Prerequisites

Python 3.8 or higher
Required Python packages:

pip install -r requirements.txt

API Keys Required

The following API keys and credentials are required to use the system:

Gemini API Key - For text generation and scene analysis
- Get from: Google AI Studio
- Used for: Scene analysis and metadata generation
ElevenLabs API Key - For voice synthesis and sound effects
- Get from: ElevenLabs
- Used for: Narration and sound effects generation
LumaAI API Key - For video generation
- Get from: LumaAI
- Used for: Video generation and scene rendering
Anthropic API Key - For Claude model access
- Get from: Anthropic
- Used for: Alternative scene analysis model
FAL API Key - For LoRA training and inference
- Get from: FAL.AI
- Used for: Training custom models for scene consistency
Google Cloud Storage:
- Create from: Google Cloud Console
- Required:
  - Bucket Name
  - Service Account Credentials JSON file
- Used for: Storing and managing generated assets

Setting up Google Cloud Credentials

Go to the Google Cloud Console
Create a new project or select an existing one
Enable the Cloud Storage API
Go to IAM & Admin > Service Accounts
Create a new service account or select an existing one
Create a new key (JSON type)
Download the JSON file - you'll need this for the application

Usage

Using the Gradio Web Interface

Start the Gradio app:

python video_generation_app.py

Open your browser and navigate to http://localhost:7860
In the "API Keys Setup" tab:
- Enter all required API keys
- Upload your Google Cloud Service Account credentials JSON file
- Enter your GCP bucket name
- Click "Save API Keys" (this will save the keys to the .env file)
- Saving API keys is not necessary to use the video generation app
- If you do not have the API keys, you can enter "none" and the app will choose the default keys
In the "Video Generation" tab:
- Enter your movie script or use the random script generation feature:
  - Check "Generate Random Script" to use a randomly generated script
  - Click "Preview Random Script" to see what will be generated before creating a video
  - The random script will use the selected model (Gemini or Claude)
- Choose the model (Gemini or Claude)
- Select video engine (Luma or LTX)
- Optional settings:
  - Skip narration generation
  - Skip sound effects generation
  - Generate metadata only
  - Customize maximum scenes and environments
- Click "Generate Video"

Different Video Generation Scripts

The project includes two main video generation scripts:

video_generation.py (Used by Gradio App)
- Basic video generation without LoRA training
- Suitable for simpler video generation needs
- Used by the Gradio web interface
- Does not require FAL API key
- Faster generation but less consistent scene transitions

video_generation_with_lora.py (Advanced CLI Version)

Advanced version with LoRA training capabilities
Better scene consistency through custom model training
Must be run from command line
Requires FAL API key for LoRA training
Command line usage:

# Generate video with LoRA training
python video_generation_with_lora.py --model gemini

# Use pre-trained LoRAs
python video_generation_with_lora.py --model gemini --trained_lora_dir /path/to/lora/dir

# Generate only narration
python video_generation_with_lora.py --narration_only

# Skip narration generation
python video_generation_with_lora.py --skip_narration

Note: If you don't need LoRA-based scene consistency, use the Gradio app with video_generation.py. The FAL API key is only required for video_generation_with_lora.py when using LoRA training.

Using Command Line

Alternatively, you can use the command line interface for basic video generation:

# Basic usage with default script file
python video_generation.py --model gemini --metadata_only

# Use FAL for image generation (default)
python video_generation.py --model gemini --first_frame_image_gen

# Use Luma AI for image generation
python video_generation.py --model gemini --first_frame_image_gen --image_gen_model luma

# Generate initial image with FAL
python video_generation.py --model gemini --initial_image_prompt "your prompt" --image_gen_model fal

# Generate initial image with Luma AI
python video_generation.py --model gemini --initial_image_prompt "your prompt" --image_gen_model luma

Command Line Arguments

--model: Choose between 'gemini' or 'claude' for scene analysis (default: gemini)
--video_engine: Choose between 'luma' or 'ltx' for video generation (default: luma)
--image_gen_model: Choose between 'luma' or 'fal' for image generation (default: fal)
--metadata_only: Generate only scene metadata without video
--script_file: Path to your movie script file
--random_script: Generate a random script instead of using a script file
--skip_narration: Skip generating narration audio
--skip_sound_effects: Skip generating sound effects
--max_scenes: Maximum number of scenes to generate (default: 5)
--max_environments: Maximum number of unique environments to use (default: 3)
--first_frame_image_gen: Generate first frame images for each scene

For random script generation:

python random_script_generator.py --help

--model: Choose between 'gemini' or 'claude' for script generation (default: gemini)
--output: Output file for the generated script (default: random_script.txt)
--video_gen: Generate a video using the random script
--video_engine: Choose between 'luma' or 'ltx' for video generation (default: luma)
--max_scenes: Maximum number of scenes to generate (default: 5)
--max_environments: Maximum number of environments to use (default: 3)
--skip_narration: Skip narration generation
--skip_sound_effects: Skip sound effects generation

Generation Process

Script Analysis
- Input script is analyzed by Gemini/Claude
- Determines optimal number of scenes
- Generates scene metadata including:
  - Physical environment descriptions
  - Movement descriptions
  - Camera instructions
  - Emotional atmosphere
Environment Generation
- Creates detailed environment descriptions
- Generates training data for LoRA models
- Trains custom models for scene consistency
Frame Generation
- Generates key frames for each scene
- Uses trained LoRA models
- Ensures visual consistency
Video Generation
- Generates video segments using LumaAI
- Handles scene transitions
- Manages video duration constraints
Audio Generation
- Creates narration using ElevenLabs
- Generates scene-specific sound effects
- Adjusts audio timing to match video
Final Assembly
- Stitches video segments together
- Synchronizes audio and video
- Produces final output video

Security Note

Never commit the .env file or credentials to version control
Keep your API keys and credentials secure
Add both .env and your GCP credentials JSON file to .gitignore
Store your credentials securely and never share them

Getting Started

Clone the repository
Install dependencies: pip install -r requirements.txt
Set up your Google Cloud project and get your credentials JSON file
Start the Gradio app: python video_generation_app.py
Enter your API keys and upload your GCP credentials in the web interface
Start generating videos!

Troubleshooting

Common Issues

API Key Errors
- Verify all API keys are correctly entered
- Check for any spaces or special characters
- Ensure keys have necessary permissions
Storage Issues
- Ensure GCP bucket exists and is accessible
- Verify service account has proper permissions
- Check available storage space
Generation Failures
- Check API quotas and limits
- Verify script length and complexity
- Monitor system resources

Error Messages

"GCP credentials file not found": Upload credentials JSON in API Keys Setup
"Invalid scene duration": Scene duration must be 5, 9, 14, or 18 seconds
"Generation failed": Check API quotas and error details

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

TODO and Future Updates

Improved Scene Consistency

We plan to implement several advanced techniques to significantly improve scene and character consistency across video generations:

Advanced Environment Frame Generation
- Implement Gaussian Splatting or specialized video generation for physical environments
- Generate high-quality first and last frames for each environment type
- Support various camera movements:
  - Panning (left/right)
  - Tilting (up/down)
  - Zooming (in/out)
  - Dolly (forward/backward)
- Create a dictionary of pre-generated environment transitions
- Allow users to customize and select preferred environment transitions
Character Generation and Integration
- Separate character generation from environment generation
- Generate characters independently with specific movement prompts
- Maintain consistent character appearance across scenes
- Support multiple characters with distinct characteristics
- Enable complex character interactions
Foreground-Background Separation
- Integrate Segment Anything Model (SAM) by Meta
- Extract and separate foreground (characters) from background (environment)
- Enable precise character placement in environments
- Improve character-environment interaction
- Support dynamic lighting and shadow adjustments
Enhanced Scene Composition
- Match characters with appropriate environment frames
- Align character movements with camera transitions
- Ensure lighting consistency between characters and environments
- Maintain spatial relationships across scene transitions
LoRA Training Improvements
- Train specialized LoRAs for each physical environment type
- Develop character-specific LoRAs for consistent appearance
- Create transition-specific LoRAs for smooth scene changes
- Enable fine-tuning of existing LoRAs for custom requirements

Implementation Flow

graph TD
    A[Environment Generation] --> B[Frame Extraction]
    B --> C[Camera Movement Library]
    D[Character Generation] --> E[SAM Processing]
    E --> F[Character Library]
    C --> G[Scene Composition]
    F --> G
    G --> H[LoRA Training]
    H --> I[Final Video Generation]

Technical Components

Environment Frame Database

environment_frames = {
    'forest': {
        'pan_left': {'first_frame': 'url1', 'last_frame': 'url2'},
        'pan_right': {'first_frame': 'url3', 'last_frame': 'url4'},
        'zoom_in': {'first_frame': 'url5', 'last_frame': 'url6'},
        # ... more camera movements
    },
    'city_street': {
        # ... similar structure for each environment
    }
}

Character Generation Control

character_config = {
    'character_id': 'protagonist',
    'appearance': {
        'gender': 'female',
        'age': '30s',
        'clothing': 'business_suit',
        'distinct_features': ['red_hair', 'tall']
    },
    'movements': ['walking', 'running', 'sitting'],
    'emotions': ['happy', 'serious', 'concerned']
}

Scene Composition Rules

scene_rules = {
    'lighting_match': True,
    'perspective_match': True,
    'scale_match': True,
    'shadow_generation': True,
    'depth_consistency': True
}

Benefits

More consistent and high-quality video generation
Better control over scene transitions
Improved character consistency across scenes
More realistic character-environment integration
Smoother camera movements
Enhanced storytelling capabilities

Getting Started with Development

If you'd like to contribute to these improvements:

Fork the repository
Choose a feature from the TODO list
Create a feature branch
Implement and test your changes
Submit a pull request

We welcome contributions and suggestions for additional improvements!

Environment Setup

This project requires several environment variables to be set up. Create a .env file in the root directory with the following variables:

# Google Cloud Storage Configuration
BUCKET_NAME=your-bucket-name
CREDENTIALS_FILE=your-credentials-file.json

# API Keys and Configuration
LUMA_API_TOKEN=your_luma_token
GEMINI_API_KEY=your_gemini_api_key
FAL_API_KEY=your_fal_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key
ELEVEN_LABS_API_KEY=your_eleven_labs_api_key

Note: For image generation, you'll need either:

LUMA_API_TOKEN if using Luma AI for image generation
FAL_API_KEY if using FAL for image generation (default)

Make sure to replace the your_* placeholders with your actual API keys and credentials.

Security Note

Never commit the .env file to version control
Keep your API keys and credentials secure
The .env file should be added to your .gitignore

Getting Started

Copy the .env.example file to .env
Fill in your actual API keys and credentials
Ensure you have the Google Cloud credentials JSON file in your project root

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
other_research_video_generation_scripts		other_research_video_generation_scripts
test		test
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
ELEVEN_LABS_DOCS		ELEVEN_LABS_DOCS
GEMINI_SDK_DOCUMENT.md		GEMINI_SDK_DOCUMENT.md
LUMA_SDK_DOCUMENT.md		LUMA_SDK_DOCUMENT.md
README.md		README.md
README_CONTINUE.md		README_CONTINUE.md
continue_video_example.py		continue_video_example.py
eleven_labs_tts.py		eleven_labs_tts.py
extract_last_frame.py		extract_last_frame.py
fal_image_gen.py		fal_image_gen.py
fal_lora_inference.py		fal_lora_inference.py
fal_train_lora.py		fal_train_lora.py
generate_narration.py		generate_narration.py
git_cleanup.sh		git_cleanup.sh
img_bucket.py		img_bucket.py
input_script.txt		input_script.txt
ltx_video_generation.py		ltx_video_generation.py
luma-sdk.md		luma-sdk.md
luma_image_gen.py		luma_image_gen.py
luma_scripts.py		luma_scripts.py
movie_script1.txt		movie_script1.txt
movie_script2.txt		movie_script2.txt
movie_script3.txt		movie_script3.txt
movie_script4.txt		movie_script4.txt
movie_test.txt		movie_test.txt
movie_trailer.txt		movie_trailer.txt
music_video.txt		music_video.txt
prompting.md		prompting.md
random_script_generator.py		random_script_generator.py
requirements.txt		requirements.txt
scan_directory.py		scan_directory.py
scene_environment_generator.py		scene_environment_generator.py
scene_lora_manager.py		scene_lora_manager.py
video_generation.py		video_generation.py
video_generation_app.py		video_generation_app.py

dylanler/agentic_video_generator

Folders and files

Latest commit

History

Repository files navigation

A Thousand Hands: AI Agentic Video Generation System with Scene Consistency

Features

System Architecture

High-Level Flow

Video Generation Process

LoRA-Enhanced Video Generation Process

Directory Structure

Prompt Structure and Video Generation Flow

Environment Description Prompts

Scene Metadata Prompts

Video Segmentation

Sound Effects

Final Assembly

Installation

Prerequisites

API Keys Required

Setting up Google Cloud Credentials

Usage

Using the Gradio Web Interface

Different Video Generation Scripts

Using Command Line

Command Line Arguments

Generation Process

Security Note

Getting Started

Troubleshooting

Common Issues

Error Messages

Contributing

License

TODO and Future Updates

Improved Scene Consistency

Implementation Flow

Technical Components

Benefits

Getting Started with Development

Environment Setup

Security Note

Getting Started

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages