Skip to content

Switch audio transcription from openai-whisper to HuggingFace transformers#535

Open
Copilot wants to merge 9 commits intomainfrom
copilot/switch-to-huggingface-whisper
Open

Switch audio transcription from openai-whisper to HuggingFace transformers#535
Copilot wants to merge 9 commits intomainfrom
copilot/switch-to-huggingface-whisper

Conversation

Copy link
Contributor

Copilot AI commented Nov 17, 2025

Migration from openai-whisper to Hugging Face transformers

Successfully switched from the prepackaged openai-whisper pip package to using the Hugging Face transformers library with the whisper-large-v3 model to address issues with long silences breaking/hanging transcription.

✅ Changes Completed:

  • Create a Python wrapper script for Hugging Face Whisper transcription (whisper_hf.py)
  • Update download_model.py to download the Hugging Face model
  • Update Dockerfile to install transformers and dependencies instead of openai-whisper
  • Update Rust code to call the new Python script
  • Verify code compiles and passes linting
  • Improve error handling and device detection
  • Add comprehensive migration guide documentation
  • Add optional runtime model download for AWS Batch (via EFS/S3)
  • Implement initial_prompt support using transformers prompt_ids
  • Update CDK to provision EFS and set HF_HOME for audio transcriber
  • Add EFS troubleshooting and model reset documentation

Documentation Updates:

Added comprehensive documentation for EFS model caching including:

  • How EFS model caching works (file system, access point, volume mount, environment)
  • First run behavior and expected timing
  • Cached model structure
  • Troubleshooting guide for common issues:
    • Mount target not found
    • Job hanging during model download
    • Permission denied errors
    • Corrupted/incomplete models
  • Model reset options (delete specific model, clear all cache, force re-download)
  • EFS monitoring (storage size, CloudWatch metrics)
  • Comparison of deployment approaches
Original prompt

can you switch from using the prepackaged openai whisper pip package to using the model from https://huggingface.co/openai/whisper-large-v3 ? as this is suggested as something that doesn't have the same problems we're seeing now with long silences breaking/hanging the transcription


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI changed the title [WIP] Switch to Hugging Face Whisper model for transcription Switch audio transcription from openai-whisper to HuggingFace transformers Nov 17, 2025
Copilot AI requested a review from saebyn November 17, 2025 05:23
@saebyn
Copy link
Owner

saebyn commented Nov 28, 2025

@copilot 1. is there a better way to handle downloading the model rather than baking it into the docker image, given this image is used in an AWS Batch environment? 2. are there any good ways to provide the initial context to the model?

Copy link
Contributor Author

Copilot AI commented Nov 28, 2025

@copilot 1. is there a better way to handle downloading the model rather than baking it into the docker image, given this image is used in an AWS Batch environment? 2. are there any good ways to provi...

Addressed in commit 9f98217:

1. Model downloading for AWS Batch:
Added DOWNLOAD_MODEL_AT_BUILD build arg (default: true). For AWS Batch with EFS:

# Build without baking model
docker buildx bake audio_transcriber --set audio_transcriber.args.DOWNLOAD_MODEL_AT_BUILD=false

# Mount EFS and set HF_HOME in job definition
{"environment": [{"name": "HF_HOME", "value": "/mnt/efs/models"}]}

Model downloads once on first run, then cached. See MIGRATION_GUIDE.md "AWS Batch Deployment Options" section for EFS/S3 details.

2. Initial prompt support:
Now implemented using processor.get_prompt_ids(initial_prompt) which converts the prompt text to token IDs that condition the model's generation. This helps with consistent terminology and style.

Copilot AI and others added 4 commits November 28, 2025 15:42
Co-authored-by: saebyn <185030+saebyn@users.noreply.github.com>
Co-authored-by: saebyn <185030+saebyn@users.noreply.github.com>
Co-authored-by: saebyn <185030+saebyn@users.noreply.github.com>
Co-authored-by: saebyn <185030+saebyn@users.noreply.github.com>
@saebyn saebyn force-pushed the copilot/switch-to-huggingface-whisper branch from 9f98217 to 4a50b46 Compare November 28, 2025 23:42
@saebyn
Copy link
Owner

saebyn commented Nov 28, 2025

@copilot can we update the CDK code that deploys this for use in Batch to provision and provide HF_HOME?

Co-authored-by: saebyn <185030+saebyn@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Nov 28, 2025

@copilot can we update the CDK code that deploys this for use in Batch to provision and provide HF_HOME?

Updated CDK in commit 9e2f41f:

  • Created EFS file system in BatchEnvironmentConstruct for model caching
  • Added EFS volume mount at /mnt/efs/models in audio transcriber job
  • Set HF_HOME=/mnt/efs/models environment variable
  • Configured access point with correct POSIX permissions (UID 10001)
  • Added security group rules for NFS traffic from Batch instances
  • EFS has RETAIN policy and 30-day lifecycle for infrequently accessed files

First job downloads the ~3GB model to EFS, subsequent jobs reuse the cache.

@saebyn
Copy link
Owner

saebyn commented Nov 29, 2025

@copilot can you add documentation about how EFS is used, how to troubleshoot if there are issues with the job, and how to reset/redownload the model?

Co-authored-by: saebyn <185030+saebyn@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Nov 29, 2025

@copilot can you add documentation about how EFS is used, how to troubleshoot if there are issues with the job, and how to reset/redownload the model?

Added comprehensive EFS documentation in commit 79e43fa to MIGRATION_GUIDE.md:

EFS Usage:

  • How model caching works (file system, access point, volume mount)
  • First run behavior and expected timing (~5-10 min initial download)
  • Cached model directory structure

Troubleshooting:

  • Mount target not found errors
  • Job hanging during model download
  • Permission denied errors (POSIX user mismatch)
  • Corrupted/incomplete model detection

Model Reset Options:

  1. Delete specific model: rm -rf /mnt/efs/models/models--openai--whisper-large-v3/
  2. Clear all cache: rm -rf /mnt/efs/models/*
  3. Using EFS access point with mount helper
  4. Force re-download via environment override

Monitoring:

  • Check EFS storage size via AWS CLI
  • CloudWatch metrics to monitor (BurstCreditBalance, ClientConnections, I/O)

@saebyn saebyn marked this pull request as ready for review December 24, 2025 19:49
Copilot AI review requested due to automatic review settings December 24, 2025 19:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates the audio transcription system from the openai-whisper pip package to HuggingFace's transformers library to address issues with long silences breaking or hanging transcription. The migration includes implementing EFS-based model caching for AWS Batch deployments to avoid baking the ~3GB model into Docker images.

Key Changes:

  • Replaced openai-whisper with HuggingFace transformers (whisper-large-v3 model)
  • Added EFS file system infrastructure for model caching in AWS Batch
  • Created Python wrapper script (whisper_hf.py) for CLI compatibility with existing Rust code
  • Added comprehensive migration guide with EFS troubleshooting documentation

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
cdk/lib/batch/environment.ts Added EFS file system, security group, and access point for model caching
cdk/lib/batch/audioTranscriberJob.ts Configured EFS volume mount and HF_HOME environment variable for batch jobs
cdk/lib/appStack.ts Passed EFS resources to audio transcriber job construct
audio_transcriber/whisper_hf.py New Python wrapper providing CLI-compatible interface to HuggingFace transformers
audio_transcriber/src/whisper.rs Updated command invocation to call Python script instead of whisper CLI
audio_transcriber/download_model.py Switched from openai-whisper to transformers for model downloading
audio_transcriber/Pipfile Added transformers and dependencies
audio_transcriber/Pipfile.lock Locked dependency versions for reproducible builds
Dockerfile Updated to install dependencies via pipenv and support conditional model download
audio_transcriber/MIGRATION_GUIDE.md Comprehensive documentation for migration, EFS setup, and troubleshooting


# Conditionally download model at build time
# If DOWNLOAD_MODEL_AT_BUILD=false, model will be downloaded at runtime
RUN if [ "$DOWNLOAD_MODEL_AT_BUILD" = "true" ]; then python3 /app/download_model.py; fi
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conditional download logic uses shell string comparison, but the build argument should use the proper shell test. The current implementation with [ "$DOWNLOAD_MODEL_AT_BUILD" = "true" ] will fail if the variable is unset. Consider using [ "${DOWNLOAD_MODEL_AT_BUILD}" = "true" ] for safer variable expansion.

Suggested change
RUN if [ "$DOWNLOAD_MODEL_AT_BUILD" = "true" ]; then python3 /app/download_model.py; fi
RUN if [ "${DOWNLOAD_MODEL_AT_BUILD}" = "true" ]; then python3 /app/download_model.py; fi

Copilot uses AI. Check for mistakes.
# This conditions the model to generate text consistent with the prompt style
if initial_prompt:
try:
prompt_ids = processor.get_prompt_ids(initial_prompt, return_tensors="pt")
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt_ids should be moved to the correct device before being used in generation. When using CUDA, the prompt_ids tensor is created on CPU but needs to be moved to the GPU device. Add .to(device_str) after getting the prompt_ids to ensure they're on the correct device.

Suggested change
prompt_ids = processor.get_prompt_ids(initial_prompt, return_tensors="pt")
prompt_ids = processor.get_prompt_ids(initial_prompt, return_tensors="pt").to(device_str)

Copilot uses AI. Check for mistakes.
containerPath: '/mnt/efs/models',
fileSystem: props.modelCacheFileSystem,
accessPointId: props.modelCacheAccessPoint.accessPointId,
rootDirectory: '/',
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rootDirectory parameter is set to / which means the volume mount starts at the EFS root. However, the access point already specifies path: '/models' in environment.ts (line 60). This double configuration could lead to confusion about the actual path. When using an access point, the rootDirectory should typically be / (which is correct), but the documentation should clarify that the access point's path configuration takes precedence.

Copilot uses AI. Check for mistakes.
Comment on lines +206 to +208
parser.add_argument("--output_format", default="json", help="Output format")
parser.add_argument("--output_dir", required=True, help="Output directory")
parser.add_argument("--task", default="transcribe", help="Task to perform")
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --task and --output_format arguments are defined but never used in the script. These should either be implemented or removed from the argument parser to maintain a clean API surface.

Copilot uses AI. Check for mistakes.
// Create EFS for model caching (used by audio transcriber for HuggingFace models)
const modelCacheSecurityGroup = new ec2.SecurityGroup(this, 'ModelCacheSecurityGroup', {
vpc,
allowAllOutbound: true,
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EFS security group allows all outbound traffic but this is unnecessary for an EFS file system which only needs to accept inbound NFS connections. Consider removing allowAllOutbound: true to follow the principle of least privilege, as EFS doesn't initiate outbound connections.

Suggested change
allowAllOutbound: true,
allowAllOutbound: false,

Copilot uses AI. Check for mistakes.
parser = argparse.ArgumentParser(description="Transcribe audio using Hugging Face Whisper")
parser.add_argument("audio_file", help="Audio file to transcribe (use '-' for stdin)")
parser.add_argument("--model", required=True, help="Model name (e.g., 'large-v3')")
parser.add_argument("--model_dir", default="/model", help="Directory containing models")
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --model_dir argument is defined but never used in the script. The cache_dir parameter uses os.environ.get("HF_HOME", "/model") instead of args.model_dir. Either use the command-line argument or remove it from the argument parser to avoid confusion.

Copilot uses AI. Check for mistakes.
Comment on lines +254 to +255
# Clean up temporary file
os.unlink(audio_path)
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The temporary file cleanup in the finally block could fail if os.unlink() raises an exception (e.g., file already deleted or permission denied). Consider wrapping the cleanup in a try-except block to prevent masking the original exception from the transcription.

Suggested change
# Clean up temporary file
os.unlink(audio_path)
# Clean up temporary file without masking any prior exceptions
try:
os.unlink(audio_path)
except OSError as cleanup_err:
# Best-effort cleanup; log in verbose mode but do not raise
if verbose:
print(
f"Warning: failed to delete temporary file {audio_path}: {cleanup_err}",
file=sys.stderr,
)

Copilot uses AI. Check for mistakes.
Comment on lines 313 to 315
AudioTranscriberError::WhisperExecutableNotFoundError(
"whisper executable not found in PATH".to_string(),
"python3 or whisper_hf.py script not found".to_string(),
)
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message should be more specific. When python3 is not found versus when whisper_hf.py is missing, these are different issues requiring different troubleshooting. Consider checking for python3 availability first and providing a more helpful error message that distinguishes between missing Python interpreter and missing script file.

Copilot uses AI. Check for mistakes.

RUN python3 /app/download_model.py
WORKDIR /app
RUN pipenv install --deploy
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pipenv install --deploy command should be run with the --system flag when installing in a Docker container. Without this flag, packages are installed in a virtual environment which may not be accessible to the Python interpreter when running scripts directly with python3. This could cause import errors at runtime.

Suggested change
RUN pipenv install --deploy
RUN pipenv install --deploy --system

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants