Switch audio transcription from openai-whisper to HuggingFace transformers#535
Switch audio transcription from openai-whisper to HuggingFace transformers#535
Conversation
|
@copilot 1. is there a better way to handle downloading the model rather than baking it into the docker image, given this image is used in an AWS Batch environment? 2. are there any good ways to provide the initial context to the model? |
Addressed in commit 9f98217: 1. Model downloading for AWS Batch: # Build without baking model
docker buildx bake audio_transcriber --set audio_transcriber.args.DOWNLOAD_MODEL_AT_BUILD=false
# Mount EFS and set HF_HOME in job definition
{"environment": [{"name": "HF_HOME", "value": "/mnt/efs/models"}]}Model downloads once on first run, then cached. See MIGRATION_GUIDE.md "AWS Batch Deployment Options" section for EFS/S3 details. 2. Initial prompt support: |
Co-authored-by: saebyn <185030+saebyn@users.noreply.github.com>
Co-authored-by: saebyn <185030+saebyn@users.noreply.github.com>
Co-authored-by: saebyn <185030+saebyn@users.noreply.github.com>
Co-authored-by: saebyn <185030+saebyn@users.noreply.github.com>
9f98217 to
4a50b46
Compare
|
@copilot can we update the CDK code that deploys this for use in Batch to provision and provide HF_HOME? |
Co-authored-by: saebyn <185030+saebyn@users.noreply.github.com>
Updated CDK in commit 9e2f41f:
First job downloads the ~3GB model to EFS, subsequent jobs reuse the cache. |
|
@copilot can you add documentation about how EFS is used, how to troubleshoot if there are issues with the job, and how to reset/redownload the model? |
Co-authored-by: saebyn <185030+saebyn@users.noreply.github.com>
Added comprehensive EFS documentation in commit 79e43fa to MIGRATION_GUIDE.md: EFS Usage:
Troubleshooting:
Model Reset Options:
Monitoring:
|
Removed the '--system' flag from pipenv install command.
There was a problem hiding this comment.
Pull request overview
This PR migrates the audio transcription system from the openai-whisper pip package to HuggingFace's transformers library to address issues with long silences breaking or hanging transcription. The migration includes implementing EFS-based model caching for AWS Batch deployments to avoid baking the ~3GB model into Docker images.
Key Changes:
- Replaced openai-whisper with HuggingFace transformers (whisper-large-v3 model)
- Added EFS file system infrastructure for model caching in AWS Batch
- Created Python wrapper script (whisper_hf.py) for CLI compatibility with existing Rust code
- Added comprehensive migration guide with EFS troubleshooting documentation
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| cdk/lib/batch/environment.ts | Added EFS file system, security group, and access point for model caching |
| cdk/lib/batch/audioTranscriberJob.ts | Configured EFS volume mount and HF_HOME environment variable for batch jobs |
| cdk/lib/appStack.ts | Passed EFS resources to audio transcriber job construct |
| audio_transcriber/whisper_hf.py | New Python wrapper providing CLI-compatible interface to HuggingFace transformers |
| audio_transcriber/src/whisper.rs | Updated command invocation to call Python script instead of whisper CLI |
| audio_transcriber/download_model.py | Switched from openai-whisper to transformers for model downloading |
| audio_transcriber/Pipfile | Added transformers and dependencies |
| audio_transcriber/Pipfile.lock | Locked dependency versions for reproducible builds |
| Dockerfile | Updated to install dependencies via pipenv and support conditional model download |
| audio_transcriber/MIGRATION_GUIDE.md | Comprehensive documentation for migration, EFS setup, and troubleshooting |
|
|
||
| # Conditionally download model at build time | ||
| # If DOWNLOAD_MODEL_AT_BUILD=false, model will be downloaded at runtime | ||
| RUN if [ "$DOWNLOAD_MODEL_AT_BUILD" = "true" ]; then python3 /app/download_model.py; fi |
There was a problem hiding this comment.
The conditional download logic uses shell string comparison, but the build argument should use the proper shell test. The current implementation with [ "$DOWNLOAD_MODEL_AT_BUILD" = "true" ] will fail if the variable is unset. Consider using [ "${DOWNLOAD_MODEL_AT_BUILD}" = "true" ] for safer variable expansion.
| RUN if [ "$DOWNLOAD_MODEL_AT_BUILD" = "true" ]; then python3 /app/download_model.py; fi | |
| RUN if [ "${DOWNLOAD_MODEL_AT_BUILD}" = "true" ]; then python3 /app/download_model.py; fi |
| # This conditions the model to generate text consistent with the prompt style | ||
| if initial_prompt: | ||
| try: | ||
| prompt_ids = processor.get_prompt_ids(initial_prompt, return_tensors="pt") |
There was a problem hiding this comment.
The prompt_ids should be moved to the correct device before being used in generation. When using CUDA, the prompt_ids tensor is created on CPU but needs to be moved to the GPU device. Add .to(device_str) after getting the prompt_ids to ensure they're on the correct device.
| prompt_ids = processor.get_prompt_ids(initial_prompt, return_tensors="pt") | |
| prompt_ids = processor.get_prompt_ids(initial_prompt, return_tensors="pt").to(device_str) |
| containerPath: '/mnt/efs/models', | ||
| fileSystem: props.modelCacheFileSystem, | ||
| accessPointId: props.modelCacheAccessPoint.accessPointId, | ||
| rootDirectory: '/', |
There was a problem hiding this comment.
The rootDirectory parameter is set to / which means the volume mount starts at the EFS root. However, the access point already specifies path: '/models' in environment.ts (line 60). This double configuration could lead to confusion about the actual path. When using an access point, the rootDirectory should typically be / (which is correct), but the documentation should clarify that the access point's path configuration takes precedence.
| parser.add_argument("--output_format", default="json", help="Output format") | ||
| parser.add_argument("--output_dir", required=True, help="Output directory") | ||
| parser.add_argument("--task", default="transcribe", help="Task to perform") |
There was a problem hiding this comment.
The --task and --output_format arguments are defined but never used in the script. These should either be implemented or removed from the argument parser to maintain a clean API surface.
| // Create EFS for model caching (used by audio transcriber for HuggingFace models) | ||
| const modelCacheSecurityGroup = new ec2.SecurityGroup(this, 'ModelCacheSecurityGroup', { | ||
| vpc, | ||
| allowAllOutbound: true, |
There was a problem hiding this comment.
The EFS security group allows all outbound traffic but this is unnecessary for an EFS file system which only needs to accept inbound NFS connections. Consider removing allowAllOutbound: true to follow the principle of least privilege, as EFS doesn't initiate outbound connections.
| allowAllOutbound: true, | |
| allowAllOutbound: false, |
| parser = argparse.ArgumentParser(description="Transcribe audio using Hugging Face Whisper") | ||
| parser.add_argument("audio_file", help="Audio file to transcribe (use '-' for stdin)") | ||
| parser.add_argument("--model", required=True, help="Model name (e.g., 'large-v3')") | ||
| parser.add_argument("--model_dir", default="/model", help="Directory containing models") |
There was a problem hiding this comment.
The --model_dir argument is defined but never used in the script. The cache_dir parameter uses os.environ.get("HF_HOME", "/model") instead of args.model_dir. Either use the command-line argument or remove it from the argument parser to avoid confusion.
| # Clean up temporary file | ||
| os.unlink(audio_path) |
There was a problem hiding this comment.
The temporary file cleanup in the finally block could fail if os.unlink() raises an exception (e.g., file already deleted or permission denied). Consider wrapping the cleanup in a try-except block to prevent masking the original exception from the transcription.
| # Clean up temporary file | |
| os.unlink(audio_path) | |
| # Clean up temporary file without masking any prior exceptions | |
| try: | |
| os.unlink(audio_path) | |
| except OSError as cleanup_err: | |
| # Best-effort cleanup; log in verbose mode but do not raise | |
| if verbose: | |
| print( | |
| f"Warning: failed to delete temporary file {audio_path}: {cleanup_err}", | |
| file=sys.stderr, | |
| ) |
| AudioTranscriberError::WhisperExecutableNotFoundError( | ||
| "whisper executable not found in PATH".to_string(), | ||
| "python3 or whisper_hf.py script not found".to_string(), | ||
| ) |
There was a problem hiding this comment.
The error message should be more specific. When python3 is not found versus when whisper_hf.py is missing, these are different issues requiring different troubleshooting. Consider checking for python3 availability first and providing a more helpful error message that distinguishes between missing Python interpreter and missing script file.
|
|
||
| RUN python3 /app/download_model.py | ||
| WORKDIR /app | ||
| RUN pipenv install --deploy |
There was a problem hiding this comment.
The pipenv install --deploy command should be run with the --system flag when installing in a Docker container. Without this flag, packages are installed in a virtual environment which may not be accessible to the Python interpreter when running scripts directly with python3. This could cause import errors at runtime.
| RUN pipenv install --deploy | |
| RUN pipenv install --deploy --system |
Migration from openai-whisper to Hugging Face transformers
Successfully switched from the prepackaged openai-whisper pip package to using the Hugging Face transformers library with the whisper-large-v3 model to address issues with long silences breaking/hanging transcription.
✅ Changes Completed:
Documentation Updates:
Added comprehensive documentation for EFS model caching including:
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.