Skip to content

MultimodalHugs is an extension of Hugging Face that offers a generalized framework for training, evaluating, and using multimodal AI models with minimal code differences, ensuring seamless compatibility with Hugging Face pipelines.

License

Notifications You must be signed in to change notification settings

mt-upc/multimodalhugs

 
 

Repository files navigation

🎨 MultiModalHugs

MultimodalHugs is a lightweight, modular framework built on top of Hugging Face for training, evaluating, and deploying multimodal AI models with minimal code.

It supports diverse input modalities—including text, images, video, and pose sequences—and integrates seamlessly with the Hugging Face ecosystem (Trainer API, model hub, evaluate, etc.).


Key Features

  • Minimal boilerplate: Standardized TSV format for datasets and YAML-based configuration.
  • 🔁 Reproducible pipelines: Consistent setup for training, evaluation, and inference.
  • 🔌 Modular design: Easily extend or swap models, processors, and modalities.
  • 📦 Hugging Face native: Built to work out-of-the-box with existing models and tools.
  • Examples Included: Refer to the examples/ directory for guided scripts, configurations, and best practices.

Whether you're working on sign language translation, image-to-text, or token-free language modeling, MultimodalHugs simplifies experimentation while keeping your codebase clean.

For more details, refer to the documentation.


Installation

  1. Clone the repository:

    git clone https://github.com/GerrySant/multimodalhugs.git
  2. Navigate and install the package:

    • Standard installation:
       cd multimodalhugs
       pip install .
    • Developer installation:
       cd multimodalhugs
       pip install -e .[dev]

Usage

🚀 Getting Started

To set up, train, and evaluate a model, follow these steps:

Steps Overview

1. Dataset Preparation

For each partition (train, val, test), create a TSV file that captures essential sample details for consistency.

Metadata File Requirements

The metadata.tsv files for each partition must include the following fields:

  • signal: The primary input to the model, either as raw text or a file path pointing to a multimodal resource (e.g., an image, pose sequence, or audio file).
  • signal_start: Start timestamp (commonly in milliseconds) of the input segment. Can be left empty or 0 if not required by the setup.
  • signal_end: End timestamp (commonly in milliseconds) of the input segment. Can be left empty or 0 if not required by the setup.
  • encoder_prompt: An optional text field providing additional context to the input; this may include instructions (e.g., Translate the pose to English), modality tags (e.g., __asl__ for American Sign Languge, ASL), or any text relevant to the task.
  • decoder_prompt: An optional textual prompt used during decoding to guide the model’s output generation, corresponding to Hugging Face’s decoder_input_ids.
  • output: The expected textual output corresponding to the input signal.

2. Setup Datasets, Model, and Processors

multimodalhugs-setup --modality {pose2text,signwriting2text,image2text,etc} --config_path $CONFIG_PATH --output_dir $OUTPUT_PATH

3. Train a Model

multimodalhugs-train --task <task_name> --config_path $CONFIG_PATH --output_dir $OUTPUT_PATH

4. Generate Outputs with a Trained Model

multimodalhugs-generate --task <task_name> \
      --metric_name $METRIC_NAME \
      --config_path $CONFIG_PATH \
      --model_name_or_path $CKPT_PATH \
      --processor_name_or_path $PROCESSOR_PATH \
      --dataset_dir $DATASET_PATH \
      --output_dir $GENERATION_OUTPUT_DIR

For more details, refer to the CLI documentation.

Here you can find some sample end-to-end experimentation pipelines.

Directory Overview

multimodalhugs/
├── README.md               # Project overview
├── LICENSE                 # License information
├── pyproject.toml          # Package dependencies and setup
├── .gitignore              # Git ignore rules
├── .github/                # GitHub actions and workflows
│   └── workflows/
├── docs/                   # Documentation
│   ├── README.md
│   ├── customization/      # Guides for custom extensions
│   ├── data/               # Data configs and dataset docs
│   ├── general/            # General framework documentation
│   ├── media/              # Visual guides
│   └── models/             # Model documentation
├── examples/               # Example scripts and configurations
│   └── multimodal_translation/
│       ├── image2text_translation/
│       ├── pose2text_translation/
│       └── signwriting2text_translation/
├── multimodalhugs/         # Core framework
│   ├── custom_datasets/    # Custom datasets
│   ├── data/               # Data handling utilities
│   ├── models/             # Model implementations
│   ├── modules/            # Custom components (adapters, embeddings, etc.)
│   ├── processors/         # Preprocessing modules
│   ├── tasks/              # Task-specific logic (e.g., translation)
│   ├── training_setup/     # Training pipeline setup
│   ├── multimodalhugs_cli/ # Command-line interface for training/inference
│   └── utils/              # Helper functions
├── scripts/                # Utility scripts (e.g., docs generation, metrics)
└── tests/                  # Unit and integration tests

For a detailed breakdown of each directory, see docs/README.md.

Contributing

All contributions—bug reports, feature requests, or pull requests—are welcome. Please see our GitHub repository to get involved.

License

This project is licensed under the terms of the MIT License.

Citing this Work

If you use MultimodalHugs in your research or applications, please cite:

@misc{sant2025multimodalhugs,
  title        = {MultimodalHugs: Enabling Sign Language Processing in Hugging Face},
  author       = {Sant, Gerard and Jiang, Zifan and Escolano, Carlos and Moryossef, Amit and Müller, Mathias and Sennrich, Rico and Ebling, Sarah},
  year         = {2024},
  note         = {Manuscript submitted for publication},
  howpublished = {https://github.com/GerrySant/multimodalhugs},
}

About

MultimodalHugs is an extension of Hugging Face that offers a generalized framework for training, evaluating, and using multimodal AI models with minimal code differences, ensuring seamless compatibility with Hugging Face pipelines.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%