Skip to content

🔮 Project: Multi-Vector Retrieval (ColBERT) #2

@MotzWanted

Description

@MotzWanted

WHY
Currently, VOD Training only complies with document-level embeddings. This represents each document with a single-vector representation, constraining the granularity of the contextual information captured.

ColBERT introduced a more complex interaction by encoding each passage into a matrix of token-level embeddings. During search, it further embeds every query into another matrix, allowing efficient passage retrieval that contextually matches the query using scalable vector-similarity operators.

The rich interactions enabled by ColBERT have been proven to surpass the quality of single-vector representation models. However, making it scale efficiently to large corpora is not trivial.

HOW
The project will address the aforementioned goals through the following means:

Utilizing Fine-Grained Contextual Late Interaction:

  • Leverage ColBERT's ability to encode queries and passages into sequences of token-level embeddings.
  • Improve vod's on-disk data structures to handle 3-dimensional tensors with variable shapes (e.g., shape N x ? x H)
  • Implement ColBERT's MaxSim operator in the loss layer
  • Implement ColBERT's two-stage retrieval

Combine T5 Models with ColBERT:

  • Benchmark ColT5 against ColBERT
  • Benchmark the end-to-end search latency in search engine like Raffle.

Implement XTR: ContXextualized Token Retriever:

  • Implement XTR loss
  • Implement XTR one-stage retrieval

Refinements:

  • Investigate Robust Multi-Hop Reasoning at Scale via Condensed Retrieval.
  • Explore effective and efficient retrieval via Lightweight Late Interaction (e.g., PLAID)

WHAT
The anticipated outcomes of this project include:

  1. State-of-the-art retrieval for RAG models (T5 + XTR)
  2. A scalable solution capable of handling large corpora without compromising efficiency.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions