Skip to content

Data Ingestion Pipeline #84

@emincalyakaisskar

Description

@emincalyakaisskar

WHY: As a user, I want to have a relevant chatbot in the sense that the data accessible by the RAG must be accessible quickly and updated with a real data ingestion system.

DoD:

  • Identify the data sources for the different types (audio, video, text, web, pdf) in our use case
  • We want a pipeline for each type of data: Video/Audio Pipeline, PDF Pipeline, Web Pipeline, Plain Text Pipeline
  • Add connectors capable of retrieving data from these sources (with the right rights, take all data, update data), one connector per pipeline
  • Define triggers according to source type (Event driven, Cron, manual)
  • Manage data transformation (Video -> Audio -> Transcript, PDF -> OCR -> Text, Html parsing -> Text, ...)
  • Chunking strategies: Intelligent according to data type
  • Quality control
  • Metadata (origin and provenance of data)

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions