Data Ingestion Pipeline

WHY: As a user, I want to have a relevant chatbot in the sense that the data accessible by the RAG must be accessible quickly and updated with a real data ingestion system.

DoD:

- [x] Identify the data sources for the different types (audio, video, text, web, pdf) in our use case
- [ ] We want a pipeline for each type of data: Video/Audio Pipeline, PDF Pipeline, Web Pipeline, Plain Text Pipeline
- [ ] Add connectors capable of retrieving data from these sources (with the right rights, take all data, update data), one connector per pipeline
- [ ] Define triggers according to source type (Event driven, Cron, manual)
- [ ] Manage data transformation (Video -> Audio -> Transcript, PDF -> OCR -> Text, Html parsing -> Text, ...)
- [ ] Chunking strategies: Intelligent according to data type
- [ ] Quality control
- [ ] Metadata (origin and provenance of data)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Ingestion Pipeline #84

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Data Ingestion Pipeline #84

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions