Implement incremental Confluence data collection with change detection and deletion handling

WHY: We want an intelligent data collection system that only fetches updated/new Confluence pages since the last sync and handles page deletions so that we reduce processing time, API calls, and maintain data consistency without reprocessing unchanged content.

DoD:

 - [ ] System tracks last successful data collection timestamp

  - [ ] Confluence API integration detects pages added/modified since last sync

  - [ ] System detects and handles deleted Confluence pages

  - [ ] Only changed pages are fetched and processed for embedding

  - [ ] Deleted pages are removed from vector database

  - [ ] Unchanged pages are skipped to optimize performance

  - [ ] **Research and decide**: separate scheduled job vs integrated in RAG but decoupled approach/launched smartly
  
  - [ ] Clear separation between data collection and RAG processing pipeline

  - [ ] Robust error handling if Confluence API is unavailable

  - [ ] Logging shows which pages were updated/added/deleted/skipped i.e. Logging are clear without errors and warnings

  - [ ] Configuration allows manual full refresh when needed

WHAT: A data collection system that intelligently syncs only Confluence changes, with a research work to choose the optimal architecture (separate job or integrated service) based on technical analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement incremental Confluence data collection with change detection and deletion handling #86

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement incremental Confluence data collection with change detection and deletion handling #86

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions