-
Notifications
You must be signed in to change notification settings - Fork 0
Description
WHY: We want an intelligent data collection system that only fetches updated/new Confluence pages since the last sync and handles page deletions so that we reduce processing time, API calls, and maintain data consistency without reprocessing unchanged content.
DoD:
-
System tracks last successful data collection timestamp
-
Confluence API integration detects pages added/modified since last sync
-
System detects and handles deleted Confluence pages
-
Only changed pages are fetched and processed for embedding
-
Deleted pages are removed from vector database
-
Unchanged pages are skipped to optimize performance
-
Research and decide: separate scheduled job vs integrated in RAG but decoupled approach/launched smartly
-
Clear separation between data collection and RAG processing pipeline
-
Robust error handling if Confluence API is unavailable
-
Logging shows which pages were updated/added/deleted/skipped i.e. Logging are clear without errors and warnings
-
Configuration allows manual full refresh when needed
WHAT: A data collection system that intelligently syncs only Confluence changes, with a research work to choose the optimal architecture (separate job or integrated service) based on technical analysis.