Skip to content

Implement incremental Confluence data collection with change detection and deletion handling #86

@emincalyakaisskar

Description

@emincalyakaisskar

WHY: We want an intelligent data collection system that only fetches updated/new Confluence pages since the last sync and handles page deletions so that we reduce processing time, API calls, and maintain data consistency without reprocessing unchanged content.

DoD:

  •  System tracks last successful data collection timestamp

  •  Confluence API integration detects pages added/modified since last sync

  •  System detects and handles deleted Confluence pages

  •  Only changed pages are fetched and processed for embedding

  •  Deleted pages are removed from vector database

  •  Unchanged pages are skipped to optimize performance

  •  Research and decide: separate scheduled job vs integrated in RAG but decoupled approach/launched smartly

  •  Clear separation between data collection and RAG processing pipeline

  •  Robust error handling if Confluence API is unavailable

  •  Logging shows which pages were updated/added/deleted/skipped i.e. Logging are clear without errors and warnings

  •  Configuration allows manual full refresh when needed

WHAT: A data collection system that intelligently syncs only Confluence changes, with a research work to choose the optimal architecture (separate job or integrated service) based on technical analysis.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions