Skip to content

Conversation

@r-dh
Copy link

@r-dh r-dh commented Dec 15, 2025

Discussion - Deletion PostgreSQL and DuckDB

@emilradix
To the best of my knowledge, there seems to be no elegant solution to support deletion for DuckDB.

DuckDB checks FK constraints immediately, so the current implementation uses a manual cascade and commits after each step.
This is a known issue: duckdb/duckdb#13819
It is also documented as a known limitation: https://duckdb.org/docs/stable/sql/indexes#over-eager-constraint-checking-in-foreign-keys

Additionally, DuckDB does not automatically update its keyword search index, so this has also been explicitly implemented.

The current implementation for DuckDB is NOT atomic - failures may leave partial deletions.

Questions

  1. I would like feedback on my implementation of deletion for DuckDB. Is there an more elegant solution?
  2. If not - is the current approach (NOT atomic) acceptable?
  3. If not - should we support deletion for DuckDB?

@emilradix
Copy link
Collaborator

emilradix commented Dec 15, 2025

In terms of whether it is ok that it is not atomic:
If I understand well, if an error happens mid-deletion we have some elements still in the DB related to the doc, but the actual Document, will still be there, as it is last to get deleted.
So if deletion fails, you could just rerun with the same doc id and have it be cleaned up? Correct? If that is the case, I think it is ok.

Main thing I would like to see different with the DuckDB implementation is that we have to manually specify everything associated with the main document -> Is there any way to automatically discover all dependent/child tables in the foreign key relationships? And then loop through them deleting them.

The reason is that if we change / add something in the future, we will always have to go back to this deletion file and add it here as well for clean up -> I just have a feeling we will forget, and then have a mismatch.

@r-dh
Copy link
Author

r-dh commented Dec 22, 2025

Thanks for the feedback. I looked into automatic FK discovery to address your concern about maintenance.

It's possible to automatically discover FK relationships by querying DuckDB's duckdb_constraints() catalog. However, to actually perform deletions, we also need to map table names to SQLModel classes, and that mapping would still need manual maintenance.

So we'd have more complex code with the same maintenance burden.

Personally I would be in favour to keep the explicit deletion order.

The current implementation is ~20 lines, easy to understand, and uses efficient bulk deletes. If someone adds a new table with an FK to document or chunk and forgets to update the deletion logic, the FK constraint will cause deletion to fail loudly - not silently leave orphaned data.

I can add a comment here noting that new FK tables require an update.

Of course, fully automatic discovery is possible (using SQLModel's registry to build the table-to-model mapping dynamically), but I'd argue the implementation and complexity (DFS traversal, regex parsing, 100+ lines) would be questionable and doesn't seem worth it for a 4-table schema. Or perhaps there is a more elegant solution?

Thoughts?

@emilradix
Copy link
Collaborator

OK, if it is not easily possible, lets keep it as is.

In that case, could you add a test that makes and deletes something?

If someone modifies the setup then the test will fail, and we are reminded we have to go in and fix it here.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds document deletion functionality for both PostgreSQL and DuckDB databases in RAGLite. The implementation includes two primary functions: delete_documents for deleting by document IDs and delete_documents_by_metadata for deleting by metadata filters. Due to DuckDB's immediate foreign key constraint checking, the PostgreSQL implementation is atomic while the DuckDB implementation requires multiple commits and is non-atomic.

Key changes:

  • Adds deletion module with database-specific implementations for PostgreSQL (atomic) and DuckDB (non-atomic with manual cascading)
  • Implements metadata table cleanup to remove orphaned metadata values
  • Adds index rebuilding logic for DuckDB after deletions
  • Improves logging in _rag.py by replacing warnings.warn with proper logger usage

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
src/raglite/_delete.py New module implementing document deletion with separate code paths for PostgreSQL (atomic, ORM-based cascade) and DuckDB (non-atomic, manual cascade with intermediate commits); includes metadata cleanup and index rebuilding
tests/test_delete.py Adds comprehensive tests for document deletion by ID and metadata filter, verifying cascade deletion of related entities and proper metadata cleanup
src/raglite/init.py Exports new deletion functions delete_documents and delete_documents_by_metadata in the public API
src/raglite/_rag.py Replaces warnings.warn with logger.warning for better logging consistency and removes unused warnings import

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jirastorza
Copy link
Contributor

  • I’ve implemented the deletion tests to verify that the database returns to its original state after documents are inserted and then deleted. This ensures all related tables (including those with foreign keys) are fully cleared and will flag if future schema changes require updated deletion logic.

  • I’ve updated the warning in _rag.py _clip to use the logging module instead of a standard warning. This prevents test case failures when the context window limit is exceeded while ensuring the information is still captured in the logs.

@jirastorza jirastorza requested a review from emilradix January 7, 2026 14:18
@emilradix
Copy link
Collaborator

Thanks.

Could you verify that if you use vector_search_multivector = False, deletion works for DuckDB?
https://github.com/superlinear-ai/raglite/blob/main/src/raglite/_config.py#L70

@jirastorza
Copy link
Contributor

Thanks.

Could you verify that if you use vector_search_multivector = False, deletion works for DuckDB? https://github.com/superlinear-ai/raglite/blob/main/src/raglite/_config.py#L70

I implemented an extra test case with vector_search_multivector = False and the test passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants