-
Notifications
You must be signed in to change notification settings - Fork 101
feat: support deletion for PostgreSQL and DuckDB #170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
In terms of whether it is ok that it is not atomic: Main thing I would like to see different with the DuckDB implementation is that we have to manually specify everything associated with the main document -> Is there any way to automatically discover all dependent/child tables in the foreign key relationships? And then loop through them deleting them. The reason is that if we change / add something in the future, we will always have to go back to this deletion file and add it here as well for clean up -> I just have a feeling we will forget, and then have a mismatch. |
|
Thanks for the feedback. I looked into automatic FK discovery to address your concern about maintenance. It's possible to automatically discover FK relationships by querying DuckDB's So we'd have more complex code with the same maintenance burden. Personally I would be in favour to keep the explicit deletion order. The current implementation is ~20 lines, easy to understand, and uses efficient bulk deletes. If someone adds a new table with an FK to I can add a comment here noting that new FK tables require an update. Of course, fully automatic discovery is possible (using SQLModel's registry to build the table-to-model mapping dynamically), but I'd argue the implementation and complexity (DFS traversal, regex parsing, 100+ lines) would be questionable and doesn't seem worth it for a 4-table schema. Or perhaps there is a more elegant solution? Thoughts? |
|
OK, if it is not easily possible, lets keep it as is. In that case, could you add a test that makes and deletes something? If someone modifies the setup then the test will fail, and we are reminded we have to go in and fix it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds document deletion functionality for both PostgreSQL and DuckDB databases in RAGLite. The implementation includes two primary functions: delete_documents for deleting by document IDs and delete_documents_by_metadata for deleting by metadata filters. Due to DuckDB's immediate foreign key constraint checking, the PostgreSQL implementation is atomic while the DuckDB implementation requires multiple commits and is non-atomic.
Key changes:
- Adds deletion module with database-specific implementations for PostgreSQL (atomic) and DuckDB (non-atomic with manual cascading)
- Implements metadata table cleanup to remove orphaned metadata values
- Adds index rebuilding logic for DuckDB after deletions
- Improves logging in
_rag.pyby replacingwarnings.warnwith proper logger usage
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| src/raglite/_delete.py | New module implementing document deletion with separate code paths for PostgreSQL (atomic, ORM-based cascade) and DuckDB (non-atomic, manual cascade with intermediate commits); includes metadata cleanup and index rebuilding |
| tests/test_delete.py | Adds comprehensive tests for document deletion by ID and metadata filter, verifying cascade deletion of related entities and proper metadata cleanup |
| src/raglite/init.py | Exports new deletion functions delete_documents and delete_documents_by_metadata in the public API |
| src/raglite/_rag.py | Replaces warnings.warn with logger.warning for better logging consistency and removes unused warnings import |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
Thanks. Could you verify that if you use vector_search_multivector = False, deletion works for DuckDB? |
I implemented an extra test case with vector_search_multivector = False and the test passed. |
Discussion - Deletion PostgreSQL and DuckDB
@emilradix
To the best of my knowledge, there seems to be no elegant solution to support deletion for DuckDB.
DuckDB checks FK constraints immediately, so the current implementation uses a manual cascade and commits after each step.
This is a known issue: duckdb/duckdb#13819
It is also documented as a known limitation: https://duckdb.org/docs/stable/sql/indexes#over-eager-constraint-checking-in-foreign-keys
Additionally, DuckDB does not automatically update its keyword search index, so this has also been explicitly implemented.
The current implementation for DuckDB is NOT atomic - failures may leave partial deletions.
Questions