-
Notifications
You must be signed in to change notification settings - Fork 4
fix(search): Cap overly large dialog content to avoid errors when indexing dialogs. #3133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds an EF Core migration that creates a tsvector aggregate and applies a new version (V3) of the search."VDialogDocument" SQL view on Up and reverts to V2 on Down; the V3 view builds weighted, language-aware tsvector documents with truncation and token cleanup. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Sql/Dialog/Search/View.VDialogDocument.V3.sql (1)
12-12: Document the rationale for the 100,000 character limit.The truncation to 100,000 characters per content piece addresses the PR objective to cap overly large content. However, since multiple content pieces are aggregated, the total tsvector size could still grow large.
Consider documenting:
- Why 100,000 characters was chosen
- Whether aggregate size should also be monitored
- Any PostgreSQL tsvector size limits that informed this decision
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Migrations/20251212140020_CapSearchVectorSize.Designer.csis excluded by!**/Migrations/**/*Designer.cs
📒 Files selected for processing (2)
src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Migrations/20251212140020_CapSearchVectorSize.cs(1 hunks)src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Sql/Dialog/Search/View.VDialogDocument.V3.sql(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.cs
📄 CodeRabbit inference engine (AGENTS.md)
**/*.cs: Use file-scoped namespaces withusingdirectives outside the namespace
Use PascalCase for classes and methods in C#
Use camelCase for variables and parameters in C#
Prefer expression bodies for single-line members in C#
Usevarwhen the type is apparent in C#
Enable nullable reference types and keep entities immutable in C#
Use OneOf for union returns when applicable in C#
All code must compile withTreatWarningsAsErrors=trueand pass .NET analyzers
Files:
src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Migrations/20251212140020_CapSearchVectorSize.cs
**/*.{cs,json,yaml,yml}
📄 CodeRabbit inference engine (AGENTS.md)
Use 4-space indentation with LF line endings
Files:
src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Migrations/20251212140020_CapSearchVectorSize.cs
🧠 Learnings (3)
📓 Common learnings
Learnt from: elsand
Repo: Altinn/dialogporten PR: 2849
File: src/Digdir.Domain.Dialogporten.Janitor/README.md:35-88
Timestamp: 2025-10-13T18:40:25.376Z
Learning: In the Dialogporten project, when reviewing documentation changes, prefer to maintain consistency with existing formatting conventions in the file, even if unconventional, rather than suggesting formatting changes that are out-of-scope for the PR's main objectives.
📚 Learning: 2025-10-13T08:14:27.518Z
Learnt from: MagnusSandgren
Repo: Altinn/dialogporten PR: 2841
File: src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Repositories/DialogSearchRepository.cs:82-83
Timestamp: 2025-10-13T08:14:27.518Z
Learning: In PostgreSQL SQL queries within the Dialogporten codebase, prefer explicit type casts (e.g., `''::tsvector`) over relying on implicit conversions for clarity and to avoid ambiguity.
Applied to files:
src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Sql/Dialog/Search/View.VDialogDocument.V3.sql
📚 Learning: 2025-10-09T08:50:14.740Z
Learnt from: oskogstad
Repo: Altinn/dialogporten PR: 2841
File: src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Migrations/20251009083408_AddDialogSearch.cs:9-10
Timestamp: 2025-10-09T08:50:14.740Z
Learning: Files in `src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Migrations/*/**` are auto-generated Entity Framework Core migrations and should not be reviewed for formatting or coding style issues.
Applied to files:
src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Migrations/20251212140020_CapSearchVectorSize.cs
🧬 Code graph analysis (1)
src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Migrations/20251212140020_CapSearchVectorSize.cs (1)
src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Sql/MigrationSqlLoader.cs (1)
MigrationSqlLoader(5-30)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Dry run deploy apps / Deploy web-api-so to test
- GitHub Check: Dry run deploy apps / Deploy service to test
- GitHub Check: Dry run deploy apps / Deploy web-api-eu to test
- GitHub Check: Dry run deploy apps / Deploy graphql to test
- GitHub Check: build / build-and-test
🔇 Additional comments (2)
src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Sql/Dialog/Search/View.VDialogDocument.V3.sql (1)
10-10: Good use of explicit regconfig cast.The explicit cast to
regconfigfollows PostgreSQL best practices and matches the coding patterns established in this codebase.Based on learnings, explicit type casts are preferred in this codebase for clarity.
src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Migrations/20251212140020_CapSearchVectorSize.cs (1)
14-22: Migration structure looks correct.The Up migration properly loads and executes the V3 SQL script using
MigrationSqlLoader, which aligns with the established pattern in this codebase.
...Domain.Dialogporten.Infrastructure/Persistence/Sql/Dialog/Search/View.VDialogDocument.V3.sql
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Migrations/20251212140020_CapSearchVectorSize.cs(1 hunks)src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Sql/Dialog/Search/Aggregate.TsVector_Agg.sql(1 hunks)src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Sql/Dialog/Search/View.VDialogDocument.V3.sql(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Migrations/20251212140020_CapSearchVectorSize.cs
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: MagnusSandgren
Repo: Altinn/dialogporten PR: 2841
File: src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Repositories/DialogSearchRepository.cs:82-83
Timestamp: 2025-10-13T08:14:27.518Z
Learning: In PostgreSQL SQL queries within the Dialogporten codebase, prefer explicit type casts (e.g., `''::tsvector`) over relying on implicit conversions for clarity and to avoid ambiguity.
📚 Learning: 2025-10-13T08:14:27.518Z
Learnt from: MagnusSandgren
Repo: Altinn/dialogporten PR: 2841
File: src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Repositories/DialogSearchRepository.cs:82-83
Timestamp: 2025-10-13T08:14:27.518Z
Learning: In PostgreSQL SQL queries within the Dialogporten codebase, prefer explicit type casts (e.g., `''::tsvector`) over relying on implicit conversions for clarity and to avoid ambiguity.
Applied to files:
src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Sql/Dialog/Search/View.VDialogDocument.V3.sqlsrc/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Sql/Dialog/Search/Aggregate.TsVector_Agg.sql
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Dry run deploy apps / Deploy web-api-eu to test
- GitHub Check: Dry run deploy apps / Deploy service to test
- GitHub Check: Dry run deploy apps / Deploy graphql to test
- GitHub Check: Dry run deploy apps / Deploy web-api-so to test
- GitHub Check: build / build-and-test
🔇 Additional comments (3)
src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Sql/Dialog/Search/Aggregate.TsVector_Agg.sql (1)
1-5: Excellent implementation of tsvector aggregation.This aggregate correctly implements the pattern recommended in the previous review using
pg_catalog.tsvector_concatto preserve position information and weights when aggregating tsvectors across rows.
Note: This implementation directly addresses the previous review concern about using
string_aggwith text casting.src/Digdir.Domain.Dialogporten.Infrastructure/Persistence/Sql/Dialog/Search/View.VDialogDocument.V3.sql (2)
6-10: Language-aware search configuration is well implemented.The use of language-specific
regconfigwith a fallback to'simple'ensures proper tokenization and stemming for multilingual content while gracefully handling unmapped languages.
Note: The explicit type casts (
::"char"and::regconfig) align with the codebase preference for clarity.
12-15: Aggregation scope is correctly constrained.The
WHERE c."DialogId" = d."Id"clause properly ensures each dialog's document only includes its own content, and theLEFT JOINappropriately handles content without mapped language codes.
| -- Produces weighted tsvectors per dialog so upserts remain a simple INSERT ... SELECT. | ||
| CREATE OR REPLACE VIEW search."VDialogDocument" AS | ||
| SELECT d."Id" AS "DialogId" | ||
| , ( | ||
| SELECT public.tsvector_agg( | ||
| SETWEIGHT( | ||
| -- Fall back to simple when the language map lacks a match. | ||
| TO_TSVECTOR(COALESCE(isomap."TsConfigName", 'simple')::regconfig, c."Value"), | ||
| c."Weight"::"char" | ||
| ) | ||
| ) | ||
| FROM search."VDialogContent" c | ||
| LEFT JOIN search."Iso639TsVectorMap" isomap | ||
| ON c."LanguageCode" = isomap."IsoCode" | ||
| WHERE c."DialogId" = d."Id" | ||
| ) AS "Document" | ||
| , d."Party" AS "Party" | ||
| FROM "Dialog" d; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing size capping logic contradicts PR objectives.
The view correctly uses the new tsvector_agg aggregate (addressing the previous review concern), but the PR is titled "Cap overly large dialog content to avoid errors when indexing dialogs" and references issue #3100. However, this implementation contains no size limiting, truncation, or cleanup logic.
The AI summary mentions "truncated to 1,048,575 characters" and "REGEXP_REPLACE cleanup," but neither appears in this code. Without capping, this PR may not resolve the indexing errors it's intended to fix.
Verify whether:
- Size capping logic was accidentally omitted from this version
- The truncation is applied elsewhere in the pipeline
- The summary description is outdated
If truncation is needed, consider wrapping the aggregated document with a length check and substring operation, for example:
, (
SELECT
CASE
WHEN LENGTH(doc_text) > 1048575
THEN REGEXP_REPLACE(
SUBSTRING(doc_text, 1, 1048575),
'\s+\S*$', -- Remove trailing partial token
''
)::tsvector
ELSE doc_text::tsvector
END
FROM (
SELECT public.tsvector_agg(
SETWEIGHT(
TO_TSVECTOR(COALESCE(isomap."TsConfigName", 'simple')::regconfig, c."Value"),
c."Weight"::"char"
)
)::text AS doc_text
FROM search."VDialogContent" c
LEFT JOIN search."Iso639TsVectorMap" isomap
ON c."LanguageCode" = isomap."IsoCode"
WHERE c."DialogId" = d."Id"
) agg
) AS "Document"Run this script to check if there are references to truncation or size limits in related migration or configuration files:
#!/bin/bash
# Search for truncation, size limits, or the specific 1048575 limit mentioned in summary
rg -i "1048575|truncat|substring.*tsvector|regexp_replace.*tsvector" --type sql --type cs -C 3
|



Uh oh!
There was an error while loading. Please reload this page.