feat(transcript): add configurable sentence capitalization#6
Conversation
Default transcript assembly now applies sentence-case normalization and promotes standalone i/i'm forms to I/I'm. Adds transcript.capitalize_sentences config support for both JSONC and legacy parsers, wires options through the transcriber pipeline, and expands transcript/config/pipeline tests.
📝 WalkthroughWalkthroughAdds configurable sentence-capitalization to transcript assembly: new Changes
Sequence Diagram(s)sequenceDiagram
participant Config as Configuration\nParser
participant Transcriber as Transcriber
participant Assemble as Transcript\nAssembler
Config->>Transcriber: Provide TranscriptConfig\n(TrailingSpace, CapitalizeSentences)
Transcriber->>Assemble: Assemble(segments, Options{TrailingSpace, CapitalizeSentences})
Assemble->>Assemble: Join segments & normalize whitespace
rect rgba(100,150,255,0.5)
Assemble->>Assemble: If CapitalizeSentences: apply sentence-case rules\n(start-of-sentence, boundary classifier, pronoun "I")
end
rect rgba(100,200,150,0.5)
Assemble->>Assemble: If TrailingSpace: append trailing space
end
Assemble->>Transcriber: Return formatted transcript
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4000b99710
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/sotto/internal/transcript/assemble.go`:
- Around line 51-73: The capitalizeSentenceStarts function treats every '.' as a
sentence boundary; change the logic so punctuation only resets capitalizeNext
when the '.' is a true sentence terminator (e.g., followed by whitespace and an
uppercaseable letter) and not when it's part of a decimal/number. In practice,
when iterating runes in capitalizeSentenceStarts, detect a '.' and check
surrounding runes (previous and next) to skip setting capitalizeNext if the dot
is between digits (decimal like '2.0') or if the next non-space rune is a digit;
otherwise keep the existing behavior for '.', '!' and '?' to set capitalizeNext.
Ensure you reference the existing capitalizeNext variable and the switch
handling of '.', '!', '?' while adding the additional rune-context checks.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
README.mdapps/sotto/internal/config/defaults.goapps/sotto/internal/config/parser_jsonc.goapps/sotto/internal/config/parser_legacy.goapps/sotto/internal/config/parser_test.goapps/sotto/internal/config/types.goapps/sotto/internal/pipeline/transcriber.goapps/sotto/internal/pipeline/transcriber_test.goapps/sotto/internal/transcript/assemble.goapps/sotto/internal/transcript/assemble_test.godocs/architecture.mddocs/configuration.md
Only capitalize after .!? when a whitespace boundary follows, preventing domain/decimal fragments like example.com or v2.1 from being mutated. Adds regression tests for boundary handling.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
apps/sotto/internal/transcript/assemble_test.go (1)
55-73: Add a regression test for leading-boundary inputs to prevent double capitalization.The suite covers domain/decimal/quote boundaries well, but it misses cases like
2. hello there, which should become2. Hello there(not2. HEllo there).🧪 Suggested test addition
func TestAssembleSentenceCaseHandlesQuoteAfterBoundary(t *testing.T) { t.Parallel() got := Assemble([]string{"he said. \"hello there\" and left."}, Options{ TrailingSpace: false, CapitalizeSentences: true, }) require.Equal(t, "He said. \"Hello there\" and left.", got) } + +func TestAssembleSentenceCaseLeadingBoundaryDoesNotDoubleCapitalize(t *testing.T) { + t.Parallel() + + got := Assemble([]string{"2. hello there"}, Options{ + TrailingSpace: false, + CapitalizeSentences: true, + }) + require.Equal(t, "2. Hello there", got) +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/sotto/internal/transcript/assemble_test.go` around lines 55 - 73, Add a regression test that ensures sentence-capitalization does not double-capitalize after a leading boundary like "2." by creating a new test (e.g., TestAssembleSentenceCaseHandlesLeadingBoundary) that calls Assemble with input such as "2. hello there" using Options{TrailingSpace:false, CapitalizeSentences:true} and asserts the output equals "2. Hello there"; refer to the Assemble function and Options type to locate where other sentence-capitalization tests live and mirror their structure for parallel execution and assertion style.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/sotto/internal/transcript/assemble.go`:
- Around line 60-63: When handling rune capitalization in assemble.go, ensure
that when capitalizeStart is true and you uppercase the rune (the block
referencing capitalizeStart and unicode.ToUpper), you also clear pendingBoundary
so the sentence-boundary state doesn't also trigger further capitalization;
update the branch that sets r = unicode.ToUpper(r) to set pendingBoundary =
false (or otherwise reset the boundary flag) immediately after capitalization to
prevent double-capitalization.
---
Nitpick comments:
In `@apps/sotto/internal/transcript/assemble_test.go`:
- Around line 55-73: Add a regression test that ensures sentence-capitalization
does not double-capitalize after a leading boundary like "2." by creating a new
test (e.g., TestAssembleSentenceCaseHandlesLeadingBoundary) that calls Assemble
with input such as "2. hello there" using Options{TrailingSpace:false,
CapitalizeSentences:true} and asserts the output equals "2. Hello there"; refer
to the Assemble function and Options type to locate where other
sentence-capitalization tests live and mirror their structure for parallel
execution and assertion style.
Reset pending sentence-boundary flags when start-of-text capitalization fires to avoid double-capitalization on inputs like '2. hello'. Adds regression coverage for leading-boundary handling.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
apps/sotto/internal/transcript/assemble.go (1)
100-107: Consider adding left curly quotes for completeness.The current set includes straight quotes and right curly quotes, but omits left curly quotes (
'U+2018,"U+201C). While the whitespace requirement handles common cases like. "Hello", adding the opening curly quotes would be more robust for edge cases where ASR outputs left curly quotes directly after punctuation without a space.💡 Optional: add left curly quotes
func isSentencePrefixRune(r rune) bool { switch r { - case ')', ']', '}', '\'', '"', ''', '"': + case ')', ']', '}', '\'', '"', ''', '"', ''', '"': return true default: return false } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/sotto/internal/transcript/assemble.go` around lines 100 - 107, The isSentencePrefixRune function currently treats right/straight quotes as sentence-prefix runes but omits left curly quotes; update the switch in isSentencePrefixRune to also return true for the left single quotation mark ‘ (U+2018) and the left double quotation mark “ (U+201C) so they are recognized the same as '\'' and '"' for sentence-prefix detection.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@apps/sotto/internal/transcript/assemble.go`:
- Around line 100-107: The isSentencePrefixRune function currently treats
right/straight quotes as sentence-prefix runes but omits left curly quotes;
update the switch in isSentencePrefixRune to also return true for the left
single quotation mark ‘ (U+2018) and the left double quotation mark “ (U+201C)
so they are recognized the same as '\'' and '"' for sentence-prefix detection.
Refine abbreviation/initialism boundary handling for pronoun and locative contexts, and split transcript casing internals into focused files for maintainability.
Add coverage for pronoun-start and locative initialism cases while preserving conjunction non-boundary behavior around etc./u.s. periods.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
apps/sotto/internal/transcript/boundary_classifier.go (1)
270-285: Consolidate duplicated reverse token scanning.Line [270]-Line [285] and Line [352]-Line [367] implement near-identical backward scans. Extracting a shared helper would reduce drift risk as token rules evolve.
♻️ Proposed refactor
+func scanSentenceTokenStart(runes []rune, from int) int { + start := from + for start >= 0 { + if r := runes[start]; unicode.IsLetter(r) || r == '.' { + start-- + continue + } + break + } + return start + 1 +} + func sentenceTokenStart(runes []rune, idx int) int { if idx <= 0 || idx >= len(runes) { return -1 } - - start := idx - 1 - for start >= 0 { - if r := runes[start]; unicode.IsLetter(r) || r == '.' { - start-- - continue - } - break - } - - return start + 1 + return scanSentenceTokenStart(runes, idx-1) } func sentenceTokenBeforePeriod(runes []rune, idx int) string { if idx <= 0 || idx >= len(runes) { return "" } - - start := idx - 1 - for start >= 0 { - if r := runes[start]; unicode.IsLetter(r) || r == '.' { - start-- - continue - } - break - } - - return strings.Trim(string(runes[start+1:idx]), ".") + start := scanSentenceTokenStart(runes, idx-1) + return strings.Trim(string(runes[start:idx]), ".") }As per coding guidelines, "Write canonical, idiomatic Go with gofmt clean, straightforward naming, and small focused functions".
Also applies to: 352-367
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/sotto/internal/transcript/boundary_classifier.go` around lines 270 - 285, The backward-scanning logic is duplicated (in sentenceTokenStart and the other reverse scan in the same file around lines 352-367); extract a small helper (e.g., scanBackwardTokenStart(runes []rune, idx int) int) that encapsulates the loop and character rules (unicode.IsLetter || r == '.') and return the computed start index, then replace both sentenceTokenStart and the other backward-scan code to call this helper; keep the same signature/behavior (return -1 for out-of-range idx) and update callers accordingly and add/adjust a small unit test exercising edge cases.apps/sotto/internal/transcript/assemble.go (1)
7-10: Add an explicit default-options constructor to avoid zero-value surprises.Line [7]-Line [10] makes
CapitalizeSentencesdefault tofalseforOptions{}, while runtime config default istrue. A constructor/default helper would make call-site intent explicit and reduce accidental behavior drift.♻️ Proposed refactor
type Options struct { TrailingSpace bool CapitalizeSentences bool } + +func DefaultOptions() Options { + return Options{ + TrailingSpace: false, + CapitalizeSentences: true, + } +}As per coding guidelines, "Use explicit constructors and dependency wiring with no hidden globals".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/sotto/internal/transcript/assemble.go` around lines 7 - 10, Add an explicit constructor for Options to avoid zero-value surprises: implement a NewOptions (or DefaultOptions) function that returns an Options instance with CapitalizeSentences set to true (and TrailingSpace set to the intended default, e.g., false), and update call-sites to use it when they rely on runtime defaults; reference the Options type in assemble.go and create func NewOptions() Options { return Options{TrailingSpace: false, CapitalizeSentences: true} } (or return *Options if callers expect pointers).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@apps/sotto/internal/transcript/assemble.go`:
- Around line 7-10: Add an explicit constructor for Options to avoid zero-value
surprises: implement a NewOptions (or DefaultOptions) function that returns an
Options instance with CapitalizeSentences set to true (and TrailingSpace set to
the intended default, e.g., false), and update call-sites to use it when they
rely on runtime defaults; reference the Options type in assemble.go and create
func NewOptions() Options { return Options{TrailingSpace: false,
CapitalizeSentences: true} } (or return *Options if callers expect pointers).
In `@apps/sotto/internal/transcript/boundary_classifier.go`:
- Around line 270-285: The backward-scanning logic is duplicated (in
sentenceTokenStart and the other reverse scan in the same file around lines
352-367); extract a small helper (e.g., scanBackwardTokenStart(runes []rune, idx
int) int) that encapsulates the loop and character rules (unicode.IsLetter || r
== '.') and return the computed start index, then replace both
sentenceTokenStart and the other backward-scan code to call this helper; keep
the same signature/behavior (return -1 for out-of-range idx) and update callers
accordingly and add/adjust a small unit test exercising edge cases.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
apps/sotto/internal/transcript/assemble.goapps/sotto/internal/transcript/assemble_test.goapps/sotto/internal/transcript/boundary_classifier.goapps/sotto/internal/transcript/pronoun_case.goapps/sotto/internal/transcript/sentence_case.go
Summary
transcript.capitalize_sentencesconfig support (JSONC + legacy parser), defaulting totrueTest Plan
just ci-checknix build 'path:.#sotto'Summary by CodeRabbit
New Features
Documentation
Tests