feat(transcript): add configurable sentence capitalization by rbright · Pull Request #6 · rbright/sotto

rbright · 2026-02-27T23:50:10Z

Summary

add transcript sentence-case normalization in assembly so output starts sentences with capitals
add transcript.capitalize_sentences config support (JSONC + legacy parser), defaulting to true
wire transcript formatting options through the transcriber pipeline and expand config/transcript/pipeline tests
update README/configuration/architecture docs to reflect the new transcript formatting behavior

Test Plan

just ci-check
nix build 'path:.#sotto'

Summary by CodeRabbit

New Features
- Added transcript.capitalize_sentences (enabled by default) to apply sentence-case normalization to generated transcripts while preserving trailing-space behavior.
Documentation
- Updated configuration docs and examples to include the new capitalization option and its default.
Tests
- Added tests covering parsing the new option (JSONC and legacy) and many sentence-case behaviors and edge cases.

Default transcript assembly now applies sentence-case normalization and promotes standalone i/i'm forms to I/I'm. Adds transcript.capitalize_sentences config support for both JSONC and legacy parsers, wires options through the transcriber pipeline, and expands transcript/config/pipeline tests.

coderabbitai · 2026-02-27T23:50:27Z

📝 Walkthrough

Walkthrough

Adds configurable sentence-capitalization to transcript assembly: new CapitalizeSentences config, updated parsing, pipeline passing an Options struct to Assemble, and implementation of sentence-case logic (including pronoun "I") with tests and docs updated.

Changes

Cohort / File(s)	Summary
Config types & defaults `apps/sotto/internal/config/types.go`, `apps/sotto/internal/config/defaults.go`	Add `CapitalizeSentences bool` to `TranscriptConfig` and set default true.
Config parsing & tests `apps/sotto/internal/config/parser_jsonc.go`, `apps/sotto/internal/config/parser_legacy.go`, `apps/sotto/internal/config/parser_test.go`	Parse `transcript.capitalize_sentences` from JSONC and legacy formats; add tests for both parsing paths.
Transcript assembly core `apps/sotto/internal/transcript/assemble.go`, `apps/sotto/internal/transcript/assemble_test.go`	Change `Assemble` to accept `Options { TrailingSpace, CapitalizeSentences }`; implement capitalization and trailing-space handling; update and add many tests for sentence-case behavior and edge cases.
New transcript helpers `apps/sotto/internal/transcript/boundary_classifier.go`, `apps/sotto/internal/transcript/sentence_case.go`, `apps/sotto/internal/transcript/pronoun_case.go`	Introduce sentence-boundary classification, sentence-start capitalization, and pronoun "I" capitalization logic.
Pipeline integration & tests `apps/sotto/internal/pipeline/transcriber.go`, `apps/sotto/internal/pipeline/transcriber_test.go`	Pass `transcript.Options` from pipeline; update test expectations to reflect capitalization.
Docs & README `README.md`, `docs/architecture.md`, `docs/configuration.md`	Document new `transcript.capitalize_sentences` option, update feature summary, architecture note, and example config.

Sequence Diagram(s)

sequenceDiagram
    participant Config as Configuration\nParser
    participant Transcriber as Transcriber
    participant Assemble as Transcript\nAssembler

    Config->>Transcriber: Provide TranscriptConfig\n(TrailingSpace, CapitalizeSentences)
    Transcriber->>Assemble: Assemble(segments, Options{TrailingSpace, CapitalizeSentences})
    Assemble->>Assemble: Join segments & normalize whitespace
    rect rgba(100,150,255,0.5)
    Assemble->>Assemble: If CapitalizeSentences: apply sentence-case rules\n(start-of-sentence, boundary classifier, pronoun "I")
    end
    rect rgba(100,200,150,0.5)
    Assemble->>Assemble: If TrailingSpace: append trailing space
    end
    Assemble->>Transcriber: Return formatted transcript

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through code and found the start,
I nudged each sentence to play its part,
I lifted little "i" to stand up tall,
I kept the trailing space if you call,
— a rabbit's nibble, tidy and small

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 6.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(transcript): add configurable sentence capitalization' clearly and specifically summarizes the main change: adding a new configurable feature for sentence capitalization in transcripts.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/transcript-sentence-capitalization

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4000b99710

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

apps/sotto/internal/transcript/assemble.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/sotto/internal/transcript/assemble.go`:
- Around line 51-73: The capitalizeSentenceStarts function treats every '.' as a
sentence boundary; change the logic so punctuation only resets capitalizeNext
when the '.' is a true sentence terminator (e.g., followed by whitespace and an
uppercaseable letter) and not when it's part of a decimal/number. In practice,
when iterating runes in capitalizeSentenceStarts, detect a '.' and check
surrounding runes (previous and next) to skip setting capitalizeNext if the dot
is between digits (decimal like '2.0') or if the next non-space rune is a digit;
otherwise keep the existing behavior for '.', '!' and '?' to set capitalizeNext.
Ensure you reference the existing capitalizeNext variable and the switch
handling of '.', '!', '?' while adding the additional rune-context checks.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44b2d3b and 4000b99.

📒 Files selected for processing (12)

README.md
apps/sotto/internal/config/defaults.go
apps/sotto/internal/config/parser_jsonc.go
apps/sotto/internal/config/parser_legacy.go
apps/sotto/internal/config/parser_test.go
apps/sotto/internal/config/types.go
apps/sotto/internal/pipeline/transcriber.go
apps/sotto/internal/pipeline/transcriber_test.go
apps/sotto/internal/transcript/assemble.go
apps/sotto/internal/transcript/assemble_test.go
docs/architecture.md
docs/configuration.md

apps/sotto/internal/transcript/assemble.go

Only capitalize after .!? when a whitespace boundary follows, preventing domain/decimal fragments like example.com or v2.1 from being mutated. Adds regression tests for boundary handling.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

apps/sotto/internal/transcript/assemble_test.go (1)

55-73: Add a regression test for leading-boundary inputs to prevent double capitalization.

The suite covers domain/decimal/quote boundaries well, but it misses cases like 2. hello there, which should become 2. Hello there (not 2. HEllo there).

🧪 Suggested test addition

 func TestAssembleSentenceCaseHandlesQuoteAfterBoundary(t *testing.T) {
 	t.Parallel()

 	got := Assemble([]string{"he said. \"hello there\" and left."}, Options{
 		TrailingSpace:       false,
 		CapitalizeSentences: true,
 	})
 	require.Equal(t, "He said. \"Hello there\" and left.", got)
 }
+
+func TestAssembleSentenceCaseLeadingBoundaryDoesNotDoubleCapitalize(t *testing.T) {
+	t.Parallel()
+
+	got := Assemble([]string{"2. hello there"}, Options{
+		TrailingSpace:       false,
+		CapitalizeSentences: true,
+	})
+	require.Equal(t, "2. Hello there", got)
+}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/sotto/internal/transcript/assemble_test.go` around lines 55 - 73, Add a
regression test that ensures sentence-capitalization does not double-capitalize
after a leading boundary like "2." by creating a new test (e.g.,
TestAssembleSentenceCaseHandlesLeadingBoundary) that calls Assemble with input
such as "2. hello there" using Options{TrailingSpace:false,
CapitalizeSentences:true} and asserts the output equals "2. Hello there"; refer
to the Assemble function and Options type to locate where other
sentence-capitalization tests live and mirror their structure for parallel
execution and assertion style.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/sotto/internal/transcript/assemble.go`:
- Around line 60-63: When handling rune capitalization in assemble.go, ensure
that when capitalizeStart is true and you uppercase the rune (the block
referencing capitalizeStart and unicode.ToUpper), you also clear pendingBoundary
so the sentence-boundary state doesn't also trigger further capitalization;
update the branch that sets r = unicode.ToUpper(r) to set pendingBoundary =
false (or otherwise reset the boundary flag) immediately after capitalization to
prevent double-capitalization.

---

Nitpick comments:
In `@apps/sotto/internal/transcript/assemble_test.go`:
- Around line 55-73: Add a regression test that ensures sentence-capitalization
does not double-capitalize after a leading boundary like "2." by creating a new
test (e.g., TestAssembleSentenceCaseHandlesLeadingBoundary) that calls Assemble
with input such as "2. hello there" using Options{TrailingSpace:false,
CapitalizeSentences:true} and asserts the output equals "2. Hello there"; refer
to the Assemble function and Options type to locate where other
sentence-capitalization tests live and mirror their structure for parallel
execution and assertion style.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4000b99 and 798eb26.

📒 Files selected for processing (2)

apps/sotto/internal/transcript/assemble.go
apps/sotto/internal/transcript/assemble_test.go

apps/sotto/internal/transcript/assemble.go

Reset pending sentence-boundary flags when start-of-text capitalization fires to avoid double-capitalization on inputs like '2. hello'. Adds regression coverage for leading-boundary handling.

coderabbitai

🧹 Nitpick comments (1)

apps/sotto/internal/transcript/assemble.go (1)
100-107: Consider adding left curly quotes for completeness.

The current set includes straight quotes and right curly quotes, but omits left curly quotes (' U+2018, " U+201C). While the whitespace requirement handles common cases like . "Hello", adding the opening curly quotes would be more robust for edge cases where ASR outputs left curly quotes directly after punctuation without a space.
💡 Optional: add left curly quotes
 func isSentencePrefixRune(r rune) bool {
 	switch r {
-	case ')', ']', '}', '\'', '"', ''', '"':
+	case ')', ']', '}', '\'', '"', ''', '"', ''', '"':
 		return true
 	default:
 		return false
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/sotto/internal/transcript/assemble.go` around lines 100 - 107, The
isSentencePrefixRune function currently treats right/straight quotes as
sentence-prefix runes but omits left curly quotes; update the switch in
isSentencePrefixRune to also return true for the left single quotation mark ‘
(U+2018) and the left double quotation mark “ (U+201C) so they are recognized
the same as '\'' and '"' for sentence-prefix detection.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@apps/sotto/internal/transcript/assemble.go`:
- Around line 100-107: The isSentencePrefixRune function currently treats
right/straight quotes as sentence-prefix runes but omits left curly quotes;
update the switch in isSentencePrefixRune to also return true for the left
single quotation mark ‘ (U+2018) and the left double quotation mark “ (U+201C)
so they are recognized the same as '\'' and '"' for sentence-prefix detection.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 798eb26 and 9b48152.

📒 Files selected for processing (2)

apps/sotto/internal/transcript/assemble.go
apps/sotto/internal/transcript/assemble_test.go

Refine abbreviation/initialism boundary handling for pronoun and locative contexts, and split transcript casing internals into focused files for maintainability.

Add coverage for pronoun-start and locative initialism cases while preserving conjunction non-boundary behavior around etc./u.s. periods.

coderabbitai

🧹 Nitpick comments (2)

apps/sotto/internal/transcript/boundary_classifier.go (1)

270-285: Consolidate duplicated reverse token scanning.

Line [270]-Line [285] and Line [352]-Line [367] implement near-identical backward scans. Extracting a shared helper would reduce drift risk as token rules evolve.

♻️ Proposed refactor

+func scanSentenceTokenStart(runes []rune, from int) int {
+	start := from
+	for start >= 0 {
+		if r := runes[start]; unicode.IsLetter(r) || r == '.' {
+			start--
+			continue
+		}
+		break
+	}
+	return start + 1
+}
+
 func sentenceTokenStart(runes []rune, idx int) int {
 	if idx <= 0 || idx >= len(runes) {
 		return -1
 	}
-
-	start := idx - 1
-	for start >= 0 {
-		if r := runes[start]; unicode.IsLetter(r) || r == '.' {
-			start--
-			continue
-		}
-		break
-	}
-
-	return start + 1
+	return scanSentenceTokenStart(runes, idx-1)
 }
 
 func sentenceTokenBeforePeriod(runes []rune, idx int) string {
 	if idx <= 0 || idx >= len(runes) {
 		return ""
 	}
-
-	start := idx - 1
-	for start >= 0 {
-		if r := runes[start]; unicode.IsLetter(r) || r == '.' {
-			start--
-			continue
-		}
-		break
-	}
-
-	return strings.Trim(string(runes[start+1:idx]), ".")
+	start := scanSentenceTokenStart(runes, idx-1)
+	return strings.Trim(string(runes[start:idx]), ".")
 }

As per coding guidelines, "Write canonical, idiomatic Go with gofmt clean, straightforward naming, and small focused functions".

Also applies to: 352-367

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/sotto/internal/transcript/boundary_classifier.go` around lines 270 -
285, The backward-scanning logic is duplicated (in sentenceTokenStart and the
other reverse scan in the same file around lines 352-367); extract a small
helper (e.g., scanBackwardTokenStart(runes []rune, idx int) int) that
encapsulates the loop and character rules (unicode.IsLetter || r == '.') and
return the computed start index, then replace both sentenceTokenStart and the
other backward-scan code to call this helper; keep the same signature/behavior
(return -1 for out-of-range idx) and update callers accordingly and add/adjust a
small unit test exercising edge cases.

apps/sotto/internal/transcript/assemble.go (1)

7-10: Add an explicit default-options constructor to avoid zero-value surprises.

Line [7]-Line [10] makes CapitalizeSentences default to false for Options{}, while runtime config default is true. A constructor/default helper would make call-site intent explicit and reduce accidental behavior drift.
♻️ Proposed refactor
 type Options struct {
 	TrailingSpace       bool
 	CapitalizeSentences bool
 }
+
+func DefaultOptions() Options {
+	return Options{
+		TrailingSpace:       false,
+		CapitalizeSentences: true,
+	}
+}
As per coding guidelines, "Use explicit constructors and dependency wiring with no hidden globals".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/sotto/internal/transcript/assemble.go` around lines 7 - 10, Add an
explicit constructor for Options to avoid zero-value surprises: implement a
NewOptions (or DefaultOptions) function that returns an Options instance with
CapitalizeSentences set to true (and TrailingSpace set to the intended default,
e.g., false), and update call-sites to use it when they rely on runtime
defaults; reference the Options type in assemble.go and create func NewOptions()
Options { return Options{TrailingSpace: false, CapitalizeSentences: true} } (or
return *Options if callers expect pointers).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@apps/sotto/internal/transcript/assemble.go`:
- Around line 7-10: Add an explicit constructor for Options to avoid zero-value
surprises: implement a NewOptions (or DefaultOptions) function that returns an
Options instance with CapitalizeSentences set to true (and TrailingSpace set to
the intended default, e.g., false), and update call-sites to use it when they
rely on runtime defaults; reference the Options type in assemble.go and create
func NewOptions() Options { return Options{TrailingSpace: false,
CapitalizeSentences: true} } (or return *Options if callers expect pointers).

In `@apps/sotto/internal/transcript/boundary_classifier.go`:
- Around line 270-285: The backward-scanning logic is duplicated (in
sentenceTokenStart and the other reverse scan in the same file around lines
352-367); extract a small helper (e.g., scanBackwardTokenStart(runes []rune, idx
int) int) that encapsulates the loop and character rules (unicode.IsLetter || r
== '.') and return the computed start index, then replace both
sentenceTokenStart and the other backward-scan code to call this helper; keep
the same signature/behavior (return -1 for out-of-range idx) and update callers
accordingly and add/adjust a small unit test exercising edge cases.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b48152 and 5796980.

📒 Files selected for processing (5)

apps/sotto/internal/transcript/assemble.go
apps/sotto/internal/transcript/assemble_test.go
apps/sotto/internal/transcript/boundary_classifier.go
apps/sotto/internal/transcript/pronoun_case.go
apps/sotto/internal/transcript/sentence_case.go

rbright added 2 commits February 27, 2026 16:43

docs(transcript): document sentence capitalization behavior

4000b99

chatgpt-codex-connector bot reviewed Feb 27, 2026

View reviewed changes

apps/sotto/internal/transcript/assemble.go Outdated Show resolved Hide resolved

coderabbitai bot reviewed Feb 27, 2026

View reviewed changes

apps/sotto/internal/transcript/assemble.go Outdated Show resolved Hide resolved

fix(transcript): avoid false sentence boundaries in casing

798eb26

Only capitalize after .!? when a whitespace boundary follows, preventing domain/decimal fragments like example.com or v2.1 from being mutated. Adds regression tests for boundary handling.

coderabbitai bot reviewed Feb 28, 2026

View reviewed changes

apps/sotto/internal/transcript/assemble.go Outdated Show resolved Hide resolved

fix(transcript): clear boundary state after initial capitalization

9b48152

Reset pending sentence-boundary flags when start-of-text capitalization fires to avoid double-capitalization on inputs like '2. hello'. Adds regression coverage for leading-boundary handling.

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

rbright added 2 commits March 1, 2026 00:16

fix(transcript): harden sentence-boundary capitalization logic

217d96f

Refine abbreviation/initialism boundary handling for pronoun and locative contexts, and split transcript casing internals into focused files for maintainability.

test(transcript): expand abbreviation boundary regression corpus

5796980

Add coverage for pronoun-start and locative initialism cases while preserving conjunction non-boundary behavior around etc./u.s. periods.

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

rbright merged commit 971b3ef into main Mar 1, 2026
2 checks passed

rbright deleted the feat/transcript-sentence-capitalization branch March 1, 2026 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(transcript): add configurable sentence capitalization#6

feat(transcript): add configurable sentence capitalization#6
rbright merged 6 commits intomainfrom
feat/transcript-sentence-capitalization

rbright commented Feb 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 27, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rbright commented Feb 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rbright commented Feb 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 27, 2026 •

edited

Loading