Skip to content

feat(transcript): add configurable sentence capitalization#6

Merged
rbright merged 6 commits intomainfrom
feat/transcript-sentence-capitalization
Mar 1, 2026
Merged

feat(transcript): add configurable sentence capitalization#6
rbright merged 6 commits intomainfrom
feat/transcript-sentence-capitalization

Conversation

@rbright
Copy link
Owner

@rbright rbright commented Feb 27, 2026

Summary

  • add transcript sentence-case normalization in assembly so output starts sentences with capitals
  • add transcript.capitalize_sentences config support (JSONC + legacy parser), defaulting to true
  • wire transcript formatting options through the transcriber pipeline and expand config/transcript/pipeline tests
  • update README/configuration/architecture docs to reflect the new transcript formatting behavior

Test Plan

  • just ci-check
  • nix build 'path:.#sotto'

Summary by CodeRabbit

  • New Features

    • Added transcript.capitalize_sentences (enabled by default) to apply sentence-case normalization to generated transcripts while preserving trailing-space behavior.
  • Documentation

    • Updated configuration docs and examples to include the new capitalization option and its default.
  • Tests

    • Added tests covering parsing the new option (JSONC and legacy) and many sentence-case behaviors and edge cases.

Default transcript assembly now applies sentence-case normalization and promotes standalone i/i'm forms to I/I'm.

Adds transcript.capitalize_sentences config support for both JSONC and legacy parsers, wires options through the transcriber pipeline, and expands transcript/config/pipeline tests.
@coderabbitai
Copy link

coderabbitai bot commented Feb 27, 2026

📝 Walkthrough

Walkthrough

Adds configurable sentence-capitalization to transcript assembly: new CapitalizeSentences config, updated parsing, pipeline passing an Options struct to Assemble, and implementation of sentence-case logic (including pronoun "I") with tests and docs updated.

Changes

Cohort / File(s) Summary
Config types & defaults
apps/sotto/internal/config/types.go, apps/sotto/internal/config/defaults.go
Add CapitalizeSentences bool to TranscriptConfig and set default true.
Config parsing & tests
apps/sotto/internal/config/parser_jsonc.go, apps/sotto/internal/config/parser_legacy.go, apps/sotto/internal/config/parser_test.go
Parse transcript.capitalize_sentences from JSONC and legacy formats; add tests for both parsing paths.
Transcript assembly core
apps/sotto/internal/transcript/assemble.go, apps/sotto/internal/transcript/assemble_test.go
Change Assemble to accept Options { TrailingSpace, CapitalizeSentences }; implement capitalization and trailing-space handling; update and add many tests for sentence-case behavior and edge cases.
New transcript helpers
apps/sotto/internal/transcript/boundary_classifier.go, apps/sotto/internal/transcript/sentence_case.go, apps/sotto/internal/transcript/pronoun_case.go
Introduce sentence-boundary classification, sentence-start capitalization, and pronoun "I" capitalization logic.
Pipeline integration & tests
apps/sotto/internal/pipeline/transcriber.go, apps/sotto/internal/pipeline/transcriber_test.go
Pass transcript.Options from pipeline; update test expectations to reflect capitalization.
Docs & README
README.md, docs/architecture.md, docs/configuration.md
Document new transcript.capitalize_sentences option, update feature summary, architecture note, and example config.

Sequence Diagram(s)

sequenceDiagram
    participant Config as Configuration\nParser
    participant Transcriber as Transcriber
    participant Assemble as Transcript\nAssembler

    Config->>Transcriber: Provide TranscriptConfig\n(TrailingSpace, CapitalizeSentences)
    Transcriber->>Assemble: Assemble(segments, Options{TrailingSpace, CapitalizeSentences})
    Assemble->>Assemble: Join segments & normalize whitespace
    rect rgba(100,150,255,0.5)
    Assemble->>Assemble: If CapitalizeSentences: apply sentence-case rules\n(start-of-sentence, boundary classifier, pronoun "I")
    end
    rect rgba(100,200,150,0.5)
    Assemble->>Assemble: If TrailingSpace: append trailing space
    end
    Assemble->>Transcriber: Return formatted transcript
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through code and found the start,
I nudged each sentence to play its part,
I lifted little "i" to stand up tall,
I kept the trailing space if you call,
— a rabbit's nibble, tidy and small

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(transcript): add configurable sentence capitalization' clearly and specifically summarizes the main change: adding a new configurable feature for sentence capitalization in transcripts.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/transcript-sentence-capitalization

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4000b99710

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/sotto/internal/transcript/assemble.go`:
- Around line 51-73: The capitalizeSentenceStarts function treats every '.' as a
sentence boundary; change the logic so punctuation only resets capitalizeNext
when the '.' is a true sentence terminator (e.g., followed by whitespace and an
uppercaseable letter) and not when it's part of a decimal/number. In practice,
when iterating runes in capitalizeSentenceStarts, detect a '.' and check
surrounding runes (previous and next) to skip setting capitalizeNext if the dot
is between digits (decimal like '2.0') or if the next non-space rune is a digit;
otherwise keep the existing behavior for '.', '!' and '?' to set capitalizeNext.
Ensure you reference the existing capitalizeNext variable and the switch
handling of '.', '!', '?' while adding the additional rune-context checks.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44b2d3b and 4000b99.

📒 Files selected for processing (12)
  • README.md
  • apps/sotto/internal/config/defaults.go
  • apps/sotto/internal/config/parser_jsonc.go
  • apps/sotto/internal/config/parser_legacy.go
  • apps/sotto/internal/config/parser_test.go
  • apps/sotto/internal/config/types.go
  • apps/sotto/internal/pipeline/transcriber.go
  • apps/sotto/internal/pipeline/transcriber_test.go
  • apps/sotto/internal/transcript/assemble.go
  • apps/sotto/internal/transcript/assemble_test.go
  • docs/architecture.md
  • docs/configuration.md

Only capitalize after .!? when a whitespace boundary follows, preventing domain/decimal fragments like example.com or v2.1 from being mutated. Adds regression tests for boundary handling.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
apps/sotto/internal/transcript/assemble_test.go (1)

55-73: Add a regression test for leading-boundary inputs to prevent double capitalization.

The suite covers domain/decimal/quote boundaries well, but it misses cases like 2. hello there, which should become 2. Hello there (not 2. HEllo there).

🧪 Suggested test addition
 func TestAssembleSentenceCaseHandlesQuoteAfterBoundary(t *testing.T) {
 	t.Parallel()

 	got := Assemble([]string{"he said. \"hello there\" and left."}, Options{
 		TrailingSpace:       false,
 		CapitalizeSentences: true,
 	})
 	require.Equal(t, "He said. \"Hello there\" and left.", got)
 }
+
+func TestAssembleSentenceCaseLeadingBoundaryDoesNotDoubleCapitalize(t *testing.T) {
+	t.Parallel()
+
+	got := Assemble([]string{"2. hello there"}, Options{
+		TrailingSpace:       false,
+		CapitalizeSentences: true,
+	})
+	require.Equal(t, "2. Hello there", got)
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/sotto/internal/transcript/assemble_test.go` around lines 55 - 73, Add a
regression test that ensures sentence-capitalization does not double-capitalize
after a leading boundary like "2." by creating a new test (e.g.,
TestAssembleSentenceCaseHandlesLeadingBoundary) that calls Assemble with input
such as "2. hello there" using Options{TrailingSpace:false,
CapitalizeSentences:true} and asserts the output equals "2. Hello there"; refer
to the Assemble function and Options type to locate where other
sentence-capitalization tests live and mirror their structure for parallel
execution and assertion style.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/sotto/internal/transcript/assemble.go`:
- Around line 60-63: When handling rune capitalization in assemble.go, ensure
that when capitalizeStart is true and you uppercase the rune (the block
referencing capitalizeStart and unicode.ToUpper), you also clear pendingBoundary
so the sentence-boundary state doesn't also trigger further capitalization;
update the branch that sets r = unicode.ToUpper(r) to set pendingBoundary =
false (or otherwise reset the boundary flag) immediately after capitalization to
prevent double-capitalization.

---

Nitpick comments:
In `@apps/sotto/internal/transcript/assemble_test.go`:
- Around line 55-73: Add a regression test that ensures sentence-capitalization
does not double-capitalize after a leading boundary like "2." by creating a new
test (e.g., TestAssembleSentenceCaseHandlesLeadingBoundary) that calls Assemble
with input such as "2. hello there" using Options{TrailingSpace:false,
CapitalizeSentences:true} and asserts the output equals "2. Hello there"; refer
to the Assemble function and Options type to locate where other
sentence-capitalization tests live and mirror their structure for parallel
execution and assertion style.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4000b99 and 798eb26.

📒 Files selected for processing (2)
  • apps/sotto/internal/transcript/assemble.go
  • apps/sotto/internal/transcript/assemble_test.go

Reset pending sentence-boundary flags when start-of-text capitalization fires to avoid double-capitalization on inputs like '2. hello'. Adds regression coverage for leading-boundary handling.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
apps/sotto/internal/transcript/assemble.go (1)

100-107: Consider adding left curly quotes for completeness.

The current set includes straight quotes and right curly quotes, but omits left curly quotes (' U+2018, " U+201C). While the whitespace requirement handles common cases like . "Hello", adding the opening curly quotes would be more robust for edge cases where ASR outputs left curly quotes directly after punctuation without a space.

💡 Optional: add left curly quotes
 func isSentencePrefixRune(r rune) bool {
 	switch r {
-	case ')', ']', '}', '\'', '"', ''', '"':
+	case ')', ']', '}', '\'', '"', ''', '"', ''', '"':
 		return true
 	default:
 		return false
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/sotto/internal/transcript/assemble.go` around lines 100 - 107, The
isSentencePrefixRune function currently treats right/straight quotes as
sentence-prefix runes but omits left curly quotes; update the switch in
isSentencePrefixRune to also return true for the left single quotation mark ‘
(U+2018) and the left double quotation mark “ (U+201C) so they are recognized
the same as '\'' and '"' for sentence-prefix detection.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@apps/sotto/internal/transcript/assemble.go`:
- Around line 100-107: The isSentencePrefixRune function currently treats
right/straight quotes as sentence-prefix runes but omits left curly quotes;
update the switch in isSentencePrefixRune to also return true for the left
single quotation mark ‘ (U+2018) and the left double quotation mark “ (U+201C)
so they are recognized the same as '\'' and '"' for sentence-prefix detection.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 798eb26 and 9b48152.

📒 Files selected for processing (2)
  • apps/sotto/internal/transcript/assemble.go
  • apps/sotto/internal/transcript/assemble_test.go

rbright added 2 commits March 1, 2026 00:16
Refine abbreviation/initialism boundary handling for pronoun and locative contexts, and split transcript casing internals into focused files for maintainability.
Add coverage for pronoun-start and locative initialism cases while preserving conjunction non-boundary behavior around etc./u.s. periods.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
apps/sotto/internal/transcript/boundary_classifier.go (1)

270-285: Consolidate duplicated reverse token scanning.

Line [270]-Line [285] and Line [352]-Line [367] implement near-identical backward scans. Extracting a shared helper would reduce drift risk as token rules evolve.

♻️ Proposed refactor
+func scanSentenceTokenStart(runes []rune, from int) int {
+	start := from
+	for start >= 0 {
+		if r := runes[start]; unicode.IsLetter(r) || r == '.' {
+			start--
+			continue
+		}
+		break
+	}
+	return start + 1
+}
+
 func sentenceTokenStart(runes []rune, idx int) int {
 	if idx <= 0 || idx >= len(runes) {
 		return -1
 	}
-
-	start := idx - 1
-	for start >= 0 {
-		if r := runes[start]; unicode.IsLetter(r) || r == '.' {
-			start--
-			continue
-		}
-		break
-	}
-
-	return start + 1
+	return scanSentenceTokenStart(runes, idx-1)
 }
 
 func sentenceTokenBeforePeriod(runes []rune, idx int) string {
 	if idx <= 0 || idx >= len(runes) {
 		return ""
 	}
-
-	start := idx - 1
-	for start >= 0 {
-		if r := runes[start]; unicode.IsLetter(r) || r == '.' {
-			start--
-			continue
-		}
-		break
-	}
-
-	return strings.Trim(string(runes[start+1:idx]), ".")
+	start := scanSentenceTokenStart(runes, idx-1)
+	return strings.Trim(string(runes[start:idx]), ".")
 }

As per coding guidelines, "Write canonical, idiomatic Go with gofmt clean, straightforward naming, and small focused functions".

Also applies to: 352-367

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/sotto/internal/transcript/boundary_classifier.go` around lines 270 -
285, The backward-scanning logic is duplicated (in sentenceTokenStart and the
other reverse scan in the same file around lines 352-367); extract a small
helper (e.g., scanBackwardTokenStart(runes []rune, idx int) int) that
encapsulates the loop and character rules (unicode.IsLetter || r == '.') and
return the computed start index, then replace both sentenceTokenStart and the
other backward-scan code to call this helper; keep the same signature/behavior
(return -1 for out-of-range idx) and update callers accordingly and add/adjust a
small unit test exercising edge cases.
apps/sotto/internal/transcript/assemble.go (1)

7-10: Add an explicit default-options constructor to avoid zero-value surprises.

Line [7]-Line [10] makes CapitalizeSentences default to false for Options{}, while runtime config default is true. A constructor/default helper would make call-site intent explicit and reduce accidental behavior drift.

♻️ Proposed refactor
 type Options struct {
 	TrailingSpace       bool
 	CapitalizeSentences bool
 }
+
+func DefaultOptions() Options {
+	return Options{
+		TrailingSpace:       false,
+		CapitalizeSentences: true,
+	}
+}

As per coding guidelines, "Use explicit constructors and dependency wiring with no hidden globals".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/sotto/internal/transcript/assemble.go` around lines 7 - 10, Add an
explicit constructor for Options to avoid zero-value surprises: implement a
NewOptions (or DefaultOptions) function that returns an Options instance with
CapitalizeSentences set to true (and TrailingSpace set to the intended default,
e.g., false), and update call-sites to use it when they rely on runtime
defaults; reference the Options type in assemble.go and create func NewOptions()
Options { return Options{TrailingSpace: false, CapitalizeSentences: true} } (or
return *Options if callers expect pointers).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@apps/sotto/internal/transcript/assemble.go`:
- Around line 7-10: Add an explicit constructor for Options to avoid zero-value
surprises: implement a NewOptions (or DefaultOptions) function that returns an
Options instance with CapitalizeSentences set to true (and TrailingSpace set to
the intended default, e.g., false), and update call-sites to use it when they
rely on runtime defaults; reference the Options type in assemble.go and create
func NewOptions() Options { return Options{TrailingSpace: false,
CapitalizeSentences: true} } (or return *Options if callers expect pointers).

In `@apps/sotto/internal/transcript/boundary_classifier.go`:
- Around line 270-285: The backward-scanning logic is duplicated (in
sentenceTokenStart and the other reverse scan in the same file around lines
352-367); extract a small helper (e.g., scanBackwardTokenStart(runes []rune, idx
int) int) that encapsulates the loop and character rules (unicode.IsLetter || r
== '.') and return the computed start index, then replace both
sentenceTokenStart and the other backward-scan code to call this helper; keep
the same signature/behavior (return -1 for out-of-range idx) and update callers
accordingly and add/adjust a small unit test exercising edge cases.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b48152 and 5796980.

📒 Files selected for processing (5)
  • apps/sotto/internal/transcript/assemble.go
  • apps/sotto/internal/transcript/assemble_test.go
  • apps/sotto/internal/transcript/boundary_classifier.go
  • apps/sotto/internal/transcript/pronoun_case.go
  • apps/sotto/internal/transcript/sentence_case.go

@rbright rbright merged commit 971b3ef into main Mar 1, 2026
2 checks passed
@rbright rbright deleted the feat/transcript-sentence-capitalization branch March 1, 2026 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant