#710 Make the schema change renderer nested schema friendly. by yruslan · Pull Request #713 · AbsaOSS/pramen

yruslan · 2026-02-25T09:04:37Z

Summary by CodeRabbit

Refactor
- Schema comparison now recursively detects changes inside nested structs and arrays, reporting additions, deletions, and type changes with full path-aware names (including array element notation) for clearer diffs.
- Type stringification improved: arrays, nested structs and length-constrained strings are rendered more readably (e.g., array<struct<...>>, varchar(n)).
Tests
- Added a test validating detection of nested additions, deletions and type changes (including arrays of primitives and structs).

coderabbitai · 2026-02-25T09:04:56Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

compareSchemas was rewritten to recursively traverse nested StructType and ArrayType schemas with path-aware names, accumulating new, deleted, and changed fields; dataTypeToString was enhanced to represent nested arrays, structs, and varchar(length). A unit test validating nested change detection was added.

Changes

Cohort / File(s)	Summary
Schema Comparison Refactor `pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`	Rewrote `compareSchemas(schemaA, schemaB)` to use recursive `processStruct` and `processArray` with path-tracking; introduced `ListBuffer` accumulators (`newFields`, `deletedFields`, `changedFields`); improved `dataTypeToString` to render `array<struct<...>>`, `array<elementType>`, `struct<...>`, and `varchar(n)`; parameter names changed to `schemaA/schemaB`.
Nested Schema Comparison Test `pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala`	Added "detect nested type changes" test constructing nested StructTypes and arrays; asserts 9 diffs total (3 ChangedType, 4 NewField, 2 DeletedField) and verifies path-aware changes (e.g., `address.city`, `tags`, `phones[].number`) and new/deleted fields (`Id`, `address.state`, `phones[].country`, `additional_properties`, `id`, `error_info`).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped through schemas, layer by layer,
Found changed types and fields laid bare,
Paths with brackets, structs curled tight,
New and lost in morning light,
A tiny rabbit cheers the recursive sight.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: enhancing schema change detection to handle nested structures (structs and arrays) in addition to top-level fields, making it compatible with nested schemas.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/710-handle-nested-structs-in-notifications

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala (1)
217-233: Dead Seq.empty expressions in processArray

processArray has return type Unit. The Seq.empty expressions on lines 221 and 224 are evaluated and immediately discarded — they carry no meaning and will confuse readers into thinking there is an intentional return value.
♻️ Proposed cleanup
   def processArray(array1: ArrayType, array2: ArrayType, metadata1: Metadata, metadata2: Metadata, path: String = ""): Unit = {
     (array1.elementType, array2.elementType) match {
       case (st1: StructType, st2: StructType) =>
         processStruct(st1, st2, s"$path[].")
-        Seq.empty
       case (ar1: ArrayType, ar2: ArrayType) =>
         processArray(ar1, ar2, metadata1, metadata2, s"$path[]")
-        Seq.empty
       case _ =>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`
around lines 217 - 233, Remove the dead Seq.empty expressions inside
processArray: when matching on (st1: StructType, st2: StructType) and (ar1:
ArrayType, ar2: ArrayType) simply call processStruct(st1, st2, s"$path[].") and
processArray(ar1, ar2, metadata1, metadata2, s"$path[]") respectively and delete
the trailing Seq.empty tokens so the method remains Unit-returning and clearer;
keep unchanged the logic that computes dt1/dt2 and records
FieldChange.ChangedType when types differ.
pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala (1)
291-349: LGTM — consider adding a mixed-case field-name test case

The new test thoroughly exercises nested struct recursion, array-of-struct element changes, new/deleted fields at multiple depth levels, and the array<struct<...>> display format.

One gap: because all field names in both schemas are lowercase, the test does not exercise the case-insensitive lookup path (i.e., the bug flagged in processStruct). A small additional assertion with a schema that has a field named, say, "City" vs "city" would both catch the existing bug and guard against regressions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala`
around lines 291 - 349, Add a small mixed-case field-name scenario to the
"detect nested type changes" test to exercise the case-insensitive lookup path
in processStruct: create a variant of schema1/schema2 (or a tiny additional
pair) where a nested field differs only by case (e.g., "City" vs "city") and
assert that compareSchemas still detects the change (use
ChangedType/NewField/DeletedField assertions similar to existing ones);
reference compareSchemas and processStruct so the test triggers the
case-insensitive branch and prevents regressions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`:
- Around line 178-214: fields1 and fields2 are keyed with lowercased names but
subsequent contains/lookup use the original field name causing incorrect
new/deleted/changed detection and potential NoSuchElementException; update all
lookups to use the lowercased key (e.g., replace f.name with f.name.toLowerCase
when calling fields1.contains, fields2.contains and when indexing
fields2(f1.name) use fields2(f1.name.toLowerCase)), ensuring newColumns,
deletedColumns and changedType (and the branch that calls
processStruct/processArray and dataTypeToString) reference the lowercased map
keys while still using the original f.name for human-readable path strings added
to newFields, deletedFields and changedFields.

---

Nitpick comments:
In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`:
- Around line 217-233: Remove the dead Seq.empty expressions inside
processArray: when matching on (st1: StructType, st2: StructType) and (ar1:
ArrayType, ar2: ArrayType) simply call processStruct(st1, st2, s"$path[].") and
processArray(ar1, ar2, metadata1, metadata2, s"$path[]") respectively and delete
the trailing Seq.empty tokens so the method remains Unit-returning and clearer;
keep unchanged the logic that computes dt1/dt2 and records
FieldChange.ChangedType when types differ.

In
`@pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala`:
- Around line 291-349: Add a small mixed-case field-name scenario to the "detect
nested type changes" test to exercise the case-insensitive lookup path in
processStruct: create a variant of schema1/schema2 (or a tiny additional pair)
where a nested field differs only by case (e.g., "City" vs "city") and assert
that compareSchemas still detects the change (use
ChangedType/NewField/DeletedField assertions similar to existing ones);
reference compareSchemas and processStruct so the test triggers the
case-insensitive branch and prevents regressions.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8f9e63 and 80fbbae.

📒 Files selected for processing (2)

pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala
pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala

pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala

github-actions · 2026-02-25T09:17:36Z

Unit Test Coverage

Overall Project	84.35% `-0.04%`	🍏
Files changed	89.3%	🍏

Module	Coverage
pramen:core Jacoco Report	86.3% `-0.05%`	🍏

Files

Module	File	Coverage
pramen:core Jacoco Report	SparkUtils.scala	87.3% `-3.06%`	🍏

coderabbitai

🧹 Nitpick comments (2)

pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala (2)
169-174: dataTypeToString for nested arrays returns "array<array>" — losing inner element type.

When a field is ArrayType(ArrayType(StringType)), line 170 produces "array<array>" (since typeName of ArrayType is "array"). This is only relevant for the leaf comparison in processArray when both sides are non-matching nested array types that don't recurse further. In practice this is unlikely to matter, but for completeness you could handle it:
Optional improvement
-        case a: ArrayType                                           => s"array<${a.elementType.typeName}>"
+        case a: ArrayType                                           => s"array<${dataTypeToString(a.elementType, Metadata.empty)}>"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`
around lines 169 - 174, The dataTypeToString logic loses inner element type for
nested arrays (e.g., ArrayType(ArrayType(StringType))) because it uses
a.flatType.typeName; update the ArrayType case in dataTypeToString to detect
when a.elementType is itself an ArrayType (or other complex type) and
recursively call dataTypeToString(a.elementType) to produce a full nested string
like "array<array<...>>" or "array<varchar(n)>" instead of "array<array>" —
adjust any callers such as processArray that rely on this output to use the new
recursive representation.
217-231: Nullable mismatch in arrays is silently ignored.

processArray recurses on elementType but never compares array1.containsNull vs array2.containsNull. If the nullability of the array elements changes, it won't be reported. This may be intentional (the existing transformSchemaForCatalog normalizes nullability), but worth confirming.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`
around lines 217 - 231, processArray currently compares elementType but ignores
array nullability; add a comparison of array1.containsNull vs
array2.containsNull after determining element types (inside processArray) and,
when they differ, append a FieldChange to changedFields (e.g., use a
nullable-change variant like FieldChange.ChangedNullable or add a new
FieldChange case) with the same path (include "[]" to indicate array) and the
two boolean values so nullability changes are reported; update processArray to
reference array1.containsNull, array2.containsNull, changedFields and
FieldChange accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`:
- Around line 169-174: The dataTypeToString logic loses inner element type for
nested arrays (e.g., ArrayType(ArrayType(StringType))) because it uses
a.flatType.typeName; update the ArrayType case in dataTypeToString to detect
when a.elementType is itself an ArrayType (or other complex type) and
recursively call dataTypeToString(a.elementType) to produce a full nested string
like "array<array<...>>" or "array<varchar(n)>" instead of "array<array>" —
adjust any callers such as processArray that rely on this output to use the new
recursive representation.
- Around line 217-231: processArray currently compares elementType but ignores
array nullability; add a comparison of array1.containsNull vs
array2.containsNull after determining element types (inside processArray) and,
when they differ, append a FieldChange to changedFields (e.g., use a
nullable-change variant like FieldChange.ChangedNullable or add a new
FieldChange case) with the same path (include "[]" to indicate array) and the
two boolean values so nullability changes are reported; update processArray to
reference array1.containsNull, array2.containsNull, changedFields and
FieldChange accordingly.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80fbbae and 275a26d.

📒 Files selected for processing (1)

pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala

…n purposes. We want to know when schema changes even if the change is casing of fields.

coderabbitai

🧹 Nitpick comments (1)

pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala (1)
291-349: Solid test coverage for nested schema comparison.

The 7 expected diffs are correctly accounted for, and the per-path assertions cover all three change categories for both nested structs and arrays-of-structs. A couple of observations:

val df on Line 347 — naming the DeletedField collection df is a minor readability hazard in a Spark test file where df is universally understood to mean DataFrame. Consider deletedFields (the variable is already available as a val in the broader test scope, so rename to e.g., deletedFieldsList or deleted).

Missing coverage for case-sensitivity — no test covers a field rename that differs only in case (e.g., "City" in schema1 vs "city" in schema2) to document/assert the current case-sensitive comparison behaviour. Adding such a test would pin the semantics and prevent silent regressions if case-insensitive matching is introduced later.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala`
around lines 291 - 349, Rename the local val `df` (the DeletedField collection)
to a clearer name like `deletedFieldsList` in the "detect nested type changes"
test to avoid confusion with Spark DataFrame; update the reference in the
pattern match and assertions (the val is created where the code does `val df =
diff.collect { case df: DeletedField => df }`). Also add an additional small
assertion block in the same SparkUtilsSuite test (or a new test) that calls
compareSchemas with two schemas differing only by field name case (e.g., "City"
vs "city") to assert the current case-sensitive behavior (use compareSchemas,
inspect returned NewField/DeletedField/ChangedType entries, and assert expected
counts and columnName values) so the intended case-sensitivity is documented and
prevented from regressing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala`:
- Around line 291-349: Rename the local val `df` (the DeletedField collection)
to a clearer name like `deletedFieldsList` in the "detect nested type changes"
test to avoid confusion with Spark DataFrame; update the reference in the
pattern match and assertions (the val is created where the code does `val df =
diff.collect { case df: DeletedField => df }`). Also add an additional small
assertion block in the same SparkUtilsSuite test (or a new test) that calls
compareSchemas with two schemas differing only by field name case (e.g., "City"
vs "city") to assert the current case-sensitive behavior (use compareSchemas,
inspect returned NewField/DeletedField/ChangedType entries, and assert expected
counts and columnName values) so the intended case-sensitivity is documented and
prevented from regressing.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 275a26d and 790a100.

📒 Files selected for processing (2)

pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala
pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`:
- Around line 174-176: The array handling in the pattern match (the cases
matching "case a: ArrayType if a.elementType.isInstanceOf[StructType]" and "case
a: ArrayType => s\"array<${a.elementType.typeName}>\"") loses nested array
detail by using a.elementType.typeName; update the second ArrayType branch to
render the element type recursively using the same renderer used for other
branches (i.e., call the surrounding type-to-string helper on a.elementType
instead of .typeName) so nested arrays and struct arrays produce full strings
like "array<array<int>>" or "array<struct<...>>".

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 484fea4 and eac3188.

📒 Files selected for processing (1)

pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala

coderabbitai · 2026-02-25T11:07:47Z

pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala

+        case a: ArrayType if a.elementType.isInstanceOf[StructType] => "array<struct<...>>"
+        case a: ArrayType                                           => s"array<${a.elementType.typeName}>"
+        case _: StructType                                          => "struct<...>"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Apache Spark Scala API ArrayType nested typeName catalogString simpleString behavior documentation

💡 Result:

In Spark’s Scala API (org.apache.spark.sql.types), these methods behave as follows for ArrayType (including nested arrays):

typeName: String
Returns the JSON type name for the data type. For ArrayType, this is just "array" (it does not include the element type, even when nested). [1][2]

simpleString: String
Returns a readable type string and recursively uses the nested type’s simpleString. For an array, it renders like:
array<elementTypeSimpleString>
Example: ArrayType(ArrayType(IntegerType)) → array<array<int>>. [1][2]

catalogString: String
Returns the string representation used when saving the type in external catalogs and recursively uses nested types’ catalogString. For an array, it renders like:
array<elementTypeCatalogString>
Example: ArrayType(StructType(...)) → array<struct<...>>. [1][2]

SQL docs alignment
Spark SQL documentation describes ArrayType as ARRAY<element_type>, matching the angle-bracket form used by simpleString / catalogString. [3]

Sources: Spark DataType API docs (defines typeName, simpleString, catalogString) [1], Spark ScalaDoc for ArrayType [2], Spark SQL data types reference [3].

🏁 Script executed:

cd pramen/core/src/main/scala/za/co/absa/pramen/core/utils && sed -n '165,190p' SparkUtils.scala

Repository: AbsaOSS/pramen

Length of output: 1387

Preserve nested array detail in rendered type strings.

Using a.elementType.typeName returns just "array" for nested arrays, collapsing type detail (e.g., array<array<int>> becomes array<array>). This reduces fidelity of new/deleted type notifications.

💡 Proposed fix

def dataTypeToString(dt: DataType, metadata: Metadata): String = { val maxLength = getLengthFromMetadata(metadata).getOrElse(0) dt match { - case a: ArrayType if a.elementType.isInstanceOf[StructType] => "array<struct<...>>" - case a: ArrayType => s"array<${a.elementType.typeName}>" + case a: ArrayType => s"array<${dataTypeToString(a.elementType, metadata)}>" case _: StructType => "struct<...>" case _: StringType if maxLength > 0 => s"varchar($maxLength)" case _ => dt.typeName } }

The recursive call handles both nested arrays and struct arrays correctly, consolidating the special case.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala` around lines 174 - 176, The array handling in the pattern match (the cases matching "case a: ArrayType if a.elementType.isInstanceOf[StructType]" and "case a: ArrayType => s\"array<${a.elementType.typeName}>\"") loses nested array detail by using a.elementType.typeName; update the second ArrayType branch to render the element type recursively using the same renderer used for other branches (i.e., call the surrounding type-to-string helper on a.elementType instead of .typeName) so nested arrays and struct arrays produce full strings like "array<array<int>>" or "array<struct<...>>".

#710 Make the schema change renderer nested schema friendly.

80fbbae

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala Outdated Show resolved Hide resolved

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

#710 Make the schema change comparison case sensitive for notificatio…

790a100

…n purposes. We want to know when schema changes even if the change is casing of fields.

yruslan force-pushed the feature/710-handle-nested-structs-in-notifications branch from 275a26d to 790a100 Compare February 25, 2026 09:51

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

yruslan added 2 commits February 25, 2026 11:21

#710 Make schema comparison case-sensitive for notification purposes.

484fea4

#710 Tidy up the code for Spark schema comparison.

eac3188

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

yruslan merged commit 52743c4 into main Feb 25, 2026
7 checks passed

yruslan deleted the feature/710-handle-nested-structs-in-notifications branch February 25, 2026 11:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#710 Make the schema change renderer nested schema friendly.#713

#710 Make the schema change renderer nested schema friendly.#713
yruslan merged 4 commits intomainfrom
feature/710-handle-nested-structs-in-notifications

yruslan commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 25, 2026 •

edited

Loading

Reviews paused

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yruslan commented Feb 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Coverage

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yruslan commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 25, 2026 •

edited

Loading

github-actions bot commented Feb 25, 2026 •

edited

Loading