Skip to content

#710 Make the schema change renderer nested schema friendly.#713

Merged
yruslan merged 4 commits intomainfrom
feature/710-handle-nested-structs-in-notifications
Feb 25, 2026
Merged

#710 Make the schema change renderer nested schema friendly.#713
yruslan merged 4 commits intomainfrom
feature/710-handle-nested-structs-in-notifications

Conversation

@yruslan
Copy link
Collaborator

@yruslan yruslan commented Feb 25, 2026

Summary by CodeRabbit

  • Refactor

    • Schema comparison now recursively detects changes inside nested structs and arrays, reporting additions, deletions, and type changes with full path-aware names (including array element notation) for clearer diffs.
    • Type stringification improved: arrays, nested structs and length-constrained strings are rendered more readably (e.g., array<struct<...>>, varchar(n)).
  • Tests

    • Added a test validating detection of nested additions, deletions and type changes (including arrays of primitives and structs).

@coderabbitai
Copy link

coderabbitai bot commented Feb 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

compareSchemas was rewritten to recursively traverse nested StructType and ArrayType schemas with path-aware names, accumulating new, deleted, and changed fields; dataTypeToString was enhanced to represent nested arrays, structs, and varchar(length). A unit test validating nested change detection was added.

Changes

Cohort / File(s) Summary
Schema Comparison Refactor
pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala
Rewrote compareSchemas(schemaA, schemaB) to use recursive processStruct and processArray with path-tracking; introduced ListBuffer accumulators (newFields, deletedFields, changedFields); improved dataTypeToString to render array<struct<...>>, array<elementType>, struct<...>, and varchar(n); parameter names changed to schemaA/schemaB.
Nested Schema Comparison Test
pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala
Added "detect nested type changes" test constructing nested StructTypes and arrays; asserts 9 diffs total (3 ChangedType, 4 NewField, 2 DeletedField) and verifies path-aware changes (e.g., address.city, tags, phones[].number) and new/deleted fields (Id, address.state, phones[].country, additional_properties, id, error_info).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped through schemas, layer by layer,
Found changed types and fields laid bare,
Paths with brackets, structs curled tight,
New and lost in morning light,
A tiny rabbit cheers the recursive sight.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: enhancing schema change detection to handle nested structures (structs and arrays) in addition to top-level fields, making it compatible with nested schemas.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/710-handle-nested-structs-in-notifications

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala (1)

217-233: Dead Seq.empty expressions in processArray

processArray has return type Unit. The Seq.empty expressions on lines 221 and 224 are evaluated and immediately discarded — they carry no meaning and will confuse readers into thinking there is an intentional return value.

♻️ Proposed cleanup
   def processArray(array1: ArrayType, array2: ArrayType, metadata1: Metadata, metadata2: Metadata, path: String = ""): Unit = {
     (array1.elementType, array2.elementType) match {
       case (st1: StructType, st2: StructType) =>
         processStruct(st1, st2, s"$path[].")
-        Seq.empty
       case (ar1: ArrayType, ar2: ArrayType) =>
         processArray(ar1, ar2, metadata1, metadata2, s"$path[]")
-        Seq.empty
       case _ =>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`
around lines 217 - 233, Remove the dead Seq.empty expressions inside
processArray: when matching on (st1: StructType, st2: StructType) and (ar1:
ArrayType, ar2: ArrayType) simply call processStruct(st1, st2, s"$path[].") and
processArray(ar1, ar2, metadata1, metadata2, s"$path[]") respectively and delete
the trailing Seq.empty tokens so the method remains Unit-returning and clearer;
keep unchanged the logic that computes dt1/dt2 and records
FieldChange.ChangedType when types differ.
pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala (1)

291-349: LGTM — consider adding a mixed-case field-name test case

The new test thoroughly exercises nested struct recursion, array-of-struct element changes, new/deleted fields at multiple depth levels, and the array<struct<...>> display format.

One gap: because all field names in both schemas are lowercase, the test does not exercise the case-insensitive lookup path (i.e., the bug flagged in processStruct). A small additional assertion with a schema that has a field named, say, "City" vs "city" would both catch the existing bug and guard against regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala`
around lines 291 - 349, Add a small mixed-case field-name scenario to the
"detect nested type changes" test to exercise the case-insensitive lookup path
in processStruct: create a variant of schema1/schema2 (or a tiny additional
pair) where a nested field differs only by case (e.g., "City" vs "city") and
assert that compareSchemas still detects the change (use
ChangedType/NewField/DeletedField assertions similar to existing ones);
reference compareSchemas and processStruct so the test triggers the
case-insensitive branch and prevents regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`:
- Around line 178-214: fields1 and fields2 are keyed with lowercased names but
subsequent contains/lookup use the original field name causing incorrect
new/deleted/changed detection and potential NoSuchElementException; update all
lookups to use the lowercased key (e.g., replace f.name with f.name.toLowerCase
when calling fields1.contains, fields2.contains and when indexing
fields2(f1.name) use fields2(f1.name.toLowerCase)), ensuring newColumns,
deletedColumns and changedType (and the branch that calls
processStruct/processArray and dataTypeToString) reference the lowercased map
keys while still using the original f.name for human-readable path strings added
to newFields, deletedFields and changedFields.

---

Nitpick comments:
In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`:
- Around line 217-233: Remove the dead Seq.empty expressions inside
processArray: when matching on (st1: StructType, st2: StructType) and (ar1:
ArrayType, ar2: ArrayType) simply call processStruct(st1, st2, s"$path[].") and
processArray(ar1, ar2, metadata1, metadata2, s"$path[]") respectively and delete
the trailing Seq.empty tokens so the method remains Unit-returning and clearer;
keep unchanged the logic that computes dt1/dt2 and records
FieldChange.ChangedType when types differ.

In
`@pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala`:
- Around line 291-349: Add a small mixed-case field-name scenario to the "detect
nested type changes" test to exercise the case-insensitive lookup path in
processStruct: create a variant of schema1/schema2 (or a tiny additional pair)
where a nested field differs only by case (e.g., "City" vs "city") and assert
that compareSchemas still detects the change (use
ChangedType/NewField/DeletedField assertions similar to existing ones);
reference compareSchemas and processStruct so the test triggers the
case-insensitive branch and prevents regressions.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8f9e63 and 80fbbae.

📒 Files selected for processing (2)
  • pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala
  • pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala

@github-actions
Copy link

github-actions bot commented Feb 25, 2026

Unit Test Coverage

Overall Project 84.35% -0.04% 🍏
Files changed 89.3% 🍏

Module Coverage
pramen:core Jacoco Report 86.3% -0.05% 🍏
Files
Module File Coverage
pramen:core Jacoco Report SparkUtils.scala 87.3% -3.06% 🍏

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala (2)

169-174: dataTypeToString for nested arrays returns "array<array>" — losing inner element type.

When a field is ArrayType(ArrayType(StringType)), line 170 produces "array<array>" (since typeName of ArrayType is "array"). This is only relevant for the leaf comparison in processArray when both sides are non-matching nested array types that don't recurse further. In practice this is unlikely to matter, but for completeness you could handle it:

Optional improvement
-        case a: ArrayType                                           => s"array<${a.elementType.typeName}>"
+        case a: ArrayType                                           => s"array<${dataTypeToString(a.elementType, Metadata.empty)}>"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`
around lines 169 - 174, The dataTypeToString logic loses inner element type for
nested arrays (e.g., ArrayType(ArrayType(StringType))) because it uses
a.flatType.typeName; update the ArrayType case in dataTypeToString to detect
when a.elementType is itself an ArrayType (or other complex type) and
recursively call dataTypeToString(a.elementType) to produce a full nested string
like "array<array<...>>" or "array<varchar(n)>" instead of "array<array>" —
adjust any callers such as processArray that rely on this output to use the new
recursive representation.

217-231: Nullable mismatch in arrays is silently ignored.

processArray recurses on elementType but never compares array1.containsNull vs array2.containsNull. If the nullability of the array elements changes, it won't be reported. This may be intentional (the existing transformSchemaForCatalog normalizes nullability), but worth confirming.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`
around lines 217 - 231, processArray currently compares elementType but ignores
array nullability; add a comparison of array1.containsNull vs
array2.containsNull after determining element types (inside processArray) and,
when they differ, append a FieldChange to changedFields (e.g., use a
nullable-change variant like FieldChange.ChangedNullable or add a new
FieldChange case) with the same path (include "[]" to indicate array) and the
two boolean values so nullability changes are reported; update processArray to
reference array1.containsNull, array2.containsNull, changedFields and
FieldChange accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`:
- Around line 169-174: The dataTypeToString logic loses inner element type for
nested arrays (e.g., ArrayType(ArrayType(StringType))) because it uses
a.flatType.typeName; update the ArrayType case in dataTypeToString to detect
when a.elementType is itself an ArrayType (or other complex type) and
recursively call dataTypeToString(a.elementType) to produce a full nested string
like "array<array<...>>" or "array<varchar(n)>" instead of "array<array>" —
adjust any callers such as processArray that rely on this output to use the new
recursive representation.
- Around line 217-231: processArray currently compares elementType but ignores
array nullability; add a comparison of array1.containsNull vs
array2.containsNull after determining element types (inside processArray) and,
when they differ, append a FieldChange to changedFields (e.g., use a
nullable-change variant like FieldChange.ChangedNullable or add a new
FieldChange case) with the same path (include "[]" to indicate array) and the
two boolean values so nullability changes are reported; update processArray to
reference array1.containsNull, array2.containsNull, changedFields and
FieldChange accordingly.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80fbbae and 275a26d.

📒 Files selected for processing (1)
  • pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala

…n purposes.

We want to know when schema changes even if the change is casing of fields.
@yruslan yruslan force-pushed the feature/710-handle-nested-structs-in-notifications branch from 275a26d to 790a100 Compare February 25, 2026 09:51
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala (1)

291-349: Solid test coverage for nested schema comparison.

The 7 expected diffs are correctly accounted for, and the per-path assertions cover all three change categories for both nested structs and arrays-of-structs. A couple of observations:

  1. val df on Line 347 — naming the DeletedField collection df is a minor readability hazard in a Spark test file where df is universally understood to mean DataFrame. Consider deletedFields (the variable is already available as a val in the broader test scope, so rename to e.g., deletedFieldsList or deleted).

  2. Missing coverage for case-sensitivity — no test covers a field rename that differs only in case (e.g., "City" in schema1 vs "city" in schema2) to document/assert the current case-sensitive comparison behaviour. Adding such a test would pin the semantics and prevent silent regressions if case-insensitive matching is introduced later.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala`
around lines 291 - 349, Rename the local val `df` (the DeletedField collection)
to a clearer name like `deletedFieldsList` in the "detect nested type changes"
test to avoid confusion with Spark DataFrame; update the reference in the
pattern match and assertions (the val is created where the code does `val df =
diff.collect { case df: DeletedField => df }`). Also add an additional small
assertion block in the same SparkUtilsSuite test (or a new test) that calls
compareSchemas with two schemas differing only by field name case (e.g., "City"
vs "city") to assert the current case-sensitive behavior (use compareSchemas,
inspect returned NewField/DeletedField/ChangedType entries, and assert expected
counts and columnName values) so the intended case-sensitivity is documented and
prevented from regressing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala`:
- Around line 291-349: Rename the local val `df` (the DeletedField collection)
to a clearer name like `deletedFieldsList` in the "detect nested type changes"
test to avoid confusion with Spark DataFrame; update the reference in the
pattern match and assertions (the val is created where the code does `val df =
diff.collect { case df: DeletedField => df }`). Also add an additional small
assertion block in the same SparkUtilsSuite test (or a new test) that calls
compareSchemas with two schemas differing only by field name case (e.g., "City"
vs "city") to assert the current case-sensitive behavior (use compareSchemas,
inspect returned NewField/DeletedField/ChangedType entries, and assert expected
counts and columnName values) so the intended case-sensitivity is documented and
prevented from regressing.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 275a26d and 790a100.

📒 Files selected for processing (2)
  • pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala
  • pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`:
- Around line 174-176: The array handling in the pattern match (the cases
matching "case a: ArrayType if a.elementType.isInstanceOf[StructType]" and "case
a: ArrayType => s\"array<${a.elementType.typeName}>\"") loses nested array
detail by using a.elementType.typeName; update the second ArrayType branch to
render the element type recursively using the same renderer used for other
branches (i.e., call the surrounding type-to-string helper on a.elementType
instead of .typeName) so nested arrays and struct arrays produce full strings
like "array<array<int>>" or "array<struct<...>>".

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 484fea4 and eac3188.

📒 Files selected for processing (1)
  • pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala

Comment on lines +174 to +176
case a: ArrayType if a.elementType.isInstanceOf[StructType] => "array<struct<...>>"
case a: ArrayType => s"array<${a.elementType.typeName}>"
case _: StructType => "struct<...>"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Apache Spark Scala API ArrayType nested typeName catalogString simpleString behavior documentation

💡 Result:

In Spark’s Scala API (org.apache.spark.sql.types), these methods behave as follows for ArrayType (including nested arrays):

  • typeName: String
    Returns the JSON type name for the data type. For ArrayType, this is just "array" (it does not include the element type, even when nested). [1][2]

  • simpleString: String
    Returns a readable type string and recursively uses the nested type’s simpleString. For an array, it renders like:
    array<elementTypeSimpleString>
    Example: ArrayType(ArrayType(IntegerType))array<array<int>>. [1][2]

  • catalogString: String
    Returns the string representation used when saving the type in external catalogs and recursively uses nested types’ catalogString. For an array, it renders like:
    array<elementTypeCatalogString>
    Example: ArrayType(StructType(...))array<struct<...>>. [1][2]

  • SQL docs alignment
    Spark SQL documentation describes ArrayType as ARRAY<element_type>, matching the angle-bracket form used by simpleString / catalogString. [3]

Sources: Spark DataType API docs (defines typeName, simpleString, catalogString) [1], Spark ScalaDoc for ArrayType [2], Spark SQL data types reference [3].


🏁 Script executed:

cd pramen/core/src/main/scala/za/co/absa/pramen/core/utils && sed -n '165,190p' SparkUtils.scala

Repository: AbsaOSS/pramen

Length of output: 1387


Preserve nested array detail in rendered type strings.

Using a.elementType.typeName returns just "array" for nested arrays, collapsing type detail (e.g., array<array<int>> becomes array<array>). This reduces fidelity of new/deleted type notifications.

💡 Proposed fix
     def dataTypeToString(dt: DataType, metadata: Metadata): String = {
       val maxLength = getLengthFromMetadata(metadata).getOrElse(0)

       dt match {
-        case a: ArrayType if a.elementType.isInstanceOf[StructType] => "array<struct<...>>"
-        case a: ArrayType                                           => s"array<${a.elementType.typeName}>"
+        case a: ArrayType                                           => s"array<${dataTypeToString(a.elementType, metadata)}>"
         case _: StructType                                          => "struct<...>"
         case _: StringType if maxLength > 0                         => s"varchar($maxLength)"
         case _                                                      => dt.typeName
       }
     }

The recursive call handles both nested arrays and struct arrays correctly, consolidating the special case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`
around lines 174 - 176, The array handling in the pattern match (the cases
matching "case a: ArrayType if a.elementType.isInstanceOf[StructType]" and "case
a: ArrayType => s\"array<${a.elementType.typeName}>\"") loses nested array
detail by using a.elementType.typeName; update the second ArrayType branch to
render the element type recursively using the same renderer used for other
branches (i.e., call the surrounding type-to-string helper on a.elementType
instead of .typeName) so nested arrays and struct arrays produce full strings
like "array<array<int>>" or "array<struct<...>>".

@yruslan yruslan merged commit 52743c4 into main Feb 25, 2026
7 checks passed
@yruslan yruslan deleted the feature/710-handle-nested-structs-in-notifications branch February 25, 2026 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant