#710 Make the schema change renderer nested schema friendly.#713
#710 Make the schema change renderer nested schema friendly.#713
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughcompareSchemas was rewritten to recursively traverse nested StructType and ArrayType schemas with path-aware names, accumulating new, deleted, and changed fields; dataTypeToString was enhanced to represent nested arrays, structs, and varchar(length). A unit test validating nested change detection was added. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala (1)
217-233: DeadSeq.emptyexpressions inprocessArray
processArrayhas return typeUnit. TheSeq.emptyexpressions on lines 221 and 224 are evaluated and immediately discarded — they carry no meaning and will confuse readers into thinking there is an intentional return value.♻️ Proposed cleanup
def processArray(array1: ArrayType, array2: ArrayType, metadata1: Metadata, metadata2: Metadata, path: String = ""): Unit = { (array1.elementType, array2.elementType) match { case (st1: StructType, st2: StructType) => processStruct(st1, st2, s"$path[].") - Seq.empty case (ar1: ArrayType, ar2: ArrayType) => processArray(ar1, ar2, metadata1, metadata2, s"$path[]") - Seq.empty case _ =>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala` around lines 217 - 233, Remove the dead Seq.empty expressions inside processArray: when matching on (st1: StructType, st2: StructType) and (ar1: ArrayType, ar2: ArrayType) simply call processStruct(st1, st2, s"$path[].") and processArray(ar1, ar2, metadata1, metadata2, s"$path[]") respectively and delete the trailing Seq.empty tokens so the method remains Unit-returning and clearer; keep unchanged the logic that computes dt1/dt2 and records FieldChange.ChangedType when types differ.pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala (1)
291-349: LGTM — consider adding a mixed-case field-name test caseThe new test thoroughly exercises nested struct recursion, array-of-struct element changes, new/deleted fields at multiple depth levels, and the
array<struct<...>>display format.One gap: because all field names in both schemas are lowercase, the test does not exercise the case-insensitive lookup path (i.e., the bug flagged in
processStruct). A small additional assertion with a schema that has a field named, say,"City"vs"city"would both catch the existing bug and guard against regressions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala` around lines 291 - 349, Add a small mixed-case field-name scenario to the "detect nested type changes" test to exercise the case-insensitive lookup path in processStruct: create a variant of schema1/schema2 (or a tiny additional pair) where a nested field differs only by case (e.g., "City" vs "city") and assert that compareSchemas still detects the change (use ChangedType/NewField/DeletedField assertions similar to existing ones); reference compareSchemas and processStruct so the test triggers the case-insensitive branch and prevents regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`:
- Around line 178-214: fields1 and fields2 are keyed with lowercased names but
subsequent contains/lookup use the original field name causing incorrect
new/deleted/changed detection and potential NoSuchElementException; update all
lookups to use the lowercased key (e.g., replace f.name with f.name.toLowerCase
when calling fields1.contains, fields2.contains and when indexing
fields2(f1.name) use fields2(f1.name.toLowerCase)), ensuring newColumns,
deletedColumns and changedType (and the branch that calls
processStruct/processArray and dataTypeToString) reference the lowercased map
keys while still using the original f.name for human-readable path strings added
to newFields, deletedFields and changedFields.
---
Nitpick comments:
In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`:
- Around line 217-233: Remove the dead Seq.empty expressions inside
processArray: when matching on (st1: StructType, st2: StructType) and (ar1:
ArrayType, ar2: ArrayType) simply call processStruct(st1, st2, s"$path[].") and
processArray(ar1, ar2, metadata1, metadata2, s"$path[]") respectively and delete
the trailing Seq.empty tokens so the method remains Unit-returning and clearer;
keep unchanged the logic that computes dt1/dt2 and records
FieldChange.ChangedType when types differ.
In
`@pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala`:
- Around line 291-349: Add a small mixed-case field-name scenario to the "detect
nested type changes" test to exercise the case-insensitive lookup path in
processStruct: create a variant of schema1/schema2 (or a tiny additional pair)
where a nested field differs only by case (e.g., "City" vs "city") and assert
that compareSchemas still detects the change (use
ChangedType/NewField/DeletedField assertions similar to existing ones);
reference compareSchemas and processStruct so the test triggers the
case-insensitive branch and prevents regressions.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scalapramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala
pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala
Outdated
Show resolved
Hide resolved
Unit Test Coverage
Files
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala (2)
169-174:dataTypeToStringfor nested arrays returns"array<array>"— losing inner element type.When a field is
ArrayType(ArrayType(StringType)), line 170 produces"array<array>"(sincetypeNameofArrayTypeis"array"). This is only relevant for the leaf comparison inprocessArraywhen both sides are non-matching nested array types that don't recurse further. In practice this is unlikely to matter, but for completeness you could handle it:Optional improvement
- case a: ArrayType => s"array<${a.elementType.typeName}>" + case a: ArrayType => s"array<${dataTypeToString(a.elementType, Metadata.empty)}>"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala` around lines 169 - 174, The dataTypeToString logic loses inner element type for nested arrays (e.g., ArrayType(ArrayType(StringType))) because it uses a.flatType.typeName; update the ArrayType case in dataTypeToString to detect when a.elementType is itself an ArrayType (or other complex type) and recursively call dataTypeToString(a.elementType) to produce a full nested string like "array<array<...>>" or "array<varchar(n)>" instead of "array<array>" — adjust any callers such as processArray that rely on this output to use the new recursive representation.
217-231: Nullable mismatch in arrays is silently ignored.
processArrayrecurses onelementTypebut never comparesarray1.containsNullvsarray2.containsNull. If the nullability of the array elements changes, it won't be reported. This may be intentional (the existingtransformSchemaForCatalognormalizes nullability), but worth confirming.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala` around lines 217 - 231, processArray currently compares elementType but ignores array nullability; add a comparison of array1.containsNull vs array2.containsNull after determining element types (inside processArray) and, when they differ, append a FieldChange to changedFields (e.g., use a nullable-change variant like FieldChange.ChangedNullable or add a new FieldChange case) with the same path (include "[]" to indicate array) and the two boolean values so nullability changes are reported; update processArray to reference array1.containsNull, array2.containsNull, changedFields and FieldChange accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`:
- Around line 169-174: The dataTypeToString logic loses inner element type for
nested arrays (e.g., ArrayType(ArrayType(StringType))) because it uses
a.flatType.typeName; update the ArrayType case in dataTypeToString to detect
when a.elementType is itself an ArrayType (or other complex type) and
recursively call dataTypeToString(a.elementType) to produce a full nested string
like "array<array<...>>" or "array<varchar(n)>" instead of "array<array>" —
adjust any callers such as processArray that rely on this output to use the new
recursive representation.
- Around line 217-231: processArray currently compares elementType but ignores
array nullability; add a comparison of array1.containsNull vs
array2.containsNull after determining element types (inside processArray) and,
when they differ, append a FieldChange to changedFields (e.g., use a
nullable-change variant like FieldChange.ChangedNullable or add a new
FieldChange case) with the same path (include "[]" to indicate array) and the
two boolean values so nullability changes are reported; update processArray to
reference array1.containsNull, array2.containsNull, changedFields and
FieldChange accordingly.
…n purposes. We want to know when schema changes even if the change is casing of fields.
275a26d to
790a100
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala (1)
291-349: Solid test coverage for nested schema comparison.The 7 expected diffs are correctly accounted for, and the per-path assertions cover all three change categories for both nested structs and arrays-of-structs. A couple of observations:
val dfon Line 347 — naming theDeletedFieldcollectiondfis a minor readability hazard in a Spark test file wheredfis universally understood to meanDataFrame. ConsiderdeletedFields(the variable is already available as avalin the broader test scope, so rename to e.g.,deletedFieldsListordeleted).Missing coverage for case-sensitivity — no test covers a field rename that differs only in case (e.g.,
"City"in schema1 vs"city"in schema2) to document/assert the current case-sensitive comparison behaviour. Adding such a test would pin the semantics and prevent silent regressions if case-insensitive matching is introduced later.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala` around lines 291 - 349, Rename the local val `df` (the DeletedField collection) to a clearer name like `deletedFieldsList` in the "detect nested type changes" test to avoid confusion with Spark DataFrame; update the reference in the pattern match and assertions (the val is created where the code does `val df = diff.collect { case df: DeletedField => df }`). Also add an additional small assertion block in the same SparkUtilsSuite test (or a new test) that calls compareSchemas with two schemas differing only by field name case (e.g., "City" vs "city") to assert the current case-sensitive behavior (use compareSchemas, inspect returned NewField/DeletedField/ChangedType entries, and assert expected counts and columnName values) so the intended case-sensitivity is documented and prevented from regressing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala`:
- Around line 291-349: Rename the local val `df` (the DeletedField collection)
to a clearer name like `deletedFieldsList` in the "detect nested type changes"
test to avoid confusion with Spark DataFrame; update the reference in the
pattern match and assertions (the val is created where the code does `val df =
diff.collect { case df: DeletedField => df }`). Also add an additional small
assertion block in the same SparkUtilsSuite test (or a new test) that calls
compareSchemas with two schemas differing only by field name case (e.g., "City"
vs "city") to assert the current case-sensitive behavior (use compareSchemas,
inspect returned NewField/DeletedField/ChangedType entries, and assert expected
counts and columnName values) so the intended case-sensitivity is documented and
prevented from regressing.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scalapramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`:
- Around line 174-176: The array handling in the pattern match (the cases
matching "case a: ArrayType if a.elementType.isInstanceOf[StructType]" and "case
a: ArrayType => s\"array<${a.elementType.typeName}>\"") loses nested array
detail by using a.elementType.typeName; update the second ArrayType branch to
render the element type recursively using the same renderer used for other
branches (i.e., call the surrounding type-to-string helper on a.elementType
instead of .typeName) so nested arrays and struct arrays produce full strings
like "array<array<int>>" or "array<struct<...>>".
| case a: ArrayType if a.elementType.isInstanceOf[StructType] => "array<struct<...>>" | ||
| case a: ArrayType => s"array<${a.elementType.typeName}>" | ||
| case _: StructType => "struct<...>" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Apache Spark Scala API ArrayType nested typeName catalogString simpleString behavior documentation
💡 Result:
In Spark’s Scala API (org.apache.spark.sql.types), these methods behave as follows for ArrayType (including nested arrays):
-
typeName: String
Returns the JSON type name for the data type. ForArrayType, this is just"array"(it does not include the element type, even when nested). [1][2] -
simpleString: String
Returns a readable type string and recursively uses the nested type’ssimpleString. For an array, it renders like:
array<elementTypeSimpleString>
Example:ArrayType(ArrayType(IntegerType))→array<array<int>>. [1][2] -
catalogString: String
Returns the string representation used when saving the type in external catalogs and recursively uses nested types’catalogString. For an array, it renders like:
array<elementTypeCatalogString>
Example:ArrayType(StructType(...))→array<struct<...>>. [1][2] -
SQL docs alignment
Spark SQL documentation describesArrayTypeasARRAY<element_type>, matching the angle-bracket form used bysimpleString/catalogString. [3]
Sources: Spark DataType API docs (defines typeName, simpleString, catalogString) [1], Spark ScalaDoc for ArrayType [2], Spark SQL data types reference [3].
🏁 Script executed:
cd pramen/core/src/main/scala/za/co/absa/pramen/core/utils && sed -n '165,190p' SparkUtils.scalaRepository: AbsaOSS/pramen
Length of output: 1387
Preserve nested array detail in rendered type strings.
Using a.elementType.typeName returns just "array" for nested arrays, collapsing type detail (e.g., array<array<int>> becomes array<array>). This reduces fidelity of new/deleted type notifications.
💡 Proposed fix
def dataTypeToString(dt: DataType, metadata: Metadata): String = {
val maxLength = getLengthFromMetadata(metadata).getOrElse(0)
dt match {
- case a: ArrayType if a.elementType.isInstanceOf[StructType] => "array<struct<...>>"
- case a: ArrayType => s"array<${a.elementType.typeName}>"
+ case a: ArrayType => s"array<${dataTypeToString(a.elementType, metadata)}>"
case _: StructType => "struct<...>"
case _: StringType if maxLength > 0 => s"varchar($maxLength)"
case _ => dt.typeName
}
}The recursive call handles both nested arrays and struct arrays correctly, consolidating the special case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`
around lines 174 - 176, The array handling in the pattern match (the cases
matching "case a: ArrayType if a.elementType.isInstanceOf[StructType]" and "case
a: ArrayType => s\"array<${a.elementType.typeName}>\"") loses nested array
detail by using a.elementType.typeName; update the second ArrayType branch to
render the element type recursively using the same renderer used for other
branches (i.e., call the surrounding type-to-string helper on a.elementType
instead of .typeName) so nested arrays and struct arrays produce full strings
like "array<array<int>>" or "array<struct<...>>".
Summary by CodeRabbit
Refactor
Tests