Skip to content

Comments

#822 Allow '_corrupt_records' to extract data in HEX instead of binary data type.#825

Merged
yruslan merged 3 commits intomasterfrom
feature/822-generate-corrupt-fields-as-hex
Feb 24, 2026
Merged

#822 Allow '_corrupt_records' to extract data in HEX instead of binary data type.#825
yruslan merged 3 commits intomasterfrom
feature/822-generate-corrupt-fields-as-hex

Conversation

@yruslan
Copy link
Collaborator

@yruslan yruslan commented Feb 24, 2026

Closes #822

Summary by CodeRabbit

  • New Features

    • Corrupt fields can be emitted as hex strings or binary via a new corrupt-fields policy and a decode-as-hex option.
    • Schema and builder APIs updated to expose the corrupt-fields policy and decode-as-hex toggle; raw_value type now reflects the chosen mode.
  • Tests

    • Added/updated tests for binary vs hex corrupt-field behavior and a hex conversion utility.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 970c6ba and 9af73e8.

📒 Files selected for processing (10)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/asttransform/DebugFieldsAdder.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/DecoderSelector.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecoders.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/raw/FixedWithRecordLengthExprRawRecordExtractor.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/utils/StringUtils.scala
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecodersSpec.scala
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/utils/StringUtilsSuite.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala
💤 Files with no reviewable changes (2)
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecodersSpec.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecoders.scala
🚧 Files skipped from review as they are similar to previous changes (4)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/raw/FixedWithRecordLengthExprRawRecordExtractor.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/DecoderSelector.scala
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/utils/StringUtilsSuite.scala

Walkthrough

Replaces boolean corrupt-field handling with a sealed CorruptFieldsPolicy (Disabled, Binary, Hex), threads the policy through parameters, schema, iterators, and record extraction, adds option to emit corrupt-field raw values as uppercase hex, and introduces StringUtils.convertArrayToHex plus related test updates.

Changes

Cohort / File(s) Summary
Core Policy Definition
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CorruptFieldsPolicy.scala
Adds sealed trait CorruptFieldsPolicy with Disabled, Binary, Hex.
Parameters & Parsing
cobol-parser/src/main/scala/.../CobolParametersParser.scala, cobol-parser/src/main/scala/.../ReaderParameters.scala
Replaces boolean generateCorruptFields with corruptFieldsPolicy; parser computes policy from existing flags and wires it into ReaderParameters.
Schema & Builder
cobol-parser/src/main/scala/.../schema/CobolSchema.scala, spark-cobol/src/main/scala/.../schema/CobolSchema.scala, spark-cobol/src/main/scala/.../schema/CobolSchemaBuilder.scala
CobolSchema now accepts corruptSchemaPolicy; builder gains decodeBinaryAsHex and withDecodeBinaryAsHex() and derives policy; corrupt-fields rawValue type depends on policy (Hex -> StringType, else BinaryType).
Record Extraction
cobol-parser/src/main/scala/.../extractors/record/RecordExtractors.scala
Adds parameter generateCorruptFieldsAsHex: Boolean to extractRecord/applyRecordPostProcessing; when enabled corrupt entries emit hex strings instead of raw byte arrays.
Iterators / Callers
cobol-parser/src/main/scala/.../iterator/FixedLenNestedRowIterator.scala, cobol-parser/src/main/scala/.../iterator/VarLenNestedIterator.scala, spark-cobol/src/main/scala/.../builder/SparkCobolOptionsBuilder.scala
Derives generateCorruptFields and generateCorruptFieldsAsHex from corruptFieldsPolicy and passes them to extractRecord; call sites updated accordingly.
String utilities & decoders
cobol-parser/src/main/scala/.../utils/StringUtils.scala, cobol-parser/src/main/scala/.../parser/decoders/DecoderSelector.scala, cobol-parser/src/main/scala/.../parser/decoders/StringDecoders.scala, cobol-parser/src/main/scala/.../parser/asttransform/DebugFieldsAdder.scala
Adds StringUtils.convertArrayToHex, replaces prior hex helpers/usages with this util, and removes StringDecoders.decodeHex.
Raw extractor messages
cobol-parser/src/main/scala/.../raw/FixedWithRecordLengthExprRawRecordExtractor.scala
Uses StringUtils.convertArrayToHex in error messages and removes an internal hex helper.
Tests & Fixtures
spark-cobol/src/test/scala/.../CobolSchemaSpec.scala, .../DummyCobolSchema.scala, .../Test41CorruptFieldsSpec.scala, cobol-parser/src/test/scala/.../StringUtilsSuite.scala, cobol-parser/src/test/scala/.../StringDecodersSpec.scala
Adds StringUtils tests; updates schema and integration tests to cover binary vs hex corrupt-field modes; adjusts test fixtures to new constructor parameters; removes old decodeHex test.

Sequence Diagram

sequenceDiagram
    participant User
    participant Parser as CobolParametersParser
    participant ReaderParams as ReaderParameters
    participant Schema as CobolSchema
    participant Iterator as NestedRowIterator
    participant Extractor as RecordExtractors
    participant Utils as StringUtils

    User->>Parser: set options (generate_corrupt_fields, binary_as_hex)
    Parser->>Parser: derive corruptFieldsPolicy (Disabled|Binary|Hex)
    Parser->>ReaderParams: pass corruptFieldsPolicy
    ReaderParams->>Schema: fromReaderParameters(corruptFieldsPolicy)
    Schema->>Iterator: iterator created (policy available)
    Iterator->>Iterator: derive booleans: generateCorruptFields, generateCorruptFieldsAsHex
    Iterator->>Extractor: extractRecord(..., generateCorruptFields, generateCorruptFieldsAsHex)
    alt generateCorruptFields && generateCorruptFieldsAsHex
        Extractor->>Utils: convertArrayToHex(rawBytes)
        Utils-->>Extractor: hexString
        Extractor-->>Iterator: record with corrupt_fields raw_value as String
    else generateCorruptFields && !generateCorruptFieldsAsHex
        Extractor-->>Iterator: record with corrupt_fields raw_value as Binary
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐇 I nibble bytes and make them shine,
Corrupt fields dressed in hex so fine,
From raw to string with one small hop,
I humbly patch each data drop,
Happy hops — hex feast on time!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature: allowing corrupt_records to extract data as HEX instead of binary. It directly matches the core objective from issue #822.
Linked Issues check ✅ Passed The PR successfully implements the feature requirements from issue #822: it reuses the 'binary_as_hex' option to control HEX representation of corrupt field values, adding generateCorruptFieldsAsHex flag and supporting Hex policy throughout the extraction pipeline.
Out of Scope Changes check ✅ Passed All changes directly support the core feature: introducing CorruptFieldsPolicy enum, updating extractors to respect hex formatting, refactoring parameters, and consolidating hex conversion logic into StringUtils. No unrelated modifications detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/822-generate-corrupt-fields-as-hex

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Feb 24, 2026

JaCoCo code coverage report - 'cobol-parser'

Overall Project 91.42% -0.2% 🍏
Files changed 68.81% 🍏

File Coverage
StringUtils.scala 100% 🍏
StringDecoders.scala 94.34% 🍏
DebugFieldsAdder.scala 88.89% -11.11% 🍏
CobolSchema.scala 80.67% -1% 🍏
RecordExtractors.scala 76.96% -7.83% 🍏
FixedLenNestedRowIterator.scala 73.42% -3.29% 🍏
FixedWithRecordLengthExprRawRecordExtractor.scala 71.27% -4.61%
DecoderSelector.scala 69.68% -0.18%
VarLenNestedIterator.scala 68.25% -3.34% 🍏

@github-actions
Copy link

github-actions bot commented Feb 24, 2026

JaCoCo code coverage report - 'spark-cobol'

Overall Project 82.35% -0.23% 🍏
Files changed 81.65% 🍏

File Coverage
CobolSchema.scala 95.06% -0.65% 🍏
SparkCobolOptionsBuilder.scala 85% -7.5%

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`:
- Around line 523-527: The hex conversion for corruptFields in
RecordExtractors.scala uses "%02X" format on signed bytes
(CorruptField.rawValue: Array[Byte]) which can produce sign-extended hex like
FFFFFFD3; fix by masking each byte with 0xFF before formatting so each element
is treated as an unsigned 0-255 value. Update the branch that builds the hex
string (the handler.create call that maps
corruptFields(i).rawValue.map(...).mkString) to map each byte via (b & 0xFF)
prior to formatting, ensuring two-digit hex output for values >= 0x80.

In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala`:
- Line 42: The docstring for parameter corruptSchemaPolicy in CobolSchema.scala
contains a typo: change the string '_corrput_fileds' to the correct
'_corrupt_fields' in the comment for corruptSchemaPolicy (locate the doc block
that documents `@param` corruptSchemaPolicy in the CobolSchema.scala file and
replace the misspelled token).

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala`:
- Line 47: Update the Scaladoc for the parameter corruptFieldsPolicy in
CobolSchema to fix the typo and use the correct field name: replace the
misspelled "_corrput_fileds" (and any occurrences of "_corrupted_fields") with
"_corrupt_fields" so the documentation matches the intended Spark convention
(e.g., "_corrupt_record"); ensure the description still mentions that the field
will be generated when the policy is set.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 904c05f and 60d2708.

📒 Files selected for processing (12)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/FixedLenNestedRowIterator.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/VarLenNestedIterator.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CorruptFieldsPolicy.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/builder/SparkCobolOptionsBuilder.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/CobolSchemaSpec.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/base/impl/DummyCobolSchema.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test41CorruptFieldsSpec.scala

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala (1)

47-47: Fix Scaladoc field name typo. The doc still says '_corrput_fields'; it should be '_corrupt_fields' to match convention.

Based on learnings: Rename the field from _corrupted_fields to _corrupt_fields to align with Apache Spark's naming convention (e.g., _corrupt_record).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala`
at line 47, Scaladoc and code use a misspelled/incorrect field name: update the
Scaladoc param for corruptFieldsPolicy to use '_corrupt_fields' (not
'_corrput_fields'), and rename any code symbols from _corrupted_fields to
_corrupt_fields to match Spark convention; search for usages in CobolSchema (and
related references) and refactor the symbol name consistently (Scaladoc tag,
field/column constant, and any tests or consumers) so all references use
'_corrupt_fields'.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala`:
- Line 47: Scaladoc and code use a misspelled/incorrect field name: update the
Scaladoc param for corruptFieldsPolicy to use '_corrupt_fields' (not
'_corrput_fields'), and rename any code symbols from _corrupted_fields to
_corrupt_fields to match Spark convention; search for usages in CobolSchema (and
related references) and refactor the symbol name consistently (Scaladoc tag,
field/column constant, and any tests or consumers) so all references use
'_corrupt_fields'.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 60d2708 and 970c6ba.

📒 Files selected for processing (10)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/asttransform/DebugFieldsAdder.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/DecoderSelector.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecoders.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/raw/FixedWithRecordLengthExprRawRecordExtractor.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/utils/StringUtils.scala
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecodersSpec.scala
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/utils/StringUtilsSuite.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala
💤 Files with no reviewable changes (2)
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecodersSpec.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecoders.scala
🚧 Files skipped from review as they are similar to previous changes (1)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala

…high performance implementation across its usages.
@yruslan yruslan force-pushed the feature/822-generate-corrupt-fields-as-hex branch from 970c6ba to 9af73e8 Compare February 24, 2026 10:19
@yruslan yruslan merged commit 9025915 into master Feb 24, 2026
7 checks passed
@yruslan yruslan deleted the feature/822-generate-corrupt-fields-as-hex branch February 24, 2026 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow '_corrupt_records' raw values to be in HEX rather than RAW

1 participant