#822 Allow '_corrupt_records' to extract data in HEX instead of binary data type. by yruslan · Pull Request #825 · AbsaOSS/cobrix

yruslan · 2026-02-24T08:05:00Z

Closes #822

Summary by CodeRabbit

New Features
- Corrupt fields can be emitted as hex strings or binary via a new corrupt-fields policy and a decode-as-hex option.
- Schema and builder APIs updated to expose the corrupt-fields policy and decode-as-hex toggle; raw_value type now reflects the chosen mode.
Tests
- Added/updated tests for binary vs hex corrupt-field behavior and a hex conversion utility.

…y data type.

coderabbitai · 2026-02-24T08:05:22Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 970c6ba and 9af73e8.

📒 Files selected for processing (10)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/asttransform/DebugFieldsAdder.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/DecoderSelector.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecoders.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/raw/FixedWithRecordLengthExprRawRecordExtractor.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/utils/StringUtils.scala
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecodersSpec.scala
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/utils/StringUtilsSuite.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala

💤 Files with no reviewable changes (2)

cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecodersSpec.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecoders.scala

🚧 Files skipped from review as they are similar to previous changes (4)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/raw/FixedWithRecordLengthExprRawRecordExtractor.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/DecoderSelector.scala
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/utils/StringUtilsSuite.scala

Walkthrough

Replaces boolean corrupt-field handling with a sealed CorruptFieldsPolicy (Disabled, Binary, Hex), threads the policy through parameters, schema, iterators, and record extraction, adds option to emit corrupt-field raw values as uppercase hex, and introduces StringUtils.convertArrayToHex plus related test updates.

Changes

Cohort / File(s)	Summary
Core Policy Definition `cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CorruptFieldsPolicy.scala`	Adds sealed trait `CorruptFieldsPolicy` with `Disabled`, `Binary`, `Hex`.
Parameters & Parsing `cobol-parser/src/main/scala/.../CobolParametersParser.scala`, `cobol-parser/src/main/scala/.../ReaderParameters.scala`	Replaces boolean `generateCorruptFields` with `corruptFieldsPolicy`; parser computes policy from existing flags and wires it into `ReaderParameters`.
Schema & Builder `cobol-parser/src/main/scala/.../schema/CobolSchema.scala`, `spark-cobol/src/main/scala/.../schema/CobolSchema.scala`, `spark-cobol/src/main/scala/.../schema/CobolSchemaBuilder.scala`	CobolSchema now accepts `corruptSchemaPolicy`; builder gains `decodeBinaryAsHex` and `withDecodeBinaryAsHex()` and derives policy; corrupt-fields rawValue type depends on policy (Hex -> StringType, else BinaryType).
Record Extraction `cobol-parser/src/main/scala/.../extractors/record/RecordExtractors.scala`	Adds parameter `generateCorruptFieldsAsHex: Boolean` to `extractRecord`/`applyRecordPostProcessing`; when enabled corrupt entries emit hex strings instead of raw byte arrays.
Iterators / Callers `cobol-parser/src/main/scala/.../iterator/FixedLenNestedRowIterator.scala`, `cobol-parser/src/main/scala/.../iterator/VarLenNestedIterator.scala`, `spark-cobol/src/main/scala/.../builder/SparkCobolOptionsBuilder.scala`	Derives `generateCorruptFields` and `generateCorruptFieldsAsHex` from `corruptFieldsPolicy` and passes them to `extractRecord`; call sites updated accordingly.
String utilities & decoders `cobol-parser/src/main/scala/.../utils/StringUtils.scala`, `cobol-parser/src/main/scala/.../parser/decoders/DecoderSelector.scala`, `cobol-parser/src/main/scala/.../parser/decoders/StringDecoders.scala`, `cobol-parser/src/main/scala/.../parser/asttransform/DebugFieldsAdder.scala`	Adds `StringUtils.convertArrayToHex`, replaces prior hex helpers/usages with this util, and removes `StringDecoders.decodeHex`.
Raw extractor messages `cobol-parser/src/main/scala/.../raw/FixedWithRecordLengthExprRawRecordExtractor.scala`	Uses `StringUtils.convertArrayToHex` in error messages and removes an internal hex helper.
Tests & Fixtures `spark-cobol/src/test/scala/.../CobolSchemaSpec.scala`, `.../DummyCobolSchema.scala`, `.../Test41CorruptFieldsSpec.scala`, `cobol-parser/src/test/scala/.../StringUtilsSuite.scala`, `cobol-parser/src/test/scala/.../StringDecodersSpec.scala`	Adds `StringUtils` tests; updates schema and integration tests to cover binary vs hex corrupt-field modes; adjusts test fixtures to new constructor parameters; removes old decodeHex test.

Sequence Diagram

sequenceDiagram
    participant User
    participant Parser as CobolParametersParser
    participant ReaderParams as ReaderParameters
    participant Schema as CobolSchema
    participant Iterator as NestedRowIterator
    participant Extractor as RecordExtractors
    participant Utils as StringUtils

    User->>Parser: set options (generate_corrupt_fields, binary_as_hex)
    Parser->>Parser: derive corruptFieldsPolicy (Disabled|Binary|Hex)
    Parser->>ReaderParams: pass corruptFieldsPolicy
    ReaderParams->>Schema: fromReaderParameters(corruptFieldsPolicy)
    Schema->>Iterator: iterator created (policy available)
    Iterator->>Iterator: derive booleans: generateCorruptFields, generateCorruptFieldsAsHex
    Iterator->>Extractor: extractRecord(..., generateCorruptFields, generateCorruptFieldsAsHex)
    alt generateCorruptFields && generateCorruptFieldsAsHex
        Extractor->>Utils: convertArrayToHex(rawBytes)
        Utils-->>Extractor: hexString
        Extractor-->>Iterator: record with corrupt_fields raw_value as String
    else generateCorruptFields && !generateCorruptFieldsAsHex
        Extractor-->>Iterator: record with corrupt_fields raw_value as Binary
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

#723 Add the ability to generate _corrupt_fields column containing information on fields that Cobrix failed to decode #820: Strong overlap implementing corrupt-fields handling and updating RecordExtractors, iterators, ReaderParameters/CobolSchema wiring.
#851 Make '_corrupt_records' a nullable field. #823: Modifies generated _corrupt_fields schema (e.g., nullability) and touches the same schema-related code paths.
#813 Fix compatibility of the relaxed sign overpunching. #814: Changes decoding helpers and selectors; touches the same decoder/hex conversion code paths.

Poem

🐇 I nibble bytes and make them shine,
Corrupt fields dressed in hex so fine,
From raw to string with one small hop,
I humbly patch each data drop,
Happy hops — hex feast on time!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main feature: allowing corrupt_records to extract data as HEX instead of binary. It directly matches the core objective from issue `#822`.
Linked Issues check	✅ Passed	The PR successfully implements the feature requirements from issue `#822`: it reuses the 'binary_as_hex' option to control HEX representation of corrupt field values, adding generateCorruptFieldsAsHex flag and supporting Hex policy throughout the extraction pipeline.
Out of Scope Changes check	✅ Passed	All changes directly support the core feature: introducing CorruptFieldsPolicy enum, updating extractors to respect hex formatting, refactoring parameters, and consolidating hex conversion logic into StringUtils. No unrelated modifications detected.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/822-generate-corrupt-fields-as-hex

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-24T08:07:37Z

JaCoCo code coverage report - 'cobol-parser'

Overall Project	91.42% `-0.2%`	🍏
Files changed	68.81%	🍏

File	Coverage
StringUtils.scala	100%	🍏
StringDecoders.scala	94.34%	🍏
DebugFieldsAdder.scala	88.89% `-11.11%`	🍏
CobolSchema.scala	80.67% `-1%`	🍏
RecordExtractors.scala	76.96% `-7.83%`	🍏
FixedLenNestedRowIterator.scala	73.42% `-3.29%`	🍏
FixedWithRecordLengthExprRawRecordExtractor.scala	71.27% `-4.61%`	❌
DecoderSelector.scala	69.68% `-0.18%`	❌
VarLenNestedIterator.scala	68.25% `-3.34%`	🍏

github-actions · 2026-02-24T08:07:39Z

JaCoCo code coverage report - 'spark-cobol'

Overall Project	82.35% `-0.23%`	🍏
Files changed	81.65%	🍏

File	Coverage
CobolSchema.scala	95.06% `-0.65%`	🍏
SparkCobolOptionsBuilder.scala	85% `-7.5%`	❌

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`:
- Around line 523-527: The hex conversion for corruptFields in
RecordExtractors.scala uses "%02X" format on signed bytes
(CorruptField.rawValue: Array[Byte]) which can produce sign-extended hex like
FFFFFFD3; fix by masking each byte with 0xFF before formatting so each element
is treated as an unsigned 0-255 value. Update the branch that builds the hex
string (the handler.create call that maps
corruptFields(i).rawValue.map(...).mkString) to map each byte via (b & 0xFF)
prior to formatting, ensuring two-digit hex output for values >= 0x80.

In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala`:
- Line 42: The docstring for parameter corruptSchemaPolicy in CobolSchema.scala
contains a typo: change the string '_corrput_fileds' to the correct
'_corrupt_fields' in the comment for corruptSchemaPolicy (locate the doc block
that documents `@param` corruptSchemaPolicy in the CobolSchema.scala file and
replace the misspelled token).

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala`:
- Line 47: Update the Scaladoc for the parameter corruptFieldsPolicy in
CobolSchema to fix the typo and use the correct field name: replace the
misspelled "_corrput_fileds" (and any occurrences of "_corrupted_fields") with
"_corrupt_fields" so the documentation matches the intended Spark convention
(e.g., "_corrupt_record"); ensure the description still mentions that the field
will be generated when the policy is set.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 904c05f and 60d2708.

📒 Files selected for processing (12)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/FixedLenNestedRowIterator.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/VarLenNestedIterator.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CorruptFieldsPolicy.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/builder/SparkCobolOptionsBuilder.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/CobolSchemaSpec.scala
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/base/impl/DummyCobolSchema.scala
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test41CorruptFieldsSpec.scala

...arser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala

…rformant

coderabbitai

♻️ Duplicate comments (1)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala (1)
47-47: Fix Scaladoc field name typo. The doc still says '_corrput_fields'; it should be '_corrupt_fields' to match convention.

Based on learnings: Rename the field from _corrupted_fields to _corrupt_fields to align with Apache Spark's naming convention (e.g., _corrupt_record).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala`
at line 47, Scaladoc and code use a misspelled/incorrect field name: update the
Scaladoc param for corruptFieldsPolicy to use '_corrupt_fields' (not
'_corrput_fields'), and rename any code symbols from _corrupted_fields to
_corrupt_fields to match Spark convention; search for usages in CobolSchema (and
related references) and refactor the symbol name consistently (Scaladoc tag,
field/column constant, and any tests or consumers) so all references use
'_corrupt_fields'.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala`:
- Line 47: Scaladoc and code use a misspelled/incorrect field name: update the
Scaladoc param for corruptFieldsPolicy to use '_corrupt_fields' (not
'_corrput_fields'), and rename any code symbols from _corrupted_fields to
_corrupt_fields to match Spark convention; search for usages in CobolSchema (and
related references) and refactor the symbol name consistently (Scaladoc tag,
field/column constant, and any tests or consumers) so all references use
'_corrupt_fields'.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 60d2708 and 970c6ba.

📒 Files selected for processing (10)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/asttransform/DebugFieldsAdder.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/DecoderSelector.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecoders.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/raw/FixedWithRecordLengthExprRawRecordExtractor.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/utils/StringUtils.scala
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecodersSpec.scala
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/utils/StringUtilsSuite.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala

💤 Files with no reviewable changes (2)

cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecodersSpec.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/StringDecoders.scala

🚧 Files skipped from review as they are similar to previous changes (1)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala

…high performance implementation across its usages.

#822 Allow '_corrupt_records' to extract data in HEX instead of binar…

60d2708

…y data type.

coderabbitai bot reviewed Feb 24, 2026

View reviewed changes

#822 Fix PR suggestions and made Bin to HEX conversion method more pe…

865359e

…rformant

coderabbitai bot reviewed Feb 24, 2026

View reviewed changes

#822 Extract string to HEX conversion into a utility object ensuring …

9af73e8

…high performance implementation across its usages.

yruslan force-pushed the feature/822-generate-corrupt-fields-as-hex branch from 970c6ba to 9af73e8 Compare February 24, 2026 10:19

yruslan merged commit 9025915 into master Feb 24, 2026
7 checks passed

yruslan deleted the feature/822-generate-corrupt-fields-as-hex branch February 24, 2026 10:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

#822 Allow '_corrupt_records' to extract data in HEX instead of binary data type.#825

#822 Allow '_corrupt_records' to extract data in HEX instead of binary data type.#825
yruslan merged 3 commits intomasterfrom
feature/822-generate-corrupt-fields-as-hex

yruslan commented Feb 24, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

yruslan commented Feb 24, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

github-actions bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JaCoCo code coverage report - 'cobol-parser'

Uh oh!

github-actions bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JaCoCo code coverage report - 'spark-cobol'

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yruslan commented Feb 24, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 24, 2026 •

edited

Loading

github-actions bot commented Feb 24, 2026 •

edited

Loading

github-actions bot commented Feb 24, 2026 •

edited

Loading