Skip to content

Conversation

@mscasso-scanoss
Copy link
Contributor

@mscasso-scanoss mscasso-scanoss commented Jan 13, 2026

Summary by CodeRabbit

  • New Features

    • Many new CLI options for scan tuning: snippet matching, ranking, SBOM, attribution, license controls, engine flags, and max-files.
  • Improvements

    • Expanded README and help output with detailed configuration sections, examples and clearer usage; version bumped to 5.4.20.
    • Unified license handling and optional component ranking in JSON output.
  • Behavioral Changes

    • Configurable file-fetch limit and vulnerability cap; refined snippet/range tolerance, dynamic range handling, and rank-based component filtering.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 13, 2026

📝 Walkthrough

Walkthrough

Extended CLI and configuration options added; per-scan snippet thresholds and ranking controls introduced; multiple public APIs updated to accept new scan parameters; component filtering by rank and a multi-criteria component comparator (SBOM, hints, rank, dates, extensions, purl, age) implemented; documentation/help expanded.

Changes

Cohort / File(s) Summary
Documentation
README.md
Rewrote configuration/help: added long-form CLI flags, env vars, engine flags table, usage examples; reorganized Configuration Options.
Headers & macros
inc/scanoss.h, inc/limits.h, inc/component.h, inc/match.h, inc/parse.h, inc/scan.h
Added/removed macros and defaults (SNIPPETS_DEFAULT_*, FETCH_MAX_FILES_DEFAULT, MAX_FILE_PATH, etc.); introduced IDENTIFIED_FILTERED; removed match_data_t.from; added includes, typedefs, externs; extended scan_data_t fields and function signatures.
CLI parsing & help
src/main.c, src/help.c
Added getopt_long with many long options; wired new CLI flags for tolerance/ranking/SBOM/attribution; expanded help text and output constants.
Scan init & callers
src/scan.c, src/binary_scan.c, src/debug.c
Extended scan_data_init, wfp_scan, hash_scan signatures to accept adjust_tolerance, ranking threshold, snippet thresholds and honor-file-ext; updated call sites.
Snippet selection & tolerance
src/snippet_selection.c, src/snippets.c
Replaced global snippet thresholds with per-scan fields; updated ranges_assemble/ranges_join_overlapping APIs to accept min_range_lines, range_tolerance and dynamic_ranges; tolerance adjustment made conditional.
Component parsing & URL handling
src/component.c, src/url.c
Shifted CSV rank column parsing; added component_rank_max and logic to mark components as IDENTIFIED_FILTERED when rank exceeds threshold; improved cleanup/logging.
Matching & ranking engine
src/match.c, src/match_list.c
Implemented multi-criteria comparator with SBOM and age handling, lazy age init, defensive alloc checks; switched fetch limits to fetch_max_files; added verbose decision logging.
Reporting / JSON output
src/report.c
Added ranking_enabled; emit "filtered" status and conditional rank; include snippet "hits" when tolerance disabled.
License handling
src/license.c
Consolidated license storage into struct license_type with id; license_add_to_list gains license_id; JSON emission adjusted and scancode filtering applied.
Limits, files & versions
src/limits.c, src/file.c, src/versions.c, src/vulnerability.c
Removed several global snippet-tuning vars; added fetch_max_files (default 12000) and max_vulnerabilities; updated loops and array bounds to new defaults.
Utilities & minor fixes
src/util.c, src/match_list.c
Safer path handling via strdup/dirname, updated third-party pattern list, ensured consistent frees; minor whitespace/formatting tweaks.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI
    participant Scanner
    participant MatchLoader
    participant ComponentDB
    participant Comparator
    participant Reporter

    User->>CLI: invoke scan with flags (--tolerance, --rank, --max-files, ...)
    CLI->>Scanner: start scan(target, config)
    Scanner->>MatchLoader: load matches (fetch_max_files)
    MatchLoader->>ComponentDB: query candidate components
    ComponentDB-->>MatchLoader: return candidates
    MatchLoader->>Comparator: evaluate candidates (SBOM, path-rank, dates, purl, extension, age)
    Comparator->>ComponentDB: lazy fetch metadata/age if needed
    Comparator-->>MatchLoader: select best match or mark IDENTIFIED_FILTERED
    MatchLoader->>Reporter: emit match + rank/status
    Reporter->>User: output JSON/report
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped through flags and files tonight,

Tuned snippets, ranks, and matching light,
Filtered high and picked the best,
JSON gleams and docs express the rest,
A rabbit cheers for v5.4.20 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title check ⚠️ Warning The title '5.4.20' is a version number that does not describe the actual changes in the pull request. The changeset contains extensive configuration enhancements, new CLI options, license handling improvements, and component ranking logic—none of which are conveyed by a version string. Replace the title with a descriptive summary of the main changes, such as 'Add extensible configuration options and component ranking' or 'Introduce configurable snippet matching and component filtering'. The version can be mentioned in commit messages or tags.
Docstring Coverage ⚠️ Warning Docstring coverage is 64.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/license.c (1)

447-454: Dead code: Inner loop serves no purpose.

This loop iterates up to 4 times but only checks if licenses_by_type.count > 0. Since the count doesn't change within the loop, it will either:

  • Break immediately on first iteration (if count > 0), or
  • Run 4 iterations doing nothing (if count == 0)

The loop variable i also shadows the outer loop variable from line 435. This appears to be vestigial code from the refactoring that should be simplified or removed.

Suggested fix
 		if (records)
 		{
-			//Look if someone of the prefered liceses ids already has a match
-			for (int i = 0; i < 4; i++)
-			{
-				if (licenses_by_type.count > 0)
-				{
-					scanlog("Stop searching for licenses\n");
-					break;
-				}
-			}
+			if (licenses_by_type.count > 0)
+			{
+				scanlog("Stop searching for licenses\n");
+				break;
+			}
 		}
inc/component.h (1)

105-109: Typo: component_date_comparation should likely be component_date_comparison.

Minor naming issue: "comparation" is not a standard English word; "comparison" is the correct term.

Suggested fix
-bool component_date_comparation(component_data_t * a, component_data_t * b);
+bool component_date_comparison(component_data_t * a, component_data_t * b);

Note: This would require updating the function definition and all call sites.

src/url.c (1)

351-357: Misleading log message in failure path.

The log message "ignoring component with rank %d" is misleading because fill_component returns false for incomplete metadata (missing url, version, or purl), not due to rank. The actual rank value may not even be meaningful at this point.

🐛 Suggested fix
 		if (!result)
 		{
-			scanlog("ignoring component with rank %d\n", comp->rank);
+			scanlog("ignoring component with incomplete metadata\n");
 			free(url);
 			component_data_free(comp);
 			return false;
 		}
🤖 Fix all issues with AI agents
In @inc/scan.h:
- Around line 69-74: The struct field snippet_honor_file_extension is an int but
the functions scan_data_init, wfp_scan, and hash_scan expect a bool; change the
field declaration snippet_honor_file_extension to type bool so it matches the
function signatures and avoids implicit conversions, and update any
initializations or assignments to use true/false accordingly (refer to symbol
snippet_honor_file_extension and the functions scan_data_init, wfp_scan,
hash_scan).

In @src/help.c:
- Line 111: Update the hardcoded copyright year in the string literal in
src/help.c (the line containing "Copyright (C) 2018-2022 SCANOSS.COM") to
reflect the current year range (e.g., "2018-2026") or replace it with a dynamic
build-time value; locate the string in help.c (the help/usage text) and either
change the year range to 2018-2026 or hook in a macro/define (e.g.,
COPYRIGHT_YEAR or generated during build) so the year does not become stale.

In @src/report.c:
- Line 51: The static flag ranking_enabled persists across file scans causing
ranking to remain enabled after a file with component_ranking_threshold >= 0;
reset ranking_enabled to false at the start of each file scan by assigning
ranking_enabled = false in either json_open_file() or json_close_file() so each
file's threshold is applied independently (locate the static bool
ranking_enabled and add the reset in json_open_file() or json_close_file() to
ensure per-file isolation).

In @src/util.c:
- Around line 342-343: strdup result is not checked before calling dirname,
which can cause undefined behavior if strdup returns NULL; after calling
strdup(comp->file) into full_path, add a NULL check and handle allocation
failure (e.g., log/report error and return an error code or clean up and exit
the function) before calling dirname(full_path), and ensure any allocated
resources are freed (and avoid calling dirname when full_path is NULL);
reference symbols: strdup, full_path, dirname, path, comp->file.

In @src/versions.c:
- Around line 48-54: The arrays purl_indirection_reference and purl_version_list
are statically sized to FETCH_MAX_FILES_DEFAULT but bounds are checked against
the user-controlled fetch_max_files, risking overflow in purl_latest_version_add
when doing assignments like purl_indirection_reference[purl_indirection_index] =
strdup(...); fix by clamping fetch_max_files to at most FETCH_MAX_FILES_DEFAULT
right after parsing the --max-files option (or alternatively allocate these
arrays dynamically using fetch_max_files); ensure all checks using
purl_indirection_index compare against the clamped value
(FETCH_MAX_FILES_DEFAULT or the allocated size) so writes cannot exceed the
actual buffer capacity.
🧹 Nitpick comments (11)
src/license.c (1)

263-269: Minor: Comment/code casing mismatch.

The comment on line 267 says "license-ref" but the code checks for "LicenseRef" (PascalCase). The code logic is correct for filtering scancode LicenseRef entries, but the comment should match:

-	//skip scancode licenses starting with "license-ref"	
+	//skip scancode licenses starting with "LicenseRef"	
src/match_list.c (1)

99-106: Whitespace-only changes look fine.

The formatting adjustments in this segment are acceptable. However, note that the static analysis warning about potential null pointer dereference from calloc (line 100) is a pre-existing concern throughout this file—all calloc calls lack null checks. Consider adding null checks in a follow-up.

♻️ Optional: Add null check for allocation failure
 struct comp_entry *nn = calloc(1, sizeof(struct comp_entry)); /* Insert after. */
+if (!nn) {
+    component_data_free(new_comp);
+    return false;
+}
 nn->component = new_comp;
README.md (1)

50-50: Consider hyphenating compound adjective.

For consistency with style guidelines, "High Precision" could be written as "High-Precision" when used as a compound adjective modifying "Snippet Match mode."

-* `-H, --hpsm` - Enable High Precision Snippet Match mode (requires 'libhpsm.so' in the system)
+* `-H, --hpsm` - Enable High-Precision Snippet Match mode (requires 'libhpsm.so' in the system)
inc/scan.h (1)

79-85: Consider grouping configuration parameters into a struct.

These function signatures have grown to 9 parameters, which reduces readability and maintainability. Consider introducing a scan_config_t struct to bundle the configuration options (max_snippets, max_components, adjust_tolerance, component_ranking_threshold, snippet_min_hits, snippet_min_lines, snippet_range_tolerance, snippet_honor_file_extension).

This would simplify signatures and make future extensions easier.

src/report.c (1)

336-341: Consider typo: scan_ower should likely be scan_owner.

The field name scan_ower appears to be a typo for scan_owner. While this is likely a pre-existing issue and the code works with the current field name, it affects readability.

src/url.c (1)

92-97: Consider extracting rank filtering to a helper function.

The rank filtering logic is duplicated between handle_url_record (lines 92-96) and get_oldest_url (lines 362-366). Consider extracting to a helper function to reduce duplication and ensure consistent behavior.

♻️ Suggested helper function
static void apply_rank_filter(component_data_t *comp) {
    if (component_rank_max > 0 && comp->rank > component_rank_max) {
        scanlog("Setting component with rank %d as filtered\n", comp->rank);
        comp->identified = IDENTIFIED_FILTERED;
    }
}

Then call apply_rank_filter(new_comp); in both locations.

Also applies to: 362-366

inc/limits.h (2)

38-53: Duplicate macro definitions for fetch max files default.

FETCH_MAX_FILES_DEFAULT (line 38) and DEFAULT_FETCH_MAX_FILES (line 53) both define the same value (12000). Consider consolidating to a single macro to avoid confusion and maintenance overhead.

♻️ Suggested consolidation
 #define MAX_FILE_PATH 1024
-#define FETCH_MAX_FILES_DEFAULT 12000
+#define DEFAULT_FETCH_MAX_FILES 12000
 #define MIN_FILE_SIZE 256 // files below this size will be ignored
 #define CRC_LIST_LEN 1024 // list of crc checksums to avoid metadata duplicates
 
 /* Snippets */
 ...
 #define SNIPPETS_DEFAULT_HONOR_FILE_EXTENSION true /** Honor file extension during snippet matching */
-#define DEFAULT_FETCH_MAX_FILES 12000 /** Maximum number of files to fetch during component matching */

22-25: Move #include inside the include guard.

The #include <stdint.h> is placed before the include guard, which means it will be processed on every inclusion of this header. While stdint.h has its own guards, it's better practice to place all includes inside the guard.

♻️ Suggested fix
-#include <stdint.h>
-
 #ifndef __LIMITS_H
     #define __LIMITS_H
+
+#include <stdint.h>
+
 /* Constants */
src/scan.c (1)

56-77: Consider adding null checks for memory allocations.

Static analysis flags potential null pointer dereferences if calloc/malloc fails (lines 59-63). While allocation failures are rare, adding a guard would improve robustness.

♻️ Proposed defensive check
 scan_data_t * scan_data_init(char *target, int max_snippets, int max_components, bool adjust_tolerance, int component_ranking_threshold, int snippet_min_hits, int snippet_min_lines, int snippet_range_tolerance, bool snippet_honor_file_extension)
 {
 	scanlog("Scan Init\n");
 	scan_data_t * scan = calloc(1, sizeof(*scan));
+	if (!scan) {
+		fprintf(stderr, "Failed to allocate scan_data_t\n");
+		return NULL;
+	}
 	scan->file_path = strdup(target);
 	scan->file_size = malloc(32);
 	scan->hashes = calloc(MAX_FILE_SIZE,1);
 	scan->lines  = malloc(MAX_FILE_SIZE);
+	if (!scan->file_path || !scan->file_size || !scan->hashes || !scan->lines) {
+		fprintf(stderr, "Failed to allocate scan fields\n");
+		// Note: Would need cleanup of partially allocated fields
+		return NULL;
+	}
src/snippet_selection.c (1)

293-339: Consider adding null check for allocation.

Line 299 allocates out_ranges without checking for failure. While rare, allocation failure would cause undefined behavior.

♻️ Proposed fix
 	matchmap_range *out_ranges = calloc(out_size, sizeof(matchmap_range));
+	if (!out_ranges) {
+		scanlog("Failed to allocate out_ranges\n");
+		return NULL;
+	}

The function logic itself is correct - properly parameterized with tolerance and dynamic_ranges flags.

src/main.c (1)

455-481: Consider validating numeric CLI arguments.

The handlers use atoi() without validating the result. Invalid input (e.g., --max-files=abc) would silently set the value to 0.

♻️ Example validation pattern
 case 257: /* --max-files */
-	fetch_max_files = atoi(optarg);
+	fetch_max_files = atoi(optarg);
+	if (fetch_max_files <= 0) {
+		fprintf(stderr, "Invalid value for --max-files: %s\n", optarg);
+		exit(EXIT_FAILURE);
+	}
 	scanlog("Max files to fetch set to %d\n", fetch_max_files);
 	break;

The logic for disabling scan_adjust_tolerance when explicit values are provided is correct - this ensures user-specified values aren't overwritten by the auto-adjustment algorithm.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d23df88 and c649df4.

📒 Files selected for processing (25)
  • README.md
  • inc/component.h
  • inc/limits.h
  • inc/match.h
  • inc/parse.h
  • inc/scan.h
  • inc/scanoss.h
  • src/binary_scan.c
  • src/component.c
  • src/debug.c
  • src/file.c
  • src/help.c
  • src/license.c
  • src/limits.c
  • src/main.c
  • src/match.c
  • src/match_list.c
  • src/report.c
  • src/scan.c
  • src/snippet_selection.c
  • src/snippets.c
  • src/url.c
  • src/util.c
  • src/versions.c
  • src/vulnerability.c
💤 Files with no reviewable changes (1)
  • inc/match.h
🧰 Additional context used
🧬 Code graph analysis (6)
inc/scan.h (2)
src/scan.c (5)
  • scan_data_init (56-77)
  • scan_data_free (82-99)
  • ldb_scan (444-545)
  • wfp_scan (224-365)
  • hash_scan (197-212)
src/snippets.c (2)
  • adjust_tolerance (106-138)
  • ldb_scan_snippets (318-600)
src/url.c (3)
src/component.c (1)
  • component_data_free (43-70)
src/scan.c (1)
  • asset_declared (141-190)
src/scanlog.c (1)
  • scanlog (71-107)
src/component.c (1)
src/parse.c (1)
  • extract_csv (302-326)
src/scan.c (1)
src/snippets.c (1)
  • adjust_tolerance (106-138)
src/snippet_selection.c (2)
src/match.c (1)
  • match_data_free (63-76)
src/scanlog.c (1)
  • scanlog (71-107)
src/binary_scan.c (1)
src/scan.c (1)
  • scan_data_init (56-77)
🪛 ast-grep (0.40.4)
src/match.c

[warning] 263-263: This hashing algorithm is insecure. If this hash is used in a security context, such as password hashing, it should be converted to a stronger hashing algorithm.
Context: MD5((uint8_t *)comp->purls[0], strlen(comp->purls[0]), comp->purls_md5[0]);
Note: [CWE-328] Use of Weak Hash. [REFERENCES]
- https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(insecure-hash-c)

🪛 Cppcheck (2.19.0)
src/match_list.c

[warning] 101-101: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)

src/scan.c

[warning] 65-65: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 66-66: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 67-67: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 68-68: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 69-69: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 70-70: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 71-71: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)

src/snippet_selection.c

[warning] 127-127: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 128-128: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 129-129: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 130-130: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 131-131: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 132-132: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 307-307: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)

🪛 LanguageTool
README.md

[uncategorized] ~50-~50: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ctual extension * -H, --hpsm - Enable High Precision Snippet Match mode (requires 'libhpsm.s...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

🔇 Additional comments (34)
src/license.c (4)

55-92: LGTM!

The translator function correctly maps license source IDs to human-readable names. The switch-case groupings (e.g., cases 0/35 and 7/9) are clear and the NULL return for unknown IDs is properly handled by the caller at line 265-266.

Minor note: case 8 appearing after case 10 is slightly unusual ordering but doesn't affect correctness.


101-108: Padding logic is correct for CRC32C alignment.

The formula ((len + 8) / 8) * 8 ensures at least 8 bytes beyond the string length, and calloc properly zero-initializes the padding bytes.

Minor edge case: if calloc fails at line 105-107, ptr->licenses has already been reallocated with an additional slot containing an uninitialized pointer. Since ptr->count is not incremented on failure, subsequent operations should be safe, but you may want to consider setting ptr->licenses[ptr->count] = NULL before the calloc for defensive consistency.


386-386: LGTM!

Clean refactoring to use the unified license list structure.


416-416: LGTM!

Clean initialization of the consolidated license list structure.

inc/parse.h (1)

7-7: LGTM!

The component.h include is correctly added to provide the component_item type required by the get_components function declaration on line 15.

src/util.c (1)

396-411: LGTM!

The special case handling for root directory (".") correctly returns numPatterns (lowest third-party rank), and memory cleanup with free(full_path) is consistent across all exit paths. The expanded pattern list appropriately captures common third-party directory conventions.

README.md (2)

46-81: Excellent documentation improvements!

The reorganized configuration options with clear subsections (Basic Configuration, SBOM and Filtering, Attribution and Licenses, General Options) significantly improves usability. The documentation accurately reflects the CLI capabilities.


87-108: Engine flags table is accurate and well-formatted.

The flag values align with the constants defined in inc/scanoss.h. The table provides clear documentation for users configuring the scanning engine.

inc/scanoss.h (1)

36-36: Version bump to 5.4.20 is correct.

Verification confirms that relocated declarations are properly accessible in their new locations:

  • Type definitions (component_item, keywords, file_recordset, len_rank) are in inc/component.h
  • Macro definitions (MAX_FILE_PATH, FETCH_MAX_FILES_DEFAULT, MIN_FILE_SIZE, CRC_LIST_LEN) are in inc/limits.h
  • Extern declarations (ignore_components, declared_components) are properly declared in inc/component.h

The modularization effort is sound.

src/vulnerability.c (1)

43-43: LGTM!

The max_vulnerabilities constant is appropriately defined with clear documentation. Moving this from the public header to the source file improves encapsulation.

src/file.c (2)

195-196: LGTM!

The change from the hard-coded FETCH_MAX_FILES constant to the configurable fetch_max_files variable enables runtime configuration. The boundary check logic remains consistent.


234-237: LGTM!

Consistent with the collect_all_files change, using the configurable fetch_max_files variable.

inc/component.h (2)

21-26: LGTM!

Adding IDENTIFIED_FILTERED = -1 as a sentinel value for filtered components is a clean approach for the new component-rank-based filtering logic.


68-99: LGTM!

The typedefs (keywords, file_recordset, len_rank, component_item) and extern declarations are appropriately moved to this header, improving modularity by co-locating component-related types.

src/component.c (2)

295-301: LGTM: Simplified rank validation is appropriate.

Removing the arbitrary strlen(rank) < 3 constraint allows all valid rank values. The commented-out scanlog on line 298 could be removed to reduce noise, but this is a minor cleanup.


256-256: Remove this review comment—it is based on incorrect information.

The rank extraction at column 14 is not a change from column 13; this is the initial code introduced in commit c649df4 (2026-01-12). Git history shows no prior version using column 13. The codebase consistently extracts columns 1–7 (vendor, component, version, release_date, license, purl, url), columns 8–12 via the url_stats loop, and deliberately skips column 13 before extracting rank from column 14. Without evidence of a bug or incorrect behavior, no action is required.

Likely an incorrect or invalid review comment.

src/debug.c (1)

141-141: LGTM: Benchmark updated to use new scan_data_init signature.

The parameter values are appropriate for benchmark testing. Using SNIPPETS_DEFAULT_RANGE_TOLERANCE constant ensures consistency with default behavior.

src/binary_scan.c (1)

273-273: LGTM: Binary scan updated to use new scan_data_init signature.

The parameters are appropriate for binary file matching where single component results are expected.

src/limits.c (1)

12-12: LGTM: Simplified global configuration.

The refactoring from multiple global tuning variables to parameterized function calls is a cleaner approach. The remaining fetch_max_files global with default 12000 aligns with the documented CLI default.

src/report.c (1)

217-225: LGTM: Clean early-exit for filtered components.

The filtered component output with minimal JSON (status and rank only) is appropriate for rank-filtered items. The JSON structure is correctly handled.

src/help.c (1)

54-61: LGTM: Comprehensive CLI documentation added.

The new long-form options are well-documented with clear descriptions and default values. The defaults reference constants (SCAN_MAX_COMPONENTS_DEFAULT, COMPONENT_DEFAULT_RANK + 1) which ensures consistency with actual behavior.

src/url.c (2)

54-55: LGTM: Global rank threshold with sensible default.

Using COMPONENT_DEFAULT_RANK + 1 as the default means components at or below the default rank pass through, which is appropriate default behavior.


86-88: LGTM: Safe basename usage with temporary copy.

Creating a temporary copy before calling basename() is correct since basename() may modify its argument. The memory is properly freed after use.

src/snippets.c (2)

106-138: LGTM - Tolerance adjustment refactored to use configuration.

The adjust_tolerance function now properly initializes from SNIPPETS_DEFAULT_* macros and stores computed values in the scan structure (scan->snippet_min_hits, scan->snippet_min_lines, scan->snippet_range_tolerance). This enables the new CLI-driven configuration while maintaining backward compatibility with auto-adjustment.


332-334: LGTM - Conditional tolerance adjustment.

The tolerance adjustment is now gated by scan->snippet_adjust_tolerance, allowing CLI users to override the automatic adjustment when they provide explicit values via --min-snippet-hits, --min-snippet-lines, or --range-tolerance.

src/scan.c (1)

461-467: LGTM - Component ranking threshold logic.

Clear three-way handling:

  • Negative: disables ranking
  • Zero: accepts all (default + 1)
  • Positive: uses explicit threshold

This provides good flexibility for CLI users to control component filtering.

src/match.c (4)

259-267: LGTM - Component age initialization with purl MD5.

The MD5 usage here is appropriate as a database lookup key (not for security purposes). The lazy initialization pattern avoids computing the hash until needed.

Note: The static analysis warning about weak hash is a false positive in this context - MD5 is being used for content addressing/lookup, not cryptographic security.


287-319: LGTM - File extension comparison helper.

The function properly handles all NULL combinations and returns appropriate values to influence component selection based on extension matching with the reference file.


341-669: LGTM - Comprehensive component comparison logic.

The multi-criteria comparison function is well-structured with clear numbered stages (1-9) and detailed comments. The hierarchical evaluation order (SBOM → hints → path rank → dates → extensions → ranking → tiebreakers) is logical and well-documented.


715-720: LGTM - Iteration limit integration with fetch_max_files.

The iteration limit now properly uses the configurable fetch_max_files value (settable via --max-files CLI option), with appropriate scaling for high accuracy mode.

src/snippet_selection.c (1)

124-145: LGTM - Snippet filtering with configurable parameters.

The filtering logic properly uses the scan configuration fields:

  • scan->snippet_min_hits for minimum hit threshold
  • scan->snippet_honor_file_extension for extension-based filtering
  • scan->snippet_min_lines for minimum matched lines

This enables CLI users to tune snippet matching sensitivity.

src/main.c (3)

67-73: LGTM - Configuration variables with sensible defaults.

The new global configuration variables are initialized with the centralized defaults from limits.h. This ensures consistent behavior whether options are specified via CLI or not.

Note: These are file-scope globals (implicitly static at file level), which is acceptable for a main.c configuration pattern.


275-303: LGTM - Well-structured long options definition.

The long_options array is properly defined:

  • Short-form options mapped to their characters
  • Long-only options use numeric codes 256+ to avoid ASCII conflicts
  • Properly terminated with null entry

218-224: LGTM - Configuration propagation through scan paths.

The scan initialization calls consistently pass all configuration parameters, ensuring the CLI options affect all scanning modes (directory recursion, single file, WFP, hash).

scanoss --sbom my_sbom.json TARGET Scan TARGET including SBOM assets\n\
\n\
Copyright (C) 2018-2022 SCANOSS.COM\n", DEFAULT_MATCHMAP_FILES, ENGINE_FLAGS_FILE);
Copyright (C) 2018-2022 SCANOSS.COM\n", SCAN_MAX_COMPONENTS_DEFAULT, COMPONENT_DEFAULT_RANK + 1, DEFAULT_MATCHMAP_FILES, ENGINE_FLAGS_FILE);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update copyright year.

The copyright notice shows 2018-2022, but the current year is 2026. Consider updating to 2018-2026 or using a dynamic approach.

🤖 Prompt for AI Agents
In @src/help.c at line 111, Update the hardcoded copyright year in the string
literal in src/help.c (the line containing "Copyright (C) 2018-2022
SCANOSS.COM") to reflect the current year range (e.g., "2018-2026") or replace
it with a dynamic build-time value; locate the string in help.c (the help/usage
text) and either change the year range to 2018-2026 or hook in a macro/define
(e.g., COPYRIGHT_YEAR or generated during build) so the year does not become
stale.


uint64_t engine_flags = 0;
char kb_version[MAX_INPUT];
static bool ranking_enabled = false;
Copy link

@coderabbitai coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

rg -n 'ranking_enabled' --type=c -B2 -A2

Repository: scanoss/engine

Length of output: 619


🏁 Script executed:

rg -n 'print_json_match' --type=c -B2 -A2 | head -100

Repository: scanoss/engine

Length of output: 1249


🏁 Script executed:

# Check for any reset or initialization of ranking_enabled in main/scan functions
rg -n 'ranking_enabled\s*=' --type=c -B1 -A1

Repository: scanoss/engine

Length of output: 310


🏁 Script executed:

# Look for file processing loops to understand scan boundaries
rg -n 'for.*file|while.*file|scan_file|process_file' --type=c | head -20

Repository: scanoss/engine

Length of output: 1694


🏁 Script executed:

# Check json_close_file and file scope boundaries
rg -n 'json_close_file|json_open_file|report_open' --type=c -B3 -A5

Repository: scanoss/engine

Length of output: 1388


🏁 Script executed:

# Look for main scan loop to understand multi-file processing
rg -n 'main|for.*scan|while.*scan' --type=c | grep -E '(main|scan_data|for|while)' | head -30

Repository: scanoss/engine

Length of output: 2488


🏁 Script executed:

# Check component_ranking_threshold to see if it's per-scan or global
rg -n 'component_ranking_threshold' --type=c -B2 -A2

Repository: scanoss/engine

Length of output: 4250


🏁 Script executed:

# Verify the exact flow: look at json_open_file and json_close_file to see if there's any reset
cat -n src/report.c | sed -n '74,110p'

Repository: scanoss/engine

Length of output: 992


🏁 Script executed:

# Check if there's any other place ranking_enabled is set to false
rg -n 'ranking_enabled\s*=\s*false' --type=c

Repository: scanoss/engine

Length of output: 110


Reset ranking_enabled between file scans to prevent state persistence across multiple files with different thresholds.

The static variable ranking_enabled is set to true when processing files with component_ranking_threshold >= 0 (line 337) but is never reset to false. In wfp_scan(), multiple files are processed sequentially, each with potentially different component_ranking_threshold values. Once ranking is enabled for one file, it remains enabled for all subsequent files regardless of their individual threshold settings, causing rank output to appear on files where it should be suppressed.

Reset ranking_enabled = false in json_close_file() or json_open_file() to ensure per-file threshold isolation.

🤖 Prompt for AI Agents
In @src/report.c at line 51, The static flag ranking_enabled persists across
file scans causing ranking to remain enabled after a file with
component_ranking_threshold >= 0; reset ranking_enabled to false at the start of
each file scan by assigning ranking_enabled = false in either json_open_file()
or json_close_file() so each file's threshold is applied independently (locate
the static bool ranking_enabled and add the reset in json_open_file() or
json_close_file() to ensure per-file isolation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's intentional

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mscasso-scanoss, understood! I've noted that the ranking_enabled static variable is intentionally designed to persist across file scans in src/report.c.


✏️ Learnings added
Learnt from: mscasso-scanoss
Repo: scanoss/engine PR: 103
File: src/report.c:51-51
Timestamp: 2026-01-13T11:45:11.443Z
Learning: In the SCANOSS engine project (src/report.c), the static variable `ranking_enabled` is intentionally designed to persist its state across multiple file scans and is not reset between files.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines +342 to +343
char * full_path = strdup(comp->file);
char * path = dirname(full_path);
Copy link

@coderabbitai coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add null check after strdup to avoid undefined behavior.

If strdup fails and returns NULL, passing it to dirname results in undefined behavior. Add a null check to handle allocation failure gracefully.

🐛 Proposed fix
 char * full_path = strdup(comp->file);
+if (!full_path)
+    return 0;
 char * path = dirname(full_path);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
char * full_path = strdup(comp->file);
char * path = dirname(full_path);
char * full_path = strdup(comp->file);
if (!full_path)
return 0;
char * path = dirname(full_path);
🤖 Prompt for AI Agents
In @src/util.c around lines 342 - 343, strdup result is not checked before
calling dirname, which can cause undefined behavior if strdup returns NULL;
after calling strdup(comp->file) into full_path, add a NULL check and handle
allocation failure (e.g., log/report error and return an error code or clean up
and exit the function) before calling dirname(full_path), and ensure any
allocated resources are freed (and avoid calling dirname when full_path is
NULL); reference symbols: strdup, full_path, dirname, path, comp->file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full_path can´t be null

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mscasso-scanoss, understood. While strdup can technically return NULL if insufficient memory is available, I acknowledge that in your environment this is not a practical concern. I'll defer to your judgment on this.

src/versions.c Outdated
Comment on lines 48 to 54
static char * purl_indirection_reference[FETCH_MAX_FILES_DEFAULT];
static int purl_indirection_index = 0;
static release_version * purl_version_list[FETCH_MAX_FILES];
static release_version * purl_version_list[FETCH_MAX_FILES_DEFAULT];

void purl_latest_version_add(component_data_t * component)
{
if (!component->purls[0] || !component->release_date || !component->version || purl_indirection_index == FETCH_MAX_FILES)
if (!component->purls[0] || !component->release_date || !component->version || purl_indirection_index == fetch_max_files)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find FETCH_MAX_FILES_DEFAULT and fetch_max_files usage
echo "=== FETCH_MAX_FILES_DEFAULT definition ==="
rg -n "FETCH_MAX_FILES_DEFAULT" --type c

echo ""
echo "=== fetch_max_files definition/initialization ==="
rg -n "fetch_max_files\s*=" --type c -A 2

echo ""
echo "=== Context around src/versions.c lines 48-54 ==="
head -70 src/versions.c | tail -30

Repository: scanoss/engine

Length of output: 1724


🏁 Script executed:

#!/bin/bash
# Check for any validation/clamping of fetch_max_files after it's set from atoi
echo "=== Context around src/main.c line 456 (fetch_max_files assignment) ==="
sed -n '450,470p' src/main.c

echo ""
echo "=== Check for any MIN/MAX clamping macros or functions ==="
rg -n "fetch_max_files" --type c | head -20

Repository: scanoss/engine

Length of output: 1717


Critical buffer overflow: static array size vs. user-controlled runtime limit.

The static arrays purl_indirection_reference and purl_version_list are sized with FETCH_MAX_FILES_DEFAULT (12000), but the boundary check uses fetch_max_files, which is user-controllable via the --max-files command-line argument. The fetch_max_files value is set directly from atoi(optarg) with no validation or clamping. If a user passes --max-files with a value greater than 12000, the code will write beyond array bounds when executing purl_indirection_reference[purl_indirection_index] = strdup(...), causing undefined behavior and a potential heap corruption vulnerability.

Either:

  1. Clamp fetch_max_files to FETCH_MAX_FILES_DEFAULT after reading from command line, or
  2. Size the arrays dynamically based on the user-provided value, or
  3. Use the constant for both array sizing and boundary checking
Suggested fix (option 3)
-	if (!component->purls[0] || !component->release_date || !component->version || purl_indirection_index == fetch_max_files)
+	if (!component->purls[0] || !component->release_date || !component->version || purl_indirection_index == FETCH_MAX_FILES_DEFAULT)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
static char * purl_indirection_reference[FETCH_MAX_FILES_DEFAULT];
static int purl_indirection_index = 0;
static release_version * purl_version_list[FETCH_MAX_FILES];
static release_version * purl_version_list[FETCH_MAX_FILES_DEFAULT];
void purl_latest_version_add(component_data_t * component)
{
if (!component->purls[0] || !component->release_date || !component->version || purl_indirection_index == FETCH_MAX_FILES)
if (!component->purls[0] || !component->release_date || !component->version || purl_indirection_index == fetch_max_files)
static char * purl_indirection_reference[FETCH_MAX_FILES_DEFAULT];
static int purl_indirection_index = 0;
static release_version * purl_version_list[FETCH_MAX_FILES_DEFAULT];
void purl_latest_version_add(component_data_t * component)
{
if (!component->purls[0] || !component->release_date || !component->version || purl_indirection_index == FETCH_MAX_FILES_DEFAULT)
🤖 Prompt for AI Agents
In @src/versions.c around lines 48 - 54, The arrays purl_indirection_reference
and purl_version_list are statically sized to FETCH_MAX_FILES_DEFAULT but bounds
are checked against the user-controlled fetch_max_files, risking overflow in
purl_latest_version_add when doing assignments like
purl_indirection_reference[purl_indirection_index] = strdup(...); fix by
clamping fetch_max_files to at most FETCH_MAX_FILES_DEFAULT right after parsing
the --max-files option (or alternatively allocate these arrays dynamically using
fetch_max_files); ensure all checks using purl_indirection_index compare against
the clamped value (FETCH_MAX_FILES_DEFAULT or the allocated size) so writes
cannot exceed the actual buffer capacity.

Copy link
Contributor

@scanoss-qg scanoss-qg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just consider array bounds advisory from CodeRabbit

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/snippet_selection.c (1)

2-3: File header comment has incorrect filename.

The comment says src/snippets.c but this file is src/snippet_selection.c.

Proposed fix
-* src/snippets.c
+* src/snippet_selection.c
src/license.c (1)

454-461: Dead code: inner loop serves no purpose.

This loop has multiple issues:

  1. Variable shadowing: The inner int i shadows the outer loop's i (line 442).
  2. Unused loop variable: The condition licenses_by_type.count > 0 doesn't depend on i, so looping 4 times is pointless.
  3. Misleading comment: The comment mentions checking "preferred license ids" but the code only checks if count > 0.
  4. Break scope: The break only exits the inner loop, not the outer purl loop.

Consider removing this dead code or implementing the intended logic.

🔧 If intent is to break outer loop when licenses exist
-			//Look if someone of the prefered liceses ids already has a match
-			for (int i = 0; i < 4; i++)
-			{
-				if (licenses_by_type.count > 0)
-				{
-					scanlog("Stop searching for licenses\n");
-					break;
-				}
-			}
+			if (licenses_by_type.count > 0)
+			{
+				scanlog("Stop searching for licenses\n");
+				break;
+			}

Note: This break will still only exit the inner for loop at line 442. If you need to exit the purl iteration entirely, you'll need a different control flow mechanism (e.g., goto, flag variable, or restructuring).

🤖 Fix all issues with AI agents
In @src/license.c:
- Around line 270-276: Update the inline comment to match the actual SPDX case
used in the condition: replace "license-ref" with "LicenseRef" so the comment
accurately reflects the check around license_source_id and the subsequent strcmp
against "LicenseRef" in the block that uses license_id_to_source_name and the
license variable.

In @src/snippet_selection.c:
- Around line 387-391: The code forcibly sets
match->scan_ower->snippet_adjust_tolerance = true and then computes ranges/
ranges_number assuming that override; remove the hard assignment so the
user-provided match->scan_ower->snippet_adjust_tolerance is respected, call
ranges_join_overlapping with match->scan_ower->snippet_range_tolerance and the
actual snippet_adjust_tolerance value, and compute ranges_number from
matchmap_reg->ranges_number when snippet_adjust_tolerance is false or
MATCHMAP_RANGES otherwise; if you must temporarily override user input, gate the
override behind a compile-time flag (e.g., #ifdef DISABLE_DYNAMIC_RANGES) or
emit a warning/log before changing the field instead of silently overwriting it.
- Around line 383-384: The format specifier in the scanlog call is wrong:
match->matchmap_reg->range[i].from and .to are uint16_t (matchmap_range) but the
format uses %ld; change the format to use an unsigned specifier (e.g., %u) or
cast the values to unsigned int when logging so the types match (update the
scanlog format string and arguments in the call that prints
match->matchmap_reg->range[i].from and .to).
- Around line 406-409: The log uses "%ld" for uint16_t fields causing wrong
formatting; update the scanlog call in the loop over ranges (variables ranges
and ranges_number) to use the correct specifier for uint16_t (e.g., "%hu" for
ranges[i].from and ranges[i].to) or use the PRIu16 macro with an appropriate
cast, so the format matches the type.
- Around line 295-307: The calloc allocation for out_ranges in the snippet
inside ranges_join_overlapping (using MATCHMAP_RANGES/dynamic_ranges and size)
lacks a NULL check; after allocating out_ranges, validate it and on failure
handle gracefully (e.g., free any prior resources if needed and return NULL or
an error indicator) and document that callers such as compile_ranges must check
for a NULL return; ensure subsequent uses (memset, loops using out_ranges) only
occur after a successful allocation.
- Around line 126-132: After calling calloc for match_new (match_data_t
*match_new = calloc(...)) add a NULL check immediately before any dereference
(memcpy, assigning fields); if calloc returns NULL, log or handle the allocation
failure and skip processing this match (e.g., continue the loop or return an
error) to avoid dereferencing a NULL pointer and leaking state; ensure any
upstream resources are left consistent when choosing to continue or return.
🧹 Nitpick comments (1)
inc/scan.h (1)

79-85: Consider grouping scan configuration into a struct.

These function signatures now accept 9 parameters each, which hurts readability and makes call sites error-prone. A configuration struct would improve maintainability:

typedef struct scan_config_t {
    int max_snippets;
    int max_components;
    bool adjust_tolerance;
    int component_ranking_threshold;
    int snippet_min_hits;
    int snippet_min_lines;
    int snippet_range_tolerance;
    bool snippet_honor_file_extension;
} scan_config_t;

scan_data_t * scan_data_init(char *target, scan_config_t *config);

This also makes future parameter additions non-breaking.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c649df4 and fdac26b.

📒 Files selected for processing (4)
  • inc/scan.h
  • src/license.c
  • src/snippet_selection.c
  • src/versions.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/versions.c
🧰 Additional context used
🧬 Code graph analysis (1)
src/snippet_selection.c (2)
src/match.c (1)
  • match_data_free (63-76)
src/scanlog.c (1)
  • scanlog (71-107)
🪛 Cppcheck (2.19.0)
src/snippet_selection.c

[warning] 127-127: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 128-128: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 129-129: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 130-130: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 131-131: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 132-132: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)


[warning] 307-307: If memory allocation fails, then there is a possible null pointer dereference

(nullPointerOutOfMemory)

🔇 Additional comments (12)
inc/scan.h (1)

69-74: LGTM - New configuration fields are well-documented.

The inline comments clarify the purpose of each field. The naming is consistent with the existing codebase conventions.

src/snippet_selection.c (5)

124-145: LGTM - Configuration parameters properly integrated.

The snippet filtering logic correctly uses the new per-scan configuration fields (snippet_min_hits, snippet_honor_file_extension, snippet_min_lines) instead of global constants.


245-277: LGTM - Parameterized range assembly.

The function now correctly accepts min_range_lines and ranges_number as parameters, enabling per-scan configuration of range filtering.


323-334: LGTM - Dynamic ranges early-exit behavior.

The conditional break logic correctly differentiates between dynamic and fixed-size range modes.


359-363: LGTM - Defensive early exit for empty ranges.

Good addition of bounds check with appropriate logging.


412-416: LGTM - Parameter threading through range assembly.

The new parameters are correctly propagated to ranges_assemble.

src/license.c (6)

49-58: LGTM!

The struct-based approach cleanly encapsulates license text with its source ID, improving type safety over the previous flat array design.


60-97: LGTM!

The function correctly returns static string literals and handles unknown IDs by returning NULL, which is properly checked by callers.


99-119: LGTM!

The padding calculation correctly ensures 8-byte alignment for CRC32C hardware reads. The memory allocation pattern is appropriate for the expected small number of licenses per component.


121-132: LGTM!

Proper cleanup with NULL checks makes this safe to call multiple times.


391-394: LGTM!

Correctly passes the source ID to the updated license_add_to_list function.


492-501: LGTM!

The final iteration correctly uses the new struct fields (.text and .id), and memory is properly cleaned up with license_free_list.

Comment on lines 387 to 391
//TODO: Re-enable dynamic ranges when feature is complete
// For now, we force adjust_tolerance to ensure stable behavior
match->scan_ower->snippet_adjust_tolerance = true;
matchmap_range *ranges = ranges_join_overlapping(match->matchmap_reg->range, match->matchmap_reg->ranges_number, match->scan_ower->snippet_range_tolerance, !match->scan_ower->snippet_adjust_tolerance);
int ranges_number = !match->scan_ower->snippet_adjust_tolerance ? match->matchmap_reg->ranges_number : MATCHMAP_RANGES;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

User configuration is silently overwritten.

Lines 388-389 force snippet_adjust_tolerance = true, ignoring the value passed through the CLI/API. This means the dynamic_ranges feature is effectively disabled and users cannot enable it. Additionally, line 391's ternary will always evaluate to MATCHMAP_RANGES since line 389 guarantees the condition is true.

If the feature isn't ready, consider:

  1. Removing the parameter from the public API until ready, or
  2. Logging a warning when overriding user input, or
  3. Using a compile-time flag to control the behavior
🤖 Prompt for AI Agents
In @src/snippet_selection.c around lines 387 - 391, The code forcibly sets
match->scan_ower->snippet_adjust_tolerance = true and then computes ranges/
ranges_number assuming that override; remove the hard assignment so the
user-provided match->scan_ower->snippet_adjust_tolerance is respected, call
ranges_join_overlapping with match->scan_ower->snippet_range_tolerance and the
actual snippet_adjust_tolerance value, and compute ranges_number from
matchmap_reg->ranges_number when snippet_adjust_tolerance is false or
MATCHMAP_RANGES otherwise; if you must temporarily override user input, gate the
override behind a compile-time flag (e.g., #ifdef DISABLE_DYNAMIC_RANGES) or
emit a warning/log before changing the field instead of silently overwriting it.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/snippet_selection.c (1)

298-346: Potential buffer overflow in dynamic_ranges mode.

When dynamic_ranges is true, the allocation at line 304 uses out_size = size. However, the loop at line 330 increments out_ranges_index without a bounds check in dynamic mode (the check at line 331 only applies when !dynamic_ranges). If multiple non-overlapping ranges exist, out_ranges_index could exceed out_size - 1, causing out-of-bounds writes at lines 333-335.

🐛 Proposed fix
 				else
 				{
 					out_ranges_index++;
-					if (out_ranges_index == MATCHMAP_RANGES && !dynamic_ranges)
+					if (out_ranges_index >= out_size)
 						break;
 					out_ranges[out_ranges_index].from = ranges[i].from;
 					out_ranges[out_ranges_index].to = ranges[i].to;
 					out_ranges[out_ranges_index].oss_line = ranges[i].oss_line;	
 				}
src/license.c (1)

451-461: Dead code: inner loop is non-functional and shadows outer variable.

This loop has multiple issues:

  1. Variable shadowing: Inner int i shadows the outer purl iteration variable i
  2. Pointless iteration: The loop checks licenses_by_type.count > 0 four times without any state change between iterations
  3. Ineffective break: The break only exits the inner loop, not the outer purl loop
  4. Missing logic: The comment references "prefered liceses ids" but no preference logic exists after the refactoring to a unified list

This appears to be vestigial code from a previous implementation. If the intent is to stop searching once licenses are found, the entire inner loop should be replaced with a direct check and a goto or flag-based outer loop exit.

Suggested fix: Remove dead loop, add proper early exit
 		if (records)
 		{
-			//Look if someone of the prefered liceses ids already has a match
-			for (int i = 0; i < 4; i++)
-			{
-				if (licenses_by_type.count > 0)
-				{
-					scanlog("Stop searching for licenses\n");
-					break;
-				}
-			}
+			if (licenses_by_type.count > 0)
+			{
+				scanlog("Stop searching for licenses\n");
+				break;  /* Exit the purl loop */
+			}
 		}
🤖 Fix all issues with AI agents
In @src/license.c:
- Around line 99-119: The realloc in license_add_to_list can succeed while the
subsequent calloc fails, leaving an allocated extra slot at
ptr->licenses[ptr->count] that is uninitialized; after calling realloc (tmp) but
before calling calloc, initialize the new slot (e.g., set
ptr->licenses[ptr->count].text = NULL and ptr->licenses[ptr->count].id = 0) so a
failed calloc won't leave stale data, and only increment ptr->count after
successful allocation and copy; ensure you use the tmp result to set the slot
prior to allocating the text and rollback/cleanup on failure if needed.

In @src/match.c:
- Around line 259-267: initialize_component_age allocates comp->purls_md5[0]
with malloc but never checks for NULL before using it; update
initialize_component_age to check the return of malloc for NULL and handle
allocation failure (e.g., skip MD5 computation and leave comp->age unchanged or
set a sentinel value and/or log an error) before calling MD5 and
get_component_age; ensure references to comp->purls_md5[0], the MD5(...) call,
and get_component_age(...) are guarded so no null pointer is dereferenced.
🧹 Nitpick comments (1)
src/snippet_selection.c (1)

394-399: TODO: Dynamic ranges feature is forcibly disabled.

The code forces snippet_adjust_tolerance = true regardless of the configured value, with a TODO indicating the dynamic ranges feature is incomplete. This is acceptable as a temporary measure, but consider tracking this as technical debt.

Would you like me to open an issue to track the completion of the dynamic ranges feature?

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fdac26b and 6af0ad9.

📒 Files selected for processing (3)
  • src/license.c
  • src/match.c
  • src/snippet_selection.c
🧰 Additional context used
🧬 Code graph analysis (1)
src/snippet_selection.c (2)
src/scanlog.c (1)
  • scanlog (71-107)
src/match.c (1)
  • match_data_free (63-76)
🪛 ast-grep (0.40.5)
src/match.c

[warning] 263-263: This hashing algorithm is insecure. If this hash is used in a security context, such as password hashing, it should be converted to a stronger hashing algorithm.
Context: MD5((uint8_t *)comp->purls[0], strlen(comp->purls[0]), comp->purls_md5[0]);
Note: [CWE-328] Use of Weak Hash. [REFERENCES]
- https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(insecure-hash-c)

🔇 Additional comments (19)
src/match.c (8)

263-265: MD5 usage for internal keying is acceptable here.

The static analysis tool flagged MD5 as insecure. However, in this context, MD5 is used as a lookup key for component age retrieval, not for security purposes (password hashing, signatures, etc.). This is an acceptable use of MD5 for non-cryptographic hashing/keying.


276-285: LGTM!

The compare_int_values helper is clean and correctly implements the documented behavior for both "higher is better" and "lower is better" comparison modes.


341-371: LGTM on SBOM and hint evaluation logic.

The SBOM evaluation (step 1) and component hint evaluation (step 2) are well-structured with clear accept/reject paths and appropriate logging.


530-545: Path depth comparison has asymmetric conditions.

The conditions at lines 534 and 540 use asymmetric comparisons (b->path_depth + 2 < a->path_depth/2 vs a->path_depth + 2 < b->path_depth/2). While this may be intentional for hysteresis, the threshold is quite strict—requiring one path to be more than twice as short plus 2. Consider verifying this is the intended behavior.


559-648: Tiebreaker logic is well-structured.

The equal release date tiebreakers (section 8) implement a sensible hierarchy: source check → health metrics → vendor check → component age → version comparison. The use of compare_int_values helper keeps the code consistent.


715-720: LGTM on dynamic iteration_max.

The change from a fixed constant to a configurable fetch_max_files value with a 4x multiplier for high accuracy mode aligns with the PR's goal of per-scan thresholds.


1107-1111: Good addition of null check after calloc.

This defensive check prevents null pointer dereference if memory allocation fails. The early return with logging is appropriate.


287-319: The extension() function does not allocate memory. It returns pointers to substrings within the input string using strrchr(), not newly allocated memory. No memory leak exists in this function.

Likely an incorrect or invalid review comment.

src/snippet_selection.c (4)

124-131: Good defensive null check after calloc.

The null check for match_new allocation prevents crashes on memory exhaustion.


250-282: LGTM on ranges_assemble signature expansion.

The new min_range_lines and ranges_number parameters enable per-scan thresholds. The logic correctly skips ranges below the minimum at lines 265-266.


366-370: Good early exit for empty ranges.

The defensive check for ranges_number <= 0 prevents unnecessary processing and potential issues downstream.


398-399: Logic inversion between adjust_tolerance and dynamic_ranges.

The parameter passed to ranges_join_overlapping is !match->scan_ower->snippet_adjust_tolerance, meaning dynamic_ranges=true when adjust_tolerance=false. Given that adjust_tolerance is now forced to true at line 396, dynamic_ranges will always be false. Ensure this inversion is intentional and the naming isn't confusing for future maintainers.

src/license.c (7)

47-58: LGTM!

The new license_type struct cleanly pairs license text with its source ID. The refactored license_list structure is a sound approach for the unified license handling.

Minor style nit: Line 56 is missing a space (struct license_type*licensesstruct license_type *licenses).


60-97: LGTM!

Clean mapping function with appropriate NULL return for unknown IDs. The grouping of related source IDs (e.g., 71-74 for underlying_license_file) is logical.


121-132: LGTM!

Proper cleanup of the refactored structure with appropriate null guards.


270-276: LGTM!

The NULL check on line 272 properly guards against unknown source IDs. The scancode/LicenseRef filter aligns with standard SPDX practices where LicenseRef identifiers from scancode may need special handling.


492-495: LGTM!

Clean iteration using the unified license list structure.


501-502: LGTM!

Proper cleanup of the unified license list at the end of the function.


330-336: LGTM!

The updated signature properly propagates the source ID through the JSON generation flow.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/match.c (1)

1128-1129: Missing null check for second calloc in windows line endings path.

For consistency with the fix at line 1112, consider adding a null check here as well.

Proposed fix
 		if (scan->windows_line_endings)
 		{
 			match_data_t *match_new = calloc(1, sizeof(match_data_t));
+			if (!match_new)
+			{
+				scanlog("Error allocating memory for match data (windows)\n");
+				return;
+			}
 			match_new->type = scan->match_type;
🤖 Fix all issues with AI agents
In `@src/license.c`:
- Line 109: Remove the stray '+' diff marker before the assignment so the line
assigning to ptr->licenses[ptr->count].id is valid C; locate the statement using
the symbol pattern "ptr->licenses[ptr->count].id = 0" (e.g., in the
initialization of license entries) and delete the leading '+' character so the
compiler sees a normal assignment.

In `@src/match.c`:
- Around line 639-652: Update the inline comment above the version comparison in
src/match.c (the block that compares a->version and b->version when b->age ==
a->age and !strcmp(a->component, b->component)) to state that the code uses a
lexicographic tie-breaker for deterministic ordering rather than a semantic
version comparison (e.g., replace the current "Lower version string is
preferred..." comment with "Lower version string is preferred as a deterministic
tie-breaker (lexicographic, not semantic)"). If semantic ordering is ever
required, note that semver_compare_version() in external/src/semver.c can be
used instead.
🧹 Nitpick comments (2)
src/license.c (1)

456-463: Consider removing redundant inner loop.

This inner loop appears to be dead code or a logic remnant from previous per-type array handling:

  1. Variable shadowing: The loop variable i shadows the outer loop's i from line 444.
  2. Redundant logic: The loop checks licenses_by_type.count > 0 up to 4 times, but since count doesn't change within this loop, the condition will either break on the first iteration or never break.

The entire for block can likely be simplified to a single if statement:

♻️ Suggested simplification
 		if (records)
 		{
-			//Look if someone of the prefered liceses ids already has a match
-			for (int i = 0; i < 4; i++)
-			{
-				if (licenses_by_type.count > 0)
-				{
-					scanlog("Stop searching for licenses\n");
-					break;
-				}
-			}
+			if (licenses_by_type.count > 0)
+			{
+				scanlog("Stop searching for licenses\n");
+				break;  // Also break from outer loop if intended
+			}
 		}
src/match.c (1)

292-292: Consider adding static keyword for internal linkage.

This function appears to be an internal helper (not declared in a header). Adding static would make its scope explicit and could enable better compiler optimizations.

Suggested fix
-int compare_file_extension(component_data_t *a, component_data_t *b)
+static int compare_file_extension(component_data_t *a, component_data_t *b)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6af0ad9 and a1c3015.

📒 Files selected for processing (2)
  • src/license.c
  • src/match.c
🧰 Additional context used
🧬 Code graph analysis (1)
src/match.c (6)
src/scanlog.c (1)
  • scanlog (71-107)
src/query.c (1)
  • get_component_age (196-207)
src/ignorelist.c (1)
  • extension (45-54)
src/util.c (2)
  • path_is_third_party (335-412)
  • binary_file_to_purl (443-606)
src/url.c (2)
  • purl_vendor_component_check (428-445)
  • purl_source_check (447-461)
src/health.c (1)
  • print_health (110-115)
🪛 ast-grep (0.40.5)
src/match.c

[warning] 268-268: This hashing algorithm is insecure. If this hash is used in a security context, such as password hashing, it should be converted to a stronger hashing algorithm.
Context: MD5((uint8_t *)comp->purls[0], strlen(comp->purls[0]), comp->purls_md5[0]);
Note: [CWE-328] Use of Weak Hash. [REFERENCES]
- https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(insecure-hash-c)

🪛 GitHub Actions: build
src/license.c

[error] 109-109: Compiler error: lvalue required as left operand of assignment at src/license.c:109 while compiling with gcc. The step 'make all' failed (exit code 2).

🔇 Additional comments (14)
src/license.c (7)

47-58: LGTM!

The introduction of struct license_type to store both text and id together, and updating struct license_list to use it, properly addresses the previous review concern about preserving the license source ID during consolidation.


60-97: LGTM!

The static helper function provides a clean mapping from license IDs to human-readable source names, with appropriate NULL return for unknown IDs.


99-121: Improved memory safety with slot initialization.

The initialization of the new slot immediately after realloc (lines 108-109) properly addresses the previous concern about partial allocation failures. If calloc fails at line 114, the slot is in a safe state with text = NULL and id = 0.


123-134: LGTM!

The cleanup function correctly handles the updated struct layout by freeing each entry's text field before freeing the array itself.


272-278: LGTM!

The updated license source handling now correctly uses license_id_to_source_name() to derive the source name from the ID, with proper NULL checking. The comment at line 276 now correctly matches the "LicenseRef" check in the code.


373-401: LGTM!

The callback now correctly passes the source ID (src) to license_add_to_list, ensuring the license source is preserved for accurate JSON output.


494-497: LGTM!

The iteration now correctly uses licenses_by_type.licenses[i].text and licenses_by_type.licenses[i].id, ensuring the actual license source ID is passed to license_to_json rather than the array index. This properly addresses the previously identified critical issue.

src/match.c (7)

259-272: Good fix for the memory allocation check.

The null check after malloc has been properly added, addressing the previous concern about potential null pointer dereference. The early return with logging is appropriate.

Regarding the static analysis warning about MD5: this usage is for content-based indexing/lookup (component age retrieval), not for security purposes like password hashing or integrity verification. MD5 is acceptable in this context.


281-290: LGTM!

Clean helper function with clear semantics. The prefer_higher parameter provides good flexibility for different comparison scenarios.


326-344: Excellent documentation.

The comprehensive docstring clearly explains the multi-stage comparison logic with all 9 hierarchical criteria. This greatly aids maintainability.


720-725: LGTM!

The iteration limit is now properly tied to fetch_max_files with configurable behavior. The 4x multiplier for high accuracy mode is clearly expressed.


769-769: LGTM!

Useful diagnostic logging for troubleshooting match loading behavior.


1112-1116: Good addition of null check after calloc.

This prevents potential null pointer dereference if memory allocation fails.


517-517: No null pointer dereference issue at this location.

The component field is always initialized via strdup() when a component_data_t is created (component.c:278), ensuring it is never NULL. Even if the component name is empty, strdup("") returns a valid pointer to an empty string, not NULL. Calling strcmp() on two valid pointers—whether they point to empty strings or populated strings—is safe and will not cause undefined behavior.

Likely an incorrect or invalid review comment.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +639 to 652
if (b->age == a->age && !strcmp(a->component, b->component))
{
scanlog("Component %s rejected by version comparison\n", b->purls[0]);
return false;
int version_cmp = strcmp(a->version, b->version);
if (version_cmp > 0)
{
scanlog("Component %s prefered over %s by version\n", b->purls[0], a->purls[0]);
return true;
}
if (version_cmp < 0)
{
scanlog("Component %s rejected by version comparison\n", b->purls[0]);
return false;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if there's an existing version comparison utility in the codebase
echo "=== Searching for version comparison patterns ==="
rg -n 'version.*compar|compar.*version' --type c -i

echo -e "\n=== Searching for versions.h/c files ==="
fd -e c -e h versions

echo -e "\n=== Checking full context in src/match.c around line 639 ==="
if [ -f src/match.c ]; then
  sed -n '630,660p' src/match.c
fi

Repository: scanoss/engine

Length of output: 2328


🏁 Script executed:

#!/bin/bash
# Check what's in versions.h/c
echo "=== Checking inc/versions.h ==="
head -50 inc/versions.h

echo -e "\n=== Checking src/versions.c (first 80 lines) ==="
head -80 src/versions.c

Repository: scanoss/engine

Length of output: 3365


Improve the clarity of the version comparison comment to distinguish between lexicographic and semantic ordering.

The code correctly uses lexicographic comparison for tie-breaking components with equal age and name, which is intentional and documented at line 637. However, line 638's comment states "Lower version string is preferred (usually represents older/more stable versions)" which could be misleading, since lexicographic ordering does not guarantee semantic version ordering. For example, "1.10.0" < "1.9.0" lexicographically, but 1.10.0 > 1.9.0 semantically.

Update line 638 to clarify that this is a lexicographic tie-breaker for deterministic ordering rather than semantic version comparison:

// Lower version string is preferred as a deterministic tie-breaker (lexicographic, not semantic)

If semantic version comparison becomes a requirement, the external semver_compare_version() function is available in external/src/semver.c.

🤖 Prompt for AI Agents
In `@src/match.c` around lines 639 - 652, Update the inline comment above the
version comparison in src/match.c (the block that compares a->version and
b->version when b->age == a->age and !strcmp(a->component, b->component)) to
state that the code uses a lexicographic tie-breaker for deterministic ordering
rather than a semantic version comparison (e.g., replace the current "Lower
version string is preferred..." comment with "Lower version string is preferred
as a deterministic tie-breaker (lexicographic, not semantic)"). If semantic
ordering is ever required, note that semver_compare_version() in
external/src/semver.c can be used instead.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/license.c (2)

454-464: Variable shadowing: inner loop reuses outer loop variable i.

The inner for (int i = 0; i < 4; i++) at line 456 shadows the outer loop variable i from line 444. While this compiles, it's confusing and the inner loop's logic appears to be checking licenses_by_type.count > 0 repeatedly without changing anything—it will always break on the first iteration if count > 0, making the loop redundant.

This looks like leftover code from a refactor. Consider simplifying to:

 		if (records)
 		{
-			//Look if someone of the prefered liceses ids already has a match
-			for (int i = 0; i < 4; i++)
-			{
-				if (licenses_by_type.count > 0)
-				{
-					scanlog("Stop searching for licenses\n");
-					break;
-				}
-			}
+			if (licenses_by_type.count > 0)
+			{
+				scanlog("Stop searching for licenses\n");
+			}
 		}

Note: The outer loop doesn't break here, so licenses will continue to be fetched even after this log message. Verify if an early exit was intended.


486-491: Missing NULL check after calloc.

If calloc fails at line 486, result will be NULL, and the subsequent sprintf at line 490 will dereference it, causing a crash.

🐛 Proposed fix
 	char * result = calloc(MAX_FIELD_LN * 1024, 1);
+	if (!result)
+	{
+		license_free_list(&licenses_by_type);
+		return;
+	}
 	char * buffer = result;
🧹 Nitpick comments (1)
src/license.c (1)

60-97: Consider documenting the license source ID meanings.

The mapping function is clear and handles unknown IDs gracefully by returning NULL. However, the magic numbers (0, 35, 31, 71-74, etc.) are not self-documenting. Consider adding a brief comment or defining named constants to clarify what each ID represents in the source system.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a1c3015 and 84ccb97.

📒 Files selected for processing (1)
  • src/license.c
🔇 Additional comments (8)
src/license.c (8)

49-58: LGTM on the struct refactor.

The new license_type struct correctly preserves both the license text and source ID, addressing the prior design limitation. Minor nit: line 56 has inconsistent spacing (struct license_type*licenses vs typical struct license_type *licenses), but this is purely cosmetic.


99-121: LGTM!

The implementation correctly addresses the prior feedback about partial allocation failure by initializing the new slot immediately after realloc. The padded allocation for CRC32C hardware reads is properly calculated, and the function safely handles allocation failures without corrupting the list state.


123-134: LGTM!

Memory cleanup correctly adapted for the new license_type struct—frees each text field before the array itself.


272-278: LGTM!

The license source ID resolution and early-return on NULL properly handles unknown source IDs. The scancode LicenseRef filtering is correctly implemented, and the comment now matches the code (addressing prior feedback).


332-338: LGTM!

The function correctly delegates to json_from_license or split_in_json_array with the source ID parameter.


373-401: LGTM!

The callback correctly extracts the source ID from CSV and passes it to the updated license_add_to_list function.


494-497: LGTM!

The loop now correctly uses licenses_by_type.licenses[i].id for the source ID instead of the array index, properly addressing the prior critical issue about incorrect license source attribution.


503-504: LGTM!

Proper cleanup of the unified license list structure.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/license.c (1)

471-481: Dead loop logic: condition doesn't vary across iterations.

The loop at lines 473-480 iterates i from 0 to 3 but:

  1. The condition licenses_by_type.count > 0 doesn't depend on the loop variable i
  2. The break only exits the inner loop, not the outer purl loop
  3. The comment mentions checking "preferred license IDs" but the code doesn't actually compare against any ID values

This loop will either break immediately on the first iteration (if count > 0) or complete all 4 iterations doing nothing. It appears to be incomplete refactoring or dead code.

🐛 Suggested fix: Remove or implement the intended logic

If the intent was to break the outer loop when licenses are found:

 		if (records)
 		{
-			//Look if someone of the prefered liceses ids already has a match
-			for (int i = 0; i < 4; i++)
-			{
-				if (licenses_by_type.count > 0)
-				{
-					scanlog("Stop searching for licenses\n");
-					break;
-				}
-			}
+			if (licenses_by_type.count > 0)
+			{
+				scanlog("Stop searching for licenses\n");
+				break;  // Break outer purl loop
+			}
 		}
🧹 Nitpick comments (1)
src/license.c (1)

60-97: Consider defining named constants for license source IDs.

The switch statement uses magic numbers (0, 1, 2, 3, etc.) that appear throughout the codebase (e.g., IDs 5 and 6 are special-cased in license_compare_by_id). Defining an enum or constants would improve readability and reduce the risk of inconsistencies.

♻️ Example refactor
// Near the top of the file or in license.h
enum license_source_id {
    LICENSE_SRC_COMPONENT_DECLARED = 0,
    LICENSE_SRC_FILE_SPDX_TAG = 1,
    LICENSE_SRC_FILE_HEADER = 2,
    LICENSE_SRC_LICENSE_FILE = 3,
    LICENSE_SRC_SCANCODE_FILE = 4,
    LICENSE_SRC_SCANCODE = 5,
    LICENSE_SRC_COMPONENT_DECLARED_ALT = 6,
    // ... etc
};
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84ccb97 and f8e6f9e.

📒 Files selected for processing (1)
  • src/license.c
🔇 Additional comments (8)
src/license.c (8)

49-58: LGTM: Struct definitions are well-designed.

The new license_type struct cleanly encapsulates both the license text and its source ID, addressing the previous issue where the source ID was lost during consolidation. Minor nit: line 56 has inconsistent spacing (license_type*licenses vs typical license_type *licenses).


99-121: LGTM: Memory management improvements look good.

The function now properly initializes the new slot immediately after realloc (lines 108-109), addressing the previous review concern about partial allocation failures. The padded allocation for CRC32C hardware reads is correctly implemented.


123-134: LGTM: Cleanup function correctly handles the new struct.

The function properly frees each text field before freeing the array, with appropriate NULL checks.


136-151: LGTM: Sorting comparator correctly prioritizes license sources.

The logic to push IDs 5 and 6 to the end works correctly. The subtraction-based comparison (la->id - lb->id) is safe here since license IDs are small positive integers (0-74 based on license_id_to_source_name).

Minor note: The magic numbers 5 and 6 would benefit from named constants for consistency with the earlier suggestion.


289-296: LGTM: Source name lookup and filtering logic is correct.

The NULL check for unknown license IDs and the filtering of scancode LicenseRef licenses are properly implemented. The comment now correctly matches the "LicenseRef" check (addressing the previous review concern about case mismatch).


409-412: LGTM: License source ID is now correctly preserved.

The actual license source ID from the CSV data is now properly stored with each license entry, addressing the previous critical issue where the loop index was incorrectly used as the source ID.


511-518: LGTM: Sorting and iteration correctly use preserved license IDs.

The licenses are sorted by ID before JSON generation, and the iteration now correctly uses licenses_by_type.licenses[i].id instead of the loop index. This fully resolves the previous critical issue about incorrect license source IDs in JSON output.


503-525: LGTM: Memory management and cleanup are correct.

The result buffer is properly allocated and its ownership is transferred to comp->license_text. The unified license list is correctly freed at the end of the function.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@mscasso-scanoss mscasso-scanoss merged commit eddc4e3 into main Jan 14, 2026
3 checks passed
@mscasso-scanoss mscasso-scanoss deleted the 5.4.20 branch January 14, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants