-
Notifications
You must be signed in to change notification settings - Fork 20
5.4.20 #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
5.4.20 #103
Conversation
📝 WalkthroughWalkthroughExtended CLI and configuration options added; per-scan snippet thresholds and ranking controls introduced; multiple public APIs updated to accept new scan parameters; component filtering by rank and a multi-criteria component comparator (SBOM, hints, rank, dates, extensions, purl, age) implemented; documentation/help expanded. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI
participant Scanner
participant MatchLoader
participant ComponentDB
participant Comparator
participant Reporter
User->>CLI: invoke scan with flags (--tolerance, --rank, --max-files, ...)
CLI->>Scanner: start scan(target, config)
Scanner->>MatchLoader: load matches (fetch_max_files)
MatchLoader->>ComponentDB: query candidate components
ComponentDB-->>MatchLoader: return candidates
MatchLoader->>Comparator: evaluate candidates (SBOM, path-rank, dates, purl, extension, age)
Comparator->>ComponentDB: lazy fetch metadata/age if needed
Comparator-->>MatchLoader: select best match or mark IDENTIFIED_FILTERED
MatchLoader->>Reporter: emit match + rank/status
Reporter->>User: output JSON/report
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/license.c (1)
447-454: Dead code: Inner loop serves no purpose.This loop iterates up to 4 times but only checks if
licenses_by_type.count > 0. Since the count doesn't change within the loop, it will either:
- Break immediately on first iteration (if count > 0), or
- Run 4 iterations doing nothing (if count == 0)
The loop variable
ialso shadows the outer loop variable from line 435. This appears to be vestigial code from the refactoring that should be simplified or removed.Suggested fix
if (records) { - //Look if someone of the prefered liceses ids already has a match - for (int i = 0; i < 4; i++) - { - if (licenses_by_type.count > 0) - { - scanlog("Stop searching for licenses\n"); - break; - } - } + if (licenses_by_type.count > 0) + { + scanlog("Stop searching for licenses\n"); + break; + } }inc/component.h (1)
105-109: Typo:component_date_comparationshould likely becomponent_date_comparison.Minor naming issue: "comparation" is not a standard English word; "comparison" is the correct term.
Suggested fix
-bool component_date_comparation(component_data_t * a, component_data_t * b); +bool component_date_comparison(component_data_t * a, component_data_t * b);Note: This would require updating the function definition and all call sites.
src/url.c (1)
351-357: Misleading log message in failure path.The log message "ignoring component with rank %d" is misleading because
fill_componentreturns false for incomplete metadata (missing url, version, or purl), not due to rank. The actual rank value may not even be meaningful at this point.🐛 Suggested fix
if (!result) { - scanlog("ignoring component with rank %d\n", comp->rank); + scanlog("ignoring component with incomplete metadata\n"); free(url); component_data_free(comp); return false; }
🤖 Fix all issues with AI agents
In @inc/scan.h:
- Around line 69-74: The struct field snippet_honor_file_extension is an int but
the functions scan_data_init, wfp_scan, and hash_scan expect a bool; change the
field declaration snippet_honor_file_extension to type bool so it matches the
function signatures and avoids implicit conversions, and update any
initializations or assignments to use true/false accordingly (refer to symbol
snippet_honor_file_extension and the functions scan_data_init, wfp_scan,
hash_scan).
In @src/help.c:
- Line 111: Update the hardcoded copyright year in the string literal in
src/help.c (the line containing "Copyright (C) 2018-2022 SCANOSS.COM") to
reflect the current year range (e.g., "2018-2026") or replace it with a dynamic
build-time value; locate the string in help.c (the help/usage text) and either
change the year range to 2018-2026 or hook in a macro/define (e.g.,
COPYRIGHT_YEAR or generated during build) so the year does not become stale.
In @src/report.c:
- Line 51: The static flag ranking_enabled persists across file scans causing
ranking to remain enabled after a file with component_ranking_threshold >= 0;
reset ranking_enabled to false at the start of each file scan by assigning
ranking_enabled = false in either json_open_file() or json_close_file() so each
file's threshold is applied independently (locate the static bool
ranking_enabled and add the reset in json_open_file() or json_close_file() to
ensure per-file isolation).
In @src/util.c:
- Around line 342-343: strdup result is not checked before calling dirname,
which can cause undefined behavior if strdup returns NULL; after calling
strdup(comp->file) into full_path, add a NULL check and handle allocation
failure (e.g., log/report error and return an error code or clean up and exit
the function) before calling dirname(full_path), and ensure any allocated
resources are freed (and avoid calling dirname when full_path is NULL);
reference symbols: strdup, full_path, dirname, path, comp->file.
In @src/versions.c:
- Around line 48-54: The arrays purl_indirection_reference and purl_version_list
are statically sized to FETCH_MAX_FILES_DEFAULT but bounds are checked against
the user-controlled fetch_max_files, risking overflow in purl_latest_version_add
when doing assignments like purl_indirection_reference[purl_indirection_index] =
strdup(...); fix by clamping fetch_max_files to at most FETCH_MAX_FILES_DEFAULT
right after parsing the --max-files option (or alternatively allocate these
arrays dynamically using fetch_max_files); ensure all checks using
purl_indirection_index compare against the clamped value
(FETCH_MAX_FILES_DEFAULT or the allocated size) so writes cannot exceed the
actual buffer capacity.
🧹 Nitpick comments (11)
src/license.c (1)
263-269: Minor: Comment/code casing mismatch.The comment on line 267 says "license-ref" but the code checks for "LicenseRef" (PascalCase). The code logic is correct for filtering scancode LicenseRef entries, but the comment should match:
- //skip scancode licenses starting with "license-ref" + //skip scancode licenses starting with "LicenseRef"src/match_list.c (1)
99-106: Whitespace-only changes look fine.The formatting adjustments in this segment are acceptable. However, note that the static analysis warning about potential null pointer dereference from
calloc(line 100) is a pre-existing concern throughout this file—allcalloccalls lack null checks. Consider adding null checks in a follow-up.♻️ Optional: Add null check for allocation failure
struct comp_entry *nn = calloc(1, sizeof(struct comp_entry)); /* Insert after. */ +if (!nn) { + component_data_free(new_comp); + return false; +} nn->component = new_comp;README.md (1)
50-50: Consider hyphenating compound adjective.For consistency with style guidelines, "High Precision" could be written as "High-Precision" when used as a compound adjective modifying "Snippet Match mode."
-* `-H, --hpsm` - Enable High Precision Snippet Match mode (requires 'libhpsm.so' in the system) +* `-H, --hpsm` - Enable High-Precision Snippet Match mode (requires 'libhpsm.so' in the system)inc/scan.h (1)
79-85: Consider grouping configuration parameters into a struct.These function signatures have grown to 9 parameters, which reduces readability and maintainability. Consider introducing a
scan_config_tstruct to bundle the configuration options (max_snippets,max_components,adjust_tolerance,component_ranking_threshold,snippet_min_hits,snippet_min_lines,snippet_range_tolerance,snippet_honor_file_extension).This would simplify signatures and make future extensions easier.
src/report.c (1)
336-341: Consider typo:scan_owershould likely bescan_owner.The field name
scan_owerappears to be a typo forscan_owner. While this is likely a pre-existing issue and the code works with the current field name, it affects readability.src/url.c (1)
92-97: Consider extracting rank filtering to a helper function.The rank filtering logic is duplicated between
handle_url_record(lines 92-96) andget_oldest_url(lines 362-366). Consider extracting to a helper function to reduce duplication and ensure consistent behavior.♻️ Suggested helper function
static void apply_rank_filter(component_data_t *comp) { if (component_rank_max > 0 && comp->rank > component_rank_max) { scanlog("Setting component with rank %d as filtered\n", comp->rank); comp->identified = IDENTIFIED_FILTERED; } }Then call
apply_rank_filter(new_comp);in both locations.Also applies to: 362-366
inc/limits.h (2)
38-53: Duplicate macro definitions for fetch max files default.
FETCH_MAX_FILES_DEFAULT(line 38) andDEFAULT_FETCH_MAX_FILES(line 53) both define the same value (12000). Consider consolidating to a single macro to avoid confusion and maintenance overhead.♻️ Suggested consolidation
#define MAX_FILE_PATH 1024 -#define FETCH_MAX_FILES_DEFAULT 12000 +#define DEFAULT_FETCH_MAX_FILES 12000 #define MIN_FILE_SIZE 256 // files below this size will be ignored #define CRC_LIST_LEN 1024 // list of crc checksums to avoid metadata duplicates /* Snippets */ ... #define SNIPPETS_DEFAULT_HONOR_FILE_EXTENSION true /** Honor file extension during snippet matching */ -#define DEFAULT_FETCH_MAX_FILES 12000 /** Maximum number of files to fetch during component matching */
22-25: Move#includeinside the include guard.The
#include <stdint.h>is placed before the include guard, which means it will be processed on every inclusion of this header. Whilestdint.hhas its own guards, it's better practice to place all includes inside the guard.♻️ Suggested fix
-#include <stdint.h> - #ifndef __LIMITS_H #define __LIMITS_H + +#include <stdint.h> + /* Constants */src/scan.c (1)
56-77: Consider adding null checks for memory allocations.Static analysis flags potential null pointer dereferences if
calloc/mallocfails (lines 59-63). While allocation failures are rare, adding a guard would improve robustness.♻️ Proposed defensive check
scan_data_t * scan_data_init(char *target, int max_snippets, int max_components, bool adjust_tolerance, int component_ranking_threshold, int snippet_min_hits, int snippet_min_lines, int snippet_range_tolerance, bool snippet_honor_file_extension) { scanlog("Scan Init\n"); scan_data_t * scan = calloc(1, sizeof(*scan)); + if (!scan) { + fprintf(stderr, "Failed to allocate scan_data_t\n"); + return NULL; + } scan->file_path = strdup(target); scan->file_size = malloc(32); scan->hashes = calloc(MAX_FILE_SIZE,1); scan->lines = malloc(MAX_FILE_SIZE); + if (!scan->file_path || !scan->file_size || !scan->hashes || !scan->lines) { + fprintf(stderr, "Failed to allocate scan fields\n"); + // Note: Would need cleanup of partially allocated fields + return NULL; + }src/snippet_selection.c (1)
293-339: Consider adding null check for allocation.Line 299 allocates
out_rangeswithout checking for failure. While rare, allocation failure would cause undefined behavior.♻️ Proposed fix
matchmap_range *out_ranges = calloc(out_size, sizeof(matchmap_range)); + if (!out_ranges) { + scanlog("Failed to allocate out_ranges\n"); + return NULL; + }The function logic itself is correct - properly parameterized with tolerance and dynamic_ranges flags.
src/main.c (1)
455-481: Consider validating numeric CLI arguments.The handlers use
atoi()without validating the result. Invalid input (e.g.,--max-files=abc) would silently set the value to 0.♻️ Example validation pattern
case 257: /* --max-files */ - fetch_max_files = atoi(optarg); + fetch_max_files = atoi(optarg); + if (fetch_max_files <= 0) { + fprintf(stderr, "Invalid value for --max-files: %s\n", optarg); + exit(EXIT_FAILURE); + } scanlog("Max files to fetch set to %d\n", fetch_max_files); break;The logic for disabling
scan_adjust_tolerancewhen explicit values are provided is correct - this ensures user-specified values aren't overwritten by the auto-adjustment algorithm.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (25)
README.mdinc/component.hinc/limits.hinc/match.hinc/parse.hinc/scan.hinc/scanoss.hsrc/binary_scan.csrc/component.csrc/debug.csrc/file.csrc/help.csrc/license.csrc/limits.csrc/main.csrc/match.csrc/match_list.csrc/report.csrc/scan.csrc/snippet_selection.csrc/snippets.csrc/url.csrc/util.csrc/versions.csrc/vulnerability.c
💤 Files with no reviewable changes (1)
- inc/match.h
🧰 Additional context used
🧬 Code graph analysis (6)
inc/scan.h (2)
src/scan.c (5)
scan_data_init(56-77)scan_data_free(82-99)ldb_scan(444-545)wfp_scan(224-365)hash_scan(197-212)src/snippets.c (2)
adjust_tolerance(106-138)ldb_scan_snippets(318-600)
src/url.c (3)
src/component.c (1)
component_data_free(43-70)src/scan.c (1)
asset_declared(141-190)src/scanlog.c (1)
scanlog(71-107)
src/component.c (1)
src/parse.c (1)
extract_csv(302-326)
src/scan.c (1)
src/snippets.c (1)
adjust_tolerance(106-138)
src/snippet_selection.c (2)
src/match.c (1)
match_data_free(63-76)src/scanlog.c (1)
scanlog(71-107)
src/binary_scan.c (1)
src/scan.c (1)
scan_data_init(56-77)
🪛 ast-grep (0.40.4)
src/match.c
[warning] 263-263: This hashing algorithm is insecure. If this hash is used in a security context, such as password hashing, it should be converted to a stronger hashing algorithm.
Context: MD5((uint8_t *)comp->purls[0], strlen(comp->purls[0]), comp->purls_md5[0]);
Note: [CWE-328] Use of Weak Hash. [REFERENCES]
- https://owasp.org/Top10/A02_2021-Cryptographic_Failures
(insecure-hash-c)
🪛 Cppcheck (2.19.0)
src/match_list.c
[warning] 101-101: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
src/scan.c
[warning] 65-65: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 66-66: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 67-67: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 68-68: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 69-69: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 70-70: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 71-71: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
src/snippet_selection.c
[warning] 127-127: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 128-128: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 129-129: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 130-130: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 131-131: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 132-132: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 307-307: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
🪛 LanguageTool
README.md
[uncategorized] ~50-~50: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ctual extension * -H, --hpsm - Enable High Precision Snippet Match mode (requires 'libhpsm.s...
(EN_COMPOUND_ADJECTIVE_INTERNAL)
🔇 Additional comments (34)
src/license.c (4)
55-92: LGTM!The translator function correctly maps license source IDs to human-readable names. The switch-case groupings (e.g., cases 0/35 and 7/9) are clear and the NULL return for unknown IDs is properly handled by the caller at line 265-266.
Minor note: case 8 appearing after case 10 is slightly unusual ordering but doesn't affect correctness.
101-108: Padding logic is correct for CRC32C alignment.The formula
((len + 8) / 8) * 8ensures at least 8 bytes beyond the string length, andcallocproperly zero-initializes the padding bytes.Minor edge case: if
callocfails at line 105-107,ptr->licenseshas already been reallocated with an additional slot containing an uninitialized pointer. Sinceptr->countis not incremented on failure, subsequent operations should be safe, but you may want to consider settingptr->licenses[ptr->count] = NULLbefore the calloc for defensive consistency.
386-386: LGTM!Clean refactoring to use the unified license list structure.
416-416: LGTM!Clean initialization of the consolidated license list structure.
inc/parse.h (1)
7-7: LGTM!The
component.hinclude is correctly added to provide thecomponent_itemtype required by theget_componentsfunction declaration on line 15.src/util.c (1)
396-411: LGTM!The special case handling for root directory (
".") correctly returnsnumPatterns(lowest third-party rank), and memory cleanup withfree(full_path)is consistent across all exit paths. The expanded pattern list appropriately captures common third-party directory conventions.README.md (2)
46-81: Excellent documentation improvements!The reorganized configuration options with clear subsections (Basic Configuration, SBOM and Filtering, Attribution and Licenses, General Options) significantly improves usability. The documentation accurately reflects the CLI capabilities.
87-108: Engine flags table is accurate and well-formatted.The flag values align with the constants defined in
inc/scanoss.h. The table provides clear documentation for users configuring the scanning engine.inc/scanoss.h (1)
36-36: Version bump to 5.4.20 is correct.Verification confirms that relocated declarations are properly accessible in their new locations:
- Type definitions (
component_item,keywords,file_recordset,len_rank) are ininc/component.h- Macro definitions (
MAX_FILE_PATH,FETCH_MAX_FILES_DEFAULT,MIN_FILE_SIZE,CRC_LIST_LEN) are ininc/limits.h- Extern declarations (
ignore_components,declared_components) are properly declared ininc/component.hThe modularization effort is sound.
src/vulnerability.c (1)
43-43: LGTM!The
max_vulnerabilitiesconstant is appropriately defined with clear documentation. Moving this from the public header to the source file improves encapsulation.src/file.c (2)
195-196: LGTM!The change from the hard-coded
FETCH_MAX_FILESconstant to the configurablefetch_max_filesvariable enables runtime configuration. The boundary check logic remains consistent.
234-237: LGTM!Consistent with the
collect_all_fileschange, using the configurablefetch_max_filesvariable.inc/component.h (2)
21-26: LGTM!Adding
IDENTIFIED_FILTERED = -1as a sentinel value for filtered components is a clean approach for the new component-rank-based filtering logic.
68-99: LGTM!The typedefs (
keywords,file_recordset,len_rank,component_item) and extern declarations are appropriately moved to this header, improving modularity by co-locating component-related types.src/component.c (2)
295-301: LGTM: Simplified rank validation is appropriate.Removing the arbitrary
strlen(rank) < 3constraint allows all valid rank values. The commented-out scanlog on line 298 could be removed to reduce noise, but this is a minor cleanup.
256-256: Remove this review comment—it is based on incorrect information.The rank extraction at column 14 is not a change from column 13; this is the initial code introduced in commit c649df4 (2026-01-12). Git history shows no prior version using column 13. The codebase consistently extracts columns 1–7 (vendor, component, version, release_date, license, purl, url), columns 8–12 via the url_stats loop, and deliberately skips column 13 before extracting rank from column 14. Without evidence of a bug or incorrect behavior, no action is required.
Likely an incorrect or invalid review comment.
src/debug.c (1)
141-141: LGTM: Benchmark updated to use new scan_data_init signature.The parameter values are appropriate for benchmark testing. Using
SNIPPETS_DEFAULT_RANGE_TOLERANCEconstant ensures consistency with default behavior.src/binary_scan.c (1)
273-273: LGTM: Binary scan updated to use new scan_data_init signature.The parameters are appropriate for binary file matching where single component results are expected.
src/limits.c (1)
12-12: LGTM: Simplified global configuration.The refactoring from multiple global tuning variables to parameterized function calls is a cleaner approach. The remaining
fetch_max_filesglobal with default 12000 aligns with the documented CLI default.src/report.c (1)
217-225: LGTM: Clean early-exit for filtered components.The filtered component output with minimal JSON (
statusandrankonly) is appropriate for rank-filtered items. The JSON structure is correctly handled.src/help.c (1)
54-61: LGTM: Comprehensive CLI documentation added.The new long-form options are well-documented with clear descriptions and default values. The defaults reference constants (
SCAN_MAX_COMPONENTS_DEFAULT,COMPONENT_DEFAULT_RANK + 1) which ensures consistency with actual behavior.src/url.c (2)
54-55: LGTM: Global rank threshold with sensible default.Using
COMPONENT_DEFAULT_RANK + 1as the default means components at or below the default rank pass through, which is appropriate default behavior.
86-88: LGTM: Safe basename usage with temporary copy.Creating a temporary copy before calling
basename()is correct sincebasename()may modify its argument. The memory is properly freed after use.src/snippets.c (2)
106-138: LGTM - Tolerance adjustment refactored to use configuration.The
adjust_tolerancefunction now properly initializes fromSNIPPETS_DEFAULT_*macros and stores computed values in the scan structure (scan->snippet_min_hits,scan->snippet_min_lines,scan->snippet_range_tolerance). This enables the new CLI-driven configuration while maintaining backward compatibility with auto-adjustment.
332-334: LGTM - Conditional tolerance adjustment.The tolerance adjustment is now gated by
scan->snippet_adjust_tolerance, allowing CLI users to override the automatic adjustment when they provide explicit values via--min-snippet-hits,--min-snippet-lines, or--range-tolerance.src/scan.c (1)
461-467: LGTM - Component ranking threshold logic.Clear three-way handling:
- Negative: disables ranking
- Zero: accepts all (default + 1)
- Positive: uses explicit threshold
This provides good flexibility for CLI users to control component filtering.
src/match.c (4)
259-267: LGTM - Component age initialization with purl MD5.The MD5 usage here is appropriate as a database lookup key (not for security purposes). The lazy initialization pattern avoids computing the hash until needed.
Note: The static analysis warning about weak hash is a false positive in this context - MD5 is being used for content addressing/lookup, not cryptographic security.
287-319: LGTM - File extension comparison helper.The function properly handles all NULL combinations and returns appropriate values to influence component selection based on extension matching with the reference file.
341-669: LGTM - Comprehensive component comparison logic.The multi-criteria comparison function is well-structured with clear numbered stages (1-9) and detailed comments. The hierarchical evaluation order (SBOM → hints → path rank → dates → extensions → ranking → tiebreakers) is logical and well-documented.
715-720: LGTM - Iteration limit integration with fetch_max_files.The iteration limit now properly uses the configurable
fetch_max_filesvalue (settable via--max-filesCLI option), with appropriate scaling for high accuracy mode.src/snippet_selection.c (1)
124-145: LGTM - Snippet filtering with configurable parameters.The filtering logic properly uses the scan configuration fields:
scan->snippet_min_hitsfor minimum hit thresholdscan->snippet_honor_file_extensionfor extension-based filteringscan->snippet_min_linesfor minimum matched linesThis enables CLI users to tune snippet matching sensitivity.
src/main.c (3)
67-73: LGTM - Configuration variables with sensible defaults.The new global configuration variables are initialized with the centralized defaults from
limits.h. This ensures consistent behavior whether options are specified via CLI or not.Note: These are file-scope globals (implicitly
staticat file level), which is acceptable for a main.c configuration pattern.
275-303: LGTM - Well-structured long options definition.The
long_optionsarray is properly defined:
- Short-form options mapped to their characters
- Long-only options use numeric codes 256+ to avoid ASCII conflicts
- Properly terminated with null entry
218-224: LGTM - Configuration propagation through scan paths.The scan initialization calls consistently pass all configuration parameters, ensuring the CLI options affect all scanning modes (directory recursion, single file, WFP, hash).
| scanoss --sbom my_sbom.json TARGET Scan TARGET including SBOM assets\n\ | ||
| \n\ | ||
| Copyright (C) 2018-2022 SCANOSS.COM\n", DEFAULT_MATCHMAP_FILES, ENGINE_FLAGS_FILE); | ||
| Copyright (C) 2018-2022 SCANOSS.COM\n", SCAN_MAX_COMPONENTS_DEFAULT, COMPONENT_DEFAULT_RANK + 1, DEFAULT_MATCHMAP_FILES, ENGINE_FLAGS_FILE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update copyright year.
The copyright notice shows 2018-2022, but the current year is 2026. Consider updating to 2018-2026 or using a dynamic approach.
🤖 Prompt for AI Agents
In @src/help.c at line 111, Update the hardcoded copyright year in the string
literal in src/help.c (the line containing "Copyright (C) 2018-2022
SCANOSS.COM") to reflect the current year range (e.g., "2018-2026") or replace
it with a dynamic build-time value; locate the string in help.c (the help/usage
text) and either change the year range to 2018-2026 or hook in a macro/define
(e.g., COPYRIGHT_YEAR or generated during build) so the year does not become
stale.
|
|
||
| uint64_t engine_flags = 0; | ||
| char kb_version[MAX_INPUT]; | ||
| static bool ranking_enabled = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
rg -n 'ranking_enabled' --type=c -B2 -A2Repository: scanoss/engine
Length of output: 619
🏁 Script executed:
rg -n 'print_json_match' --type=c -B2 -A2 | head -100Repository: scanoss/engine
Length of output: 1249
🏁 Script executed:
# Check for any reset or initialization of ranking_enabled in main/scan functions
rg -n 'ranking_enabled\s*=' --type=c -B1 -A1Repository: scanoss/engine
Length of output: 310
🏁 Script executed:
# Look for file processing loops to understand scan boundaries
rg -n 'for.*file|while.*file|scan_file|process_file' --type=c | head -20Repository: scanoss/engine
Length of output: 1694
🏁 Script executed:
# Check json_close_file and file scope boundaries
rg -n 'json_close_file|json_open_file|report_open' --type=c -B3 -A5Repository: scanoss/engine
Length of output: 1388
🏁 Script executed:
# Look for main scan loop to understand multi-file processing
rg -n 'main|for.*scan|while.*scan' --type=c | grep -E '(main|scan_data|for|while)' | head -30Repository: scanoss/engine
Length of output: 2488
🏁 Script executed:
# Check component_ranking_threshold to see if it's per-scan or global
rg -n 'component_ranking_threshold' --type=c -B2 -A2Repository: scanoss/engine
Length of output: 4250
🏁 Script executed:
# Verify the exact flow: look at json_open_file and json_close_file to see if there's any reset
cat -n src/report.c | sed -n '74,110p'Repository: scanoss/engine
Length of output: 992
🏁 Script executed:
# Check if there's any other place ranking_enabled is set to false
rg -n 'ranking_enabled\s*=\s*false' --type=cRepository: scanoss/engine
Length of output: 110
Reset ranking_enabled between file scans to prevent state persistence across multiple files with different thresholds.
The static variable ranking_enabled is set to true when processing files with component_ranking_threshold >= 0 (line 337) but is never reset to false. In wfp_scan(), multiple files are processed sequentially, each with potentially different component_ranking_threshold values. Once ranking is enabled for one file, it remains enabled for all subsequent files regardless of their individual threshold settings, causing rank output to appear on files where it should be suppressed.
Reset ranking_enabled = false in json_close_file() or json_open_file() to ensure per-file threshold isolation.
🤖 Prompt for AI Agents
In @src/report.c at line 51, The static flag ranking_enabled persists across
file scans causing ranking to remain enabled after a file with
component_ranking_threshold >= 0; reset ranking_enabled to false at the start of
each file scan by assigning ranking_enabled = false in either json_open_file()
or json_close_file() so each file's threshold is applied independently (locate
the static bool ranking_enabled and add the reset in json_open_file() or
json_close_file() to ensure per-file isolation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's intentional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mscasso-scanoss, understood! I've noted that the ranking_enabled static variable is intentionally designed to persist across file scans in src/report.c.
✏️ Learnings added
Learnt from: mscasso-scanoss
Repo: scanoss/engine PR: 103
File: src/report.c:51-51
Timestamp: 2026-01-13T11:45:11.443Z
Learning: In the SCANOSS engine project (src/report.c), the static variable `ranking_enabled` is intentionally designed to persist its state across multiple file scans and is not reset between files.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
| char * full_path = strdup(comp->file); | ||
| char * path = dirname(full_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add null check after strdup to avoid undefined behavior.
If strdup fails and returns NULL, passing it to dirname results in undefined behavior. Add a null check to handle allocation failure gracefully.
🐛 Proposed fix
char * full_path = strdup(comp->file);
+if (!full_path)
+ return 0;
char * path = dirname(full_path);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| char * full_path = strdup(comp->file); | |
| char * path = dirname(full_path); | |
| char * full_path = strdup(comp->file); | |
| if (!full_path) | |
| return 0; | |
| char * path = dirname(full_path); |
🤖 Prompt for AI Agents
In @src/util.c around lines 342 - 343, strdup result is not checked before
calling dirname, which can cause undefined behavior if strdup returns NULL;
after calling strdup(comp->file) into full_path, add a NULL check and handle
allocation failure (e.g., log/report error and return an error code or clean up
and exit the function) before calling dirname(full_path), and ensure any
allocated resources are freed (and avoid calling dirname when full_path is
NULL); reference symbols: strdup, full_path, dirname, path, comp->file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full_path can´t be null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mscasso-scanoss, understood. While strdup can technically return NULL if insufficient memory is available, I acknowledge that in your environment this is not a practical concern. I'll defer to your judgment on this.
src/versions.c
Outdated
| static char * purl_indirection_reference[FETCH_MAX_FILES_DEFAULT]; | ||
| static int purl_indirection_index = 0; | ||
| static release_version * purl_version_list[FETCH_MAX_FILES]; | ||
| static release_version * purl_version_list[FETCH_MAX_FILES_DEFAULT]; | ||
|
|
||
| void purl_latest_version_add(component_data_t * component) | ||
| { | ||
| if (!component->purls[0] || !component->release_date || !component->version || purl_indirection_index == FETCH_MAX_FILES) | ||
| if (!component->purls[0] || !component->release_date || !component->version || purl_indirection_index == fetch_max_files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find FETCH_MAX_FILES_DEFAULT and fetch_max_files usage
echo "=== FETCH_MAX_FILES_DEFAULT definition ==="
rg -n "FETCH_MAX_FILES_DEFAULT" --type c
echo ""
echo "=== fetch_max_files definition/initialization ==="
rg -n "fetch_max_files\s*=" --type c -A 2
echo ""
echo "=== Context around src/versions.c lines 48-54 ==="
head -70 src/versions.c | tail -30Repository: scanoss/engine
Length of output: 1724
🏁 Script executed:
#!/bin/bash
# Check for any validation/clamping of fetch_max_files after it's set from atoi
echo "=== Context around src/main.c line 456 (fetch_max_files assignment) ==="
sed -n '450,470p' src/main.c
echo ""
echo "=== Check for any MIN/MAX clamping macros or functions ==="
rg -n "fetch_max_files" --type c | head -20Repository: scanoss/engine
Length of output: 1717
Critical buffer overflow: static array size vs. user-controlled runtime limit.
The static arrays purl_indirection_reference and purl_version_list are sized with FETCH_MAX_FILES_DEFAULT (12000), but the boundary check uses fetch_max_files, which is user-controllable via the --max-files command-line argument. The fetch_max_files value is set directly from atoi(optarg) with no validation or clamping. If a user passes --max-files with a value greater than 12000, the code will write beyond array bounds when executing purl_indirection_reference[purl_indirection_index] = strdup(...), causing undefined behavior and a potential heap corruption vulnerability.
Either:
- Clamp
fetch_max_filestoFETCH_MAX_FILES_DEFAULTafter reading from command line, or - Size the arrays dynamically based on the user-provided value, or
- Use the constant for both array sizing and boundary checking
Suggested fix (option 3)
- if (!component->purls[0] || !component->release_date || !component->version || purl_indirection_index == fetch_max_files)
+ if (!component->purls[0] || !component->release_date || !component->version || purl_indirection_index == FETCH_MAX_FILES_DEFAULT)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| static char * purl_indirection_reference[FETCH_MAX_FILES_DEFAULT]; | |
| static int purl_indirection_index = 0; | |
| static release_version * purl_version_list[FETCH_MAX_FILES]; | |
| static release_version * purl_version_list[FETCH_MAX_FILES_DEFAULT]; | |
| void purl_latest_version_add(component_data_t * component) | |
| { | |
| if (!component->purls[0] || !component->release_date || !component->version || purl_indirection_index == FETCH_MAX_FILES) | |
| if (!component->purls[0] || !component->release_date || !component->version || purl_indirection_index == fetch_max_files) | |
| static char * purl_indirection_reference[FETCH_MAX_FILES_DEFAULT]; | |
| static int purl_indirection_index = 0; | |
| static release_version * purl_version_list[FETCH_MAX_FILES_DEFAULT]; | |
| void purl_latest_version_add(component_data_t * component) | |
| { | |
| if (!component->purls[0] || !component->release_date || !component->version || purl_indirection_index == FETCH_MAX_FILES_DEFAULT) |
🤖 Prompt for AI Agents
In @src/versions.c around lines 48 - 54, The arrays purl_indirection_reference
and purl_version_list are statically sized to FETCH_MAX_FILES_DEFAULT but bounds
are checked against the user-controlled fetch_max_files, risking overflow in
purl_latest_version_add when doing assignments like
purl_indirection_reference[purl_indirection_index] = strdup(...); fix by
clamping fetch_max_files to at most FETCH_MAX_FILES_DEFAULT right after parsing
the --max-files option (or alternatively allocate these arrays dynamically using
fetch_max_files); ensure all checks using purl_indirection_index compare against
the clamped value (FETCH_MAX_FILES_DEFAULT or the allocated size) so writes
cannot exceed the actual buffer capacity.
scanoss-qg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just consider array bounds advisory from CodeRabbit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/snippet_selection.c (1)
2-3: File header comment has incorrect filename.The comment says
src/snippets.cbut this file issrc/snippet_selection.c.Proposed fix
-* src/snippets.c +* src/snippet_selection.csrc/license.c (1)
454-461: Dead code: inner loop serves no purpose.This loop has multiple issues:
- Variable shadowing: The inner
int ishadows the outer loop'si(line 442).- Unused loop variable: The condition
licenses_by_type.count > 0doesn't depend oni, so looping 4 times is pointless.- Misleading comment: The comment mentions checking "preferred license ids" but the code only checks if count > 0.
- Break scope: The
breakonly exits the inner loop, not the outer purl loop.Consider removing this dead code or implementing the intended logic.
🔧 If intent is to break outer loop when licenses exist
- //Look if someone of the prefered liceses ids already has a match - for (int i = 0; i < 4; i++) - { - if (licenses_by_type.count > 0) - { - scanlog("Stop searching for licenses\n"); - break; - } - } + if (licenses_by_type.count > 0) + { + scanlog("Stop searching for licenses\n"); + break; + }Note: This
breakwill still only exit the innerforloop at line 442. If you need to exit the purl iteration entirely, you'll need a different control flow mechanism (e.g., goto, flag variable, or restructuring).
🤖 Fix all issues with AI agents
In @src/license.c:
- Around line 270-276: Update the inline comment to match the actual SPDX case
used in the condition: replace "license-ref" with "LicenseRef" so the comment
accurately reflects the check around license_source_id and the subsequent strcmp
against "LicenseRef" in the block that uses license_id_to_source_name and the
license variable.
In @src/snippet_selection.c:
- Around line 387-391: The code forcibly sets
match->scan_ower->snippet_adjust_tolerance = true and then computes ranges/
ranges_number assuming that override; remove the hard assignment so the
user-provided match->scan_ower->snippet_adjust_tolerance is respected, call
ranges_join_overlapping with match->scan_ower->snippet_range_tolerance and the
actual snippet_adjust_tolerance value, and compute ranges_number from
matchmap_reg->ranges_number when snippet_adjust_tolerance is false or
MATCHMAP_RANGES otherwise; if you must temporarily override user input, gate the
override behind a compile-time flag (e.g., #ifdef DISABLE_DYNAMIC_RANGES) or
emit a warning/log before changing the field instead of silently overwriting it.
- Around line 383-384: The format specifier in the scanlog call is wrong:
match->matchmap_reg->range[i].from and .to are uint16_t (matchmap_range) but the
format uses %ld; change the format to use an unsigned specifier (e.g., %u) or
cast the values to unsigned int when logging so the types match (update the
scanlog format string and arguments in the call that prints
match->matchmap_reg->range[i].from and .to).
- Around line 406-409: The log uses "%ld" for uint16_t fields causing wrong
formatting; update the scanlog call in the loop over ranges (variables ranges
and ranges_number) to use the correct specifier for uint16_t (e.g., "%hu" for
ranges[i].from and ranges[i].to) or use the PRIu16 macro with an appropriate
cast, so the format matches the type.
- Around line 295-307: The calloc allocation for out_ranges in the snippet
inside ranges_join_overlapping (using MATCHMAP_RANGES/dynamic_ranges and size)
lacks a NULL check; after allocating out_ranges, validate it and on failure
handle gracefully (e.g., free any prior resources if needed and return NULL or
an error indicator) and document that callers such as compile_ranges must check
for a NULL return; ensure subsequent uses (memset, loops using out_ranges) only
occur after a successful allocation.
- Around line 126-132: After calling calloc for match_new (match_data_t
*match_new = calloc(...)) add a NULL check immediately before any dereference
(memcpy, assigning fields); if calloc returns NULL, log or handle the allocation
failure and skip processing this match (e.g., continue the loop or return an
error) to avoid dereferencing a NULL pointer and leaking state; ensure any
upstream resources are left consistent when choosing to continue or return.
🧹 Nitpick comments (1)
inc/scan.h (1)
79-85: Consider grouping scan configuration into a struct.These function signatures now accept 9 parameters each, which hurts readability and makes call sites error-prone. A configuration struct would improve maintainability:
typedef struct scan_config_t { int max_snippets; int max_components; bool adjust_tolerance; int component_ranking_threshold; int snippet_min_hits; int snippet_min_lines; int snippet_range_tolerance; bool snippet_honor_file_extension; } scan_config_t; scan_data_t * scan_data_init(char *target, scan_config_t *config);This also makes future parameter additions non-breaking.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
inc/scan.hsrc/license.csrc/snippet_selection.csrc/versions.c
🚧 Files skipped from review as they are similar to previous changes (1)
- src/versions.c
🧰 Additional context used
🧬 Code graph analysis (1)
src/snippet_selection.c (2)
src/match.c (1)
match_data_free(63-76)src/scanlog.c (1)
scanlog(71-107)
🪛 Cppcheck (2.19.0)
src/snippet_selection.c
[warning] 127-127: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 128-128: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 129-129: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 130-130: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 131-131: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 132-132: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
[warning] 307-307: If memory allocation fails, then there is a possible null pointer dereference
(nullPointerOutOfMemory)
🔇 Additional comments (12)
inc/scan.h (1)
69-74: LGTM - New configuration fields are well-documented.The inline comments clarify the purpose of each field. The naming is consistent with the existing codebase conventions.
src/snippet_selection.c (5)
124-145: LGTM - Configuration parameters properly integrated.The snippet filtering logic correctly uses the new per-scan configuration fields (
snippet_min_hits,snippet_honor_file_extension,snippet_min_lines) instead of global constants.
245-277: LGTM - Parameterized range assembly.The function now correctly accepts
min_range_linesandranges_numberas parameters, enabling per-scan configuration of range filtering.
323-334: LGTM - Dynamic ranges early-exit behavior.The conditional break logic correctly differentiates between dynamic and fixed-size range modes.
359-363: LGTM - Defensive early exit for empty ranges.Good addition of bounds check with appropriate logging.
412-416: LGTM - Parameter threading through range assembly.The new parameters are correctly propagated to
ranges_assemble.src/license.c (6)
49-58: LGTM!The struct-based approach cleanly encapsulates license text with its source ID, improving type safety over the previous flat array design.
60-97: LGTM!The function correctly returns static string literals and handles unknown IDs by returning NULL, which is properly checked by callers.
99-119: LGTM!The padding calculation correctly ensures 8-byte alignment for CRC32C hardware reads. The memory allocation pattern is appropriate for the expected small number of licenses per component.
121-132: LGTM!Proper cleanup with NULL checks makes this safe to call multiple times.
391-394: LGTM!Correctly passes the source ID to the updated
license_add_to_listfunction.
492-501: LGTM!The final iteration correctly uses the new struct fields (
.textand.id), and memory is properly cleaned up withlicense_free_list.
| //TODO: Re-enable dynamic ranges when feature is complete | ||
| // For now, we force adjust_tolerance to ensure stable behavior | ||
| match->scan_ower->snippet_adjust_tolerance = true; | ||
| matchmap_range *ranges = ranges_join_overlapping(match->matchmap_reg->range, match->matchmap_reg->ranges_number, match->scan_ower->snippet_range_tolerance, !match->scan_ower->snippet_adjust_tolerance); | ||
| int ranges_number = !match->scan_ower->snippet_adjust_tolerance ? match->matchmap_reg->ranges_number : MATCHMAP_RANGES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User configuration is silently overwritten.
Lines 388-389 force snippet_adjust_tolerance = true, ignoring the value passed through the CLI/API. This means the dynamic_ranges feature is effectively disabled and users cannot enable it. Additionally, line 391's ternary will always evaluate to MATCHMAP_RANGES since line 389 guarantees the condition is true.
If the feature isn't ready, consider:
- Removing the parameter from the public API until ready, or
- Logging a warning when overriding user input, or
- Using a compile-time flag to control the behavior
🤖 Prompt for AI Agents
In @src/snippet_selection.c around lines 387 - 391, The code forcibly sets
match->scan_ower->snippet_adjust_tolerance = true and then computes ranges/
ranges_number assuming that override; remove the hard assignment so the
user-provided match->scan_ower->snippet_adjust_tolerance is respected, call
ranges_join_overlapping with match->scan_ower->snippet_range_tolerance and the
actual snippet_adjust_tolerance value, and compute ranges_number from
matchmap_reg->ranges_number when snippet_adjust_tolerance is false or
MATCHMAP_RANGES otherwise; if you must temporarily override user input, gate the
override behind a compile-time flag (e.g., #ifdef DISABLE_DYNAMIC_RANGES) or
emit a warning/log before changing the field instead of silently overwriting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/snippet_selection.c (1)
298-346: Potential buffer overflow in dynamic_ranges mode.When
dynamic_rangesis true, the allocation at line 304 usesout_size = size. However, the loop at line 330 incrementsout_ranges_indexwithout a bounds check in dynamic mode (the check at line 331 only applies when!dynamic_ranges). If multiple non-overlapping ranges exist,out_ranges_indexcould exceedout_size - 1, causing out-of-bounds writes at lines 333-335.🐛 Proposed fix
else { out_ranges_index++; - if (out_ranges_index == MATCHMAP_RANGES && !dynamic_ranges) + if (out_ranges_index >= out_size) break; out_ranges[out_ranges_index].from = ranges[i].from; out_ranges[out_ranges_index].to = ranges[i].to; out_ranges[out_ranges_index].oss_line = ranges[i].oss_line; }src/license.c (1)
451-461: Dead code: inner loop is non-functional and shadows outer variable.This loop has multiple issues:
- Variable shadowing: Inner
int ishadows the outer purl iteration variablei- Pointless iteration: The loop checks
licenses_by_type.count > 0four times without any state change between iterations- Ineffective break: The
breakonly exits the inner loop, not the outer purl loop- Missing logic: The comment references "prefered liceses ids" but no preference logic exists after the refactoring to a unified list
This appears to be vestigial code from a previous implementation. If the intent is to stop searching once licenses are found, the entire inner loop should be replaced with a direct check and a
gotoor flag-based outer loop exit.Suggested fix: Remove dead loop, add proper early exit
if (records) { - //Look if someone of the prefered liceses ids already has a match - for (int i = 0; i < 4; i++) - { - if (licenses_by_type.count > 0) - { - scanlog("Stop searching for licenses\n"); - break; - } - } + if (licenses_by_type.count > 0) + { + scanlog("Stop searching for licenses\n"); + break; /* Exit the purl loop */ + } }
🤖 Fix all issues with AI agents
In @src/license.c:
- Around line 99-119: The realloc in license_add_to_list can succeed while the
subsequent calloc fails, leaving an allocated extra slot at
ptr->licenses[ptr->count] that is uninitialized; after calling realloc (tmp) but
before calling calloc, initialize the new slot (e.g., set
ptr->licenses[ptr->count].text = NULL and ptr->licenses[ptr->count].id = 0) so a
failed calloc won't leave stale data, and only increment ptr->count after
successful allocation and copy; ensure you use the tmp result to set the slot
prior to allocating the text and rollback/cleanup on failure if needed.
In @src/match.c:
- Around line 259-267: initialize_component_age allocates comp->purls_md5[0]
with malloc but never checks for NULL before using it; update
initialize_component_age to check the return of malloc for NULL and handle
allocation failure (e.g., skip MD5 computation and leave comp->age unchanged or
set a sentinel value and/or log an error) before calling MD5 and
get_component_age; ensure references to comp->purls_md5[0], the MD5(...) call,
and get_component_age(...) are guarded so no null pointer is dereferenced.
🧹 Nitpick comments (1)
src/snippet_selection.c (1)
394-399: TODO: Dynamic ranges feature is forcibly disabled.The code forces
snippet_adjust_tolerance = trueregardless of the configured value, with a TODO indicating the dynamic ranges feature is incomplete. This is acceptable as a temporary measure, but consider tracking this as technical debt.Would you like me to open an issue to track the completion of the dynamic ranges feature?
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/license.csrc/match.csrc/snippet_selection.c
🧰 Additional context used
🧬 Code graph analysis (1)
src/snippet_selection.c (2)
src/scanlog.c (1)
scanlog(71-107)src/match.c (1)
match_data_free(63-76)
🪛 ast-grep (0.40.5)
src/match.c
[warning] 263-263: This hashing algorithm is insecure. If this hash is used in a security context, such as password hashing, it should be converted to a stronger hashing algorithm.
Context: MD5((uint8_t *)comp->purls[0], strlen(comp->purls[0]), comp->purls_md5[0]);
Note: [CWE-328] Use of Weak Hash. [REFERENCES]
- https://owasp.org/Top10/A02_2021-Cryptographic_Failures
(insecure-hash-c)
🔇 Additional comments (19)
src/match.c (8)
263-265: MD5 usage for internal keying is acceptable here.The static analysis tool flagged MD5 as insecure. However, in this context, MD5 is used as a lookup key for component age retrieval, not for security purposes (password hashing, signatures, etc.). This is an acceptable use of MD5 for non-cryptographic hashing/keying.
276-285: LGTM!The
compare_int_valueshelper is clean and correctly implements the documented behavior for both "higher is better" and "lower is better" comparison modes.
341-371: LGTM on SBOM and hint evaluation logic.The SBOM evaluation (step 1) and component hint evaluation (step 2) are well-structured with clear accept/reject paths and appropriate logging.
530-545: Path depth comparison has asymmetric conditions.The conditions at lines 534 and 540 use asymmetric comparisons (
b->path_depth + 2 < a->path_depth/2vsa->path_depth + 2 < b->path_depth/2). While this may be intentional for hysteresis, the threshold is quite strict—requiring one path to be more than twice as short plus 2. Consider verifying this is the intended behavior.
559-648: Tiebreaker logic is well-structured.The equal release date tiebreakers (section 8) implement a sensible hierarchy: source check → health metrics → vendor check → component age → version comparison. The use of
compare_int_valueshelper keeps the code consistent.
715-720: LGTM on dynamic iteration_max.The change from a fixed constant to a configurable
fetch_max_filesvalue with a 4x multiplier for high accuracy mode aligns with the PR's goal of per-scan thresholds.
1107-1111: Good addition of null check after calloc.This defensive check prevents null pointer dereference if memory allocation fails. The early return with logging is appropriate.
287-319: Theextension()function does not allocate memory. It returns pointers to substrings within the input string usingstrrchr(), not newly allocated memory. No memory leak exists in this function.Likely an incorrect or invalid review comment.
src/snippet_selection.c (4)
124-131: Good defensive null check after calloc.The null check for
match_newallocation prevents crashes on memory exhaustion.
250-282: LGTM on ranges_assemble signature expansion.The new
min_range_linesandranges_numberparameters enable per-scan thresholds. The logic correctly skips ranges below the minimum at lines 265-266.
366-370: Good early exit for empty ranges.The defensive check for
ranges_number <= 0prevents unnecessary processing and potential issues downstream.
398-399: Logic inversion between adjust_tolerance and dynamic_ranges.The parameter passed to
ranges_join_overlappingis!match->scan_ower->snippet_adjust_tolerance, meaningdynamic_ranges=truewhenadjust_tolerance=false. Given thatadjust_toleranceis now forced totrueat line 396,dynamic_rangeswill always befalse. Ensure this inversion is intentional and the naming isn't confusing for future maintainers.src/license.c (7)
47-58: LGTM!The new
license_typestruct cleanly pairs license text with its source ID. The refactoredlicense_liststructure is a sound approach for the unified license handling.Minor style nit: Line 56 is missing a space (
struct license_type*licenses→struct license_type *licenses).
60-97: LGTM!Clean mapping function with appropriate NULL return for unknown IDs. The grouping of related source IDs (e.g., 71-74 for underlying_license_file) is logical.
121-132: LGTM!Proper cleanup of the refactored structure with appropriate null guards.
270-276: LGTM!The NULL check on line 272 properly guards against unknown source IDs. The scancode/LicenseRef filter aligns with standard SPDX practices where LicenseRef identifiers from scancode may need special handling.
492-495: LGTM!Clean iteration using the unified license list structure.
501-502: LGTM!Proper cleanup of the unified license list at the end of the function.
330-336: LGTM!The updated signature properly propagates the source ID through the JSON generation flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/match.c (1)
1128-1129: Missing null check for secondcallocin windows line endings path.For consistency with the fix at line 1112, consider adding a null check here as well.
Proposed fix
if (scan->windows_line_endings) { match_data_t *match_new = calloc(1, sizeof(match_data_t)); + if (!match_new) + { + scanlog("Error allocating memory for match data (windows)\n"); + return; + } match_new->type = scan->match_type;
🤖 Fix all issues with AI agents
In `@src/license.c`:
- Line 109: Remove the stray '+' diff marker before the assignment so the line
assigning to ptr->licenses[ptr->count].id is valid C; locate the statement using
the symbol pattern "ptr->licenses[ptr->count].id = 0" (e.g., in the
initialization of license entries) and delete the leading '+' character so the
compiler sees a normal assignment.
In `@src/match.c`:
- Around line 639-652: Update the inline comment above the version comparison in
src/match.c (the block that compares a->version and b->version when b->age ==
a->age and !strcmp(a->component, b->component)) to state that the code uses a
lexicographic tie-breaker for deterministic ordering rather than a semantic
version comparison (e.g., replace the current "Lower version string is
preferred..." comment with "Lower version string is preferred as a deterministic
tie-breaker (lexicographic, not semantic)"). If semantic ordering is ever
required, note that semver_compare_version() in external/src/semver.c can be
used instead.
🧹 Nitpick comments (2)
src/license.c (1)
456-463: Consider removing redundant inner loop.This inner loop appears to be dead code or a logic remnant from previous per-type array handling:
- Variable shadowing: The loop variable
ishadows the outer loop'sifrom line 444.- Redundant logic: The loop checks
licenses_by_type.count > 0up to 4 times, but since count doesn't change within this loop, the condition will either break on the first iteration or never break.The entire
forblock can likely be simplified to a singleifstatement:♻️ Suggested simplification
if (records) { - //Look if someone of the prefered liceses ids already has a match - for (int i = 0; i < 4; i++) - { - if (licenses_by_type.count > 0) - { - scanlog("Stop searching for licenses\n"); - break; - } - } + if (licenses_by_type.count > 0) + { + scanlog("Stop searching for licenses\n"); + break; // Also break from outer loop if intended + } }src/match.c (1)
292-292: Consider addingstatickeyword for internal linkage.This function appears to be an internal helper (not declared in a header). Adding
staticwould make its scope explicit and could enable better compiler optimizations.Suggested fix
-int compare_file_extension(component_data_t *a, component_data_t *b) +static int compare_file_extension(component_data_t *a, component_data_t *b)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/license.csrc/match.c
🧰 Additional context used
🧬 Code graph analysis (1)
src/match.c (6)
src/scanlog.c (1)
scanlog(71-107)src/query.c (1)
get_component_age(196-207)src/ignorelist.c (1)
extension(45-54)src/util.c (2)
path_is_third_party(335-412)binary_file_to_purl(443-606)src/url.c (2)
purl_vendor_component_check(428-445)purl_source_check(447-461)src/health.c (1)
print_health(110-115)
🪛 ast-grep (0.40.5)
src/match.c
[warning] 268-268: This hashing algorithm is insecure. If this hash is used in a security context, such as password hashing, it should be converted to a stronger hashing algorithm.
Context: MD5((uint8_t *)comp->purls[0], strlen(comp->purls[0]), comp->purls_md5[0]);
Note: [CWE-328] Use of Weak Hash. [REFERENCES]
- https://owasp.org/Top10/A02_2021-Cryptographic_Failures
(insecure-hash-c)
🪛 GitHub Actions: build
src/license.c
[error] 109-109: Compiler error: lvalue required as left operand of assignment at src/license.c:109 while compiling with gcc. The step 'make all' failed (exit code 2).
🔇 Additional comments (14)
src/license.c (7)
47-58: LGTM!The introduction of
struct license_typeto store bothtextandidtogether, and updatingstruct license_listto use it, properly addresses the previous review concern about preserving the license source ID during consolidation.
60-97: LGTM!The static helper function provides a clean mapping from license IDs to human-readable source names, with appropriate NULL return for unknown IDs.
99-121: Improved memory safety with slot initialization.The initialization of the new slot immediately after
realloc(lines 108-109) properly addresses the previous concern about partial allocation failures. Ifcallocfails at line 114, the slot is in a safe state withtext = NULLandid = 0.
123-134: LGTM!The cleanup function correctly handles the updated struct layout by freeing each entry's
textfield before freeing the array itself.
272-278: LGTM!The updated license source handling now correctly uses
license_id_to_source_name()to derive the source name from the ID, with proper NULL checking. The comment at line 276 now correctly matches the "LicenseRef" check in the code.
373-401: LGTM!The callback now correctly passes the source ID (
src) tolicense_add_to_list, ensuring the license source is preserved for accurate JSON output.
494-497: LGTM!The iteration now correctly uses
licenses_by_type.licenses[i].textandlicenses_by_type.licenses[i].id, ensuring the actual license source ID is passed tolicense_to_jsonrather than the array index. This properly addresses the previously identified critical issue.src/match.c (7)
259-272: Good fix for the memory allocation check.The null check after
mallochas been properly added, addressing the previous concern about potential null pointer dereference. The early return with logging is appropriate.Regarding the static analysis warning about MD5: this usage is for content-based indexing/lookup (component age retrieval), not for security purposes like password hashing or integrity verification. MD5 is acceptable in this context.
281-290: LGTM!Clean helper function with clear semantics. The
prefer_higherparameter provides good flexibility for different comparison scenarios.
326-344: Excellent documentation.The comprehensive docstring clearly explains the multi-stage comparison logic with all 9 hierarchical criteria. This greatly aids maintainability.
720-725: LGTM!The iteration limit is now properly tied to
fetch_max_fileswith configurable behavior. The 4x multiplier for high accuracy mode is clearly expressed.
769-769: LGTM!Useful diagnostic logging for troubleshooting match loading behavior.
1112-1116: Good addition of null check aftercalloc.This prevents potential null pointer dereference if memory allocation fails.
517-517: No null pointer dereference issue at this location.The
componentfield is always initialized viastrdup()when acomponent_data_tis created (component.c:278), ensuring it is never NULL. Even if the component name is empty,strdup("")returns a valid pointer to an empty string, not NULL. Callingstrcmp()on two valid pointers—whether they point to empty strings or populated strings—is safe and will not cause undefined behavior.Likely an incorrect or invalid review comment.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| if (b->age == a->age && !strcmp(a->component, b->component)) | ||
| { | ||
| scanlog("Component %s rejected by version comparison\n", b->purls[0]); | ||
| return false; | ||
| int version_cmp = strcmp(a->version, b->version); | ||
| if (version_cmp > 0) | ||
| { | ||
| scanlog("Component %s prefered over %s by version\n", b->purls[0], a->purls[0]); | ||
| return true; | ||
| } | ||
| if (version_cmp < 0) | ||
| { | ||
| scanlog("Component %s rejected by version comparison\n", b->purls[0]); | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if there's an existing version comparison utility in the codebase
echo "=== Searching for version comparison patterns ==="
rg -n 'version.*compar|compar.*version' --type c -i
echo -e "\n=== Searching for versions.h/c files ==="
fd -e c -e h versions
echo -e "\n=== Checking full context in src/match.c around line 639 ==="
if [ -f src/match.c ]; then
sed -n '630,660p' src/match.c
fiRepository: scanoss/engine
Length of output: 2328
🏁 Script executed:
#!/bin/bash
# Check what's in versions.h/c
echo "=== Checking inc/versions.h ==="
head -50 inc/versions.h
echo -e "\n=== Checking src/versions.c (first 80 lines) ==="
head -80 src/versions.cRepository: scanoss/engine
Length of output: 3365
Improve the clarity of the version comparison comment to distinguish between lexicographic and semantic ordering.
The code correctly uses lexicographic comparison for tie-breaking components with equal age and name, which is intentional and documented at line 637. However, line 638's comment states "Lower version string is preferred (usually represents older/more stable versions)" which could be misleading, since lexicographic ordering does not guarantee semantic version ordering. For example, "1.10.0" < "1.9.0" lexicographically, but 1.10.0 > 1.9.0 semantically.
Update line 638 to clarify that this is a lexicographic tie-breaker for deterministic ordering rather than semantic version comparison:
// Lower version string is preferred as a deterministic tie-breaker (lexicographic, not semantic)
If semantic version comparison becomes a requirement, the external semver_compare_version() function is available in external/src/semver.c.
🤖 Prompt for AI Agents
In `@src/match.c` around lines 639 - 652, Update the inline comment above the
version comparison in src/match.c (the block that compares a->version and
b->version when b->age == a->age and !strcmp(a->component, b->component)) to
state that the code uses a lexicographic tie-breaker for deterministic ordering
rather than a semantic version comparison (e.g., replace the current "Lower
version string is preferred..." comment with "Lower version string is preferred
as a deterministic tie-breaker (lexicographic, not semantic)"). If semantic
ordering is ever required, note that semver_compare_version() in
external/src/semver.c can be used instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/license.c (2)
454-464: Variable shadowing: inner loop reuses outer loop variablei.The inner
for (int i = 0; i < 4; i++)at line 456 shadows the outer loop variableifrom line 444. While this compiles, it's confusing and the inner loop's logic appears to be checkinglicenses_by_type.count > 0repeatedly without changing anything—it will always break on the first iteration if count > 0, making the loop redundant.This looks like leftover code from a refactor. Consider simplifying to:
if (records) { - //Look if someone of the prefered liceses ids already has a match - for (int i = 0; i < 4; i++) - { - if (licenses_by_type.count > 0) - { - scanlog("Stop searching for licenses\n"); - break; - } - } + if (licenses_by_type.count > 0) + { + scanlog("Stop searching for licenses\n"); + } }Note: The outer loop doesn't break here, so licenses will continue to be fetched even after this log message. Verify if an early exit was intended.
486-491: Missing NULL check after calloc.If
callocfails at line 486,resultwill be NULL, and the subsequentsprintfat line 490 will dereference it, causing a crash.🐛 Proposed fix
char * result = calloc(MAX_FIELD_LN * 1024, 1); + if (!result) + { + license_free_list(&licenses_by_type); + return; + } char * buffer = result;
🧹 Nitpick comments (1)
src/license.c (1)
60-97: Consider documenting the license source ID meanings.The mapping function is clear and handles unknown IDs gracefully by returning NULL. However, the magic numbers (0, 35, 31, 71-74, etc.) are not self-documenting. Consider adding a brief comment or defining named constants to clarify what each ID represents in the source system.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/license.c
🔇 Additional comments (8)
src/license.c (8)
49-58: LGTM on the struct refactor.The new
license_typestruct correctly preserves both the license text and source ID, addressing the prior design limitation. Minor nit: line 56 has inconsistent spacing (struct license_type*licensesvs typicalstruct license_type *licenses), but this is purely cosmetic.
99-121: LGTM!The implementation correctly addresses the prior feedback about partial allocation failure by initializing the new slot immediately after
realloc. The padded allocation for CRC32C hardware reads is properly calculated, and the function safely handles allocation failures without corrupting the list state.
123-134: LGTM!Memory cleanup correctly adapted for the new
license_typestruct—frees eachtextfield before the array itself.
272-278: LGTM!The license source ID resolution and early-return on NULL properly handles unknown source IDs. The scancode
LicenseReffiltering is correctly implemented, and the comment now matches the code (addressing prior feedback).
332-338: LGTM!The function correctly delegates to
json_from_licenseorsplit_in_json_arraywith the source ID parameter.
373-401: LGTM!The callback correctly extracts the source ID from CSV and passes it to the updated
license_add_to_listfunction.
494-497: LGTM!The loop now correctly uses
licenses_by_type.licenses[i].idfor the source ID instead of the array index, properly addressing the prior critical issue about incorrect license source attribution.
503-504: LGTM!Proper cleanup of the unified license list structure.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/license.c (1)
471-481: Dead loop logic: condition doesn't vary across iterations.The loop at lines 473-480 iterates
ifrom 0 to 3 but:
- The condition
licenses_by_type.count > 0doesn't depend on the loop variablei- The
breakonly exits the inner loop, not the outer purl loop- The comment mentions checking "preferred license IDs" but the code doesn't actually compare against any ID values
This loop will either break immediately on the first iteration (if count > 0) or complete all 4 iterations doing nothing. It appears to be incomplete refactoring or dead code.
🐛 Suggested fix: Remove or implement the intended logic
If the intent was to break the outer loop when licenses are found:
if (records) { - //Look if someone of the prefered liceses ids already has a match - for (int i = 0; i < 4; i++) - { - if (licenses_by_type.count > 0) - { - scanlog("Stop searching for licenses\n"); - break; - } - } + if (licenses_by_type.count > 0) + { + scanlog("Stop searching for licenses\n"); + break; // Break outer purl loop + } }
🧹 Nitpick comments (1)
src/license.c (1)
60-97: Consider defining named constants for license source IDs.The switch statement uses magic numbers (0, 1, 2, 3, etc.) that appear throughout the codebase (e.g., IDs 5 and 6 are special-cased in
license_compare_by_id). Defining an enum or constants would improve readability and reduce the risk of inconsistencies.♻️ Example refactor
// Near the top of the file or in license.h enum license_source_id { LICENSE_SRC_COMPONENT_DECLARED = 0, LICENSE_SRC_FILE_SPDX_TAG = 1, LICENSE_SRC_FILE_HEADER = 2, LICENSE_SRC_LICENSE_FILE = 3, LICENSE_SRC_SCANCODE_FILE = 4, LICENSE_SRC_SCANCODE = 5, LICENSE_SRC_COMPONENT_DECLARED_ALT = 6, // ... etc };
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/license.c
🔇 Additional comments (8)
src/license.c (8)
49-58: LGTM: Struct definitions are well-designed.The new
license_typestruct cleanly encapsulates both the license text and its source ID, addressing the previous issue where the source ID was lost during consolidation. Minor nit: line 56 has inconsistent spacing (license_type*licensesvs typicallicense_type *licenses).
99-121: LGTM: Memory management improvements look good.The function now properly initializes the new slot immediately after realloc (lines 108-109), addressing the previous review concern about partial allocation failures. The padded allocation for CRC32C hardware reads is correctly implemented.
123-134: LGTM: Cleanup function correctly handles the new struct.The function properly frees each
textfield before freeing the array, with appropriate NULL checks.
136-151: LGTM: Sorting comparator correctly prioritizes license sources.The logic to push IDs 5 and 6 to the end works correctly. The subtraction-based comparison (
la->id - lb->id) is safe here since license IDs are small positive integers (0-74 based onlicense_id_to_source_name).Minor note: The magic numbers 5 and 6 would benefit from named constants for consistency with the earlier suggestion.
289-296: LGTM: Source name lookup and filtering logic is correct.The NULL check for unknown license IDs and the filtering of scancode
LicenseReflicenses are properly implemented. The comment now correctly matches the"LicenseRef"check (addressing the previous review concern about case mismatch).
409-412: LGTM: License source ID is now correctly preserved.The actual license source ID from the CSV data is now properly stored with each license entry, addressing the previous critical issue where the loop index was incorrectly used as the source ID.
511-518: LGTM: Sorting and iteration correctly use preserved license IDs.The licenses are sorted by ID before JSON generation, and the iteration now correctly uses
licenses_by_type.licenses[i].idinstead of the loop index. This fully resolves the previous critical issue about incorrect license source IDs in JSON output.
503-525: LGTM: Memory management and cleanup are correct.The result buffer is properly allocated and its ownership is transferred to
comp->license_text. The unified license list is correctly freed at the end of the function.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
Summary by CodeRabbit
New Features
Improvements
Behavioral Changes
✏️ Tip: You can customize this high-level summary in your review settings.