allow data freeze parsing by ccaio · Pull Request #2 · genomehubs/data

ccaio · 2025-09-25T13:41:49Z

Summary by Sourcery

Introduce optional data freeze parsing into the NCBI assemblies workflow by adding tasks to fetch, default, and apply freeze information per assembly, controlled via a new CLI flag.

New Features:

Add fetch_data_freeze_file task to read assembly-to-freeze mappings from a TSV.
Add set_data_freeze_default task to assign a default data freeze when none is provided.
Add process_datafreeze_info task to update parsed assembly entries with freeze subsets and adjust assembly IDs.
Introduce --data_freeze_path CLI option and parameterize parse_ncbi_assemblies and wrapper scripts to accept extra keyword args

Enhancements:

Log processing steps for each assembly and data freeze operation to aid debugging

sourcery-ai · 2025-09-25T13:41:55Z

Reviewer's Guide

This PR extends the NCBI assemblies parsing flow with support for ‘data freeze’ lists by adding new Prefect tasks for reading and applying freeze subsets, updating the main flow signature to conditionally invoke them, enhancing the CLI to accept a data_freeze_path flag, and adjusting helper-parsers for compatibility.

File-Level Changes

Change	Details	Files
Introduce data freeze handling tasks	Added fetch_data_freeze_file task to load and parse a TSV of freeze lists Implemented set_data_freeze_default to assign a fallback freeze Created process_datafreeze_info to map and tag assemblies with specific freezes and build their IDs	`flows/parsers/parse_ncbi_assemblies.py`
Integrate data freeze into parse_ncbi_assemblies flow	Extended flow signature with optional data_freeze_path Branched to use default or fetched freeze data based on presence of the path Invoked new tasks and updated write_to_tsv accordingly	`flows/parsers/parse_ncbi_assemblies.py`
Expand CLI to accept data_freeze_path	Defined DATA_FREEZE_PATH flag in shared_args Added the flag to the parser args for NCBI assemblies	`flows/lib/shared_args.py` `flows/parsers/args.py`
Make parser wrappers accept extra kwargs	Updated parse_refseq_organelles to include kwargs Modified parse_sequencing_status to accept kwargs Adjusted parse_skip_parsing signature for **kwargs	`flows/parsers/parse_refseq_organelles.py` `flows/parsers/parse_sequencing_status.py` `flows/parsers/parse_skip_parsing.py`
Add logging for report processing	Inserted print in process_assembly_reports to log the current accession Tagged prints with log_prints for Prefect visibility	`flows/parsers/parse_ncbi_assemblies.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `flows/parsers/parse_ncbi_assemblies.py:398` </location>
<code_context>
+    # local_path = "../vgp_phase1_data_freeze.tsv"
+    # local_path = "/tmp/data_freeze_list.tsv"
+    # fetch_from_s3(data_freeze_path, local_path)
+    local_path = os.path.abspath(data_freeze_path)
+    data_freeze = {}
+    with open(local_path, "r") as f:
</code_context>

<issue_to_address>
**issue (bug_risk):** Using os.path.abspath may not be appropriate for S3 paths.

os.path.abspath does not support S3 URIs. Handle S3 paths separately, such as by downloading them to a local file before opening.
</issue_to_address>

### Comment 2
<location> `flows/parsers/parse_ncbi_assemblies.py:399-404` </location>
<code_context>
+    data_freeze = {}
+    with open(local_path, "r") as f:
+        for line in f:
+            parts = re.split(r"\s*\t\s*", line.strip())
+            if len(parts) < 2:
+                continue
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Splitting on tab with optional whitespace may cause issues with fields containing tabs.

Using '\s*\t\s*' may split fields incorrectly if values contain tabs. Use a strict '\t' delimiter for TSV files to ensure accurate parsing.

```suggestion
    data_freeze = {}
    with open(local_path, "r") as f:
        for line in f:
            parts = line.strip().split('\t')
            if len(parts) < 2:
                continue
```
</issue_to_address>

### Comment 3
<location> `flows/parsers/parse_ncbi_assemblies.py:381` </location>
<code_context>
             continue


+@task(log_prints=True)
+def fetch_data_freeze_file(data_freeze_path: str) -> dict:
+    """
</code_context>

<issue_to_address>
**issue (complexity):** Consider refactoring the data freeze logic to use the csv module, compute the freeze name once, and unify default and explicit freeze handling in a single pass.

```markdown
You can simplify and DRY-up these tasks in three steps:

1. use the `csv` module to parse your TSV  
2. compute `data_freeze_name` once outside the loop  
3. unify the “default” vs. explicit‐freeze logic in a single pass  

For example:

```python
import csv
import os
import re

@task(log_prints=True)
def fetch_data_freeze_file(data_freeze_path: str) -> dict:
    """Fetch a 2‐column TSV and return {accession: [freeze,…]}."""
    print(f"Fetching data freeze file from {data_freeze_path}")
    data_freeze = {}
    with open(os.path.abspath(data_freeze_path), newline="") as f:
        reader = csv.reader(f, delimiter="\t")
        for row in reader:
            if len(row) < 2:
                continue
            acc = row[0].strip()
            freezes = [x.strip() for x in row[1].split(",") if x.strip()]
            data_freeze[acc] = freezes
    return data_freeze
```

```python
@task(log_prints=True)
def process_datafreeze_info(
    processed_report: dict, data_freeze: dict, config: Config
):
    """Annotate each record with its dataFreeze + assemblyID."""
    # compute once
    df_name = (
        re.sub(r"\.tsv(\.gz)?$", "", os.path.basename(config.meta["file_name"]))
        if config.meta.get("file_name")
        else "data_freeze"
    )
    print(f"Processing data freeze info {df_name}")
    for rec in processed_report.values():
        # pick explicit status or default to ["latest"]
        status = (
            data_freeze.get(rec["refseqAccession"])
            or data_freeze.get(rec["genbankAccession"])
            or [df_name]
        )
        rec["dataFreeze"] = status
        # choose accession and append df_name
        accession = rec.get("refseqAccession") or rec["genbankAccession"]
        rec["assemblyID"] = f"{accession}_{df_name}"
        print(f"{rec['assemblyID']} => {status}")
```

Finally, in your flow you can collapse default vs. explicit into one call:

```python
if data_freeze_path:
    df = fetch_data_freeze_file(data_freeze_path)
else:
    df = {}
process_datafreeze_info(parsed, df, config)
```

This keeps all functionality, removes inline regex loops, and centralizes `data_freeze_name`.
</issue_to_address>

### Comment 4
<location> `flows/parsers/parse_ncbi_assemblies.py:446` </location>
<code_context>
        status = data_freeze.get(line["refseqAccession"], None) or data_freeze.get(

</code_context>

<issue_to_address>
**suggestion (code-quality):** Replace `dict.get(x, None)` with `dict.get(x)` ([`remove-none-from-default-get`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/remove-none-from-default-get))

```suggestion
        status = data_freeze.get(line["refseqAccession"]) or data_freeze.get(
```

<br/><details><summary>Explanation</summary>When using a dictionary's `get` method you can specify a default to return if
the key is not found. This defaults to `None`, so it is unnecessary to specify
`None` if this is the required behaviour. Removing the unnecessary argument
makes the code slightly shorter and clearer.
</details>
</issue_to_address>

### Comment 5
<location> `flows/parsers/parse_ncbi_assemblies.py:446-448` </location>
<code_context>
        status = data_freeze.get(line["refseqAccession"], None) or data_freeze.get(
            line["genbankAccession"], None
        )

</code_context>

<issue_to_address>
**suggestion (code-quality):** Replace `dict.get(x, None)` with `dict.get(x)` ([`remove-none-from-default-get`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/remove-none-from-default-get))

```suggestion
        status = data_freeze.get(line["refseqAccession"], None) or data_freeze.get(line["genbankAccession"])

```

<br/><details><summary>Explanation</summary>When using a dictionary's `get` method you can specify a default to return if
the key is not found. This defaults to `None`, so it is unnecessary to specify
`None` if this is the required behaviour. Removing the unnecessary argument
makes the code slightly shorter and clearer.
</details>
</issue_to_address>

### Comment 6
<location> `flows/parsers/parse_ncbi_assemblies.py:453-458` </location>
<code_context>
@task(log_prints=True)
def process_datafreeze_info(processed_report: dict, data_freeze: dict, config: Config):
    """
    Process the data freeze information for a given assembly report.
    Rename the assembly

    Args:
        processed_report (dict): A dictionary containing processed assembly data.
        data_freeze (dict): A dictionary containing data freeze information.
    """
    data_freeze_name = (
        re.sub(r"\.tsv(\.gz)?$", "", os.path.basename(config.meta["file_name"]))
        if config.meta["file_name"]
        else "data_freeze"
    )
    print(f"Processing data freeze info for {data_freeze_name}")
    for line in processed_report.values():
        print(
            f"Processing data freeze info for {line['refseqAccession']} - "
            f"{line['genbankAccession']}"
        )
        status = data_freeze.get(line["refseqAccession"], None) or data_freeze.get(
            line["genbankAccession"], None
        )
        if not status:
            continue
        line["dataFreeze"] = status

        accession_name = (
            line["refseqAccession"]
            if line["refseqAccession"] in data_freeze.keys()
            else line["genbankAccession"]
        )
        line["assemblyID"] = accession_name + "_" + data_freeze_name
        print(line["assemblyID"])

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Remove unnecessary call to keys() ([`remove-dict-keys`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-dict-keys/))
- Use f-string instead of string concatenation [×2] ([`use-fstring-for-concatenation`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-fstring-for-concatenation/))
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

flows/parsers/parse_ncbi_assemblies.py

- Introduced `parse_s3_file` function to fetch and parse TSV files from S3. - Updated `fetch_data_freeze_file` to use the new parsing method. - Renamed `fetch_data_freeze_file` to `parse_data_freeze_file` for clarity. - Modified `fetch_ncbi_datasets_summary` to accept an optional data freeze path. - Enhanced error handling and logging for better traceability.

allow data freeze parsing

bdbb733

sourcery-ai bot reviewed Sep 25, 2025

View reviewed changes

rjchallis added 3 commits September 25, 2025 15:37

update wrapper and add data freeze deployments

72bf2a3

gzip jsonl on s3

4713e51

rjchallis merged commit a835a4e into main Sep 25, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow data freeze parsing#2

allow data freeze parsing#2
rjchallis merged 4 commits intomainfrom
vgp-freeze

ccaio commented Sep 25, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Sep 25, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ccaio commented Sep 25, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ccaio commented Sep 25, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Sep 25, 2025 •

edited

Loading