Skip to content

allow data freeze parsing#2

Merged
rjchallis merged 4 commits intomainfrom
vgp-freeze
Sep 25, 2025
Merged

allow data freeze parsing#2
rjchallis merged 4 commits intomainfrom
vgp-freeze

Conversation

@ccaio
Copy link
Contributor

@ccaio ccaio commented Sep 25, 2025

Summary by Sourcery

Introduce optional data freeze parsing into the NCBI assemblies workflow by adding tasks to fetch, default, and apply freeze information per assembly, controlled via a new CLI flag.

New Features:

  • Add fetch_data_freeze_file task to read assembly-to-freeze mappings from a TSV.
  • Add set_data_freeze_default task to assign a default data freeze when none is provided.
  • Add process_datafreeze_info task to update parsed assembly entries with freeze subsets and adjust assembly IDs.
  • Introduce --data_freeze_path CLI option and parameterize parse_ncbi_assemblies and wrapper scripts to accept extra keyword args

Enhancements:

  • Log processing steps for each assembly and data freeze operation to aid debugging

@sourcery-ai
Copy link

sourcery-ai bot commented Sep 25, 2025

Reviewer's Guide

This PR extends the NCBI assemblies parsing flow with support for ‘data freeze’ lists by adding new Prefect tasks for reading and applying freeze subsets, updating the main flow signature to conditionally invoke them, enhancing the CLI to accept a data_freeze_path flag, and adjusting helper-parsers for compatibility.

File-Level Changes

Change Details Files
Introduce data freeze handling tasks
  • Added fetch_data_freeze_file task to load and parse a TSV of freeze lists
  • Implemented set_data_freeze_default to assign a fallback freeze
  • Created process_datafreeze_info to map and tag assemblies with specific freezes and build their IDs
flows/parsers/parse_ncbi_assemblies.py
Integrate data freeze into parse_ncbi_assemblies flow
  • Extended flow signature with optional data_freeze_path
  • Branched to use default or fetched freeze data based on presence of the path
  • Invoked new tasks and updated write_to_tsv accordingly
flows/parsers/parse_ncbi_assemblies.py
Expand CLI to accept data_freeze_path
  • Defined DATA_FREEZE_PATH flag in shared_args
  • Added the flag to the parser args for NCBI assemblies
flows/lib/shared_args.py
flows/parsers/args.py
Make parser wrappers accept extra kwargs
  • Updated parse_refseq_organelles to include **kwargs
  • Modified parse_sequencing_status to accept **kwargs
  • Adjusted parse_skip_parsing signature for **kwargs
flows/parsers/parse_refseq_organelles.py
flows/parsers/parse_sequencing_status.py
flows/parsers/parse_skip_parsing.py
Add logging for report processing
  • Inserted print in process_assembly_reports to log the current accession
  • Tagged prints with log_prints for Prefect visibility
flows/parsers/parse_ncbi_assemblies.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `flows/parsers/parse_ncbi_assemblies.py:398` </location>
<code_context>
+    # local_path = "../vgp_phase1_data_freeze.tsv"
+    # local_path = "/tmp/data_freeze_list.tsv"
+    # fetch_from_s3(data_freeze_path, local_path)
+    local_path = os.path.abspath(data_freeze_path)
+    data_freeze = {}
+    with open(local_path, "r") as f:
</code_context>

<issue_to_address>
**issue (bug_risk):** Using os.path.abspath may not be appropriate for S3 paths.

os.path.abspath does not support S3 URIs. Handle S3 paths separately, such as by downloading them to a local file before opening.
</issue_to_address>

### Comment 2
<location> `flows/parsers/parse_ncbi_assemblies.py:399-404` </location>
<code_context>
+    data_freeze = {}
+    with open(local_path, "r") as f:
+        for line in f:
+            parts = re.split(r"\s*\t\s*", line.strip())
+            if len(parts) < 2:
+                continue
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Splitting on tab with optional whitespace may cause issues with fields containing tabs.

Using '\s*\t\s*' may split fields incorrectly if values contain tabs. Use a strict '\t' delimiter for TSV files to ensure accurate parsing.

```suggestion
    data_freeze = {}
    with open(local_path, "r") as f:
        for line in f:
            parts = line.strip().split('\t')
            if len(parts) < 2:
                continue
```
</issue_to_address>

### Comment 3
<location> `flows/parsers/parse_ncbi_assemblies.py:381` </location>
<code_context>
             continue


+@task(log_prints=True)
+def fetch_data_freeze_file(data_freeze_path: str) -> dict:
+    """
</code_context>

<issue_to_address>
**issue (complexity):** Consider refactoring the data freeze logic to use the csv module, compute the freeze name once, and unify default and explicit freeze handling in a single pass.

```markdown
You can simplify and DRY-up these tasks in three steps:

1. use the `csv` module to parse your TSV  
2. compute `data_freeze_name` once outside the loop  
3. unify the “default” vs. explicit‐freeze logic in a single pass  

For example:

```python
import csv
import os
import re

@task(log_prints=True)
def fetch_data_freeze_file(data_freeze_path: str) -> dict:
    """Fetch a 2‐column TSV and return {accession: [freeze,…]}."""
    print(f"Fetching data freeze file from {data_freeze_path}")
    data_freeze = {}
    with open(os.path.abspath(data_freeze_path), newline="") as f:
        reader = csv.reader(f, delimiter="\t")
        for row in reader:
            if len(row) < 2:
                continue
            acc = row[0].strip()
            freezes = [x.strip() for x in row[1].split(",") if x.strip()]
            data_freeze[acc] = freezes
    return data_freeze
```

```python
@task(log_prints=True)
def process_datafreeze_info(
    processed_report: dict, data_freeze: dict, config: Config
):
    """Annotate each record with its dataFreeze + assemblyID."""
    # compute once
    df_name = (
        re.sub(r"\.tsv(\.gz)?$", "", os.path.basename(config.meta["file_name"]))
        if config.meta.get("file_name")
        else "data_freeze"
    )
    print(f"Processing data freeze info {df_name}")
    for rec in processed_report.values():
        # pick explicit status or default to ["latest"]
        status = (
            data_freeze.get(rec["refseqAccession"])
            or data_freeze.get(rec["genbankAccession"])
            or [df_name]
        )
        rec["dataFreeze"] = status
        # choose accession and append df_name
        accession = rec.get("refseqAccession") or rec["genbankAccession"]
        rec["assemblyID"] = f"{accession}_{df_name}"
        print(f"{rec['assemblyID']} => {status}")
```

Finally, in your flow you can collapse default vs. explicit into one call:

```python
if data_freeze_path:
    df = fetch_data_freeze_file(data_freeze_path)
else:
    df = {}
process_datafreeze_info(parsed, df, config)
```

This keeps all functionality, removes inline regex loops, and centralizes `data_freeze_name`.
</issue_to_address>

### Comment 4
<location> `flows/parsers/parse_ncbi_assemblies.py:446` </location>
<code_context>
        status = data_freeze.get(line["refseqAccession"], None) or data_freeze.get(

</code_context>

<issue_to_address>
**suggestion (code-quality):** Replace `dict.get(x, None)` with `dict.get(x)` ([`remove-none-from-default-get`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/remove-none-from-default-get))

```suggestion
        status = data_freeze.get(line["refseqAccession"]) or data_freeze.get(
```

<br/><details><summary>Explanation</summary>When using a dictionary's `get` method you can specify a default to return if
the key is not found. This defaults to `None`, so it is unnecessary to specify
`None` if this is the required behaviour. Removing the unnecessary argument
makes the code slightly shorter and clearer.
</details>
</issue_to_address>

### Comment 5
<location> `flows/parsers/parse_ncbi_assemblies.py:446-448` </location>
<code_context>
        status = data_freeze.get(line["refseqAccession"], None) or data_freeze.get(
            line["genbankAccession"], None
        )

</code_context>

<issue_to_address>
**suggestion (code-quality):** Replace `dict.get(x, None)` with `dict.get(x)` ([`remove-none-from-default-get`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/remove-none-from-default-get))

```suggestion
        status = data_freeze.get(line["refseqAccession"], None) or data_freeze.get(line["genbankAccession"])

```

<br/><details><summary>Explanation</summary>When using a dictionary's `get` method you can specify a default to return if
the key is not found. This defaults to `None`, so it is unnecessary to specify
`None` if this is the required behaviour. Removing the unnecessary argument
makes the code slightly shorter and clearer.
</details>
</issue_to_address>

### Comment 6
<location> `flows/parsers/parse_ncbi_assemblies.py:453-458` </location>
<code_context>
@task(log_prints=True)
def process_datafreeze_info(processed_report: dict, data_freeze: dict, config: Config):
    """
    Process the data freeze information for a given assembly report.
    Rename the assembly

    Args:
        processed_report (dict): A dictionary containing processed assembly data.
        data_freeze (dict): A dictionary containing data freeze information.
    """
    data_freeze_name = (
        re.sub(r"\.tsv(\.gz)?$", "", os.path.basename(config.meta["file_name"]))
        if config.meta["file_name"]
        else "data_freeze"
    )
    print(f"Processing data freeze info for {data_freeze_name}")
    for line in processed_report.values():
        print(
            f"Processing data freeze info for {line['refseqAccession']} - "
            f"{line['genbankAccession']}"
        )
        status = data_freeze.get(line["refseqAccession"], None) or data_freeze.get(
            line["genbankAccession"], None
        )
        if not status:
            continue
        line["dataFreeze"] = status

        accession_name = (
            line["refseqAccession"]
            if line["refseqAccession"] in data_freeze.keys()
            else line["genbankAccession"]
        )
        line["assemblyID"] = accession_name + "_" + data_freeze_name
        print(line["assemblyID"])

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Remove unnecessary call to keys() ([`remove-dict-keys`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-dict-keys/))
- Use f-string instead of string concatenation [×2] ([`use-fstring-for-concatenation`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-fstring-for-concatenation/))
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

- Introduced `parse_s3_file` function to fetch and parse TSV files from S3.
- Updated `fetch_data_freeze_file` to use the new parsing method.
- Renamed `fetch_data_freeze_file` to `parse_data_freeze_file` for clarity.
- Modified `fetch_ncbi_datasets_summary` to accept an optional data freeze path.
- Enhanced error handling and logging for better traceability.
@rjchallis rjchallis merged commit a835a4e into main Sep 25, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants