Merged
Conversation
Reviewer's GuideThis PR extends the NCBI assemblies parsing flow with support for ‘data freeze’ lists by adding new Prefect tasks for reading and applying freeze subsets, updating the main flow signature to conditionally invoke them, enhancing the CLI to accept a data_freeze_path flag, and adjusting helper-parsers for compatibility. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `flows/parsers/parse_ncbi_assemblies.py:398` </location>
<code_context>
+ # local_path = "../vgp_phase1_data_freeze.tsv"
+ # local_path = "/tmp/data_freeze_list.tsv"
+ # fetch_from_s3(data_freeze_path, local_path)
+ local_path = os.path.abspath(data_freeze_path)
+ data_freeze = {}
+ with open(local_path, "r") as f:
</code_context>
<issue_to_address>
**issue (bug_risk):** Using os.path.abspath may not be appropriate for S3 paths.
os.path.abspath does not support S3 URIs. Handle S3 paths separately, such as by downloading them to a local file before opening.
</issue_to_address>
### Comment 2
<location> `flows/parsers/parse_ncbi_assemblies.py:399-404` </location>
<code_context>
+ data_freeze = {}
+ with open(local_path, "r") as f:
+ for line in f:
+ parts = re.split(r"\s*\t\s*", line.strip())
+ if len(parts) < 2:
+ continue
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Splitting on tab with optional whitespace may cause issues with fields containing tabs.
Using '\s*\t\s*' may split fields incorrectly if values contain tabs. Use a strict '\t' delimiter for TSV files to ensure accurate parsing.
```suggestion
data_freeze = {}
with open(local_path, "r") as f:
for line in f:
parts = line.strip().split('\t')
if len(parts) < 2:
continue
```
</issue_to_address>
### Comment 3
<location> `flows/parsers/parse_ncbi_assemblies.py:381` </location>
<code_context>
continue
+@task(log_prints=True)
+def fetch_data_freeze_file(data_freeze_path: str) -> dict:
+ """
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring the data freeze logic to use the csv module, compute the freeze name once, and unify default and explicit freeze handling in a single pass.
```markdown
You can simplify and DRY-up these tasks in three steps:
1. use the `csv` module to parse your TSV
2. compute `data_freeze_name` once outside the loop
3. unify the “default” vs. explicit‐freeze logic in a single pass
For example:
```python
import csv
import os
import re
@task(log_prints=True)
def fetch_data_freeze_file(data_freeze_path: str) -> dict:
"""Fetch a 2‐column TSV and return {accession: [freeze,…]}."""
print(f"Fetching data freeze file from {data_freeze_path}")
data_freeze = {}
with open(os.path.abspath(data_freeze_path), newline="") as f:
reader = csv.reader(f, delimiter="\t")
for row in reader:
if len(row) < 2:
continue
acc = row[0].strip()
freezes = [x.strip() for x in row[1].split(",") if x.strip()]
data_freeze[acc] = freezes
return data_freeze
```
```python
@task(log_prints=True)
def process_datafreeze_info(
processed_report: dict, data_freeze: dict, config: Config
):
"""Annotate each record with its dataFreeze + assemblyID."""
# compute once
df_name = (
re.sub(r"\.tsv(\.gz)?$", "", os.path.basename(config.meta["file_name"]))
if config.meta.get("file_name")
else "data_freeze"
)
print(f"Processing data freeze info {df_name}")
for rec in processed_report.values():
# pick explicit status or default to ["latest"]
status = (
data_freeze.get(rec["refseqAccession"])
or data_freeze.get(rec["genbankAccession"])
or [df_name]
)
rec["dataFreeze"] = status
# choose accession and append df_name
accession = rec.get("refseqAccession") or rec["genbankAccession"]
rec["assemblyID"] = f"{accession}_{df_name}"
print(f"{rec['assemblyID']} => {status}")
```
Finally, in your flow you can collapse default vs. explicit into one call:
```python
if data_freeze_path:
df = fetch_data_freeze_file(data_freeze_path)
else:
df = {}
process_datafreeze_info(parsed, df, config)
```
This keeps all functionality, removes inline regex loops, and centralizes `data_freeze_name`.
</issue_to_address>
### Comment 4
<location> `flows/parsers/parse_ncbi_assemblies.py:446` </location>
<code_context>
status = data_freeze.get(line["refseqAccession"], None) or data_freeze.get(
</code_context>
<issue_to_address>
**suggestion (code-quality):** Replace `dict.get(x, None)` with `dict.get(x)` ([`remove-none-from-default-get`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/remove-none-from-default-get))
```suggestion
status = data_freeze.get(line["refseqAccession"]) or data_freeze.get(
```
<br/><details><summary>Explanation</summary>When using a dictionary's `get` method you can specify a default to return if
the key is not found. This defaults to `None`, so it is unnecessary to specify
`None` if this is the required behaviour. Removing the unnecessary argument
makes the code slightly shorter and clearer.
</details>
</issue_to_address>
### Comment 5
<location> `flows/parsers/parse_ncbi_assemblies.py:446-448` </location>
<code_context>
status = data_freeze.get(line["refseqAccession"], None) or data_freeze.get(
line["genbankAccession"], None
)
</code_context>
<issue_to_address>
**suggestion (code-quality):** Replace `dict.get(x, None)` with `dict.get(x)` ([`remove-none-from-default-get`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/remove-none-from-default-get))
```suggestion
status = data_freeze.get(line["refseqAccession"], None) or data_freeze.get(line["genbankAccession"])
```
<br/><details><summary>Explanation</summary>When using a dictionary's `get` method you can specify a default to return if
the key is not found. This defaults to `None`, so it is unnecessary to specify
`None` if this is the required behaviour. Removing the unnecessary argument
makes the code slightly shorter and clearer.
</details>
</issue_to_address>
### Comment 6
<location> `flows/parsers/parse_ncbi_assemblies.py:453-458` </location>
<code_context>
@task(log_prints=True)
def process_datafreeze_info(processed_report: dict, data_freeze: dict, config: Config):
"""
Process the data freeze information for a given assembly report.
Rename the assembly
Args:
processed_report (dict): A dictionary containing processed assembly data.
data_freeze (dict): A dictionary containing data freeze information.
"""
data_freeze_name = (
re.sub(r"\.tsv(\.gz)?$", "", os.path.basename(config.meta["file_name"]))
if config.meta["file_name"]
else "data_freeze"
)
print(f"Processing data freeze info for {data_freeze_name}")
for line in processed_report.values():
print(
f"Processing data freeze info for {line['refseqAccession']} - "
f"{line['genbankAccession']}"
)
status = data_freeze.get(line["refseqAccession"], None) or data_freeze.get(
line["genbankAccession"], None
)
if not status:
continue
line["dataFreeze"] = status
accession_name = (
line["refseqAccession"]
if line["refseqAccession"] in data_freeze.keys()
else line["genbankAccession"]
)
line["assemblyID"] = accession_name + "_" + data_freeze_name
print(line["assemblyID"])
</code_context>
<issue_to_address>
**issue (code-quality):** We've found these issues:
- Remove unnecessary call to keys() ([`remove-dict-keys`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-dict-keys/))
- Use f-string instead of string concatenation [×2] ([`use-fstring-for-concatenation`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-fstring-for-concatenation/))
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
- Introduced `parse_s3_file` function to fetch and parse TSV files from S3. - Updated `fetch_data_freeze_file` to use the new parsing method. - Renamed `fetch_data_freeze_file` to `parse_data_freeze_file` for clarity. - Modified `fetch_ncbi_datasets_summary` to accept an optional data freeze path. - Enhanced error handling and logging for better traceability.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary by Sourcery
Introduce optional data freeze parsing into the NCBI assemblies workflow by adding tasks to fetch, default, and apply freeze information per assembly, controlled via a new CLI flag.
New Features:
Enhancements: