-
Notifications
You must be signed in to change notification settings - Fork 27
Feat : Multiple download functionality #271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Feat : Multiple download functionality #271
Conversation
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Avantol13
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had time to review other Python code, will get to that soon
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
533d187 to
97c29eb
Compare
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
gen3/cli/download.py
Outdated
| @click.argument("guid") | ||
| @click.option( | ||
| "--download-path", | ||
| default=".", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make this default more dynamic and a folder by default.
maybe let's do a timestamped folder name like:
from datetime import datetime
...
default=f"download_{datetime.now().strftime("%d_%b_%Y")}",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the same thing for multiple download async
gen3/cli/download.py
Outdated
| ) | ||
| @click.option( | ||
| "--max-concurrent-requests", | ||
| default=300, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really, really high. Let's scale this default to something like 20 and people can manually bump this up as needed.
| file_client = Gen3File(auth_provider=auth) | ||
|
|
||
| # Debug logging for input parameters | ||
| logging.debug( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this before the prompt above and you need to use our logger. Put this at the top of the file. See other files for reference:
from cdislogging import get_logger
logging = get_logger("__name__")
| type=int, | ||
| ) | ||
| @click.option( | ||
| "--skip-completed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is working as intended. I get a lot of warnings when I try this:
$ poetry run gen3 -vv download-multiple-async --manifest MIDRC_case_manifest.json --download-path ./downloads --skip-completed
[2025-09-02 14:28:36,954][ DEBUG] Initializing auth..
Found 5178 files to download
Continue with async download? [y/N]: y
Downloading: 0%| | 0/5178 [00:00<?, ?it/s][2025-09-02 14:28:40,004][WARNING] File will be overwritten: downloads/10041569-u_H3HaB1lES6-HXVpiEfMA/2.16.840.1.114274.1818.48858790339993589885669552017090272896/2.16.840.1.114274.1818.56909342758958044433235758152671341713.zip
[2025-09-02 14:28:40,010][WARNING] File will be overwritten: downloads/10041569-FtpnR4GEPEe9JztuDUqrXg/2.16.840.1.114274.1818.544490635779373958610541144202466729913/2.16.840.1.114274.1818.55741368199706394947919032241968638388.zip
[2025-09-02 14:28:40,085][WARNING] File will be overwritten: downloads/10041569-CooqJZVIRkywB6_o9CJi0Q/2.16.840.1.114274.1818.567640554746210778914433462824885428113/2.16.840.1.114274.1818.523353596424499437510054502719882590104.zip
Actually... it seems like maybe it is actually skipping but we need to suppress all these warnings and/or not overwrite existing files, just skip downloading again.
gen3/cli/download.py
Outdated
| ) | ||
| @click.option( | ||
| "--skip-completed", | ||
| is_flag=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can have an is_flag and default to True b/c how do I set this to false then? it just needs to be a boolean but default to true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could reverse the logic here and just call it: redownload-completed and then leave is_flag=True but default to false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure to update docs if you change this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't change it, actually. We need to ensure we're matching previous commands exactly.
Here's the help from gen3-client:
Flags:
--download-path string The directory in which to store the downloaded files (default ".")
--filename-format string The format of filename to be used, including "original", "guid" and "combined" (default "original")
-h, --help help for download-multiple
--manifest string The manifest file to read from. A valid manifest can be acquired by using the "Download Manifest" button in Data Explorer from a data common's portal
--no-prompt If set to true, will not display user prompt message for confirmation
--numparallel int Number of downloads to run in parallel (default 1)
--protocol string Specify the preferred protocol with --protocol=s3
--rename Only useful when "--filename-format=original", will rename file by appending a counter value to its filename if set to true, otherwise the same filename will be used
--skip-completed If set to true, will check for filename and size before download and skip any files in "download-path" that matches both
Let's change the overall command name to match too. download-multiple instead of download-multiple-async
Keep skip completed but make it a boolean so I can pass in false.
I'm okay with --numparallel not existing and being different b/c the parallelization scheme in Python is a bit different... maybe we can allow that to be passed in though and just use the value as max-concurrent-requests if that isn't supplied. So people can easily port their commands over and have them work right away
docs/howto/asyncDownloadMultiple.md
Outdated
| - **High-bandwidth networks**: Increase the number of worker processes | ||
| - **Limited memory**: Reduce queue sizes to manage memory usage | ||
|
|
||
| ### Memory Management |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove this section, I feel it's redundant given the above config info
docs/howto/asyncDownloadMultiple.md
Outdated
| - **Batch Size**: Balance between memory usage and processing overhead | ||
| - **Process Count**: Match available CPU cores for optimal performance | ||
|
|
||
| ### Network Optimization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also remove this section
docs/howto/asyncDownloadMultiple.md
Outdated
|
|
||
| - Check network bandwidth and server limits | ||
| - Reduce concurrent request limits if server is overwhelmed | ||
| - Verify authentication token is valid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auth token would not be a reason for slow downloads, let's remove this
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Avantol13
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to add unit tests for this new functionality
gen3/cli/__main__.py
Outdated
| main.add_command(drs_pull.drs_pull) | ||
| main.add_command(file.file) | ||
| main.add_command(download.download_single, name="download-single") | ||
| main.add_command(download.download_multiple_async, name="download-multiple-async") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| main.add_command(download.download_multiple_async, name="download-multiple-async") | |
| main.add_command(download.download_multiple, name="download-multiple") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to remember to retest after any changes to your code, this was broken b/c the the function doesn't exist anymore
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
| ] | ||
|
|
||
|
|
||
| def test_download_single_success(gen3_file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a docstring for each test describing what you're testing
docs/howto/asyncDownloadMultiple.md
Outdated
| gen3 --endpoint data.commons.io --auth creds.json download-multiple \ | ||
| --manifest large_dataset.json \ | ||
| --download-path ./large_downloads \ | ||
| --max-concurrent-requests 20 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's change to:
| --max-concurrent-requests 20 \ | |
| --max-concurrent-requests 64 \ | |
| --max-concurrent-requests 8 |
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Avantol13
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will also need to update this test. I got it working:
def test_download_single_basic_functionality(gen3_file):
"""
Test download_single basic functionality with synchronous download.
Verifies that download_single downloads a file successfully using
synchronous requests and returns True.
"""
gen3_file._auth_provider._refresh_token = {"api_key": "123"}
with patch.object(gen3_file, 'get_presigned_url') as mock_presigned, \
patch('gen3.file.requests.get') as mock_get, \
patch('gen3.index.Gen3Index.get_record') as mock_index:
mock_presigned.return_value = {"url": "https://fake-url.com/file"}
mock_index.return_value = {"file_name": "test-file.txt"}
mock_response = MagicMock()
mock_response.status_code = 200
mock_response.headers = {"content-length": "12"}
mock_response.iter_content = lambda size: [b"test content"]
mock_get.return_value = mock_response
result = gen3_file.download_single(object_id="test-guid", path="/tmp", protocol="s3")
assert result == True
mock_presigned.assert_called_once_with("test-guid", protocol="s3")
mock_index.assert_called_once_with("test-guid")| try: | ||
| file_client = Gen3File(auth_provider=auth) | ||
|
|
||
| result = file_client.download_single( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function does not take these params and it's failing. I believe I got it working.
@click.command()
@click.argument("guid")
@click.option(
"--download-path",
default=f"./download_{datetime.now().strftime('%d_%b_%Y')}",
help="Directory to download file to (default: timestamped folder)",
)
@click.option(
"--protocol",
default=None,
help="Protocol for presigned URL (e.g., s3) (default: auto-detect)",
)
@click.pass_context
def download_single(
ctx,
guid,
download_path,
protocol,
):
"""Download a single file by GUID."""
auth = ctx.obj["auth_factory"].get()
download_path = os.path.abspath(download_path)
os.makedirs(download_path, exist_ok=True)
try:
file_client = Gen3File(auth_provider=auth)
is_successful = file_client.download_single(
object_id=guid,
path=download_path,
protocol=protocol,
)
result = {
"status": "downloaded" if is_successful else "failed",
"reason": "Failed to download GUID." if not is_successful else "",
}
if result["status"] == "downloaded":
click.echo(f"✓ Downloaded to path: {download_path}")
else:
click.echo(f"✗ Failed: {result.get('error', 'See logs for error')}")
except Exception as e:
logging.error(f"Download failed: {e}")
raise click.ClickException(f"Download failed: {e}")And update the download_single function as well:
def download_single(self, object_id, path, protocol=None):
"""
Download a single file using its GUID.
Args:
object_id (str): The file's unique ID
path (str): Path to store the downloaded file at
Returns:
bool: True if download successful, False otherwise
"""
try:
url = self.get_presigned_url(object_id, protocol=protocol)
except Exception as e:
logging.critical(f"Unable to get a presigned URL for download: {e}")
return False
response = requests.get(url["url"], stream=True)
if response.status_code != 200:
logging.error(f"Response code: {response.status_code}")
if response.status_code >= 500:
for _ in range(MAX_RETRIES):
logging.info("Retrying now...")
# NOTE could be updated with exponential backoff
time.sleep(1)
response = requests.get(url["url"], stream=True)
if response.status_code == 200:
break
if response.status_code != 200:
logging.critical("Response status not 200, try again later")
return False
else:
return False
response.raise_for_status()
total_size_in_bytes = int(response.headers.get("content-length"))
total_downloaded = 0
index = Gen3Index(self._auth_provider)
record = index.get_record(object_id)
filename = record["file_name"]
out_path = os.path.join(path, filename)
Gen3File._ensure_dirpath_exists(Path(os.path.dirname(out_path)))
with open(out_path, "wb") as f:
for data in response.iter_content(4096):
total_downloaded += len(data)
f.write(data)
if total_size_in_bytes == total_downloaded:
logging.info(f"File {filename} downloaded successfully")
else:
logging.error(f"File {filename} not downloaded successfully")
return False
return TrueSigned-off-by: Dhiren-Mhatre <kp064669@gmail.com>
New Features
Dependency updates
Updated fastavro from 1.8.4 to 1.11.1
Updated pypfb to include extras: pypfb = {extras = ["gen3"], version = "^0.5.33"}
Updated importlib-metadata from 8.5.0 to 4.13.0