Feat : Multiple download functionality #271

Dhiren-Mhatre · 2025-08-05T09:56:47Z

New Features

Enhanced Asynchronous Multiple File Download: Introduces a high-performance async_download_multiple method utilizing hybrid concurrency (multiprocessing + asyncio) with memory-efficient queue-based management
New CLI Command: Adds gen3 download-multiple-async command with advanced options like resuming downloads (--skip-completed), flexible filename formatting, and progress bars
Performance Testing Suite: Added comprehensive benchmarking framework comparing the new implementation against CDIS Data Client for download speed, memory usage, and concurrency efficiency

Dependency updates
Updated fastavro from 1.8.4 to 1.11.1
Updated pypfb to include extras: pypfb = {extras = ["gen3"], version = "^0.5.33"}
Updated importlib-metadata from 8.5.0 to 4.13.0

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

Avantol13

I haven't had time to review other Python code, will get to that soon

performance_testing/requirements.txt

docs/howto/asyncDownloadMultiple.md

gen3/cli/download.py

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

Avantol13 · 2025-09-02T19:23:49Z

gen3/cli/download.py

+@click.argument("guid")
+@click.option(
+    "--download-path",
+    default=".",


let's make this default more dynamic and a folder by default.

maybe let's do a timestamped folder name like:

from datetime import datetime ... default=f"download_{datetime.now().strftime("%d_%b_%Y")}",

do the same thing for multiple download async

Avantol13 · 2025-09-02T19:26:22Z

gen3/cli/download.py

+)
+@click.option(
+    "--max-concurrent-requests",
+    default=300,


this is really, really high. Let's scale this default to something like 20 and people can manually bump this up as needed.

Avantol13 · 2025-09-02T19:34:32Z

gen3/cli/download.py

+        file_client = Gen3File(auth_provider=auth)
+
+        # Debug logging for input parameters
+        logging.debug(


put this before the prompt above and you need to use our logger. Put this at the top of the file. See other files for reference:

from cdislogging import get_logger logging = get_logger("__name__")

Avantol13 · 2025-09-02T19:36:30Z

gen3/cli/download.py

+    type=int,
+)
+@click.option(
+    "--skip-completed",


I'm not sure this is working as intended. I get a lot of warnings when I try this:

$ poetry run gen3 -vv download-multiple-async --manifest MIDRC_case_manifest.json --download-path ./downloads --skip-completed [2025-09-02 14:28:36,954][ DEBUG] Initializing auth.. Found 5178 files to download Continue with async download? [y/N]: y Downloading: 0%| | 0/5178 [00:00<?, ?it/s][2025-09-02 14:28:40,004][WARNING] File will be overwritten: downloads/10041569-u_H3HaB1lES6-HXVpiEfMA/2.16.840.1.114274.1818.48858790339993589885669552017090272896/2.16.840.1.114274.1818.56909342758958044433235758152671341713.zip [2025-09-02 14:28:40,010][WARNING] File will be overwritten: downloads/10041569-FtpnR4GEPEe9JztuDUqrXg/2.16.840.1.114274.1818.544490635779373958610541144202466729913/2.16.840.1.114274.1818.55741368199706394947919032241968638388.zip [2025-09-02 14:28:40,085][WARNING] File will be overwritten: downloads/10041569-CooqJZVIRkywB6_o9CJi0Q/2.16.840.1.114274.1818.567640554746210778914433462824885428113/2.16.840.1.114274.1818.523353596424499437510054502719882590104.zip

Actually... it seems like maybe it is actually skipping but we need to suppress all these warnings and/or not overwrite existing files, just skip downloading again.

Avantol13 · 2025-09-02T19:36:39Z

gen3/cli/download.py

+)
+@click.option(
+    "--skip-completed",
+    is_flag=True,


I don't think you can have an is_flag and default to True b/c how do I set this to false then? it just needs to be a boolean but default to true

you could reverse the logic here and just call it: redownload-completed and then leave is_flag=True but default to false

make sure to update docs if you change this

don't change it, actually. We need to ensure we're matching previous commands exactly.

Here's the help from gen3-client:

Flags: --download-path string The directory in which to store the downloaded files (default ".") --filename-format string The format of filename to be used, including "original", "guid" and "combined" (default "original") -h, --help help for download-multiple --manifest string The manifest file to read from. A valid manifest can be acquired by using the "Download Manifest" button in Data Explorer from a data common's portal --no-prompt If set to true, will not display user prompt message for confirmation --numparallel int Number of downloads to run in parallel (default 1) --protocol string Specify the preferred protocol with --protocol=s3 --rename Only useful when "--filename-format=original", will rename file by appending a counter value to its filename if set to true, otherwise the same filename will be used --skip-completed If set to true, will check for filename and size before download and skip any files in "download-path" that matches both

Let's change the overall command name to match too. download-multiple instead of download-multiple-async

Keep skip completed but make it a boolean so I can pass in false.

I'm okay with --numparallel not existing and being different b/c the parallelization scheme in Python is a bit different... maybe we can allow that to be passed in though and just use the value as max-concurrent-requests if that isn't supplied. So people can easily port their commands over and have them work right away

Avantol13 · 2025-09-02T19:42:59Z

docs/howto/asyncDownloadMultiple.md

+- **High-bandwidth networks**: Increase the number of worker processes
+- **Limited memory**: Reduce queue sizes to manage memory usage
+
+### Memory Management


let's remove this section, I feel it's redundant given the above config info

Avantol13 · 2025-09-02T19:43:09Z

docs/howto/asyncDownloadMultiple.md

+- **Batch Size**: Balance between memory usage and processing overhead
+- **Process Count**: Match available CPU cores for optimal performance
+
+### Network Optimization


let's also remove this section

Avantol13 · 2025-09-02T19:43:25Z

docs/howto/asyncDownloadMultiple.md

+
+- Check network bandwidth and server limits
+- Reduce concurrent request limits if server is overwhelmed
+- Verify authentication token is valid


auth token would not be a reason for slow downloads, let's remove this

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

Avantol13

Make sure to add unit tests for this new functionality

Avantol13 · 2025-09-09T14:18:19Z

gen3/cli/__main__.py

 main.add_command(drs_pull.drs_pull)
 main.add_command(file.file)
+main.add_command(download.download_single, name="download-single")
+main.add_command(download.download_multiple_async, name="download-multiple-async")


Suggested change

main.add_command(download.download_multiple_async, name="download-multiple-async")

main.add_command(download.download_multiple, name="download-multiple")

Try to remember to retest after any changes to your code, this was broken b/c the the function doesn't exist anymore

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

Avantol13 · 2025-09-30T15:09:44Z

tests/test_file.py

+    ]
+
+
+def test_download_single_success(gen3_file):


please add a docstring for each test describing what you're testing

Avantol13 · 2025-09-30T15:23:56Z

docs/howto/asyncDownloadMultiple.md

+gen3 --endpoint data.commons.io --auth creds.json download-multiple \
+    --manifest large_dataset.json \
+    --download-path ./large_downloads \
+    --max-concurrent-requests 20 \


let's change to:

Suggested change

--max-concurrent-requests 20 \

--max-concurrent-requests 64 \

--max-concurrent-requests 8

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

Avantol13

You will also need to update this test. I got it working:

def test_download_single_basic_functionality(gen3_file):
    """
    Test download_single basic functionality with synchronous download.

    Verifies that download_single downloads a file successfully using
    synchronous requests and returns True.
    """
    gen3_file._auth_provider._refresh_token = {"api_key": "123"}

    with patch.object(gen3_file, 'get_presigned_url') as mock_presigned, \
         patch('gen3.file.requests.get') as mock_get, \
         patch('gen3.index.Gen3Index.get_record') as mock_index:

        mock_presigned.return_value = {"url": "https://fake-url.com/file"}
        mock_index.return_value = {"file_name": "test-file.txt"}
        mock_response = MagicMock()
        mock_response.status_code = 200
        mock_response.headers = {"content-length": "12"}
        mock_response.iter_content = lambda size: [b"test content"]
        mock_get.return_value = mock_response

        result = gen3_file.download_single(object_id="test-guid", path="/tmp", protocol="s3")

        assert result == True
        mock_presigned.assert_called_once_with("test-guid", protocol="s3")
        mock_index.assert_called_once_with("test-guid")

Avantol13 · 2025-10-14T15:48:37Z

gen3/cli/download.py

+    try:
+        file_client = Gen3File(auth_provider=auth)
+
+        result = file_client.download_single(


this function does not take these params and it's failing. I believe I got it working.

@click.command() @click.argument("guid") @click.option( "--download-path", default=f"./download_{datetime.now().strftime('%d_%b_%Y')}", help="Directory to download file to (default: timestamped folder)", ) @click.option( "--protocol", default=None, help="Protocol for presigned URL (e.g., s3) (default: auto-detect)", ) @click.pass_context def download_single( ctx, guid, download_path, protocol, ): """Download a single file by GUID.""" auth = ctx.obj["auth_factory"].get() download_path = os.path.abspath(download_path) os.makedirs(download_path, exist_ok=True) try: file_client = Gen3File(auth_provider=auth) is_successful = file_client.download_single( object_id=guid, path=download_path, protocol=protocol, ) result = { "status": "downloaded" if is_successful else "failed", "reason": "Failed to download GUID." if not is_successful else "", } if result["status"] == "downloaded": click.echo(f"✓ Downloaded to path: {download_path}") else: click.echo(f"✗ Failed: {result.get('error', 'See logs for error')}") except Exception as e: logging.error(f"Download failed: {e}") raise click.ClickException(f"Download failed: {e}")

And update the download_single function as well:

def download_single(self, object_id, path, protocol=None): """ Download a single file using its GUID. Args: object_id (str): The file's unique ID path (str): Path to store the downloaded file at Returns: bool: True if download successful, False otherwise """ try: url = self.get_presigned_url(object_id, protocol=protocol) except Exception as e: logging.critical(f"Unable to get a presigned URL for download: {e}") return False response = requests.get(url["url"], stream=True) if response.status_code != 200: logging.error(f"Response code: {response.status_code}") if response.status_code >= 500: for _ in range(MAX_RETRIES): logging.info("Retrying now...") # NOTE could be updated with exponential backoff time.sleep(1) response = requests.get(url["url"], stream=True) if response.status_code == 200: break if response.status_code != 200: logging.critical("Response status not 200, try again later") return False else: return False response.raise_for_status() total_size_in_bytes = int(response.headers.get("content-length")) total_downloaded = 0 index = Gen3Index(self._auth_provider) record = index.get_record(object_id) filename = record["file_name"] out_path = os.path.join(path, filename) Gen3File._ensure_dirpath_exists(Path(os.path.dirname(out_path))) with open(out_path, "wb") as f: for data in response.iter_content(4096): total_downloaded += len(data) f.write(data) if total_size_in_bytes == total_downloaded: logging.info(f"File {filename} downloaded successfully") else: logging.error(f"File {filename} not downloaded successfully") return False return True

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

Feat : Multiple download functionality with performance testing

eaaf94d

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

Avantol13 requested changes Aug 11, 2025

View reviewed changes

performance_testing/requirements.txt Outdated Show resolved Hide resolved

Avantol13 requested changes Aug 19, 2025

View reviewed changes

Avantol13 reviewed Aug 19, 2025

View reviewed changes

gen3/cli/download.py Outdated Show resolved Hide resolved

applied feedback

97c29eb

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

Dhiren-Mhatre force-pushed the feat/multiple-download-performance-testing branch from 533d187 to 97c29eb Compare August 26, 2025 13:12

removed timeout

fadbaac

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

Avantol13 requested changes Sep 2, 2025

View reviewed changes

Avantol13 and others added 2 commits September 2, 2025 14:58

Merge branch 'master' into feat/multiple-download-performance-testing

4bfcd98

addressed feedbacks

3679b4b

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

Avantol13 requested changes Sep 9, 2025

View reviewed changes

Dhiren-Mhatre changed the title ~~Feat : Multiple download functionality with performance testing~~ Feat : Multiple download functionality Sep 9, 2025

Dhiren-Mhatre and others added 2 commits September 14, 2025 17:48

added unit tests

f005a87

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

Merge branch 'master' into feat/multiple-download-performance-testing

3dd6b69

Avantol13 reviewed Sep 30, 2025

View reviewed changes

Dhiren-Mhatre and others added 5 commits October 2, 2025 18:18

added docstrings

a78f9be

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

Merge branch 'master' into feat/multiple-download-performance-testing

a348aee

fixed test

970e77d

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

fixed tests

2516e27

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

fixed tests

4b371e7

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

Avantol13 requested changes Oct 14, 2025

View reviewed changes

version bumped and fixed tests

ffbf404

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>

	main.add_command(download.download_multiple_async, name="download-multiple-async")
	main.add_command(download.download_multiple, name="download-multiple")

	--max-concurrent-requests 20 \
	--max-concurrent-requests 64 \
	--max-concurrent-requests 8

Feat : Multiple download functionality #271

Are you sure you want to change the base?

Feat : Multiple download functionality #271

Uh oh!

Conversation

Dhiren-Mhatre commented Aug 5, 2025

Uh oh!

Avantol13 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Avantol13 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Avantol13 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants