Skip to content

Conversation

@Dhiren-Mhatre
Copy link

New Features

  • Enhanced Asynchronous Multiple File Download: Introduces a high-performance async_download_multiple method utilizing hybrid concurrency (multiprocessing + asyncio) with memory-efficient queue-based management
  • New CLI Command: Adds gen3 download-multiple-async command with advanced options like resuming downloads (--skip-completed), flexible filename formatting, and progress bars
  • Performance Testing Suite: Added comprehensive benchmarking framework comparing the new implementation against CDIS Data Client for download speed, memory usage, and concurrency efficiency

Dependency updates
Updated fastavro from 1.8.4 to 1.11.1
Updated pypfb to include extras: pypfb = {extras = ["gen3"], version = "^0.5.33"}
Updated importlib-metadata from 8.5.0 to 4.13.0

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Copy link
Contributor

@Avantol13 Avantol13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had time to review other Python code, will get to that soon

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
@Dhiren-Mhatre Dhiren-Mhatre force-pushed the feat/multiple-download-performance-testing branch from 533d187 to 97c29eb Compare August 26, 2025 13:12
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
@click.argument("guid")
@click.option(
"--download-path",
default=".",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make this default more dynamic and a folder by default.

maybe let's do a timestamped folder name like:

from datetime import datetime

...

default=f"download_{datetime.now().strftime("%d_%b_%Y")}",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the same thing for multiple download async

)
@click.option(
"--max-concurrent-requests",
default=300,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really, really high. Let's scale this default to something like 20 and people can manually bump this up as needed.

file_client = Gen3File(auth_provider=auth)

# Debug logging for input parameters
logging.debug(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this before the prompt above and you need to use our logger. Put this at the top of the file. See other files for reference:

from cdislogging import get_logger

logging = get_logger("__name__")

type=int,
)
@click.option(
"--skip-completed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is working as intended. I get a lot of warnings when I try this:

$ poetry run gen3 -vv download-multiple-async --manifest MIDRC_case_manifest.json --download-path ./downloads --skip-completed 
[2025-09-02 14:28:36,954][  DEBUG] Initializing auth..
Found 5178 files to download
Continue with async download? [y/N]: y
Downloading:   0%|                                                                                                                                                                                                                      | 0/5178 [00:00<?, ?it/s][2025-09-02 14:28:40,004][WARNING] File will be overwritten: downloads/10041569-u_H3HaB1lES6-HXVpiEfMA/2.16.840.1.114274.1818.48858790339993589885669552017090272896/2.16.840.1.114274.1818.56909342758958044433235758152671341713.zip
[2025-09-02 14:28:40,010][WARNING] File will be overwritten: downloads/10041569-FtpnR4GEPEe9JztuDUqrXg/2.16.840.1.114274.1818.544490635779373958610541144202466729913/2.16.840.1.114274.1818.55741368199706394947919032241968638388.zip
[2025-09-02 14:28:40,085][WARNING] File will be overwritten: downloads/10041569-CooqJZVIRkywB6_o9CJi0Q/2.16.840.1.114274.1818.567640554746210778914433462824885428113/2.16.840.1.114274.1818.523353596424499437510054502719882590104.zip

Actually... it seems like maybe it is actually skipping but we need to suppress all these warnings and/or not overwrite existing files, just skip downloading again.

)
@click.option(
"--skip-completed",
is_flag=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can have an is_flag and default to True b/c how do I set this to false then? it just needs to be a boolean but default to true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could reverse the logic here and just call it: redownload-completed and then leave is_flag=True but default to false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure to update docs if you change this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change it, actually. We need to ensure we're matching previous commands exactly.

Here's the help from gen3-client:

Flags:
      --download-path string     The directory in which to store the downloaded files (default ".")
      --filename-format string   The format of filename to be used, including "original", "guid" and "combined" (default "original")
  -h, --help                     help for download-multiple
      --manifest string          The manifest file to read from. A valid manifest can be acquired by using the "Download Manifest" button in Data Explorer from a data common's portal
      --no-prompt                If set to true, will not display user prompt message for confirmation
      --numparallel int          Number of downloads to run in parallel (default 1)
      --protocol string          Specify the preferred protocol with --protocol=s3
      --rename                   Only useful when "--filename-format=original", will rename file by appending a counter value to its filename if set to true, otherwise the same filename will be used
      --skip-completed           If set to true, will check for filename and size before download and skip any files in "download-path" that matches both

Let's change the overall command name to match too. download-multiple instead of download-multiple-async

Keep skip completed but make it a boolean so I can pass in false.

I'm okay with --numparallel not existing and being different b/c the parallelization scheme in Python is a bit different... maybe we can allow that to be passed in though and just use the value as max-concurrent-requests if that isn't supplied. So people can easily port their commands over and have them work right away

- **High-bandwidth networks**: Increase the number of worker processes
- **Limited memory**: Reduce queue sizes to manage memory usage

### Memory Management
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove this section, I feel it's redundant given the above config info

- **Batch Size**: Balance between memory usage and processing overhead
- **Process Count**: Match available CPU cores for optimal performance

### Network Optimization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also remove this section


- Check network bandwidth and server limits
- Reduce concurrent request limits if server is overwhelmed
- Verify authentication token is valid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auth token would not be a reason for slow downloads, let's remove this

Copy link
Contributor

@Avantol13 Avantol13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to add unit tests for this new functionality

main.add_command(drs_pull.drs_pull)
main.add_command(file.file)
main.add_command(download.download_single, name="download-single")
main.add_command(download.download_multiple_async, name="download-multiple-async")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
main.add_command(download.download_multiple_async, name="download-multiple-async")
main.add_command(download.download_multiple, name="download-multiple")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to remember to retest after any changes to your code, this was broken b/c the the function doesn't exist anymore

@Dhiren-Mhatre Dhiren-Mhatre changed the title Feat : Multiple download functionality with performance testing Feat : Multiple download functionality Sep 9, 2025
]


def test_download_single_success(gen3_file):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a docstring for each test describing what you're testing

gen3 --endpoint data.commons.io --auth creds.json download-multiple \
--manifest large_dataset.json \
--download-path ./large_downloads \
--max-concurrent-requests 20 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change to:

Suggested change
--max-concurrent-requests 20 \
--max-concurrent-requests 64 \
--max-concurrent-requests 8

Dhiren-Mhatre and others added 5 commits October 2, 2025 18:18
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Copy link
Contributor

@Avantol13 Avantol13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will also need to update this test. I got it working:

def test_download_single_basic_functionality(gen3_file):
    """
    Test download_single basic functionality with synchronous download.

    Verifies that download_single downloads a file successfully using
    synchronous requests and returns True.
    """
    gen3_file._auth_provider._refresh_token = {"api_key": "123"}

    with patch.object(gen3_file, 'get_presigned_url') as mock_presigned, \
         patch('gen3.file.requests.get') as mock_get, \
         patch('gen3.index.Gen3Index.get_record') as mock_index:

        mock_presigned.return_value = {"url": "https://fake-url.com/file"}
        mock_index.return_value = {"file_name": "test-file.txt"}
        mock_response = MagicMock()
        mock_response.status_code = 200
        mock_response.headers = {"content-length": "12"}
        mock_response.iter_content = lambda size: [b"test content"]
        mock_get.return_value = mock_response

        result = gen3_file.download_single(object_id="test-guid", path="/tmp", protocol="s3")

        assert result == True
        mock_presigned.assert_called_once_with("test-guid", protocol="s3")
        mock_index.assert_called_once_with("test-guid")

try:
file_client = Gen3File(auth_provider=auth)

result = file_client.download_single(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function does not take these params and it's failing. I believe I got it working.

@click.command()
@click.argument("guid")
@click.option(
    "--download-path",
    default=f"./download_{datetime.now().strftime('%d_%b_%Y')}",
    help="Directory to download file to (default: timestamped folder)",
)
@click.option(
    "--protocol",
    default=None,
    help="Protocol for presigned URL (e.g., s3) (default: auto-detect)",
)
@click.pass_context
def download_single(
    ctx,
    guid,
    download_path,
    protocol,
):
    """Download a single file by GUID."""
    auth = ctx.obj["auth_factory"].get()

    download_path = os.path.abspath(download_path)
    os.makedirs(download_path, exist_ok=True)

    try:
        file_client = Gen3File(auth_provider=auth)

        is_successful = file_client.download_single(
            object_id=guid,
            path=download_path,
            protocol=protocol,
        )

        result = {
            "status": "downloaded" if is_successful else "failed",
            "reason": "Failed to download GUID." if not is_successful else "",
        }

        if result["status"] == "downloaded":
            click.echo(f"✓ Downloaded to path: {download_path}")
        else:
            click.echo(f"✗ Failed: {result.get('error', 'See logs for error')}")

    except Exception as e:
        logging.error(f"Download failed: {e}")
        raise click.ClickException(f"Download failed: {e}")

And update the download_single function as well:

    def download_single(self, object_id, path, protocol=None):
        """
        Download a single file using its GUID.

        Args:
            object_id (str): The file's unique ID
            path (str): Path to store the downloaded file at

        Returns:
            bool: True if download successful, False otherwise
        """
        try:
            url = self.get_presigned_url(object_id, protocol=protocol)
        except Exception as e:
            logging.critical(f"Unable to get a presigned URL for download: {e}")
            return False

        response = requests.get(url["url"], stream=True)
        if response.status_code != 200:
            logging.error(f"Response code: {response.status_code}")
            if response.status_code >= 500:
                for _ in range(MAX_RETRIES):
                    logging.info("Retrying now...")
                    # NOTE could be updated with exponential backoff
                    time.sleep(1)
                    response = requests.get(url["url"], stream=True)
                    if response.status_code == 200:
                        break
                if response.status_code != 200:
                    logging.critical("Response status not 200, try again later")
                    return False
            else:
                return False

        response.raise_for_status()

        total_size_in_bytes = int(response.headers.get("content-length"))
        total_downloaded = 0

        index = Gen3Index(self._auth_provider)
        record = index.get_record(object_id)

        filename = record["file_name"]

        out_path = os.path.join(path, filename)
        Gen3File._ensure_dirpath_exists(Path(os.path.dirname(out_path)))

        with open(out_path, "wb") as f:
            for data in response.iter_content(4096):
                total_downloaded += len(data)
                f.write(data)

        if total_size_in_bytes == total_downloaded:
            logging.info(f"File {filename} downloaded successfully")
        else:
            logging.error(f"File {filename} not downloaded successfully")
            return False

        return True

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants