Skip to content

Add AioS3FileSystem using asyncio for parallel S3 operations #675

@laughingman7743

Description

@laughingman7743

Summary

The current S3FileSystem uses ThreadPoolExecutor for parallel S3 operations (range reads, batch deletes, multipart uploads). Add an AioS3FileSystem variant that uses asyncio.gather + asyncio.to_thread instead, providing better integration with the asyncio event loop for aio cursors.

Motivation

When aio cursors use the current S3FileSystem, operations are double-wrapped:

  1. The cursor wraps the result set creation in asyncio.to_thread()
  2. Inside that thread, S3FileSystem spawns more threads via ThreadPoolExecutor

An AioS3FileSystem would allow aio cursors to use async S3 operations directly, eliminating the thread-in-thread pattern and integrating naturally with the event loop.

Design

  • Keep S3FileSystem as-is — no breaking changes to the existing synchronous implementation
  • Add AioS3FileSystem — uses asyncio.gather + asyncio.to_thread for individual boto3 calls instead of ThreadPoolExecutor
  • Follow fsspec's _async_impl pattern — implement async methods (_cat_file, _ls, etc.) so fsspec provides sync wrappers automatically
  • No new dependencies — uses asyncio.to_thread to wrap synchronous boto3 calls (no aiobotocore needed)

ThreadPoolExecutor usage to replace

Location Current usage Async replacement
S3File._fetch_range() Parallel range GETs asyncio.gather(*[asyncio.to_thread(...)])
S3FileSystem._delete_objects() Batch deletes asyncio.gather(*[asyncio.to_thread(...)])
S3File._upload_chunk() Multipart uploads asyncio.gather(*[asyncio.to_thread(...)])

User-facing API

Users can choose which filesystem to use. Aio cursors could default to AioS3FileSystem when available:

from pyathena.filesystem.s3 import AioS3FileSystem

# Direct usage
fs = AioS3FileSystem(connection=connection)
async with fs.open("s3://bucket/key", "rb") as f:
    data = await f.read()

Future consideration

If benchmarks show the aio version performs better, it could become the default implementation for aio cursors or even replace the ThreadPoolExecutor-based implementation entirely.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions