SPARC FUSE is a file unification and standardization engine for SPARC datasets.
Whether youβre on the command line, in a Python notebook, or on the SPARC Portal itself, FUSE converts the 40+ eclectic imaging & time-series formats scattered across SPARC datasets into one clean, cloud-native Zarr layout (or .mat / .npz, if you prefer).
β Out of the box, SPARC FUSE supports over 82% of all imaging and time-series file types found in public SPARC datasets.
- CLI / Python API β one call turns raw files into analysis-ready arrays that slot straight into Xarray + Dask, MATLAB, PyTorch, etc.
- Browser extension β adds a βConvert & Downloadβ button to every dataset page so you can grab ready-to-analyze Zarr bundles without leaving your browser.
- Cloud-first β outputs stream directly from S3 for zero-copy workflows that scale from your laptop to HPC or Lambda.
Spend your time on science, not on hunting converters and understanding proprietary internal file structures. π¦Ύ
- Quick start
- Why SPARC FUSE?
- Zarr + AWS: super-charging SPARC data
- Cloud-first Demo: Prepare & Consume Data from S3
- Supported File Formats
Every SPARC FUSE export β whether .zarr, .npz, or .mat β automatically bundles all SPARC metadata available for the source file or dataset.
In addition, we append detailed conversion-specific metadata, including:
meta = {
"time_units": "seconds",
"time_auto_generated": True,
"source_format": "acq",
"database_id": "1337",
"sampling_frequency": 1000.0,
"channel_names": ["CH1", "CH2", "CH3"],
"channel_units": ["mV", "mV", "mV"],
"version": "v1.0",
"upload_date": "2025-07-15T12:34:56Z",
"conversion_date": "2025-08-03T10:01:22Z",
"auto_mapped": True,
"doi": "10.12345/sparc.1337",
"original_file_name": "20_1021.acq",
"sparc_subject_id": "sub-ChAT-Male-Subject-1",
"anatomical_location": "vagus nerve",
"sweep_mode": False,
"notes": "Mapped using SPARC FUSE"
}π‘ This metadata is embedded directly into .zarr attributes or stored as structure / dictionary in .mat / .npz files. Every export is fully self-describing and ready for downstream use or publication.
SPARC FUSE uses a hybrid mapping system to extract standardized signals, time vectors, metadata, and annotations from a wide range of raw data formats:
- π§ Handcrafted descriptors: Expert-written mappings for known formats like .smr, .abf, .adicht, etc.
- π€ Heuristic auto-mapping: If no direct match is found, FUSE evaluates all known descriptors using a scoring system and applies the best match.
- π Mapping score: Each descriptor is scored based on required fields (like signals, time, sampling_frequency). The descriptor with the highest score is selected.
result = evaluate_mapping_fields(descriptor, context)
score = score_mapping_result(result, descriptor)π‘ The selected descriptor is recorded in the output metadata, along with auto_mapped = true if it was selected heuristically. This gives you full transparency into how the conversion was performed.
If time is missing in the original file, FUSE will auto-generate a time vector from signal length and set:
"time_auto_generated": trueThis mapping system ensures that even unknown or partially supported formats can be converted with best-effort accuracy β and that you always know how the output was produced.
# Clone and install
git clone https://github.com/SPARC-FAIR-Codeathon/2025-team-B.git
# (optional) install a conda environment
conda env create -f 2025-team-B/conda_env_py310.yaml
conda activate py310
cd 2025-team-B/sparcfuse
pip install -e .
cd ..# Bulk convert primary files of an entire dataset
sparc-fuse 224 --output-dir ./convertedsparc-fuse-terminal-bulk-download_x4.mov
# Convert a file
sparc-fuse 224 primary/sub-ChAT-Male-Subject-1/20_1021.acq --output-dir ./convertedsparc-fuse-terminal-single-file-download_x2.mov
# View options
sparc-fuse --help
π‘ Note: You will have to run the install_sparc_fuse. ipynb notebook to install all dependencies before working with sparc-fuse. Also, make sure you have your own SciCrunch API key and AWS credentials ready.
git clone https://github.com/SPARC-FAIR-Codeathon/2025-team-B.git
cd 2025-team-Bfrom sparc_fuse_core import download_and_convert_sparc_data, list_primary_files
DATASET_ID = 224 # Any valid SPARC dataset ID
files, _ = list_primary_files(DATASET_ID)
print("primary files:", [f["path"] for f in files])
download_and_convert_sparc_data(
DATASET_ID,
primary_paths=files[0]["path"].replace("files/", ""),
output_dir="./output_single",
file_format="zarr"
)sparc-fuse-notebook-single-file-download_x1_5.mov
from sparc_fuse_core import download_and_convert_sparc_data, list_primary_files
DATASET_ID = 224 # Any valid SPARC dataset ID
bulk_report = download_and_convert_sparc_data(
DATASET_ID,
output_dir="./output_bulk",
file_format="zarr"
)
from pprint import pprint
pprint(bulk_report)sparc-fuse-notebook-bulk-download_x4.mov
from sparc_fuse_core import download_and_convert_sparc_data, list_primary_files
DATASET_ID = 224 # Any valid SPARC dataset ID
# Grab (for example) the first three primary files
files, _ = list_primary_files(DATASET_ID)
subset_paths = [f["path"].replace("files/", "") for f in files[:3]]
report = download_and_convert_sparc_data(
DATASET_ID,
primary_paths=subset_paths, # any iterable works
output_dir="./output_subset",
file_format="npz",
overwrite=True # regenerate if outputs already exist
)
from pprint import pprint
pprint(report)sparc-fuse-notebook-multifile-download_x4.mov
π‘ Tip:
file_formataccepts"zarr","zarr.zip","npz", or"mat". Choose the one that best matches your downstream workflow.
π Note: The doc-string generated API documentation of SPARC FUSE can be accessed through our GitHub Wiki
To start using the Firefox plugin, you must start the local server. This is essential because the plugin relies on the backend to process and serve data. The Firefox plugin communicates with your local server for data conversion and download, so the server must be running for the extension to function.
In your terminal navigate to the server directory and start the server with:
cd 2025-team-B/server
python server.pyThe server will run locally at: http://127.0.0.1:5000 (port 5000).
Open your Firefox browser and navigate to about:debugging#/runtime/this-firefox
Click Load Temporary Add-on and select the manifest.json file from the plugin directory.
The extension will appear in your browser extensions area.
Once the Firefox plugin is installed and the server is running, you can use it to download datasets directly from the SPARC website.
The plugin integrates into the SPARC website interface and provides two types of download options:
- Download the Full Dataset
Use the Download & Convert Dataset button to retrieve the entire dataset. This button is located near the top of the dataset page.
ff_download_dataset.mov
- Download Individual Files
For selective downloading, each file listed in the dataset has its own Download & Convert icon. Clicking this button lets you fetch only the specific file you need.
ff_download_single_file.mov
The plugin processes and converts the files through your local server before saving them to your machine. Make sure the server is active while downloading.
- SPARC hosts 40 + heterogeneous file formats and countless sub-variants (custom internal structures) β each with its own quirks.
- Researchers lose precious hours hunting converters and writing glue code instead of analysing data.
- This format jungle breaks reproducibility and puts FAIR principles at risk.
See FigureΒ 1 and 2 for the distribution of file extensions in SPARC datasets.
FigureΒ 1. Relative frequency of every time series and imaging file extension found in public SPARC datasets (log-scaled word cloud).
FigureΒ 2.The SPARC database contains 20 + distinct time-series formats and 20 + imaging formats, each hiding additional proprietary structures inside the files.
- SPARC FUSE automatically remaps any supported file (time-series & imaging) into a uniform, chunked Zarr store
β optionally also.mat, or.npz, for legacy tools.
β Currently, out of the box, SPARC FUSE supports over 82% of all imaging and time-series file types found in public SPARC datasets. (See FigureΒ 3 for an overview of selected supported formats and unified export targets.)
FigureΒ 3. Overview of selected file formats supported by SPARC FUSE for conversion into a unified, cloud-native Zarr representation. Shown here is a non-exhaustive subset of imaging formats (e.g., .czi, .nd2, .tiff), time-series and electrophysiology formats (e.g., .mat, .csv, .smr, .rhd, .acq). Optional exports to .mat and .npz are also supported for downstream analysis.
- Works three ways:
- Python API β bulk-convert or cherry-pick files in a single call.
- CLI β one-liner on the command line.
- Browser button β βConvert & Downloadβ directly from the SPARC portal.
- Keeps full provenance: every conversion is logged, making pipelines fully reproducible.
- β Hours β seconds: spend time on science, not format wrangling.
- π Interoperability out-of-the-box: unified layout means the same loader works for every dataset.
- βοΈ Cloud-ready chunks: Zarrβs design unlocks scalable, parallel analysis on HPC or S3-style storage.
- π FAIR boost: data become immediately Accessible, Interoperable and Reusable across toolchains.
SPARC FUSE: One data format to unite them all

TL;DR β Zarr is a cloud-native chunked-array format that lets you stream only the bytes you need.
SPARC datasets are now mirrored on Amazon S3 via the AWS Registry of Open Data, so Zarr fits like a glove.
| Why Zarr? | Why now? |
|---|---|
| βZarr is like Parquet for arrays.β It stores N-D data in tiny, independent chunksβperfect for parallel reads/writes and lazy loading. | SPARC just announced that all public datasets are directly accessible on AWS S3 (Requester Pays) and even have a listing on the AWS Open Data Registry. |
Plays nicely with xarray, Dask, PyTorch, TensorFlow, MATLAB (via zarr-matlab), and more. |
With data already in S3, a converted Zarr store can be queried in-place from an EC2, Lambda, or SageMaker jobβno re-download cycles. |
| Open spec, community-driven, language-agnostic. | SPARC FUSEβs one-line sparc-fuse <id> β¦ --file-format zarr command gives you an analysis-ready cloud-optimised dataset in seconds. |

FigureΒ 4. Zarr Overview. Diagram adapted from the Earthmover blog post βWhat is Zarr?β.
- Chunked storage β data are broken into independently readable/writable tiles.
- Cloud-optimised layout β each chunk is just an object in S3 / GCS, so you stream only the bytes you need.
- Parallel-ready β Dask, Ray, Spark, etc. slurp different chunks concurrently for massive speed-ups.
- Open spec β language-agnostic, community-governed, and already adopted by NASA, OME-Zarr, Pangeo, and more.
This demo illustrates a simple, end-to-end workflow for cloud-based data handling:
Convert SPARC primary files into Zarr format, upload them to S3, consolidate metadata, and open them directly in Xarray without downloading the entire dataset. You stream only the slices you need, making your analysis quicker and easier.
- Convert a SPARC primary file to Zarr.
- Upload the converted data to an S3 bucket.
- Consolidate metadata for fast remote access.
- Wrap into an Xarray-compatible Zarr store, ready to use with
xr.open_zarr(...). - Lazily open and stream data slices directly from S3.
from sparc_fuse_core import (
list_primary_files, download_and_convert_sparc_data,
upload_to_s3, consolidate_s3_metadata,
create_xarray_zarr_from_raw, generate_and_upload_manifest
)
# Parameters
DATASET_ID = 224
BUCKET = "sparc-fuse-demo-ab-2025"
REGION = "eu-north-1"
RAW_ZARR = "20_1021_std.zarr"
XARRAY_ZARR = "20_1021_std_xarray.zarr"
# Convert SPARC file to Zarr locally
files, _ = list_primary_files(DATASET_ID)
primary_path = files[0]["path"].replace("files/", "")
download_and_convert_sparc_data(
DATASET_ID,
primary_paths=primary_path,
output_dir="./output_single",
file_format="zarr"
)
# Upload raw Zarr to S3
upload_to_s3(f"./output_single/{RAW_ZARR}", BUCKET, RAW_ZARR, REGION)
# Consolidate metadata
consolidate_s3_metadata(BUCKET, RAW_ZARR, REGION)
# Create Xarray-compatible Zarr and upload to S3
create_xarray_zarr_from_raw(BUCKET, RAW_ZARR, XARRAY_ZARR, REGION)
# Generate discovery manifest and upload
generate_and_upload_manifest(DATASET_ID, BUCKET, XARRAY_ZARR, REGION)
print("β
Preparation complete.")
s3-bucket-overview_x4.mov
from sparc_fuse_core import open_zarr_from_s3
import time
import matplotlib.pyplot as plt
# Open dataset lazily from S3
ds = open_zarr_from_s3(bucket="sparc-fuse-demo-ab-2025", zarr_path="20_1021_std_xarray.zarr")
print(ds) # Immediately available metadata, lazy data loading
# Example: load a subset of channel 1 for the first 100,000 time points
start = time.perf_counter()
subset_ch1 = ds["signals"].sel(channel=1).isel(time=slice(0, 100000)).load()
elapsed = time.perf_counter() - start
# plot
plt.figure(figsize=(6, 1.5)), plt.plot(subset_ch1.time.values, subset_ch1.values)
plt.xlabel("Time"), plt.ylabel("CH1"), plt.tight_layout(), plt.show()
print(f"Subset load time: {elapsed:.3f} s")load-data-from-s3_x1.mov
β³ Took only 0.3 s to load the 100,000 time points from s3.
z = open_zarr_from_s3(BUCKET, XARRAY_ZARR, region="eu-north-1")
metadata = dict(z.attrs)
print(json.dumps(metadata['sparc_metadata'], indent=2))load-sparc-metadata-from-s3-xarr_x1.mov
s3 slice is roughly 33Γ faster than doing a fresh SPARC download and slice for the same data slice (Figure 5).
FigureΒ 5. Latency comparison for SPARC download & slice (~9.8s) vs s3 slice (~0.3s), showing ~33Γ speedup.
- π’ Fully supported and tested
- π‘ Expected to work via auto-mapping or heuristic parsing
- π΄ Not yet supported
| Extension(s) | Description | Support Status |
|---|---|---|
.mat |
MathWorks MATLAB file | π’ |
.smr |
CED Spike2 binary recording | π’ |
.csv |
Comma-separated values text (generic) | π’ |
.adicht |
ADInstruments LabChart binary trace | π’ |
.hdf5 |
Hierarchical Data Format v5 container | π’ |
.h5 |
Same as .hdf5 |
π’ |
.abf |
Molecular Devices Axon Binary File (pClamp) | π’ |
.rhd |
Intan RHD2000 amplifier data | π’ |
.nev |
Blackrock NeuroPort event file | π΄ |
.ns5 |
Blackrock continuous 30 kHz signal | π’ |
.ns2 |
Blackrock 1 kHz LFP signal | π΄ |
.ns1 |
Blackrock low-rate summary signal | π΄ |
.smrx |
CED Spike2 v9+ extended recording | π’ |
.wav |
Waveform audio (PCM) | π’ |
.acq |
AxoScope raw acquisition | π’ |
.tdx, .tev, .tnt, .tsq |
TDT Synapse time-series (multi-file) | π΄ |
.eeg, .vmrk, .vhdr |
BrainVision EEG dataset (multi-file) | π΄ |
.sev |
TDT RS4 single-channel stream | π΄ |
| Extension(s) | Description | Support Status |
|---|---|---|
.tif |
Tagged Image File Format (high-bit-depth microscopy) | π’ |
.tiff |
Same as .tif |
π’ |
.czi |
Carl Zeiss ZEN container | π’ |
.nd2 |
Nikon NIS-Elements microscope image | π’ |
.lsm |
Zeiss laser-scanning-microscope stack | π΄ |
.jpx |
JPEG-2000 (JPX) image | π‘ |
.svs |
Aperio/Leica whole-slide image | π΄ |
.ims |
Bitplane Imaris 3-D/4-D scene | π’ |
.png |
Portable Network Graphics (lossless) | π‘ |
.jpg |
JPEG compressed image | π’ |
.jpeg |
Same as .jpg |
π‘ |
.bmp |
Windows bitmap | π‘ |
.vsi |
Olympus virtual-slide βwrapperβ file | π‘ |
.ets |
Olympus VS series full-resolution tile set | π‘ |
.jp2 |
JPEG-2000 codestream | π‘ |
.roi |
ImageJ/Fiji region-of-interest set | π‘ |
.dm3 |
Gatan DigitalMicrograph EM image | π‘ |
.pxp |
Igor Pro packed experiment (can embed images) | π‘ |
.ipf |
Igor Pro procedure/data file | π‘ |
.lif |
Leica Image File (LAS X) | π‘ |
.ima |
Amira/Avizo volumetric raw image | π‘ |
.mrxs |
3DHISTECH Mirax whole-slide image | π‘ |
.obj |
Wavefront 3-D mesh | π‘ |
.avi |
Uncompressed/codec AVI video (time-lapse stacks) | π‘ |
.exf |
Zeiss experiment file (ZEN) | π‘ |
.cxd |
Olympus cellSens dataset | π‘ |
| Dataset ID | Type | Source Format(s) | Success |
|---|---|---|---|
| 108 | Time Series | .csv |
β |
| 126 | Time Series | .acq |
β |
| 142 | Time Series | .csv |
β |
| 148 | Time Series | .acq |
β |
| 149 | Time Series | .smr |
β |
| 150 | Time Series | .smr |
β |
| 224 | Time Series | .acq |
β |
| 297 | Time Series | .abf |
β |
| 301 | Time Series | .csv |
β |
| 305 | Time Series | .csv |
β |
| 309 | Time Series | .mat |
β |
| 310 | Time Series | .mat |
β |
| 315 | Time Series | .smrx |
β |
| 316 | Time Series | .rhd |
β |
| 323 | Time Series | .csv |
β |
| 327 | Time Series | .mat |
β |
| 338 | Time Series | .smrx |
β |
| 349 | Time Series | .hdf5 |
β |
| 350 | Time Series | .csv |
β |
| 351 | Time Series | .csv |
β |
| 357 | Time Series | .mat |
β |
| 375 | Time Series | .mat |
β |
| 376 | Time Series | .mat |
β |
| 378 | Time Series | .adicht, .adidat, .adidatx |
β |
| 380 | Time Series | .hdf5 |
β |
| 391 | Time Series | .hdf5 |
β |
| 400 | Time Series | .adi, .mat |
β |
| 406 | Time Series | .dat, .wav |
β |
| 425 | Time Series | .csv |
β |
| 435 | Time Series | .abf |
β |
| 436 | Time Series | .ns5 |
β |
| 117 | Imaging | .rhd |
β |
| 65 | Imaging | .nd2, .tif |
β |
| 132 | Imaging | .ima |
β |
| 187 | Imaging | .jpg |
β |
| 290 | Imaging | .tif |
β |
| 296 | Imaging | .ims |
β |
Max Haberbusch: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Software, Supervision, Validation, Visualization, Writing β original draft, Writing β review & editing; David Lung: Data curation, Formal Analysis, Investigation, Methodology, Software, Validation, Visualization, Writing β review & editing; Philipp Heute: Investigation, Software, Validation, Visualization, Writing β original draft, Writing β review & editing; Sebastian Hochreiter: Investigation, Software, Validation, Writing β review & editing; Laurenz Berger: Validation, Writing β review & editing







