Scoring and fallback logic

What is SPARC FUSE?

SPARC FUSE is a file unification and standardization engine for SPARC datasets.

Whether you’re on the command line, in a Python notebook, or on the SPARC Portal itself, FUSE converts the 40+ eclectic imaging & time-series formats scattered across SPARC datasets into one clean, cloud-native Zarr layout (or .mat / .npz, if you prefer).

✅ Out of the box, SPARC FUSE supports over 82% of all imaging and time-series file types found in public SPARC datasets.

CLI / Python API – one call turns raw files into analysis-ready arrays that slot straight into Xarray + Dask, MATLAB, PyTorch, etc.
Browser extension – adds a “Convert & Download” button to every dataset page so you can grab ready-to-analyze Zarr bundles without leaving your browser.
Cloud-first – outputs stream directly from S3 for zero-copy workflows that scale from your laptop to HPC or Lambda.

Spend your time on science, not on hunting converters and understanding proprietary internal file structures. 🦾

meta = {
    "time_units": "seconds",
    "time_auto_generated": True,
    "source_format": "acq",
    "database_id": "1337",
    "sampling_frequency": 1000.0,
    "channel_names": ["CH1", "CH2", "CH3"],
    "channel_units": ["mV", "mV", "mV"],
    "version": "v1.0",
    "upload_date": "2025-07-15T12:34:56Z",
    "conversion_date": "2025-08-03T10:01:22Z",
    "auto_mapped": True,
    "doi": "10.12345/sparc.1337",
    "original_file_name": "20_1021.acq",
    "sparc_subject_id": "sub-ChAT-Male-Subject-1",
    "anatomical_location": "vagus nerve",
    "sweep_mode": False,
    "notes": "Mapped using SPARC FUSE"
}

💡 This metadata is embedded directly into .zarr attributes or stored as structure / dictionary in .mat / .npz files. Every export is fully self-describing and ready for downstream use or publication.

🧠 Smart Mapping with Confidence Scoring

SPARC FUSE uses a hybrid mapping system to extract standardized signals, time vectors, metadata, and annotations from a wide range of raw data formats:

🔧 Handcrafted descriptors: Expert-written mappings for known formats like .smr, .abf, .adicht, etc.
🤖 Heuristic auto-mapping: If no direct match is found, FUSE evaluates all known descriptors using a scoring system and applies the best match.
📊 Mapping score: Each descriptor is scored based on required fields (like signals, time, sampling_frequency). The descriptor with the highest score is selected.

Scoring and fallback logic

result = evaluate_mapping_fields(descriptor, context)
score = score_mapping_result(result, descriptor)

💡 The selected descriptor is recorded in the output metadata, along with auto_mapped = true if it was selected heuristically. This gives you full transparency into how the conversion was performed.

If time is missing in the original file, FUSE will auto-generate a time vector from signal length and set:

"time_auto_generated": true

This mapping system ensures that even unknown or partially supported formats can be converted with best-effort accuracy — and that you always know how the output was produced.

🚀 Quick start

Command-line interface

# Clone and install
git clone https://github.com/SPARC-FAIR-Codeathon/2025-team-B.git

# (optional) install a conda environment 
conda env create -f 2025-team-B/conda_env_py310.yaml
conda activate py310

cd 2025-team-B/sparcfuse
pip install -e .
cd ..

# Bulk convert primary files of an entire dataset
sparc-fuse 224 --output-dir ./converted

sparc-fuse-terminal-bulk-download_x4.mov

# Convert a file
sparc-fuse 224 primary/sub-ChAT-Male-Subject-1/20_1021.acq --output-dir ./converted

sparc-fuse-terminal-single-file-download_x2.mov

# View options
sparc-fuse --help

Use as a Python library

Try it out on oSPARC

💡 Note: You will have to run the install_sparc_fuse. ipynb notebook to install all dependencies before working with sparc-fuse. Also, make sure you have your own SciCrunch API key and AWS credentials ready.

Or proceed locally

0 – Clone the project

git clone https://github.com/SPARC-FAIR-Codeathon/2025-team-B.git
cd 2025-team-B

1 – Convert a single primary file

from sparc_fuse_core import download_and_convert_sparc_data, list_primary_files

DATASET_ID = 224  # Any valid SPARC dataset ID

files, _ = list_primary_files(DATASET_ID)
print("primary files:", [f["path"] for f in files])

download_and_convert_sparc_data(
    DATASET_ID,
    primary_paths=files[0]["path"].replace("files/", ""),
    output_dir="./output_single",
    file_format="zarr"
)

sparc-fuse-notebook-single-file-download_x1_5.mov

2 – Bulk-convert an entire dataset

from sparc_fuse_core import download_and_convert_sparc_data, list_primary_files

DATASET_ID = 224  # Any valid SPARC dataset ID

bulk_report = download_and_convert_sparc_data(
    DATASET_ID,
    output_dir="./output_bulk",
    file_format="zarr"
)

from pprint import pprint
pprint(bulk_report)

sparc-fuse-notebook-bulk-download_x4.mov

3 – Convert a subset of primary files

from sparc_fuse_core import download_and_convert_sparc_data, list_primary_files

DATASET_ID = 224  # Any valid SPARC dataset ID

# Grab (for example) the first three primary files
files, _ = list_primary_files(DATASET_ID)
subset_paths = [f["path"].replace("files/", "") for f in files[:3]]

report = download_and_convert_sparc_data(
    DATASET_ID,
    primary_paths=subset_paths,   # any iterable works
    output_dir="./output_subset",
    file_format="npz",
    overwrite=True                # regenerate if outputs already exist
)

from pprint import pprint
pprint(report)

sparc-fuse-notebook-multifile-download_x4.mov

💡 Tip: file_format accepts "zarr", "zarr.zip", "npz", or "mat". Choose the one that best matches your downstream workflow.

📄 Note: The doc-string generated API documentation of SPARC FUSE can be accessed through our GitHub Wiki

Firefox Plugin

Start the Server

To start using the Firefox plugin, you must start the local server. This is essential because the plugin relies on the backend to process and serve data. The Firefox plugin communicates with your local server for data conversion and download, so the server must be running for the extension to function.

In your terminal navigate to the server directory and start the server with:

cd 2025-team-B/server
python server.py

The server will run locally at: http://127.0.0.1:5000 (port 5000).

Install the Firefox plugin

Open your Firefox browser and navigate to about:debugging#/runtime/this-firefox

Click Load Temporary Add-on and select the manifest.json file from the plugin directory.

The extension will appear in your browser extensions area.

Using the Plugin to Download Data from the SPARC Website

Once the Firefox plugin is installed and the server is running, you can use it to download datasets directly from the SPARC website.

The plugin integrates into the SPARC website interface and provides two types of download options:

Download the Full Dataset
Use the Download & Convert Dataset button to retrieve the entire dataset. This button is located near the top of the dataset page.

ff_download_dataset.mov

Download Individual Files
For selective downloading, each file listed in the dataset has its own Download & Convert icon. Clicking this button lets you fetch only the specific file you need.

ff_download_single_file.mov

The plugin processes and converts the files through your local server before saving them to your machine. Make sure the server is active while downloading.

❓ Why SPARC FUSE?

The headache

SPARC hosts 40 + heterogeneous file formats and countless sub-variants (custom internal structures) – each with its own quirks.
Researchers lose precious hours hunting converters and writing glue code instead of analysing data.
This format jungle breaks reproducibility and puts FAIR principles at risk.

See Figure 1 and 2 for the distribution of file extensions in SPARC datasets.

_{Figure 1. Relative frequency of every time series and imaging file extension found in public SPARC datasets (log-scaled word cloud).}

_{Figure 2.The SPARC database contains 20 + distinct time-series formats and 20 + imaging formats, each hiding additional proprietary structures inside the files.}

The cure

SPARC FUSE automatically remaps any supported file (time-series & imaging) into a uniform, chunked Zarr store
– optionally also .mat, or .npz, for legacy tools.

✅ Currently, out of the box, SPARC FUSE supports over 82% of all imaging and time-series file types found in public SPARC datasets. (See Figure 3 for an overview of selected supported formats and unified export targets.)

_{Figure 3. Overview of selected file formats supported by SPARC FUSE for conversion into a unified, cloud-native Zarr representation. Shown here is a non-exhaustive subset of imaging formats (e.g., .czi, .nd2, .tiff), time-series and electrophysiology formats (e.g., .mat, .csv, .smr, .rhd, .acq). Optional exports to .mat and .npz are also supported for downstream analysis.}

Works three ways:
1. Python API – bulk-convert or cherry-pick files in a single call.
2. CLI – one-liner on the command line.
3. Browser button – “Convert & Download” directly from the SPARC portal.
Keeps full provenance: every conversion is logged, making pipelines fully reproducible.

Why it matters

✅ Hours → seconds: spend time on science, not format wrangling.
🔄 Interoperability out-of-the-box: unified layout means the same loader works for every dataset.
☁️ Cloud-ready chunks: Zarr’s design unlocks scalable, parallel analysis on HPC or S3-style storage.
🌐 FAIR boost: data become immediately Accessible, Interoperable and Reusable across toolchains.

Supported File Formats

SPARC FUSE: One data format to unite them all

🌩️ Zarr + AWS: super-charging SPARC data

TL;DR — Zarr is a cloud-native chunked-array format that lets you stream only the bytes you need.
SPARC datasets are now mirrored on Amazon S3 via the AWS Registry of Open Data, so Zarr fits like a glove.

Why Zarr?	Why now?
“Zarr is like Parquet for arrays.” It stores N-D data in tiny, independent chunks—perfect for parallel reads/writes and lazy loading.	SPARC just announced that all public datasets are directly accessible on AWS S3 (Requester Pays) and even have a listing on the AWS Open Data Registry.
Plays nicely with `xarray`, Dask, PyTorch, TensorFlow, MATLAB (via `zarr-matlab`), and more.	With data already in S3, a converted Zarr store can be queried in-place from an EC2, Lambda, or SageMaker job—no re-download cycles.
Open spec, community-driven, language-agnostic.	SPARC FUSE’s one-line `sparc-fuse <id> … --file-format zarr` command gives you an analysis-ready cloud-optimised dataset in seconds.

What is Zarr?

_{Figure 4. Zarr Overview. Diagram adapted from the Earthmover blog post “What is Zarr?”.}

Chunked storage – data are broken into independently readable/writable tiles.
Cloud-optimised layout – each chunk is just an object in S3 / GCS, so you stream only the bytes you need.
Parallel-ready – Dask, Ray, Spark, etc. slurp different chunks concurrently for massive speed-ups.
Open spec – language-agnostic, community-governed, and already adopted by NASA, OME-Zarr, Pangeo, and more.

☁️ Cloud-first Demo: Prepare & Consume Data from S3

This demo illustrates a simple, end-to-end workflow for cloud-based data handling:
Convert SPARC primary files into Zarr format, upload them to S3, consolidate metadata, and open them directly in Xarray without downloading the entire dataset. You stream only the slices you need, making your analysis quicker and easier.

🚀 What's going on?

Convert a SPARC primary file to Zarr.
Upload the converted data to an S3 bucket.
Consolidate metadata for fast remote access.
Wrap into an Xarray-compatible Zarr store, ready to use with xr.open_zarr(...).
Lazily open and stream data slices directly from S3.

🛠️ Prepare S3 Bucket

from sparc_fuse_core import (
    list_primary_files, download_and_convert_sparc_data,
    upload_to_s3, consolidate_s3_metadata,
    create_xarray_zarr_from_raw, generate_and_upload_manifest
)

# Parameters
DATASET_ID = 224
BUCKET = "sparc-fuse-demo-ab-2025"
REGION = "eu-north-1"
RAW_ZARR = "20_1021_std.zarr"
XARRAY_ZARR = "20_1021_std_xarray.zarr"

# Convert SPARC file to Zarr locally
files, _ = list_primary_files(DATASET_ID)
primary_path = files[0]["path"].replace("files/", "")
download_and_convert_sparc_data(
    DATASET_ID,
    primary_paths=primary_path,
    output_dir="./output_single",
    file_format="zarr"
)

# Upload raw Zarr to S3
upload_to_s3(f"./output_single/{RAW_ZARR}", BUCKET, RAW_ZARR, REGION)

# Consolidate metadata
consolidate_s3_metadata(BUCKET, RAW_ZARR, REGION)

# Create Xarray-compatible Zarr and upload to S3
create_xarray_zarr_from_raw(BUCKET, RAW_ZARR, XARRAY_ZARR, REGION)

# Generate discovery manifest and upload
generate_and_upload_manifest(DATASET_ID, BUCKET, XARRAY_ZARR, REGION)

print("✅ Preparation complete.")

s3-bucket-overview_x4.mov

📋 Consume Data

Load 100,000 time points and plot

from sparc_fuse_core import open_zarr_from_s3
import time
import matplotlib.pyplot as plt

# Open dataset lazily from S3
ds = open_zarr_from_s3(bucket="sparc-fuse-demo-ab-2025", zarr_path="20_1021_std_xarray.zarr")
print(ds)  # Immediately available metadata, lazy data loading

# Example: load a subset of channel 1 for the first 100,000 time points
start = time.perf_counter()
subset_ch1 = ds["signals"].sel(channel=1).isel(time=slice(0, 100000)).load()
elapsed = time.perf_counter() - start

# plot
plt.figure(figsize=(6, 1.5)), plt.plot(subset_ch1.time.values, subset_ch1.values)
plt.xlabel("Time"), plt.ylabel("CH1"), plt.tight_layout(), plt.show()
print(f"Subset load time: {elapsed:.3f} s")

load-data-from-s3_x1.mov

⏳ Took only 0.3 s to load the 100,000 time points from s3.

Get SPARC Metadata from zarr

z = open_zarr_from_s3(BUCKET, XARRAY_ZARR, region="eu-north-1")
metadata = dict(z.attrs)
print(json.dumps(metadata['sparc_metadata'], indent=2))

load-sparc-metadata-from-s3-xarr_x1.mov

⏱ s3 slice speedup vs SPARC download & slice

s3 slice is roughly 33× faster than doing a fresh SPARC download and slice for the same data slice (Figure 5).

_{Figure 5. Latency comparison for SPARC download & slice (~9.8s) vs s3 slice (~0.3s), showing ~33× speedup.}

Supported File Formats

Format Support Legend

🟢 Fully supported and tested
🟡 Expected to work via auto-mapping or heuristic parsing
🔴 Not yet supported

Time-Series Formats

Extension(s)	Description	Support Status
`.mat`	MathWorks MATLAB file	🟢
`.smr`	CED Spike2 binary recording	🟢
`.csv`	Comma-separated values text (generic)	🟢
`.adicht`	ADInstruments LabChart binary trace	🟢
`.hdf5`	Hierarchical Data Format v5 container	🟢
`.h5`	Same as `.hdf5`	🟢
`.abf`	Molecular Devices Axon Binary File (pClamp)	🟢
`.rhd`	Intan RHD2000 amplifier data	🟢
`.nev`	Blackrock NeuroPort event file	🔴
`.ns5`	Blackrock continuous 30 kHz signal	🟢
`.ns2`	Blackrock 1 kHz LFP signal	🔴
`.ns1`	Blackrock low-rate summary signal	🔴
`.smrx`	CED Spike2 v9+ extended recording	🟢
`.wav`	Waveform audio (PCM)	🟢
`.acq`	AxoScope raw acquisition	🟢
`.tdx`, `.tev`, `.tnt`, `.tsq`	TDT Synapse time-series (multi-file)	🔴
`.eeg`, `.vmrk`, `.vhdr`	BrainVision EEG dataset (multi-file)	🔴
`.sev`	TDT RS4 single-channel stream	🔴

Imaging Formats

Extension(s)	Description	Support Status
`.tif`	Tagged Image File Format (high-bit-depth microscopy)	🟢
`.tiff`	Same as `.tif`	🟢
`.czi`	Carl Zeiss ZEN container	🟢
`.nd2`	Nikon NIS-Elements microscope image	🟢
`.lsm`	Zeiss laser-scanning-microscope stack	🔴
`.jpx`	JPEG-2000 (JPX) image	🟡
`.svs`	Aperio/Leica whole-slide image	🔴
`.ims`	Bitplane Imaris 3-D/4-D scene	🟢
`.png`	Portable Network Graphics (lossless)	🟡
`.jpg`	JPEG compressed image	🟢
`.jpeg`	Same as `.jpg`	🟡
`.bmp`	Windows bitmap	🟡
`.vsi`	Olympus virtual-slide “wrapper” file	🟡
`.ets`	Olympus VS series full-resolution tile set	🟡
`.jp2`	JPEG-2000 codestream	🟡
`.roi`	ImageJ/Fiji region-of-interest set	🟡
`.dm3`	Gatan DigitalMicrograph EM image	🟡
`.pxp`	Igor Pro packed experiment (can embed images)	🟡
`.ipf`	Igor Pro procedure/data file	🟡
`.lif`	Leica Image File (LAS X)	🟡
`.ima`	Amira/Avizo volumetric raw image	🟡
`.mrxs`	3DHISTECH Mirax whole-slide image	🟡
`.obj`	Wavefront 3-D mesh	🟡
`.avi`	Uncompressed/codec AVI video (time-lapse stacks)	🟡
`.exf`	Zeiss experiment file (ZEN)	🟡
`.cxd`	Olympus cellSens dataset	🟡

Tested Single File and Bulk Conversion

Dataset ID	Type	Source Format(s)	Success
108	Time Series	`.csv`	✅
126	Time Series	`.acq`	✅
142	Time Series	`.csv`	✅
148	Time Series	`.acq`	✅
149	Time Series	`.smr`	✅
150	Time Series	`.smr`	✅
224	Time Series	`.acq`	✅
297	Time Series	`.abf`	✅
301	Time Series	`.csv`	✅
305	Time Series	`.csv`	✅
309	Time Series	`.mat`	✅
310	Time Series	`.mat`	✅
315	Time Series	`.smrx`	✅
316	Time Series	`.rhd`	✅
323	Time Series	`.csv`	✅
327	Time Series	`.mat`	✅
338	Time Series	`.smrx`	✅
349	Time Series	`.hdf5`	✅
350	Time Series	`.csv`	✅
351	Time Series	`.csv`	✅
357	Time Series	`.mat`	✅
375	Time Series	`.mat`	✅
376	Time Series	`.mat`	✅
378	Time Series	`.adicht`, `.adidat`, `.adidatx`	✅
380	Time Series	`.hdf5`	✅
391	Time Series	`.hdf5`	✅
400	Time Series	`.adi`, `.mat`	✅
406	Time Series	`.dat`, `.wav`	✅
425	Time Series	`.csv`	✅
435	Time Series	`.abf`	✅
436	Time Series	`.ns5`	✅
117	Imaging	`.rhd`	✅
65	Imaging	`.nd2`, `.tif`	✅
132	Imaging	`.ima`	✅
187	Imaging	`.jpg`	✅
290	Imaging	`.tif`	✅
296	Imaging	`.ims`	✅

CRediT

Max Haberbusch: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing; David Lung: Data curation, Formal Analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – review & editing; Philipp Heute: Investigation, Software, Validation, Visualization, Writing – original draft, Writing – review & editing; Sebastian Hochreiter: Investigation, Software, Validation, Writing – review & editing; Laurenz Berger: Validation, Writing – review & editing

Name		Name	Last commit message	Last commit date
Latest commit History 199 Commits
assets		assets
converted		converted
downloads		downloads
mapping_schemes		mapping_schemes
notebooks		notebooks
server		server
sparc-codeathon-team-b-extension		sparc-codeathon-team-b-extension
sparcfuse		sparcfuse
stats_figures		stats_figures
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bucket-policy.json		bucket-policy.json
conda_env_py310.yaml		conda_env_py310.yaml
config.ini		config.ini
explore_ds_new.ipynb		explore_ds_new.ipynb
img_tst.ipynb		img_tst.ipynb
requirements.txt		requirements.txt
sparc_fuse_core.py		sparc_fuse_core.py
test_aws.ipynb		test_aws.ipynb
test_sparc_fusion_core.ipynb		test_sparc_fusion_core.ipynb
utils.py		utils.py

License

SPARC-FAIR-Codeathon/2025-team-B

Folders and files

Latest commit

History

Repository files navigation

What is SPARC FUSE?

Table of Contents

🧬 Full Metadata, Always Included

🧠 Smart Mapping with Confidence Scoring

Scoring and fallback logic

🚀 Quick start

Command-line interface

Use as a Python library

Try it out on oSPARC

Or proceed locally

0 – Clone the project

1 – Convert a single primary file

2 – Bulk-convert an entire dataset

3 – Convert a subset of primary files

Firefox Plugin

Start the Server

Install the Firefox plugin

Using the Plugin to Download Data from the SPARC Website

❓ Why SPARC FUSE?

The headache

The cure

Why it matters

Supported File Formats

🌩️ Zarr + AWS: super-charging SPARC data

What is Zarr?

☁️ Cloud-first Demo: Prepare & Consume Data from S3

🚀 What's going on?

🛠️ Prepare S3 Bucket

📋 Consume Data

Load 100,000 time points and plot

Get SPARC Metadata from zarr

⏱ s3 slice speedup vs SPARC download & slice

Supported File Formats

Format Support Legend

Time-Series Formats

Imaging Formats

Tested Single File and Bulk Conversion

CRediT

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages