Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
115 commits
Select commit Hold shift + click to select a range
80bf43a
Merge pull request #80 from jermp/master
jermp Sep 3, 2025
b042476
some comments on random lookup benchmark
jermp Sep 3, 2025
5a47677
another comment on random lookup benchmark
jermp Sep 3, 2025
41071fe
some results on random lookup benchmark
jermp Sep 4, 2025
e894045
updated pthash; simplified hash utils
jermp Sep 8, 2025
21c2e85
updated hash utils
jermp Sep 8, 2025
d0e39db
tripartition of offsets
jermp Sep 12, 2025
550dbd9
fix
jermp Sep 12, 2025
80c9d00
using 32-bit words for buckets.start_lists_of_size
jermp Sep 12, 2025
a0140aa
lookup for canonical indexes
jermp Sep 12, 2025
34af717
a note on presence of minimizers when lookup is resolved via the skew…
jermp Sep 13, 2025
42c3d1d
fixed constants for skew index; merge parse and skew index construction
jermp Sep 14, 2025
c49df82
new results taken on 14/09/25: slightly faster construction, faster q…
jermp Sep 14, 2025
2c2ccc0
new results.png
jermp Sep 14, 2025
7c2e9c2
new results.png
jermp Sep 14, 2025
2155fff
a note on SIMD for encoding in dictionary::lookup; optimized string_t…
jermp Sep 16, 2025
c907f6d
a note about loop-unrolling in string_to_uint_kmer
jermp Sep 16, 2025
041d6d5
removed useless line
jermp Sep 16, 2025
c716fe7
minor fix to num. partitions in skew index; better access
jermp Sep 19, 2025
1ec6110
use a bits::compact_vector for (iteration to be fixed)
jermp Sep 19, 2025
bbcc2b6
updated external/bits
jermp Sep 20, 2025
27d8b72
updated external/bits and using bits::endpoints_sequence
jermp Sep 22, 2025
6b48d47
added missing include for compilation on Linux
jermp Sep 22, 2025
dd1a7d2
added missing include for compilation on Linux
jermp Sep 22, 2025
053f012
results 22-09-25 for k=31
jermp Sep 23, 2025
a972335
a note in readme
jermp Sep 23, 2025
cfc22a2
perf lookup by list size
jermp Sep 23, 2025
3c698d7
updated results to 22/09/25
jermp Sep 25, 2025
43dd436
added endpoints.hpp
jermp Sep 26, 2025
4390b13
minor
jermp Oct 1, 2025
df7438b
using encoded offsets
jermp Oct 3, 2025
3ccbbf4
clean up
jermp Oct 4, 2025
2efac6c
clean up and implemented endpoints::id_to_offset
jermp Oct 4, 2025
6e4d9aa
fixed CMakeLists.txt
jermp Oct 4, 2025
f66ed13
fixed endpoints and parallel correctness check
jermp Oct 5, 2025
571e3d4
added bioconda badge
jermp Oct 5, 2025
b8f589c
implemented all miscellaneous fixes by Oleksandr Kulkov
jermp Oct 6, 2025
eb4d1c4
updated external/pthash
jermp Oct 6, 2025
ac4abe6
set offsets using a single thread
jermp Oct 7, 2025
26f48a5
removed unused code
jermp Oct 7, 2025
e12fc8d
minor
jermp Oct 7, 2025
d22d01d
back to previous scheme
jermp Oct 10, 2025
4c07bde
more
jermp Oct 11, 2025
91677e7
more (needs fixing)
jermp Oct 12, 2025
14c832f
fix
jermp Oct 12, 2025
4733dae
fix perf test iterator
jermp Oct 12, 2025
a9055e2
big refactoring
jermp Oct 15, 2025
0f31776
minor
jermp Oct 15, 2025
13360a4
optimized num. locate queries
jermp Oct 16, 2025
858e71b
optimized num. locate queries
jermp Oct 16, 2025
127ca04
minor
jermp Oct 16, 2025
091f244
minor
jermp Oct 16, 2025
64c8443
XXH128 does not work on AMD processor: rewritten hashers for minimize…
jermp Oct 18, 2025
f5215ef
added cityhash
jermp Oct 19, 2025
09244aa
parallel checks
jermp Oct 21, 2025
5bb6ef3
print cmd; build and bench scripts updated
jermp Oct 21, 2025
d5987b2
build and bench scripts updated
jermp Oct 21, 2025
46d2118
new benchmarks logs: 21/10/25
jermp Oct 22, 2025
1b373f3
cap kmers to scan in perf_test_iterator to 10^8
jermp Oct 22, 2025
813b9bc
updated scripts
jermp Oct 22, 2025
2e42570
minor
jermp Oct 22, 2025
f66ce60
fixed build script and new results (22/10/25); also, noted that encod…
jermp Oct 22, 2025
a028972
added results
jermp Oct 23, 2025
db02c17
compute min by scan is actually faster than using a min-heap
jermp Oct 24, 2025
e67257e
scripts updated
jermp Oct 25, 2025
e9a525d
simplified file_merging_iterator
jermp Oct 25, 2025
ff33ec7
optimized merging with a looser tree (faster then a min-heap because …
jermp Oct 25, 2025
7f0b05d
avoid branch in tight loop
jermp Oct 27, 2025
ac04609
wrong namespace
jermp Oct 27, 2025
0ee2aa7
minor
jermp Oct 31, 2025
33020f4
quiet build
jermp Oct 31, 2025
70ceef1
quiet build
jermp Oct 31, 2025
2ae21fc
refctoring of build steps
jermp Nov 2, 2025
007ca31
json stats and refactored dictionary_builder
jermp Nov 3, 2025
c31d22f
minor
jermp Nov 3, 2025
e275d51
prefetching experiment: a little gain
jermp Nov 3, 2025
c41bdb8
json stats for perf benchmark
jermp Nov 3, 2025
2efb5d4
prefetching helps indeed random lookup
jermp Nov 3, 2025
7530305
prefetching also for canonical lookup
jermp Nov 4, 2025
0c53a23
updated external/pthash and refactored offsets.hpp
jermp Nov 4, 2025
e644a94
step 7.1 and 7.2 timed as well
jermp Nov 4, 2025
6fb7925
minor
jermp Nov 4, 2025
7d29302
examples in the readme updated
jermp Nov 5, 2025
fe05a41
minor
jermp Nov 6, 2025
f264b9b
minor
jermp Nov 6, 2025
5a12f40
build.py
jermp Nov 6, 2025
9219105
bench.py
jermp Nov 6, 2025
3e01643
build.py
jermp Nov 6, 2025
d5fb57c
deleted old scripts
jermp Nov 6, 2025
2b751b2
fix build.py script
jermp Nov 7, 2025
bd8be44
fix build.py script
jermp Nov 7, 2025
8c8562a
fix script
jermp Nov 7, 2025
eefc24e
updated essentials; fixed script
jermp Nov 10, 2025
efc3212
fix streaming query multiline fasta
jermp Nov 10, 2025
b7f815f
more stats to json
jermp Nov 10, 2025
2e6c05b
bench results 10/11/25
jermp Nov 10, 2025
99900cb
updated results; better streaming query script
jermp Nov 11, 2025
c4218d1
different query file for SE
jermp Nov 11, 2025
0502751
results updated
jermp Nov 12, 2025
71a6a93
benchmarks subfolder refactored
jermp Nov 12, 2025
98ee7a8
print version number in main tool
jermp Nov 12, 2025
3ed53ad
a note on benchmarks
jermp Nov 12, 2025
4365223
minor
jermp Nov 15, 2025
c769e05
sbwt results for k=63
jermp Nov 16, 2025
8c6ac62
prefetching does not actually help but writing offsets to an array fi…
jermp Nov 16, 2025
6f225a2
added results for sshash-v3 to compare against
jermp Nov 28, 2025
b59a3e4
removed empty json files
jermp Dec 8, 2025
32fd510
minor name cleanup
jermp Dec 18, 2025
de450db
minor name cleanup
jermp Dec 18, 2025
ad0cac9
removed some old comments
jermp Dec 18, 2025
1fbb593
README UPDATED
jermp Dec 18, 2025
4d9786f
Merge branch 'master' into bench
jermp Dec 19, 2025
ae61108
resolved some conflicts for merging into master
jermp Dec 19, 2025
63f5927
resolved some conflicts for merging into master
jermp Dec 19, 2025
aa95d7f
removed some old comments
jermp Dec 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -148,5 +148,3 @@ StatementMacros:
TabWidth: 8
UseTab: Never
...


3 changes: 1 addition & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ if (UNIX)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ggdb")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -Wno-missing-braces -Wno-unknown-attributes -Wno-unused-function")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -Werror -Wno-missing-braces -Wno-unknown-attributes -Wno-unused-function")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -pthread")

if (SSHASH_USE_SANITIZERS)
Expand Down Expand Up @@ -63,7 +63,6 @@ set(SSHASH_SOURCES
src/dictionary.cpp
src/query.cpp
src/info.cpp
src/statistics.cpp
)

set(SSHASH_INCLUDE_DIRS
Expand Down
74 changes: 28 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7772316.svg)](https://doi.org/10.5281/zenodo.7772316)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7239205.svg)](https://doi.org/10.5281/zenodo.7239205)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17582116.svg)](https://doi.org/10.5281/zenodo.17582116)

<picture>
<source media="(prefers-color-scheme: dark)" srcset="img/sshash_on_dark.png">
Expand All @@ -24,8 +25,8 @@ The data structure is described in the following papers:
For a dictionary of n k-mers,
two basic queries are supported:

- i = **Lookup**(g), where i is in [0,n) if the k-mer g is found in the dictionary or i = -1 otherwise;
- g = **Access**(i), where g is the k-mer associated to the identifier i.
- i = **Lookup**(x), where i is in [0,n) if the k-mer x is found in the dictionary or i = -1 otherwise;
- x = **Access**(i), where x is the k-mer associated to the identifier i.

If also the weights of the k-mers (their frequency counts) are stored in the dictionary, then the dictionary is said to be *weighted* and it also supports:

Expand All @@ -36,9 +37,9 @@ Other supported queries are:
- **Membership Queries**: determine if a given k-mer is present in the dictionary or not.
- **Streaming Queries**: stream through all k-mers of a given DNA file
(.fasta or .fastq formats) to determine their membership to the dictionary.
- **Navigational Queries**: given a k-mer g[1..k] determine if g[2..k]+x is present (forward neighbourhood) and if x+g[1..k-1] is present (backward neighbourhood), for x = A, C, G, T ('+' here means string concatenation).
SSHash internally stores a set of strings, called *contigs* in the following, each associated to a distinct identifier.
If a contig identifier is specified for a navigational query (rather than a k-mer), then the backward neighbourhood of the first k-mer and the forward neighbourhood of the last k-mer in the contig are returned.
- **Navigational Queries**: given a k-mer x[1..k] determine if x[2..k]+c is present (forward neighbourhood) and if c+x[1..k-1] is present (backward neighbourhood), for c in {A,C,G,T} ('+' here means string concatenation).
SSHash internally stores a set of strings, each associated to a distinct identifier.
If a string identifier is specified for a navigational query (rather than a k-mer), then the backward neighbourhood of the first k-mer and the forward neighbourhood of the last k-mer in the string are returned.

If you are interested in a **membership-only** version of SSHash, have a look at [SSHash-Lite](https://github.com/jermp/sshash-lite). It also works for input files with duplicate k-mers (e.g., [matchtigs](https://github.com/algbio/matchtigs) [4]). For a query sequence S and a given coverage threshold E in [0,1], the sequence is considered to be present in the dictionary if at least E*(|S|-k+1) of the k-mers of S are positive.

Expand Down Expand Up @@ -76,6 +77,8 @@ To compile the code for a release environment (see file `CMakeLists.txt` for the
cmake ..
make -j

**NOTE**: For best performance on `x86` architectures, the option `-D SSHASH_USE_ARCH_NATIVE` can be specified as well.

For a testing environment, use the following instead:

mkdir debug_build
Expand Down Expand Up @@ -142,18 +145,6 @@ Tools and Usage
There is one executable called `sshash` after the compilation, which can be used to run a tool.
Run `./sshash` as follows to see a list of available tools.

== SSHash: (S)parse and (S)kew (Hash)ing of k-mers =========================

Usage: ./sshash <tool> ...

Available tools:
build build a dictionary
query query a dictionary
check check correctness of a dictionary
bench run performance tests for a dictionary
permute permute a weighted input file
compute-statistics compute index statistics

For large-scale indexing, it could be necessary to increase the number of file descriptors that can be opened simultaneously:

ulimit -n 2048
Expand All @@ -179,50 +170,50 @@ such collections of stitched unitigs can be obtained from raw FASTA files.

### Example 1

./sshash build -i ../data/unitigs_stitched/salmonella_enterica_k31_ust.fa.gz -k 31 -m 13 --check --bench -o salmonella_enterica.index
./sshash build -i ../data/unitigs_stitched/salmonella_enterica_k31_ust.fa.gz -k 31 -m 13 --check --bench -o salmonella_enterica.sshash

This example builds a dictionary for the k-mers read from the file `../data/unitigs_stitched/salmonella_enterica_k31_ust.fa.gz`,
with k = 31 and m = 13. It also check the correctness of the dictionary (`--check` option), run a performance benchmark (`--bench` option), and serializes the index on disk to the file `salmonella_enterica.index`.
with k = 31 and m = 13. It also check the correctness of the dictionary (`--check` option), run a performance benchmark (`--bench` option), and serializes the index on disk to the file `salmonella_enterica.sshash`.

To run a performance benchmark after construction of the index,
use:

./sshash bench -i salmonella_enterica.index
./sshash bench -i salmonella_enterica.sshash

To also store the weights, use the option `--weighted`:

./sshash build -i ../data/unitigs_stitched/with_weights/salmonella_enterica.ust.k31.fa.gz -k 31 -m 13 --weighted --check --verbose

### Example 2

./sshash build -i ../data/unitigs_stitched/salmonella_100_k31_ust.fa.gz -k 31 -m 15 -l 2 -o salmonella_100.index
./sshash build -i ../data/unitigs_stitched/salmonella_100_k31_ust.fa.gz -k 31 -m 15 -o salmonella_100.sshash

This example builds a dictionary from the input file `../data/unitigs_stitched/salmonella_100_k31_ust.fa.gz` (a pangenome consisting in 100 genomes of *Salmonella Enterica*), with k = 31, m = 15, and l = 2. It also serializes the index on disk to the file `salmonella_100.index`.
This example builds a dictionary from the input file `../data/unitigs_stitched/salmonella_100_k31_ust.fa.gz` (a pangenome consisting in 100 genomes of *Salmonella Enterica*), with k = 31, m = 15, and l = 2. It also serializes the index on disk to the file `salmonella_100.sshash`.

To perform some streaming membership queries, use:

./sshash query -i salmonella_100.index -q ../data/queries/SRR5833294.10K.fastq.gz
./sshash query -i salmonella_100.sshash -q ../data/queries/SRR5833294.10K.fastq.gz

if your queries are meant to be read from a FASTQ file, or

./sshash query -i salmonella_100.index -q ../data/queries/salmonella_enterica.fasta.gz --multiline
./sshash query -i salmonella_100.sshash -q ../data/queries/salmonella_enterica.fasta.gz --multiline

if your queries are to be read from a (multi-line) FASTA file.

### Example 3

./sshash build -i ../data/unitigs_stitched/salmonella_100_k31_ust.fa.gz -k 31 -m 13 -l 4 -s 347692 --canonical -o salmonella_100.canon.index
./sshash build -i ../data/unitigs_stitched/salmonella_100_k31_ust.fa.gz -k 31 -m 13 --canonical -o salmonella_100.canon.sshash

This example builds a dictionary from the input file `../data/unitigs_stitched/salmonella_100_k31_ust.fa.gz` (same used in Example 2), with k = 31, m = 13, l = 4, using a seed 347692 for construction (`-s 347692`), and with the canonical parsing modality (option `--canonical`). The dictionary is serialized on disk to the file `salmonella_100.canon.index`.
This example builds a dictionary from the input file `../data/unitigs_stitched/salmonella_100_k31_ust.fa.gz` (same used in Example 2), with k = 31, m = 13, and with the canonical parsing modality (option `--canonical`). The dictionary is serialized on disk to the file `salmonella_100.canon.sshash`.

The "canonical" version of the dictionary offers more speed for only a little space increase (for a suitable choice of parameters m and l), especially under low-hit workloads -- when the majority of k-mers are not found in the dictionary. (For all details, refer to the paper.)
The "canonical" version of the dictionary offers more speed for only a little space increase, especially under low-hit workloads -- when the majority of k-mers are not found in the dictionary. (For all details, refer to the paper.)

Below a comparison between the dictionary built in Example 2 (not canonical)
and the one just built (Example 3, canonical).

./sshash query -i salmonella_100.index -q ../data/queries/SRR5833294.10K.fastq.gz
./sshash query -i salmonella_100.sshash -q ../data/queries/SRR5833294.10K.fastq.gz

./sshash query -i salmonella_100.canon.index -q ../data/queries/SRR5833294.10K.fastq.gz
./sshash query -i salmonella_100.canon.sshash -q ../data/queries/SRR5833294.10K.fastq.gz

Both queries should originate the following report (reported here for reference):

Expand Down Expand Up @@ -262,33 +253,24 @@ Input Files

SSHash is meant to index k-mers from collections that **do not contain duplicates
nor invalid k-mers** (strings containing symbols different from {A,C,G,T}).
These collections can be obtained, for example, by extracting the maximal unitigs of a de Bruijn graph.

To do so, we can use the tool [BCALM2](https://github.com/GATB/bcalm).
This tool builds a compacted de Bruijn graph and outputs its maximal unitigs.
From the output of BCALM2, we can then *stitch* (i.e., glue) some unitigs to reduce the number of nucleotides. The stitiching process is carried out using the [UST](https://github.com/jermp/UST) tool.
These collections can be obtained, for example, by extracting the maximal unitigs of a de Bruijn graph, or eulertigs, using the [GGCAT](https://github.com/algbio/ggcat) algorithm.

**NOTE**: Input files are expected to have **one DNA sequence per line**. If a sequence spans multiple lines (e.g., multi-fasta), the lines should be concatenated before indexing.

Below we provide a complete example (assuming both BCALM2 and UST are installed correctly) that downloads the Human (GRCh38) Chromosome 13 and extracts the maximal stitiched unitigs for k = 31.

mkdir DNA_datasets
wget http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz -O DNA_datasets/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz
~/bcalm/build/bcalm -in ~/DNA_datasets/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz -kmer-size 31 -abundance-min 1 -nb-cores 8
~/UST/ust -k 31 -i ~/Homo_sapiens.GRCh38.dna.chromosome.13.fa.unitigs.fa
gzip Homo_sapiens.GRCh38.dna.chromosome.13.fa.unitigs.fa.ust.fa
rm ~/Homo_sapiens.GRCh38.dna.chromosome.13.fa.unitigs.fa

#### Datasets

The script `scripts/download_and_preprocess_datasets.sh`
The script `scripts/download_and_preprocess_datasets.sh` of [this release](https://github.com/jermp/sshash/releases/tag/v3.0.0)
contains all the needed steps to download and pre-process
the datasets that we used in [1].

For the experiments in [2] and [3], we used the datasets available on [Zenodo](https://doi.org/10.5281/zenodo.7772316).
For the experiments in [2] and [3], we used the datasets available at [https://doi.org/10.5281/zenodo.7772316](https://doi.org/10.5281/zenodo.7772316).

For the latest benchmarks maintained in [this other repository](https://github.com/jermp/kmer_sets_benchmark)
we used the datasets described at [https://zenodo.org/records/17582116](https://zenodo.org/records/17582116).

#### Weights
Using the option `-all-abundance-counts` of BCALM2, it is possible to also include the abundance counts of the k-mers in the BCALM2 output. Then, use the option `-a 1` of UST to include such counts in the stitched unitigs.

Using the option `-all-abundance-counts` of [BCALM2](https://github.com/GATB/bcalm), it is possible to also include the abundance counts of the k-mers in the BCALM2 output. Then, use the option `-a 1` of [UST](https://github.com/jermp/UST) to include such counts in the stitched unitigs.

Create a New Release
--------------------
Expand Down
39 changes: 17 additions & 22 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,29 @@
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7239205.svg)](https://doi.org/10.5281/zenodo.7239205)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17582116.svg)](https://doi.org/10.5281/zenodo.17582116)

Benchmarks
----------

For these benchmarks we used the whole genomes of the following organisms:
For these benchmarks we used the datasets available here
[https://zenodo.org/records/17582116](https://zenodo.org/records/17582116).

- Gadus Morhua ("Cod")
- Falco Tinnunculus ("Kestrel")
- Homo Sapiens ("Human")

for k = 31 and 63.
To run the benchmarks, from within the `build` directory, run

The datasets and queries used in these benchmarks can be downloaded
by running the script
python3 ../script/build.py <log_label> <input_datasets_dir> <output_index_dir>
python3 ../script/bench.py <log_label> <input_index_dir>
python3 ../script/streaming-query-high-hit.py <log_label> <input_index_dir> <input_queries_dir>

```
bash download-datasets.sh
```
where `<log_label>` should be replaced by a suitable basename, e.g., the current date.

To run the benchmarks, from within the `build` directory, run
These are the results obtained on 10/11/25 (see logs [here](results-10-11-25))
on a machine equipped with an AMD Ryzen Threadripper PRO 7985WX processor clocked at 5.40GHz.
The code was compiled with `gcc` 13.3.0.

```
bash ../script/build.sh [prefix]
bash ../script/bench.sh [prefix]
bash ../script/streaming-query-high-hit.sh [prefix]
bash ../script/streaming-query-low-hit.sh [prefix]
```
The indexes were build with a max RAM usage of 16 GB and 64 threads.
Queries were run using one thread, instead.

where `[prefix]` should be replaced by a suitable basename, e.g., the current date.
![](results-10-11-25/results.png)

These are the results obtained on 22/08/25 (see logs [here](results-22-08-25)).
The results can be exported to CSV format with

![](results-22-08-25/results.png)
python3 ../script/print_csv.py ../benchmarks/results-10-11-25/k31
python3 ../script/print_csv.py ../benchmarks/results-10-11-25/k63
16 changes: 0 additions & 16 deletions benchmarks/download-datasets.sh

This file was deleted.

Loading
Loading