Skip to content

jhuus/BriteKit

Repository files navigation

Getting Started

More Information

Reference Guides

Getting Started


Introduction

BriteKit (Bioacoustic Recognizer Toolkit) is a Python package that facilitates the development of bioacoustic recognizers using deep learning. It provides a command-line interface (CLI) as well as a Python API, to support functions such as:

  • downloading recordings from Xeno-Canto, iNaturalist, and YouTube (optionally using Google Audioset metadata)
  • managing training data in a SQLite database
  • training models
  • testing, tuning and calibrating models
  • reporting
  • deployment and inference

To view a list of BriteKit commands, type britekit --help. You can also get help for individual commands, e.g. britekit train --help describes the train command. When accessing BriteKit from Python, the britekit.commands namespace contains a function for each command, as documented here. The classes used by the commands can also be accessed, and are documented here.

License

BriteKit is distributed under the terms of the MIT license.

Installation

It is best to install BriteKit in a virtual environment, such as a Python venv. Once you have that set up, install the BriteKit package using pip:

pip install britekit

In Windows environments, you then need to uninstall and reinstall PyTorch:

pip uninstall -y torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

Note that cu126 refers to CUDA 12.6.
Once BriteKit is installed, initialize a working environment using the init command:

britekit init --dest=<directory path>

This creates the directories needed and installs sample files. If you omit --dest, it will create directories under the current working directory, which is the most common use case.

Configuration

Configuration parameters are documented here. After running britekit init, the file yaml/base_config.yaml contains all parameters in YAML format. Most CLI commands have a --config argument that allows you to specify the path to a YAML file that overrides selected parameters. For example, when running the train command, you could provide a YAML file containing the following:

train:
  model_type: "effnet.4"
  learning_rate: .002
  drop_rate: 0.1
  num_epochs: 20

This overrides the default values for model_type, learning_rate, drop_rate and num_epochs. When using the API, you can update configuration parameters like this:

import britekit as bk
cfg = bk.get_config()
cfg.train.model_type = "effnet.4"

Downloading Recordings

The inat, xeno and youtube commands make it easy to download recordings from Xeno-Canto, iNaturalist and YouTube. For iNaturalist it is important to provide the scientific name. For example, to download recordings of the American Green Frog (lithobates clamitans), type:

britekit inat --name "lithobates clamitans" --output <output-path>

For Xeno-Canto, use --name for the common name or --sci for the scientific name. For YouTube, specify the ID of the corresponding video. For example, specify --id K_EsxukdNXM to download the audio from https://www.youtube.com/watch?v=K_EsxukdNXM.

The audioset command lets you download using Google Audioset, which is metadata that classifies sounds in YouTube videos. Audioset was released in March 2017, so any videos uploaded later than that are not included. Also, some videos that are tagged in Audioset are no longer available. Type britekit audioset --help for more information.

Managing Training Data

Once you have a collection of recordings, the steps to prepare it for training are:

  1. Extract spectrograms from recordings and insert them into the training database.
  2. Curate the training spectrograms.
  3. Create a pickle file from the training data.
  4. Provide the path to the pickle file when running training.

Suppose we have a folder called recordings/cow. To generate spectrograms and insert them into the training database, we could type britekit extract-all --name Cow --dir recordings/cow. This will create a SQLite database in data/training.db and populate it with spectrograms using the default configuration, assigning the class name "Cow". To browse the database, you can use DB Browser for SQLite, or a similar application. That will reveal the following tables:

  • Class: classes that the recognizer will be trained to identify, e.g. species of bird, mammal or amphibian
  • Category: categories such as Bird, Mammal or Amphibian
  • Source: sources of recordings, e.g. Xeno-Canto or iNaturalist.
  • Recording: individual recordings
  • Segment: fixed-length sections of recordings
  • SpecGroup: groups of spectrograms that share spectrogram parameters
  • SpecValue: spectrograms, each referencing a Segment and SpecGroup
  • SegmentClass: associations between Segment and Class, to identify the classes that occur in a segment

There are commands to add or delete database records, e.g. add-cat and del-cat to add or delete a category record. In addition, specifying the --cat argument with the extract-all or extract-by-image commands will add the required category record if it does not exist. You can plot database spectrograms using plot-db, or plot spectrograms for recordings using plot-rec or plot-dir. Once you have a folder of spectrogram images, you can manually delete or copy some of them. The extract-by-image command will then extract only the spectrograms corresponding to the given images. Similarly, the del-seg command will delete segments, and their spectrograms, corresponding to the images in a directory.

It is important to tune spectrogram parameters such as height, width, maximum/minimum frequency and window length for your specific application. This is discussed more in the tuning section below, but for now be aware that you can set specific parameters in a YAML file to pass to an extract or plot command. For example:

audio:
  min_freq: 350
  max_freq: 4000
  win_length: .08
  spec_height: 192
  spec_width: 256

The FFT window length is specified as a fraction of a second: .08 seconds in this example. That way the real window length does not vary if you change the sampling rate. As a rule of thumb, the sampling rate should be about 2.1 times the maximum frequency. Using the theoretically correct value of 2.0 can sometimes lead to implementation artifacts in the highest frequencies. Before training your first model, it is advisable to examine some spectrogram images and choose settings that seem reasonable as a starting point. For example, the frequency range needed for your application may be greater or less than the defaults.

The SpecGroup table allows you to easily experiment with different spectrogram settings. Running extract-all or extract-by-image creates spectrograms assigned to the default SpecGroup, if none is specified. Once you have curated some training data, use the reextract command to create another set of spectrograms, assigned to a different SpecGroup. That way you can keep spectrograms with different settings for easy experimentation.

Training

The pickle-train command creates a binary pickle file (data/training.pkl by default), which is the source of training data for the train command. Reading a binary file is much faster than querying the database, so this speeds up the training process. Also, this provides a simple way to select a SpecGroup, and/or a subset of classes for training. For training, you should always provide a config file to override some defaults. Here is an expanded version of the earlier example:

train:
  train_pickle: "data/low_freq.pkl"
  model_type: "effnet.4"
  head_type: "basic_sed"
  learning_rate: .002
  drop_rate: 0.1
  drop_path_rate: 0.1
  val_portion: 0.1
  num_epochs: 20

The model_type parameter can be "timm.x" for any model x supported by timm. However, many bioacoustic recognizers benefit from a smaller model than typical timm models. Therefore BriteKit provides a set of scalable models, such as "effnet.3" and "effnet.4", where larger numbers indicate larger models. The scalable models are:

Model Original Name Comments Original Paper
bknet GENet A custom implementation of the timm gernet, which is an implementation of GENet. Fast, useful for all but the smallest models. here
dla DLA Slow and not good for large models, but works well for some very small models. here
effnet EfficientNetV2 Medium speed, widely used, useful for all sizes. here
gernet GENet Scalable configurations of the timm gernet model. Fast, useful for all but the smallest models. here
hgnet HgNetV2 Fast, useful for all but the smallest models. not published
vovnet VovNet Medium-fast, useful for all sizes. here

For very small models, say with less than 10 classes and just a few thousand training spectrograms, DLA and VovNet are good candidates. As model size increases, DLA becomes slower and less appropriate. Of course, it is best to try different models and model sizes to see which works best for your application.

If head_type is not specified, BriteKit uses the default classifier head defined by the model. However, you can also specify any of the following head types:

Head Type Description
basic A basic MLP classifier head.
effnet The MLP classifier head used in EfficientNetV2.
hgnet The MLP classifier head used in HgNetV2.
basic_sed SED head with attention-weighted frame classification applied directly to backbone features.
bitemporal_sed SED head that uses bidirectional (forward and backward) convolutions for temporal context.
scalable_sed SED head with grouped channel reduction and a temporal convolution stack.

The hidden_channels parameter controls the size of all three SED heads.

See the discussion of Backbones and Classifier Heads below for more information.

You can specify val_portion > 0 to run validation on a portion of the training data, or num_folds > 1 to run k-fold cross-validation. In the latter case, training output will be in logs/fold-0/version_x etc. Otherwise it is under logs/version_x. Output from the first training run is saved in version_0, and the version number is incremented in subsequent runs. To view graphs of the loss and learning rate, type tensorboard --logdir <log directory>. This will launch an embedded web server and display a URL that you can use to view graphs such as the learning rate in a web browser.

Testing

To run a test, you need to annotate a set of test recordings, analyze them with your model or ensemble, and then run the rpt-test command (although test annotations are also used by the calibrate, ensemble, plot-test, rpt-ann, rpt-epochs and tune commands). Annotations must be saved in a CSV file with a defined format. For initial testing and tuning it is best to annotate each relevant sound (per-segment), but for later usage you may wish to use per-block (e.g. minute) or per-recording annotations. Per-recording annotations are defined in a CSV file with these columns:

Column Description
recording Just the stem of the recording name, e.g. XC12345, not XC12345.mp3.
classes Defined classes found in the recording, separated by commas. For example: AMCR,BCCH,COYE.

Per-block annotations are defined in a CSV file with these columns:

Column Description
recording Just the stem of the recording name, as above.
block 1 for the first block (e.g. minute), 2 for the second, etc.
classes Defined classes found in that block, if any, separated by commas.

Per-segment annotations are recommended, and are defined in a CSV file with these columns:

Column Description
recording Just the stem of the recording name, as above.
class Identified class.
start_time Where the sound starts, in seconds from the start of the recording.
end_time Where the sound ends, in seconds from the start of the recording.

Use the analyze command to analyze the recordings with your model or ensemble. For testing, be sure to specify --min_score 0. That way all predictions will be saved, not just those above a particular threshold, which is important when calculating metrics. See Metrics (PR-AUC and ROC-AUC) for more information.

It's usually best for a test to consist of a single directory of recordings, containing a file called annotations.csv. If that directory is called recordings and you run analyze specifying --output recordings/labels, you could generate test reports as follows:

britekit rpt-test -a recordings/annotations.csv -l labels -o <output-dir>

If your annotations were per-block or per-recording, you would specify the --granularity block or --granularity recording argument (--granularity segment is the default).

Tuning

Before tuning your model, you need to create a good test, as described in the previous section. Then you can use the tune command to find optimal settings for a given test. If you are only tuning inference parameters, you can run many iterations very quickly, since no training is needed. To tune training hyperparameters, many training runs are needed, which takes longer. You can also use the tune command to tune audio and spectrogram settings. In that case, every iteration extracts a new set of spectrograms, which takes even longer.

Here is a practical approach:

  1. Review spectrogram plots with different settings, especially spec_duration, spec_width, spec_height, min_frequency, max_frequency and win_length. Then choose reasonable-looking initial settings. For example, if all the relevant sounds fall between 1000 and 5000 Hz, set min and max frequency accordingly.
  2. Tune the main training hyperparameters, especially model_type, head_type and num_epochs.
  3. Tune the audio/spectrogram hyperparameters.
  4. Tune data augmentation hyperparameters, which are described in the Data Augmentation section below.
  5. Tune the inference audio_power hyperparameter.
  6. Perform a second tuning pass, starting at step 2 above.

This usually leads to a substantial improvement in scores (see Metrics (PR-AUC and ROC-AUC). If you are using a SED classifier head, it is also worth tuning segment_len and overlap.

To run the tune command, you would typically use a config YAML file as described earlier, plus a special tuning YAML file, as in this example:

- name: spec_width
  type: int
  bounds:
  - 256
  - 512
  step: 64

This gives the name of the parameter to tune, its data type, and the bounds and step sizes to try. In this case, we want to try spec_width values of 256, 320, 384, 448 and 512. You can also tune multiple parameters at the same time, by simply appending more definitions similar to this one. Parameters that have a choice of defined values rather than a range are specified like this:

- name: head_type
  type: categorical
  choices:
  - "effnet"
  - "hgnet"
  - "basic_sed"

When running the tune command, you can ask it to test all defined combinations based on the input, or to test a random sample. To try 100 random combinations, add the argument --tries 100. To tune audio/spectrogram parameters, add the --extract argument. To tune inference only, add the --notrain argument.

Training is non-deterministic, and results for a given group of settings can vary substantially across multiple training runs. Therefore it is important to specify the --runs argument, indicating how often training should be run for a given set of values.

As an example, to find the best spec_width value, we could type a command like this:

britekit tune -c yaml/my_train.yml -p yaml/my_tune.yml -a my_test/annotations.csv -o output/tune-spec-width --runs 5 --extract

This will perform an extract before each trial, and use the average score from 5 training runs in each case. Scores will be based on the given test, using macro-averaged ROC-AUC, although this can be changed with the --metric argument.

Ensembling

Combining multiple checkpoints in an ensemble is a quick and easy way to improve classifier results. This can be especially powerful when different model architectures are used, but even with the same model type and training protocol, ensembling almost always improves results.

Using an ensemble is very easy - just copy all the ensemble checkpoint files to the data/ckpt directory (or whichever directory is specified by the ckpt_folder configuration parameter). With too many models in an ensemble, inference will become very slow, and at some point there is no benefit to adding more checkpoints anyway. In most cases an ensemble of 3-6 checkpoints is best.

Given a per-segment test and a directory containing checkpoints, use the ensemble command to find the highest-scoring ensemble of a given size.

Calibrating

By default, the scores or predictions generated by your models may not align well with probabilities. Ideally, a score of .8 should be correct about 80% of the time, but for a given ensemble it might actually be correct 70% or 90% of the time. Aligning output predictions with probabilities is called calibration.

Use the calibrate command to calibrate your model or ensemble. Given a per-segment test and inference output it will generate a graph showing the uncalibrated and calibrated scores. Calibrated scores are based on a corresponding scaling coefficient and intercept. To use these values, set the scaling_coefficient and scaling_intercept parameters in your inference configuration.

More Information


Spectrograms

Height and Width

When choosing spectrogram height and width, keep the following considerations in mind:

  1. Some models requires image height and width to be multiples of 32.
  2. If the width is not divisible by the duration in seconds, temporal alignment issues may arise.
  3. If the sampling rate is not divisible by the width-per-second, temporal alignment issues may arise.

Temporal alignment is usually only a problem when issues 2 and 3 are very pronounced. Small deviations are usually not a problem.

Window Length

As described earlier, BriteKit differs from most packages in specifying FFT window length in fractional seconds. For example, a window length of 2048 with a sampling rate of 32000 would be specified as .064 (2048/32000). If it was specified as 2048 and you later changed the sampling rate, the effective window length would also change, which might not be what you intended.

Shorter window lengths give better temporal resolution and longer ones give better frequency resolution. This is easy to see by plotting a few examples. Window length is therefore an important tuning parameter, and the optimal value depends on the application.

Frequency Scales

BriteKit supports three frequency scales: mel, linear and log, with mel as the default. The mel scale is very close to linear up to 1 KHz and very close to log above that. Implementations differ from theory, so it is always best to look at a few samples before choosing a scale. For example, mel scale with a frequency range of 0-200 Hz should be linear in theory, but is not useful due to implementation issues.

Amplitude Scales

By default, spectrograms use a magnitude scale. Squaring the magnitudes gives a power scale, and taking 10*log10(power) gives a decibel scale. BriteKit normalizes all spectrogram values to the range [0, 1], since neural networks work best with values in a narrow range. This plot shows the different normalized amplitude scales:

Compared to the magnitude scale, the power scale reduces most values, which makes spectrograms look much fainter. The decibel scale increases most values, so spectrograms look very bright and quiet sounds become visible. Taking the square root of the magnitude gives a result between the magnitude and decibel scales. In BriteKit this can be achieved by tuning the audio_power parameter, where a value of 0.5 will yield a square root as shown above. In practice, a magnitude scale with a tuned audio_power usually works best.

Backbones and Classifier Heads

Image classifiers are constructed as layers, which are grouped into a "backbone" and a "classifier head". Most of the layers form the backbone, with just the last few forming the classifier head. The backbone converts the input to a simplified representation that can then be classified by the head. Traditionally, image classifier heads used a design called MLP or multi-layer perceptron, named after the very first neural network. For spectrogram classification, SED (sound event detection) heads have proven to be effective.

The backbone outputs a feature map, which has time in the horizontal dimension and features in the vertical dimension. Compared to the input spectrogram, the time dimension is coarser, and frequencies are replaced with features. Typically a spectrogram has 64 to 128 samples per second, while the backbone output has only 4. MLP heads use matrix multiplication to process the whole feature map at once. SED heads process the input frames as a time series, and output a score for each frame. For example, with 3-second spectrograms and 4 frames per second, an MLP head outputs 1 score and a SED head outputs 12. The BriteKit Predictor class therefore returns two values: a score array and a frame map. The frame map is null when MLP heads are used, and it contains individual frame scores for SED heads.

Metrics (PR-AUC and ROC-AUC)

In the context of a bioacoustic recognizer with a given score threshold, precision is the proportion of predictions that are correct. Recall is the proportion of actual sounds that are detected. For example, suppose a recording has 100 segments containing American Crow sounds, and suppose a recognizer flags 80 segments as containing American Crow. If 60 of those flagged segments actually contain crow sounds, precision is 75% (60/80). And since 60 of the 100 actual crow segments were detected, recall is 60% (60/100). Note that raising the threshold typically increases precision but decreases recall, meaning fewer false alarms but more missed detections.

Using metrics that depend on a threshold causes problems though. For example, when comparing two models at a threshold, if one has higher recall but lower precision, how do you decide which is better? Therefore it's better to use threshold-independent metrics, the most popular of which are PR-AUC and ROC-AUC.

PR-AUC is the area under the precision-recall curve, as in this example:

Conceptually, the precision and recall are calculated at each threshold, the curve is constructed and then the area is computed. In practice a different algorithm is used, but it results in the same value.

ROC-AUC is the area under the receiver operating charactistic curve. Like the PR curve, it uses recall on one axis, but it replaces precision with 1 - FPR, like this:

Both PR-AUC and ROC-AUC can be calculated using macro-averaging or micro-averaging. Macro-averaging gives all classes equal weight, while micro-averaging aggregates predictions across all classes, making it more sensitive to classes with more annotations or more false positives. ROC-AUC has the nice property that it equals the probability that a randomly selected positive example will be scored higher than a randomly selected negative example. Also, ROC-AUC has lower variance than PR-AUC. So for initial tuning, it is usually best to use ROC-AUC. However, PR-AUC is more sensitive to high-scoring false positives, and tuning it can help to reduce those. Also, users generally care more about precision and recall than about the relative ranking of randomly selected examples, so once ROC-AUC is showing good results it's best to focus on PR-AUC.

Data Augmentation

Data augmentation reduces overfitting by artificially expanding the training dataset. During training, BriteKit randomly transforms spectrograms, so the model sees different variations of the same training samples across epochs. This helps the model generalize to real-world recordings.

Augmentation Types

BriteKit supports the following augmentation types:

Augmentation Description
add_real_noise Merges a noise spectrogram from the training database with the current spectrogram. Requires a "Noise" class in training data.
add_white_noise Adds random Gaussian noise to the spectrogram.
blur Applies Gaussian blur to soften spectrogram details.
flip_horizontal Flips the spectrogram horizontally (time reversal).
freq_mask Masks random frequency bands with zeros (SpecAugment-style).
shift_horizontal Shifts the spectrogram left or right in time.
speckle Adds multiplicative noise (speckle pattern).
time_mask Masks random time segments with zeros (SpecAugment-style).

Configuration

Augmentation is enabled by default. To configure it, create a YAML file like this:

train:
  augment: true
  max_augmentations: 1
  prob_simple_merge: 0.32
  prob_fade1: 0.5
  min_fade1: 0.1
  max_fade1: 1.0
  augmentations:
    - name: add_real_noise
      prob: 0.34
      params:
        prob_fade2: 0.5
        min_fade2: 0.2
        max_fade2: 0.8
    - name: shift_horizontal
      prob: 0.6
      params:
        max_shift: 8

The prob field specifies the probability of applying each augmentation. Set prob: 0 to disable an augmentation. The params field contains augmentation-specific parameters.

When prob_simple_merge is non-zero, there is a chance of merging two spectrograms from different classes. This teaches the model to recognize overlapping sounds. The prob_fade1, min_fade1 and max_fade1 parameters control random fading applied after augmentation.

Practical Tips

  • Start with the default augmentation settings and tune them later.
  • The add_real_noise augmentation requires spectrograms labeled as "Noise" in your training database. These should contain background noise without target sounds.
  • For small datasets, more aggressive augmentation can help. Try increasing max_augmentations to 2.
  • Time and frequency masking (time_mask, freq_mask) are based on SpecAugment and can improve robustness.
  • Use the tune command to optimize augmentation parameters for your specific application.

Development Environment

These instructions have been tested in Linux only. To create a BriteKit development environment, install hatch at the user level (not in a virtual environment), and verify it's installed:

sudo apt install pipx -y
pipx ensurepath
pipx install hatch
hatch --version

Then clone the BriteKit repo and type the following inside the BriteKit directory:

hatch env create britekit
hatch shell

That will take some time to run. After that, the britekit package will be installed and you will be able to modify code and test without re-installing it. To confirm the package is available, type:

python -c "import britekit; print(britekit.__version__)"

To run the unit tests, type:

pytest

To activate the environment in future, just type hatch shell, which should run almost instantly. The virtual environment is managed by hatch and defined by the pyproject.toml file. If that file changes and you want to update your environment, you first need to type exit to leave the hatch shell if you are in it. Then type:

hatch env remove britekit
hatch env create britekit
hatch shell

The .gitignore file includes rules to ignore the data and yaml directories. If you have other inputs, outputs or scripts you'd like to keep under BriteKit but hide from git, create a _local directory and put them there.

You can run a code-check in either of the following two ways:

./scripts/check_code.sh
hatch run britekit:check

In both cases, the only warnings you should get are: "By default the bodies of untyped functions are not checked, consider using --check-untyped-defs".

About

Bioacoustic Recognizer Toolkit

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages