This is my source code for my master at École de Technologie Supérieure in partnership with Desjardins.
The goal was to create a proof of concept to determine if it is possible to anonymize french audio recordings.
The code is separated as followed :
- fa : Code used to evaluate two Forced Alignment (FA) algorithms
- ner : Code used to train and evaluate three Named Entity Recognition (NER) models
- pipeline : Code used to create the docker image to anonymize audio recordings
- annotations : Manually annotated speech corpora gold annotations. See Datasets section for more details.
The pipeline is usable via a Docker image. Please refer to the official documentation for more details on how to install Docker.
Firstly, build the docker image.
cd pipeline
docker build --tag pipeline .
Secondly, download the trained NER models on Zenodo.
Thirdly, allow the docker image to use your GPU.
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list |\
sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
sudo apt-get install nvidia-container-runtime
sudo systemctl stop docker
sudo systemctl start docker
Now, you can use the pipeline.
docker run -it -v [PATH_TO_DATA_TO_ANONYMIZE]:/input \
-v [PATH_TO_TMP_FA_ALGO_OUTPUT]:/align \
-v [PATH_TO_NER_MODELS_DIR]:/ner_models \
-v [PATH_TO_PIPELINE_OUTPUT]:/redact \
--gpus device=0 pipeline
The input directory contains the audio with its corresponding transcription.
The audio file format is wav. The transcription file format is TextGrid. Both files must have the same name. For example, if the audio file is named example.wav, the transcription must be named example.TextGrid.
Here is an example of a Textgrid file :
File type = "ooTextFile"
Object class = "TextGrid"
xmin = 0.000000
xmax = 4.803000
tiers? <exists>
size = 1
item []:
item [1]:
class = "IntervalTier"
name = "spkr_1_1-trans"
xmin = 0.000000
xmax = 4.803000
intervals: size = 3
intervals [1]:
xmin = 0.000000
xmax = 0.500000
text = ""
intervals [2]:
xmin = 0.500000
xmax = 4.303000
text = "This is an example of someone talking for approximately four seconds"
intervals [3]:
xmin = 4.303000
xmax = 4.803000
text = ""
Note that to work directly with our pipeline, the interval name containing the transcription must be named spkr_1_1-trans.
I used two datasets for this project.
The first one is FrenNER to train the NER models.
The second one is based on NCCFR to evaluate the FA algorithms and pipeline. For more details on how to generate the dataset, check annotations folder.
Part of matching between the predictions with the gold standard annotations was done by hand to ensure a good evaluation of the pipeline.