- checkpoints — directory for locally stored model weights. Each model has its own subdirectory inside
- data — directory for datasets
- image — images displayed in the README.md
- models — Python files containing model classes
- notebooks — Jupyter notebooks with experiments
- base_analysis — EDA
- 2d_mfcc — training a CNN on MFCC
- 2d_spectrogramm — training a CNN on spectrograms
- catboost — CatBoost training
- data_splitting — data splitting
- ml_test_all_features — Experiments with classical ML
- swishnet — Training SwishNet on audio chunks
- wav2vec_train — Training Wav2Vec
- wav2vec_test — Testing Wav2Vec
- python scrips — code for running on a remote cluster
- src — helper functions/classes and Streamlit web app
Assistive systems are among the most demanded areas in machine learning.
Even today, some doctors use artificial intelligence in their daily practice. It helps simplify diagnosis
and enables personalized treatment for each patient.
Our work focuses on building a model to predict the presence of aphasia in a patient. Aphasia is a speech disorder
that affects speech comprehension. It often occurs after a stroke, traumatic brain injury, or diseases
related to the central nervous system. This condition can severely impact a person’s ability to communicate,
especially in elderly individuals. However, if therapy starts early enough, recovery is possible.
Therefore, having a tool that can detect the first signs of aphasia is crucial.
The dataset was provided by the Laboratory of Social Cognitive Informatics. It includes 353 participants with aphasia
and 101 without. Each participant has approximately two audio recordings. The participants belong to different age groups.
The average age of aphasic participants is 58, and the distribution is close to normal. The non-aphasic group’s age
distribution is more uniform, containing both young and elderly subjects.
Below is the distribution of aphasia severity:
Distribution of aphasia severity
As a baseline, classical machine learning was chosen, since in some cases it can be sufficient.
FLAML was used because it automatically selects models and tunes their hyperparameters.
Feature sets included MFCC+ZCR, Prosody Features+ZCR, and a combination of several types
(MFCC, Chromagram, Spectral Features, Prosody Features, ZCR, Timestamps). Additionally we used optuna to tune catboost.
MFCC represents audio using coefficients for time segments obtained by
convolving the spectrogram with “special filters.” Physically,
it simulates how human hearing processes speech (similar to Mel-spectrograms).
Since our data consists of speech recordings, this representation captures relevant
speech-related features.
In the literature, both classical ML and 1D CNNs are commonly used with MFCCs.
One straightforward idea is to feed raw audio directly into a transformer. Below are the Wav2Vec results:
Wav2Vec architecture
Spectrograms remain one of the most commonly used representations, so it was reasonable to test them as well
Various methods were tested, and for the final Streamlit application,
Wav2Vec was chosen for its accuracy, and MobileNet on MFCC due to its speed and good performance.
Although the classifier itself is complete, there is still room for exploration. For example,
severity prediction remains an open goal. If more datasets in other languages were publicly available,
we could train on them and test on Russian speech to see whether the model focuses more on what is said or how it is said.
- SwishNet: A Fast Convolutional Neural Network for Speech, Music and Noise Classification and Segmentation. Md. Shamim Hussain, and Mohammad Ariful Haque. 2018
- Automatic Assessment of Aphasic Speech Sensed by Audio Sensors for Classification into Aphasia Severity Levels to Recommend Speech Therapies. Herath Mudiyanselage Dhammike Piyumal Madhurajith Herath,
Weraniyagoda Arachchilage Sahanaka Anuththara Weraniyagoda,
Rajapakshage Thilina Madhushan Rajapaksha, Patikiri Arachchige Don Shehan Nilmantha Wijesekara,
Kalupahana Liyanage Kushan Sudheera, Peter Han Joo Chong. 2022 - A comparison of data augmentation methods in voice pathology detection. Farhad Javanmardi, Sudarsana Reddy Kadiri, Paavo Alku. 2022
- Predicting Severity in People with Aphasia: A Natural Language
Processing and Machine Learning Approach. Marjory Day, Rupam Kumar Dey, Matthew Baucum, Eun Jin Paek, Hyejin Park, Anahita Khojandi. 2021 - An End-to-End Approach to Automatic Speech Assessment
for Cantonese-speaking People with Aphasia. Ying Qin, Yuzhong Wu, Tan Lee, Anthony Pak Hin Kong. 2019

