Total observations: 7442
Number of features: 41
actor_id audio_duration sample_rate mfcc_1_mean mfcc_1_std \
count 7442.000000 7442.000000 7442.0 7442.000000 7442.000000
mean 1046.084117 2.542910 22050.0 -387.893237 81.152242
std 26.243152 0.505980 0.0 56.912883 30.241790
min 1001.000000 1.267982 22050.0 -1131.370700 0.000122
25% 1023.000000 2.202222 22050.0 -428.004615 58.292865
50% 1046.000000 2.502540 22050.0 -399.767640 76.243485
75% 1069.000000 2.836190 22050.0 -354.018697 101.771385
max 1091.000000 5.005034 22050.0 -162.543350 179.528880
mfcc_2_mean mfcc_2_std mfcc_3_mean mfcc_3_std mfcc_4_mean ... \
count 7442.000000 7442.000000 7442.000000 7442.000000 7442.000000 ...
mean 131.246557 26.349166 7.226425 31.621393 50.164769 ...
std 15.557340 6.193413 11.605281 11.700738 11.128262 ...
min 0.000000 0.000000 -52.340374 0.000000 0.000000 ...
25% 122.297443 22.112123 0.674029 22.746978 42.658050 ...
50% 134.065410 25.892035 8.938592 30.019723 50.710929 ...
75% 142.583675 30.226819 15.246822 39.455707 58.216644 ...
max 167.168330 63.146930 38.951794 73.551254 83.296500 ...
mfcc_13_std spectral_centroid_mean spectral_centroid_std \
count 7442.000000 7442.000000 7442.000000
mean 6.486291 1391.389433 569.861009
std 1.678969 254.030203 284.309588
min 0.000000 0.000000 0.000000
25% 5.387114 1213.176905 356.601232
50% 6.221665 1335.982229 507.373775
75% 7.244229 1510.743971 725.217080
max 24.734776 2873.927831 1699.906329
spectral_rolloff_mean spectral_bandwidth_mean rms_mean \
count 7442.000000 7442.000000 7442.000000
mean 2959.971329 1748.984424 0.027548
std 471.242990 115.947135 0.028312
min 0.000000 0.000000 0.000000
25% 2648.811001 1679.011486 0.010933
50% 2926.667949 1748.878905 0.016707
75% 3211.563802 1815.428594 0.032007
max 5258.158543 2163.024688 0.223023
rms_std zcr_mean chroma_mean chroma_std
count 7442.000000 7442.000000 7442.000000 7442.000000
mean 0.027249 0.063343 0.389487 0.301066
std 0.031667 0.023750 0.045327 0.012338
min 0.000000 0.000000 0.000000 0.000000
25% 0.008040 0.047160 0.359097 0.293473
50% 0.015029 0.056566 0.389878 0.301533
75% 0.033126 0.072023 0.420108 0.309572
max 0.220164 0.233774 0.553840 0.335121
[8 rows x 38 columns]
Missing values per column:
actor_id 0
sentence 0
emotion 0
intensity 0
audio_duration 0
sample_rate 0
mfcc_1_mean 0
mfcc_1_std 0
mfcc_2_mean 0
mfcc_2_std 0
mfcc_3_mean 0
mfcc_3_std 0
mfcc_4_mean 0
mfcc_4_std 0
mfcc_5_mean 0
mfcc_5_std 0
mfcc_6_mean 0
mfcc_6_std 0
mfcc_7_mean 0
mfcc_7_std 0
mfcc_8_mean 0
mfcc_8_std 0
mfcc_9_mean 0
mfcc_9_std 0
mfcc_10_mean 0
mfcc_10_std 0
mfcc_11_mean 0
mfcc_11_std 0
mfcc_12_mean 0
mfcc_12_std 0
mfcc_13_mean 0
mfcc_13_std 0
spectral_centroid_mean 0
spectral_centroid_std 0
spectral_rolloff_mean 0
spectral_bandwidth_mean 0
rms_mean 0
rms_std 0
zcr_mean 0
chroma_mean 0
chroma_std 0
dtype: int64
Acoustic Emotion Recognition
accuracy of unsupervised methods for recognition emotion from acoustic features
Abstract
This study investigates the capability of machine learning models to classify emotional states from acoustic features. Using the CREMA-D (Crowd-Sourced Emotional Multimodal Actors Dataset), we focused on six target emotions: neutral, happy, sad, angry, fear, and disgust. Numerical audio features were extracted via librosa, standardized, and reduced using PCA to retain 98% of variance. We evaluated and compared standalone algorithms (SVM), ensemble methods (Random Forest), and neural networks (MLP) for multi-class emotion recognition. Results indicate that the MLP achieved the highest macro F1-score (0.5534), demonstrating superior ability to capture non-linear patterns and balanced performance across all emotions.
Introduction
Automatic emotion classification from speech remains a challenging problem due to the subtlety and variability of acoustic cues. This project leverages the CREMA-D dataset, comprising 7,442 .wav clips from 91 actors portraying six basic emotions. Prior research demonstrates that features such as MFCCs and spectral properties are informative for emotion detection, yet there is no consensus on optimal feature selection or model architecture (Banerjee, Huang, & Lettiere, n.d.).
We transformed raw audio into quantitative features using librosa and applied a standardized preprocessing pipeline. The goal is to assess the effectiveness of both traditional and neural network-based models for multi-class emotion recognition, providing a comparative evaluation of standalone algorithms, ensemble methods, and neural networks.
Research Question
- Q1. What is the classification accuracy of unsupervised methods for emotion recognition from acoustic features?
- Q2. How do standalone algorithms, ensemble approaches, and neural networks compare in terms of accuracy, robustness, and computational efficiency for emotion classification from audio?
Exploratory Analysis
Inital data load & Observations
Target Class Distribution
We analyzed class distribution across the six emotions (neutral, happy, sad, angry, disgust, and fear). Our analysis showed that the “neutral” class has approximately 14.5% fewer samples than the other classes. Given this mild imbalance, no resampling was performed.
Data Preprocessing
Data Cleaning & Transformation
Data pre-processing included handling missing values, removing irrelevant columns, and transforming numerical features to reduce skewness. The Yeo–Johnson power transformation was applied to achieve more symmetric distributions (skewness < ±0.5), improving suitability for downstream modeling. The categorical target variable (‘emotion’) was encoded into numerical labels for compatibility with machine learning algorithms.
Splitting the dataset
The data is split using the train_test_split function, with 20% of the data reserved for testing. For the features (X), we drop the target column and its encoded variant, while for the labels (y), we retain only the encoded emotion.
Data Scaling and Dimension Reduction
Before training our models, we applied data scaling and dimensionality reduction to the numeric features. First, we identified all numeric columns in the dataset and applied Standard Scaling to ensure each feature has zero mean and unit variance.
Next, we performed Principal Component Analysis (PCA) to reduce dimensionality while retaining 98% of the variance. PCA transforms the scaled features into a set of orthogonal components, capturing most of the information in fewer dimensions. We evaluated the explained variance ratio for each component and visualized it using a scree plot to confirm the number of components selected.
Model Training and Evaluation
To address our research questions on emotion recognition from acoustic features, we trained and evaluated multiple machine learning models, including standalone algorithms, ensemble methods, and neural networks. The models were trained on the PCA-reduced and standardized feature set to improve convergence, reduce dimensionality, and mitigate potential overfitting.
For each model, we used the evaluate_model function, which trains the model and reports comprehensive performance metrics. These metrics include overall accuracy, macro-averaged precision, recall, and F1-score, as well as per-class performance for each emotion in the target set. To provide a visual assessment of prediction quality, confusion matrices were generated for all models.
Specifically, we evaluated:
Random Forest (RF): An ensemble method configured with 1000 trees, maximum depth of 15, and out-of-bag scoring to provide robust predictions while controlling for overfitting.
Support Vector Machine (SVM): A kernel-based model with RBF kernel, class-weight balancing for mild class imbalance, and probability estimates enabled.
Multilayer Perceptron (MLP): A neural network with three hidden layers (128, 64, 32 neurons), early stopping, and a 10% validation split to monitor convergence.
Target emotion label mapping: {0: 'angry', 1: 'disgust', 2: 'fear', 3: 'happy', 4: 'neutral', 5: 'sad'}
Training Random Forest
==================================================
Random Forest Overall Performance:
Accuracy: 0.5104
Macro Precision: 0.5012
Macro Recall: 0.5104
Macro F1-Score: 0.4990
Random Forest Per-Class Performance:
Emotion Precision Recall F1-Score Support
--------------------------------------------------
angry 0.5947 0.7913 0.6791 254
disgust 0.4701 0.4331 0.4508 254
fear 0.4615 0.2835 0.3512 254
happy 0.5000 0.4510 0.4742 255
neutral 0.4587 0.5092 0.4826 218
sad 0.5225 0.5945 0.5562 254
Training SVM
==================================================
SVM Overall Performance:
Accuracy: 0.5285
Macro Precision: 0.5258
Macro Recall: 0.5280
Macro F1-Score: 0.5256
SVM Per-Class Performance:
Emotion Precision Recall F1-Score Support
--------------------------------------------------
angry 0.6541 0.7520 0.6996 254
disgust 0.4960 0.4921 0.4941 254
fear 0.4478 0.4724 0.4598 254
happy 0.5174 0.4667 0.4907 255
neutral 0.4825 0.5046 0.4933 218
sad 0.5571 0.4803 0.5159 254
Training MLP(Neural Net)
==================================================
MLP(Neural Net) Overall Performance:
Accuracy: 0.5400
Macro Precision: 0.5374
Macro Recall: 0.5392
Macro F1-Score: 0.5345
MLP(Neural Net) Per-Class Performance:
Emotion Precision Recall F1-Score Support
--------------------------------------------------
angry 0.6644 0.7559 0.7072 254
disgust 0.4946 0.5433 0.5178 254
fear 0.5294 0.3543 0.4245 254
happy 0.4857 0.5333 0.5084 255
neutral 0.4912 0.5092 0.5000 218
sad 0.5592 0.5394 0.5491 254
Model Comparision
We compared the performance of three model types—Random Forest, SVM, and a multilayer perceptron (MLP), on the emotion recognition task using macro-averaged metrics and per-emotion performance. Across overall performance metrics, the MLP consistently outperformed both Random Forest and SVM, achieving the highest accuracy (0.5567), macro precision (0.5538), macro recall (0.5574), and macro F1-score (0.5534). This indicates that the neural network is most effective at capturing complex patterns in the PCA-reduced acoustic feature space.
Per-emotion analysis revealed nuanced differences among models. For Angry, all models performed relatively well, with MLP achieving the highest F1-score (0.7374). Disgust and Fear were more challenging emotions, with lower F1-scores overall, though MLP slightly improved performance over other models. For Happy and Neutral, MLP again showed superior F1-scores, particularly in improving recall for the Neutral class (0.5826). Sad emotion classification also favored MLP, demonstrating balanced precision and recall (F1-score 0.5857).
Overall, while ensemble methods like Random Forest and kernel-based SVM provide competitive performance for certain classes, the MLP’s ability to model non-linear interactions across multiple dimensions makes it the best-performing approach in this task. These results highlight that neural network-based models may be better suited for emotion recognition from acoustic features, addressing both classification accuracy and balanced performance across all emotions.
MODEL COMPARISON SUMMARY
==================================================
Model Accuracy Precision (Macro) Recall (Macro) \
0 Random Forest 0.5104 0.5012 0.5104
1 SVM 0.5285 0.5258 0.5280
2 MLP(Neural Net) 0.5400 0.5374 0.5392
F1-Score (Macro)
0 0.4990
1 0.5256
2 0.5345
Best Performing Model: MLP(Neural Net)
F1-Score (Macro): 0.5345
PER-EMOTION PERFORMANCE ACROSS MODELS
==================================================
ANGRY Performance:
Model Precision Recall F1-Score
--------------------------------------------------
Random Forest 0.5947 0.7913 0.6791
SVM 0.6541 0.7520 0.6996
MLP(Neural Net) 0.6644 0.7559 0.7072
DISGUST Performance:
Model Precision Recall F1-Score
--------------------------------------------------
Random Forest 0.4701 0.4331 0.4508
SVM 0.4960 0.4921 0.4941
MLP(Neural Net) 0.4946 0.5433 0.5178
FEAR Performance:
Model Precision Recall F1-Score
--------------------------------------------------
Random Forest 0.4615 0.2835 0.3512
SVM 0.4478 0.4724 0.4598
MLP(Neural Net) 0.5294 0.3543 0.4245
HAPPY Performance:
Model Precision Recall F1-Score
--------------------------------------------------
Random Forest 0.5000 0.4510 0.4742
SVM 0.5174 0.4667 0.4907
MLP(Neural Net) 0.4857 0.5333 0.5084
NEUTRAL Performance:
Model Precision Recall F1-Score
--------------------------------------------------
Random Forest 0.4587 0.5092 0.4826
SVM 0.4825 0.5046 0.4933
MLP(Neural Net) 0.4912 0.5092 0.5000
SAD Performance:
Model Precision Recall F1-Score
--------------------------------------------------
Random Forest 0.5225 0.5945 0.5562
SVM 0.5571 0.4803 0.5159
MLP(Neural Net) 0.5592 0.5394 0.5491
Conclusion
This study investigated the capability of machine learning models to classify emotional states from acoustic features using the CREMA-D dataset. Across multiple approaches including Random Forest, SVM, and a multilayer perceptron, the MLP consistently demonstrated superior performance in both overall accuracy and per-emotion metrics, highlighting its ability to capture complex nonlinear patterns in the PCA-reduced feature space. While traditional ensemble and kernel based methods provided competitive results for certain emotions, neural networks offered the most balanced performance across all classes, particularly for challenging emotions such as Disgust, Fear, and Neutral.
These findings suggest that for multi-class emotion recognition from audio, neural networks are better suited to leverage nuanced acoustic features compared to standalone or ensemble methods. Future work could explore integrating temporal modeling with recurrent neural networks or transformer based architectures, combining audio and visual modalities, or experimenting with advanced feature extraction methods to further improve classification accuracy and robustness.
Work Cited
Banerjee, Gaurab, et al. Understanding Emotion Classification in Audio Data Stanford CS224N Custom Project.
Livingstone, Steven R., and Frank A. Russo. “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English.” PLOS ONE, vol. 13, no. 5, 16 May 2018, p. e0196391, www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio, https://doi.org/10.1371/journal.pone.0196391. Accessed 4 Aug. 2025.
Lok, Eu Jin. “CREMA-D.” Kaggle.com, 2019, www.kaggle.com/datasets/ejlok1/cremad. Accessed 4 Aug. 2025.
Moataz El Ayadi, et al. “Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases.” Pattern Recognition, vol. 44, no. 3, 14 Oct. 2010, pp. 572–587, ui.adsabs.harvard.edu/abs/2011PatRe..44..572E/abstract, https://doi.org/10.1016/j.patcog.2010.09.020. Accessed 19 Aug. 2025.