Acoustic Emotion Recognition

accuracy of unsupervised methods for recognition emotion from acoustic features

An emotion classification project using machine learning to identify emotionsfrom speech audio recordings, comparing different algorithms to determine the most effective approach for acoustic emotion recognition.
Author

Ralph Andrade

Abstract

This study investigates the capability of machine learning models to classify emotional states from acoustic features. Using the CREMA-D (Crowd-Sourced Emotional Multimodal Actors Dataset), we focused on six target emotions: neutral, happy, sad, angry, fear, and disgust. Numerical audio features were extracted via librosa, standardized, and reduced using PCA to retain 98% of variance. We evaluated and compared standalone algorithms (SVM), ensemble methods (Random Forest), and neural networks (MLP) for multi-class emotion recognition. Results indicate that the MLP achieved the highest macro F1-score (0.5534), demonstrating superior ability to capture non-linear patterns and balanced performance across all emotions.

Introduction

Automatic emotion classification from speech remains a challenging problem due to the subtlety and variability of acoustic cues. This project leverages the CREMA-D dataset, comprising 7,442 .wav clips from 91 actors portraying six basic emotions. Prior research demonstrates that features such as MFCCs and spectral properties are informative for emotion detection, yet there is no consensus on optimal feature selection or model architecture (Banerjee, Huang, & Lettiere, n.d.).

We transformed raw audio into quantitative features using librosa and applied a standardized preprocessing pipeline. The goal is to assess the effectiveness of both traditional and neural network-based models for multi-class emotion recognition, providing a comparative evaluation of standalone algorithms, ensemble methods, and neural networks.

Research Question

  • Q1. What is the classification accuracy of unsupervised methods for emotion recognition from acoustic features?
  • Q2. How do standalone algorithms, ensemble approaches, and neural networks compare in terms of accuracy, robustness, and computational efficiency for emotion classification from audio?

Exploratory Analysis

Inital data load & Observations

Total observations: 7442
Number of features: 41
          actor_id  audio_duration  sample_rate  mfcc_1_mean   mfcc_1_std  \
count  7442.000000     7442.000000       7442.0  7442.000000  7442.000000   
mean   1046.084117        2.542910      22050.0  -387.893237    81.152242   
std      26.243152        0.505980          0.0    56.912883    30.241790   
min    1001.000000        1.267982      22050.0 -1131.370700     0.000122   
25%    1023.000000        2.202222      22050.0  -428.004615    58.292865   
50%    1046.000000        2.502540      22050.0  -399.767640    76.243485   
75%    1069.000000        2.836190      22050.0  -354.018697   101.771385   
max    1091.000000        5.005034      22050.0  -162.543350   179.528880   

       mfcc_2_mean   mfcc_2_std  mfcc_3_mean   mfcc_3_std  mfcc_4_mean  ...  \
count  7442.000000  7442.000000  7442.000000  7442.000000  7442.000000  ...   
mean    131.246557    26.349166     7.226425    31.621393    50.164769  ...   
std      15.557340     6.193413    11.605281    11.700738    11.128262  ...   
min       0.000000     0.000000   -52.340374     0.000000     0.000000  ...   
25%     122.297443    22.112123     0.674029    22.746978    42.658050  ...   
50%     134.065410    25.892035     8.938592    30.019723    50.710929  ...   
75%     142.583675    30.226819    15.246822    39.455707    58.216644  ...   
max     167.168330    63.146930    38.951794    73.551254    83.296500  ...   

       mfcc_13_std  spectral_centroid_mean  spectral_centroid_std  \
count  7442.000000             7442.000000            7442.000000   
mean      6.486291             1391.389433             569.861009   
std       1.678969              254.030203             284.309588   
min       0.000000                0.000000               0.000000   
25%       5.387114             1213.176905             356.601232   
50%       6.221665             1335.982229             507.373775   
75%       7.244229             1510.743971             725.217080   
max      24.734776             2873.927831            1699.906329   

       spectral_rolloff_mean  spectral_bandwidth_mean     rms_mean  \
count            7442.000000              7442.000000  7442.000000   
mean             2959.971329              1748.984424     0.027548   
std               471.242990               115.947135     0.028312   
min                 0.000000                 0.000000     0.000000   
25%              2648.811001              1679.011486     0.010933   
50%              2926.667949              1748.878905     0.016707   
75%              3211.563802              1815.428594     0.032007   
max              5258.158543              2163.024688     0.223023   

           rms_std     zcr_mean  chroma_mean   chroma_std  
count  7442.000000  7442.000000  7442.000000  7442.000000  
mean      0.027249     0.063343     0.389487     0.301066  
std       0.031667     0.023750     0.045327     0.012338  
min       0.000000     0.000000     0.000000     0.000000  
25%       0.008040     0.047160     0.359097     0.293473  
50%       0.015029     0.056566     0.389878     0.301533  
75%       0.033126     0.072023     0.420108     0.309572  
max       0.220164     0.233774     0.553840     0.335121  

[8 rows x 38 columns]
Missing values per column:
 actor_id                   0
sentence                   0
emotion                    0
intensity                  0
audio_duration             0
sample_rate                0
mfcc_1_mean                0
mfcc_1_std                 0
mfcc_2_mean                0
mfcc_2_std                 0
mfcc_3_mean                0
mfcc_3_std                 0
mfcc_4_mean                0
mfcc_4_std                 0
mfcc_5_mean                0
mfcc_5_std                 0
mfcc_6_mean                0
mfcc_6_std                 0
mfcc_7_mean                0
mfcc_7_std                 0
mfcc_8_mean                0
mfcc_8_std                 0
mfcc_9_mean                0
mfcc_9_std                 0
mfcc_10_mean               0
mfcc_10_std                0
mfcc_11_mean               0
mfcc_11_std                0
mfcc_12_mean               0
mfcc_12_std                0
mfcc_13_mean               0
mfcc_13_std                0
spectral_centroid_mean     0
spectral_centroid_std      0
spectral_rolloff_mean      0
spectral_bandwidth_mean    0
rms_mean                   0
rms_std                    0
zcr_mean                   0
chroma_mean                0
chroma_std                 0
dtype: int64

Target Class Distribution

We analyzed class distribution across the six emotions (neutral, happy, sad, angry, disgust, and fear). Our analysis showed that the “neutral” class has approximately 14.5% fewer samples than the other classes. Given this mild imbalance, no resampling was performed.

Data Preprocessing

Data Cleaning & Transformation

Data pre-processing included handling missing values, removing irrelevant columns, and transforming numerical features to reduce skewness. The Yeo–Johnson power transformation was applied to achieve more symmetric distributions (skewness < ±0.5), improving suitability for downstream modeling. The categorical target variable (‘emotion’) was encoded into numerical labels for compatibility with machine learning algorithms.

Splitting the dataset

The data is split using the train_test_split function, with 20% of the data reserved for testing. For the features (X), we drop the target column and its encoded variant, while for the labels (y), we retain only the encoded emotion.

Data Scaling and Dimension Reduction

Before training our models, we applied data scaling and dimensionality reduction to the numeric features. First, we identified all numeric columns in the dataset and applied Standard Scaling to ensure each feature has zero mean and unit variance.

Next, we performed Principal Component Analysis (PCA) to reduce dimensionality while retaining 98% of the variance. PCA transforms the scaled features into a set of orthogonal components, capturing most of the information in fewer dimensions. We evaluated the explained variance ratio for each component and visualized it using a scree plot to confirm the number of components selected.

Model Training and Evaluation

To address our research questions on emotion recognition from acoustic features, we trained and evaluated multiple machine learning models, including standalone algorithms, ensemble methods, and neural networks. The models were trained on the PCA-reduced and standardized feature set to improve convergence, reduce dimensionality, and mitigate potential overfitting.

For each model, we used the evaluate_model function, which trains the model and reports comprehensive performance metrics. These metrics include overall accuracy, macro-averaged precision, recall, and F1-score, as well as per-class performance for each emotion in the target set. To provide a visual assessment of prediction quality, confusion matrices were generated for all models.

Specifically, we evaluated:

  • Random Forest (RF): An ensemble method configured with 1000 trees, maximum depth of 15, and out-of-bag scoring to provide robust predictions while controlling for overfitting.

  • Support Vector Machine (SVM): A kernel-based model with RBF kernel, class-weight balancing for mild class imbalance, and probability estimates enabled.

  • Multilayer Perceptron (MLP): A neural network with three hidden layers (128, 64, 32 neurons), early stopping, and a 10% validation split to monitor convergence.

Target emotion label mapping: {0: 'angry', 1: 'disgust', 2: 'fear', 3: 'happy', 4: 'neutral', 5: 'sad'}
Training Random Forest
==================================================

Random Forest Overall Performance:
Accuracy: 0.5104
Macro Precision: 0.5012
Macro Recall: 0.5104
Macro F1-Score: 0.4990

Random Forest Per-Class Performance:
Emotion    Precision  Recall     F1-Score   Support   
--------------------------------------------------
angry      0.5947     0.7913     0.6791     254       
disgust    0.4701     0.4331     0.4508     254       
fear       0.4615     0.2835     0.3512     254       
happy      0.5000     0.4510     0.4742     255       
neutral    0.4587     0.5092     0.4826     218       
sad        0.5225     0.5945     0.5562     254       
Training SVM
==================================================

SVM Overall Performance:
Accuracy: 0.5285
Macro Precision: 0.5258
Macro Recall: 0.5280
Macro F1-Score: 0.5256

SVM Per-Class Performance:
Emotion    Precision  Recall     F1-Score   Support   
--------------------------------------------------
angry      0.6541     0.7520     0.6996     254       
disgust    0.4960     0.4921     0.4941     254       
fear       0.4478     0.4724     0.4598     254       
happy      0.5174     0.4667     0.4907     255       
neutral    0.4825     0.5046     0.4933     218       
sad        0.5571     0.4803     0.5159     254       
Training MLP(Neural Net)
==================================================

MLP(Neural Net) Overall Performance:
Accuracy: 0.5400
Macro Precision: 0.5374
Macro Recall: 0.5392
Macro F1-Score: 0.5345

MLP(Neural Net) Per-Class Performance:
Emotion    Precision  Recall     F1-Score   Support   
--------------------------------------------------
angry      0.6644     0.7559     0.7072     254       
disgust    0.4946     0.5433     0.5178     254       
fear       0.5294     0.3543     0.4245     254       
happy      0.4857     0.5333     0.5084     255       
neutral    0.4912     0.5092     0.5000     218       
sad        0.5592     0.5394     0.5491     254       

Model Comparision

We compared the performance of three model types—Random Forest, SVM, and a multilayer perceptron (MLP), on the emotion recognition task using macro-averaged metrics and per-emotion performance. Across overall performance metrics, the MLP consistently outperformed both Random Forest and SVM, achieving the highest accuracy (0.5567), macro precision (0.5538), macro recall (0.5574), and macro F1-score (0.5534). This indicates that the neural network is most effective at capturing complex patterns in the PCA-reduced acoustic feature space.

Per-emotion analysis revealed nuanced differences among models. For Angry, all models performed relatively well, with MLP achieving the highest F1-score (0.7374). Disgust and Fear were more challenging emotions, with lower F1-scores overall, though MLP slightly improved performance over other models. For Happy and Neutral, MLP again showed superior F1-scores, particularly in improving recall for the Neutral class (0.5826). Sad emotion classification also favored MLP, demonstrating balanced precision and recall (F1-score 0.5857).

Overall, while ensemble methods like Random Forest and kernel-based SVM provide competitive performance for certain classes, the MLP’s ability to model non-linear interactions across multiple dimensions makes it the best-performing approach in this task. These results highlight that neural network-based models may be better suited for emotion recognition from acoustic features, addressing both classification accuracy and balanced performance across all emotions.

MODEL COMPARISON SUMMARY
==================================================
             Model  Accuracy  Precision (Macro)  Recall (Macro)  \
0    Random Forest    0.5104             0.5012          0.5104   
1              SVM    0.5285             0.5258          0.5280   
2  MLP(Neural Net)    0.5400             0.5374          0.5392   

   F1-Score (Macro)  
0            0.4990  
1            0.5256  
2            0.5345  

Best Performing Model: MLP(Neural Net)
F1-Score (Macro): 0.5345
PER-EMOTION PERFORMANCE ACROSS MODELS
==================================================

ANGRY Performance:
Model           Precision  Recall     F1-Score  
--------------------------------------------------
Random Forest   0.5947     0.7913     0.6791    
SVM             0.6541     0.7520     0.6996    
MLP(Neural Net) 0.6644     0.7559     0.7072    

DISGUST Performance:
Model           Precision  Recall     F1-Score  
--------------------------------------------------
Random Forest   0.4701     0.4331     0.4508    
SVM             0.4960     0.4921     0.4941    
MLP(Neural Net) 0.4946     0.5433     0.5178    

FEAR Performance:
Model           Precision  Recall     F1-Score  
--------------------------------------------------
Random Forest   0.4615     0.2835     0.3512    
SVM             0.4478     0.4724     0.4598    
MLP(Neural Net) 0.5294     0.3543     0.4245    

HAPPY Performance:
Model           Precision  Recall     F1-Score  
--------------------------------------------------
Random Forest   0.5000     0.4510     0.4742    
SVM             0.5174     0.4667     0.4907    
MLP(Neural Net) 0.4857     0.5333     0.5084    

NEUTRAL Performance:
Model           Precision  Recall     F1-Score  
--------------------------------------------------
Random Forest   0.4587     0.5092     0.4826    
SVM             0.4825     0.5046     0.4933    
MLP(Neural Net) 0.4912     0.5092     0.5000    

SAD Performance:
Model           Precision  Recall     F1-Score  
--------------------------------------------------
Random Forest   0.5225     0.5945     0.5562    
SVM             0.5571     0.4803     0.5159    
MLP(Neural Net) 0.5592     0.5394     0.5491    

Conclusion

This study investigated the capability of machine learning models to classify emotional states from acoustic features using the CREMA-D dataset. Across multiple approaches including Random Forest, SVM, and a multilayer perceptron, the MLP consistently demonstrated superior performance in both overall accuracy and per-emotion metrics, highlighting its ability to capture complex nonlinear patterns in the PCA-reduced feature space. While traditional ensemble and kernel based methods provided competitive results for certain emotions, neural networks offered the most balanced performance across all classes, particularly for challenging emotions such as Disgust, Fear, and Neutral.

These findings suggest that for multi-class emotion recognition from audio, neural networks are better suited to leverage nuanced acoustic features compared to standalone or ensemble methods. Future work could explore integrating temporal modeling with recurrent neural networks or transformer based architectures, combining audio and visual modalities, or experimenting with advanced feature extraction methods to further improve classification accuracy and robustness.

Work Cited

Banerjee, Gaurab, et al. Understanding Emotion Classification in Audio Data Stanford CS224N Custom Project.

Livingstone, Steven R., and Frank A. Russo. “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English.” PLOS ONE, vol. 13, no. 5, 16 May 2018, p. e0196391, www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio, https://doi.org/10.1371/journal.pone.0196391. Accessed 4 Aug. 2025.

Lok, Eu Jin. “CREMA-D.” Kaggle.com, 2019, www.kaggle.com/datasets/ejlok1/cremad. Accessed 4 Aug. 2025.

Moataz El Ayadi, et al. “Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases.” Pattern Recognition, vol. 44, no. 3, 14 Oct. 2010, pp. 572–587, ui.adsabs.harvard.edu/abs/2011PatRe..44..572E/abstract, https://doi.org/10.1016/j.patcog.2010.09.020. Accessed 19 Aug. 2025.