Acoustic Emotion Recognition

Ralph Andrade

Abstract

Dataset: CREMA-D, six emotions (neutral, happy, sad, angry, fear, disgust)
Features: librosa, standardized, PCA (98% variance)
Models: SVM, Random Forest, MLP
MLP achieved macro F1-score: 0.5534

Introduction

Automatic emotion classification is challenging
CREMA-D: 7,442 clips, 91 actors
Features: MFCCs, spectral properties
Goal: Compare traditional, ensemble, and neural networks

Research Question

Q1. What is the classification accuracy of unsupervised methods for emotion recognition from acoustic features?
Q2. How do standalone algorithms, ensemble approaches, and neural networks compare in terms of accuracy, robustness, and computational efficiency for emotion classification from audio?

Exploratory Analysis

Code

# number of variables and observations in the data
print(f"Total observations: {df.shape[0]}")
print(f"Number of features: {df.shape[1]}")

# missing values in each column
missing_df = df.isna().sum()
print("Numeric Description:\n", df.describe())

Total observations: 7442
Number of features: 41
Numeric Description:
           actor_id  audio_duration  sample_rate  mfcc_1_mean   mfcc_1_std  \
count  7442.000000     7442.000000       7442.0  7442.000000  7442.000000   
mean   1046.084117        2.542910      22050.0  -387.893237    81.152242   
std      26.243152        0.505980          0.0    56.912883    30.241790   
min    1001.000000        1.267982      22050.0 -1131.370700     0.000122   
25%    1023.000000        2.202222      22050.0  -428.004615    58.292865   
50%    1046.000000        2.502540      22050.0  -399.767640    76.243485   
75%    1069.000000        2.836190      22050.0  -354.018697   101.771385   
max    1091.000000        5.005034      22050.0  -162.543350   179.528880   

       mfcc_2_mean   mfcc_2_std  mfcc_3_mean   mfcc_3_std  mfcc_4_mean  ...  \
count  7442.000000  7442.000000  7442.000000  7442.000000  7442.000000  ...   
mean    131.246557    26.349166     7.226425    31.621393    50.164769  ...   
std      15.557340     6.193413    11.605281    11.700738    11.128262  ...   
min       0.000000     0.000000   -52.340374     0.000000     0.000000  ...   
25%     122.297443    22.112123     0.674029    22.746978    42.658050  ...   
50%     134.065410    25.892035     8.938592    30.019723    50.710929  ...   
75%     142.583675    30.226819    15.246822    39.455707    58.216644  ...   
max     167.168330    63.146930    38.951794    73.551254    83.296500  ...   

       mfcc_13_std  spectral_centroid_mean  spectral_centroid_std  \
count  7442.000000             7442.000000            7442.000000   
mean      6.486291             1391.389433             569.861009   
std       1.678969              254.030203             284.309588   
min       0.000000                0.000000               0.000000   
25%       5.387114             1213.176905             356.601232   
50%       6.221665             1335.982229             507.373775   
75%       7.244229             1510.743971             725.217080   
max      24.734776             2873.927831            1699.906329   

       spectral_rolloff_mean  spectral_bandwidth_mean     rms_mean  \
count            7442.000000              7442.000000  7442.000000   
mean             2959.971329              1748.984424     0.027548   
std               471.242990               115.947135     0.028312   
min                 0.000000                 0.000000     0.000000   
25%              2648.811001              1679.011486     0.010933   
50%              2926.667949              1748.878905     0.016707   
75%              3211.563802              1815.428594     0.032007   
max              5258.158543              2163.024688     0.223023   

           rms_std     zcr_mean  chroma_mean   chroma_std  
count  7442.000000  7442.000000  7442.000000  7442.000000  
mean      0.027249     0.063343     0.389487     0.301066  
std       0.031667     0.023750     0.045327     0.012338  
min       0.000000     0.000000     0.000000     0.000000  
25%       0.008040     0.047160     0.359097     0.293473  
50%       0.015029     0.056566     0.389878     0.301533  
75%       0.033126     0.072023     0.420108     0.309572  
max       0.220164     0.233774     0.553840     0.335121  

[8 rows x 38 columns]

Data Preprocessing

Data Preparation
Skewness Transformation

Remove irrelevant columns, handle missing values
Apply Yeo-Johnson Power Transformation for numeric skew
Encode target labels

Sknewness Before & After Yeo Johnson Transformation

Feature engineering

Spectral Contrast: Measures amplitude differences between spectral peaks and valleys, capturing timbral characteristics that distinguish emotional expressions
MFCCs (Mel-frequency cepstral coefficients): Extract 13 coefficients representing the short-term power spectrum, fundamental for speech emotion recognition
Chroma Features: Capture pitch class energy distribution, providing harmonic content information relevant to emotional prosody
Zero-Crossing Rate: Quantifies signal noisiness by measuring zero-axis crossings, distinguishing between voiced and unvoiced speech segments
Root Mean Square (RMS) Energy: Measures overall signal energy, correlating with loudness and emotional intensity

Principal Component Analysis

PCA: retain 98% variance

Model Training & Evaluation

Model Evaulation Function

Inputs:
- model → ML model instance (Random Forest, SVM, MLP)
- X_train, X_test → feature matrices
- y_train, y_test → labels
- model_name → string for labeling outputs

Output:
- Console print of metrics & confusion matrix
- Dictionary with overall and per-class performance

Model Comparison

Model Comparison Summary

Model	Accuracy	Precision*	Recall*	F1-Score*
RF	0.5124	0.5045	0.5121	0.5014
SVM	0.5285	0.5258	0.5280	0.5256
MLP	0.5567	0.5538	0.5574	0.5534

* macro

Model Metric Comparison

Model Metric (Accuracy, Precision, Recall, F1-Score) Comparison

Model Prediction & Performance

Random Forest Overall Performance

Metric	Value
Accuracy	0.5104
Macro Precision	0.5012
Macro Recall	0.5104
Macro F1-Score	0.4990

Random Forest Per-Class Performance

Emotion	Precision	Recall	F1-Score	Support
angry	0.5947	0.7913	0.6791	254
disgust	0.4701	0.4331	0.4508	254
fear	0.4615	0.2835	0.3512	254
happy	0.5000	0.4510	0.4742	255
neutral	0.4587	0.5092	0.4826	218
sad	0.5225	0.5945	0.5562	254

Random Forest Predictions

Random Forest; Confusion Matrix

SVM Overall Performance

Metric	Value
Accuracy	0.5285
Macro Precision	0.5258
Macro Recall	0.5280
Macro F1-Score	0.5256

SVM Per-Class Performance

Emotion	Precision	Recall	F1-Score	Support
angry	0.6541	0.7520	0.6996	254
disgust	0.4960	0.4921	0.4941	254
fear	0.4478	0.4724	0.4598	254
happy	0.5152	0.4667	0.4897	255
neutral	0.4846	0.5046	0.4944	218
sad	0.5571	0.4803	0.5159	254

Support Vector Machine Predictions

Support Vector Machine; Confusion Matrix

MLP (Neural Net) Overall Performance

Metric	Value
Accuracy	0.5567
Macro Precision	0.5538
Macro Recall	0.5574
Macro F1-Score	0.5534

MLP (Neural Net) Per-Class Performance

Emotion	Precision	Recall	F1-Score	Support
angry	0.6996	0.7795	0.7374	254
disgust	0.5205	0.4488	0.4820	254
fear	0.4845	0.4921	0.4883	254
happy	0.5463	0.4627	0.5011	255
neutral	0.4792	0.5826	0.5259	218
sad	0.5927	0.5787	0.5857	254

Multi Layer Perceptron Predictions

Multi Layer Perceptron; Confusion Matrix

Conclusion

Key Findings

MLP outperformed other approaches in both overall accuracy and per-emotion metrics
Neural networks showed stronger ability to capture complex, non-linear patterns
Delivered the most balanced performance across all classes, especially for difficult emotions (Disgust, Fear, Neutral)

Futuer Direction

Explore temporal modeling (RNNs, Transformers) to capture sequence information
Integrate multi-modal inputs (audio + visual features)
Experiment with advanced feature extraction beyond PCA
Aim to improve classification accuracy and robustness in real-world scenarios