Acoustic Emotion Recognition

Ralph Andrade

Abstract

  • Dataset: CREMA-D, six emotions (neutral, happy, sad, angry, fear, disgust)
  • Features: librosa, standardized, PCA (98% variance)
  • Models: SVM, Random Forest, MLP
  • MLP achieved macro F1-score: 0.5534

Introduction

  • Automatic emotion classification is challenging
  • CREMA-D: 7,442 clips, 91 actors
  • Features: MFCCs, spectral properties
  • Goal: Compare traditional, ensemble, and neural networks

Research Question

  • Q1. What is the classification accuracy of unsupervised methods for emotion recognition from acoustic features?
  • Q2. How do standalone algorithms, ensemble approaches, and neural networks compare in terms of accuracy, robustness, and computational efficiency for emotion classification from audio?

Exploratory Analysis

EDA

Code
# number of variables and observations in the data
print(f"Total observations: {df.shape[0]}")
print(f"Number of features: {df.shape[1]}")

# missing values in each column
missing_df = df.isna().sum()
print("Numeric Description:\n", df.describe())
Total observations: 7442
Number of features: 41
Numeric Description:
           actor_id  audio_duration  sample_rate  mfcc_1_mean   mfcc_1_std  \
count  7442.000000     7442.000000       7442.0  7442.000000  7442.000000   
mean   1046.084117        2.542910      22050.0  -387.893237    81.152242   
std      26.243152        0.505980          0.0    56.912883    30.241790   
min    1001.000000        1.267982      22050.0 -1131.370700     0.000122   
25%    1023.000000        2.202222      22050.0  -428.004615    58.292865   
50%    1046.000000        2.502540      22050.0  -399.767640    76.243485   
75%    1069.000000        2.836190      22050.0  -354.018697   101.771385   
max    1091.000000        5.005034      22050.0  -162.543350   179.528880   

       mfcc_2_mean   mfcc_2_std  mfcc_3_mean   mfcc_3_std  mfcc_4_mean  ...  \
count  7442.000000  7442.000000  7442.000000  7442.000000  7442.000000  ...   
mean    131.246557    26.349166     7.226425    31.621393    50.164769  ...   
std      15.557340     6.193413    11.605281    11.700738    11.128262  ...   
min       0.000000     0.000000   -52.340374     0.000000     0.000000  ...   
25%     122.297443    22.112123     0.674029    22.746978    42.658050  ...   
50%     134.065410    25.892035     8.938592    30.019723    50.710929  ...   
75%     142.583675    30.226819    15.246822    39.455707    58.216644  ...   
max     167.168330    63.146930    38.951794    73.551254    83.296500  ...   

       mfcc_13_std  spectral_centroid_mean  spectral_centroid_std  \
count  7442.000000             7442.000000            7442.000000   
mean      6.486291             1391.389433             569.861009   
std       1.678969              254.030203             284.309588   
min       0.000000                0.000000               0.000000   
25%       5.387114             1213.176905             356.601232   
50%       6.221665             1335.982229             507.373775   
75%       7.244229             1510.743971             725.217080   
max      24.734776             2873.927831            1699.906329   

       spectral_rolloff_mean  spectral_bandwidth_mean     rms_mean  \
count            7442.000000              7442.000000  7442.000000   
mean             2959.971329              1748.984424     0.027548   
std               471.242990               115.947135     0.028312   
min                 0.000000                 0.000000     0.000000   
25%              2648.811001              1679.011486     0.010933   
50%              2926.667949              1748.878905     0.016707   
75%              3211.563802              1815.428594     0.032007   
max              5258.158543              2163.024688     0.223023   

           rms_std     zcr_mean  chroma_mean   chroma_std  
count  7442.000000  7442.000000  7442.000000  7442.000000  
mean      0.027249     0.063343     0.389487     0.301066  
std       0.031667     0.023750     0.045327     0.012338  
min       0.000000     0.000000     0.000000     0.000000  
25%       0.008040     0.047160     0.359097     0.293473  
50%       0.015029     0.056566     0.389878     0.301533  
75%       0.033126     0.072023     0.420108     0.309572  
max       0.220164     0.233774     0.553840     0.335121  

[8 rows x 38 columns]

Target Class Count Plot

Data Preprocessing

  • Remove irrelevant columns, handle missing values
  • Apply Yeo-Johnson Power Transformation for numeric skew
  • Encode target labels

Sknewness Before & After Yeo Johnson Transformation

Feature engineering

  • Spectral Contrast: Measures amplitude differences between spectral peaks and valleys, capturing timbral characteristics that distinguish emotional expressions
  • MFCCs (Mel-frequency cepstral coefficients): Extract 13 coefficients representing the short-term power spectrum, fundamental for speech emotion recognition
  • Chroma Features: Capture pitch class energy distribution, providing harmonic content information relevant to emotional prosody
  • Zero-Crossing Rate: Quantifies signal noisiness by measuring zero-axis crossings, distinguishing between voiced and unvoiced speech segments
  • Root Mean Square (RMS) Energy: Measures overall signal energy, correlating with loudness and emotional intensity

Principal Component Analysis

PCA: retain 98% variance

Model Training & Evaluation

Model Evaulation Function

  • Inputs:
    • model → ML model instance (Random Forest, SVM, MLP)
    • X_train, X_test → feature matrices
    • y_train, y_test → labels
    • model_name → string for labeling outputs
  • Output:
    • Console print of metrics & confusion matrix
    • Dictionary with overall and per-class performance

Model Comparison

Model Comparison Summary

Model Accuracy Precision* Recall* F1-Score*
RF 0.5124 0.5045 0.5121 0.5014
SVM 0.5285 0.5258 0.5280 0.5256
MLP 0.5567 0.5538 0.5574 0.5534

* macro

Model Metric Comparison

Model Metric (Accuracy, Precision, Recall, F1-Score) Comparison

Model Prediction & Performance

Random Forest Overall Performance

Metric Value
Accuracy 0.5104
Macro Precision 0.5012
Macro Recall 0.5104
Macro F1-Score 0.4990

Random Forest Per-Class Performance

Emotion Precision Recall F1-Score Support
angry 0.5947 0.7913 0.6791 254
disgust 0.4701 0.4331 0.4508 254
fear 0.4615 0.2835 0.3512 254
happy 0.5000 0.4510 0.4742 255
neutral 0.4587 0.5092 0.4826 218
sad 0.5225 0.5945 0.5562 254

Random Forest Predictions

Random Forest; Confusion Matrix

SVM Overall Performance

Metric Value
Accuracy 0.5285
Macro Precision 0.5258
Macro Recall 0.5280
Macro F1-Score 0.5256

SVM Per-Class Performance

Emotion Precision Recall F1-Score Support
angry 0.6541 0.7520 0.6996 254
disgust 0.4960 0.4921 0.4941 254
fear 0.4478 0.4724 0.4598 254
happy 0.5152 0.4667 0.4897 255
neutral 0.4846 0.5046 0.4944 218
sad 0.5571 0.4803 0.5159 254

Support Vector Machine Predictions

Support Vector Machine; Confusion Matrix

MLP (Neural Net) Overall Performance

Metric Value
Accuracy 0.5567
Macro Precision 0.5538
Macro Recall 0.5574
Macro F1-Score 0.5534

MLP (Neural Net) Per-Class Performance

Emotion Precision Recall F1-Score Support
angry 0.6996 0.7795 0.7374 254
disgust 0.5205 0.4488 0.4820 254
fear 0.4845 0.4921 0.4883 254
happy 0.5463 0.4627 0.5011 255
neutral 0.4792 0.5826 0.5259 218
sad 0.5927 0.5787 0.5857 254

Multi Layer Perceptron Predictions

Multi Layer Perceptron; Confusion Matrix

Conclusion

Key Findings

  • MLP outperformed other approaches in both overall accuracy and per-emotion metrics
  • Neural networks showed stronger ability to capture complex, non-linear patterns
  • Delivered the most balanced performance across all classes, especially for difficult emotions (Disgust, Fear, Neutral)

Futuer Direction

  • Explore temporal modeling (RNNs, Transformers) to capture sequence information
  • Integrate multi-modal inputs (audio + visual features)
  • Experiment with advanced feature extraction beyond PCA
  • Aim to improve classification accuracy and robustness in real-world scenarios