Motivation & Questions

  • Challenges in classification (Tessler et al., 2022; van der Gulik et al., 2023)
  • Machine learning for taxonomy is new, needs evaluation (Alipour et al., 2024)
  • Stepping stone: refine models, improve performance
  • Potential: narrow candidate taxa, save time, improve relevance
  1. How accurately can a machine learning model classify animal taxa based on the binary presence of sexually selected traits?
  2. Do evolutionary origin rates of sexual traits provide stronger predictive power than binary trait presence when classifying animal taxa?

Data Overview

Evolution dataset shape: (84, 12)
Family dataset shape: (1087, 13)

Evolution dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Tree    84 non-null     int64  
 1   Phylum  84 non-null     object 
 2   A       84 non-null     float64
 3   G       84 non-null     float64
 4   O       84 non-null     float64
 5   T       84 non-null     float64
 6   V       84 non-null     float64
 7   C       84 non-null     float64
 8   F       84 non-null     float64
 9   K       84 non-null     float64
 10  M       84 non-null     float64
 11  S       84 non-null     float64
dtypes: float64(10), int64(1), object(1)
memory usage: 8.0+ KB

Family dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1087 entries, 0 to 1086
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Tree_Label  1087 non-null   object
 1   Phylum      1087 non-null   object
 2   SS          1087 non-null   int64 
 3   A           1087 non-null   int64 
 4   G           1087 non-null   int64 
 5   O           1087 non-null   int64 
 6   T           1087 non-null   int64 
 7   V           1087 non-null   int64 
 8   C           1087 non-null   int64 
 9   F           1087 non-null   int64 
 10  K           1087 non-null   int64 
 11  M           1087 non-null   int64 
 12  S           1087 non-null   int64 
dtypes: int64(11), object(2)
memory usage: 110.5+ KB
  • SS — Any sexually selected trait
  • A — Auditory trait
  • G — Gustatory trait
  • O — Olfactory trait
  • T — Tactile trait
  • V — Visual trait
  • C — Male–male competition trait
  • F — Female choice trait
  • K — Female–female competition trait
  • M — Male choice trait
  • S — Intersexual conflict trait

EDA — Family

Non-zero trait prevalence by phylum (df1):

                   SS         A         G         O         T         V  \
Phylum                                                                    
Annelida     0.500000       NaN       NaN  0.500000  0.500000       NaN   
Arthropoda   0.295529  0.037077  0.025082  0.114504  0.149400  0.098146   
Chordata     0.346154  0.134615       NaN  0.076923  0.076923  0.211538   
Mollusca     0.014286       NaN       NaN       NaN       NaN  0.014286   
Rotifera     1.000000       NaN       NaN  1.000000       NaN       NaN   

                    C         F         K         M         S  
Phylum                                                         
Annelida     0.250000  0.250000  0.250000  0.250000       NaN  
Arthropoda   0.114504  0.199564  0.006543  0.122137  0.018539  
Chordata     0.230769  0.250000  0.019231  0.057692       NaN  
Mollusca     0.014286       NaN       NaN       NaN       NaN  
Rotifera          NaN       NaN       NaN  1.000000       NaN  

EDA - Family Plots

EDA — Evolution


=== Skewness ===
A    :  3.5053
G    :  9.0968
O    :  3.3938
T    :  4.4907
V    :  4.2405
C    :  3.4515
F    :  3.4248
K    :  4.0070
M    :  3.8618
S    :  5.0951
Outlier summary (IQR):
Trait  n_outliers            Phyla_with_outliers
    C           9 Arthropoda, Chordata, Mollusca
    V           9 Arthropoda, Chordata, Mollusca
    A           6           Arthropoda, Chordata
    F           6           Arthropoda, Chordata
    K           6           Arthropoda, Chordata
    M           6           Arthropoda, Chordata
    O           6           Arthropoda, Chordata
    T           6           Arthropoda, Chordata
    G           3                     Arthropoda
    S           3                     Arthropoda
                   A         G         O         T         V         C  \
Phylum                                                                   
Arthropoda  0.000227  0.001774  0.000384  0.001187  0.000923  0.000997   
Chordata    0.000299       NaN  0.000387  0.000377  0.002300  0.000800   
Mollusca         NaN       NaN       NaN       NaN  0.000059  0.000059   

                   F         K         M         S  
Phylum                                              
Arthropoda  0.001152  0.000043  0.000506  0.000108  
Chordata    0.001333  0.000087  0.000276       NaN  
Mollusca         NaN       NaN       NaN       NaN  

EDA - Evolution Plots

Modeling Approach

Target encoding - Created feature: Superphylum (label-encoded, 5 groups)

  • Ecdysozoa
  • Lophotrochozoa
  • Deuterostomia
  • Basal Metazoa & Non-Bilaterians
  • Basal Bilateria
  • Family dataset
    • Binary trait presence (0/1) kept
    • Class weights applied
  • Evolution dataset
    • Trait rates log-transformed (log1p)
  • Standardized features (skewness)
  • Train/test split = 1/3
  • Stratified by superphyla
  • Excluded ID & class columns
  • Models:
    • Logistic Regression
    • Random Forest
    • Decision Tree
  • Accuracy (overall correct)
  • Balanced Accuracy (avg recall per class)
  • Macro F1 (avg F1 per class)

Results

=== FAMILY (Superphylum) RESULTS ===

Decision Tree: acc=0.147 | bal_acc=0.280 | macro-F1=0.087
              precision    recall  f1-score   support

           0       0.01      1.00      0.01         1
           1       0.00      0.00      0.00         2
           2       0.12      0.25      0.16        12
           3       0.93      0.15      0.26       186
           4       0.00      0.00      0.00        17

    accuracy                           0.15       218
   macro avg       0.21      0.28      0.09       218
weighted avg       0.80      0.15      0.23       218

Confusion matrix:
 [[  1   0   0   0   0]
 [  2   0   0   0   0]
 [  8   0   3   1   0]
 [136   0  22  28   0]
 [ 16   0   0   1   0]]

Random Forest: acc=0.229 | bal_acc=0.283 | macro-F1=0.128
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.01      1.00      0.02         2
           2       0.33      0.17      0.22        12
           3       0.94      0.25      0.39       186
           4       0.00      0.00      0.00        17

    accuracy                           0.23       218
   macro avg       0.26      0.28      0.13       218
weighted avg       0.82      0.23      0.35       218

Confusion matrix:
 [[  0   1   0   0   0]
 [  0   2   0   0   0]
 [  0   8   2   2   0]
 [  0 136   4  46   0]
 [  0  16   0   1   0]]

Logistic Regression: acc=0.220 | bal_acc=0.296 | macro-F1=0.132
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.01      1.00      0.02         2
           2       0.27      0.25      0.26        12
           3       0.98      0.23      0.37       186
           4       0.00      0.00      0.00        17

    accuracy                           0.22       218
   macro avg       0.25      0.30      0.13       218
weighted avg       0.85      0.22      0.33       218

Confusion matrix:
 [[  0   1   0   0   0]
 [  0   2   0   0   0]
 [  0   8   3   1   0]
 [  0 136   7  43   0]
 [  0  16   1   0   0]]

[FAMILY Superphylum] RF 4-fold CV macro-F1: 0.109 ± 0.011

=== EVOLUTION RESULTS ===

Decision Tree: acc=0.143 | bal_acc=0.229 | macro-F1=0.093
              precision    recall  f1-score   support

           0       0.12      1.00      0.21         3
           1       0.00      0.00      0.00         4
           2       0.00      0.00      0.00         3
           3       1.00      0.14      0.25         7
           4       0.00      0.00      0.00        11

    accuracy                           0.14        28
   macro avg       0.22      0.23      0.09        28
weighted avg       0.26      0.14      0.09        28

Confusion matrix:
 [[ 3  0  0  0  0]
 [ 4  0  0  0  0]
 [ 2  0  0  0  1]
 [ 5  0  1  1  0]
 [11  0  0  0  0]]

Random Forest: acc=0.214 | bal_acc=0.324 | macro-F1=0.232
              precision    recall  f1-score   support

           0       0.12      1.00      0.21         3
           1       0.00      0.00      0.00         4
           2       1.00      0.33      0.50         3
           3       1.00      0.29      0.44         7
           4       0.00      0.00      0.00        11

    accuracy                           0.21        28
   macro avg       0.42      0.32      0.23        28
weighted avg       0.37      0.21      0.19        28

Confusion matrix:
 [[ 3  0  0  0  0]
 [ 4  0  0  0  0]
 [ 2  0  1  0  0]
 [ 5  0  0  2  0]
 [11  0  0  0  0]]

Logistic Regression: acc=0.500 | bal_acc=0.324 | macro-F1=0.311
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         3
           1       0.00      0.00      0.00         4
           2       1.00      0.33      0.50         3
           3       1.00      0.29      0.44         7
           4       0.44      1.00      0.61        11

    accuracy                           0.50        28
   macro avg       0.49      0.32      0.31        28
weighted avg       0.53      0.50      0.40        28

Confusion matrix:
 [[ 0  0  0  0  3]
 [ 0  0  0  0  4]
 [ 0  0  1  0  2]
 [ 0  0  0  2  5]
 [ 0  0  0  0 11]]

[EVOLUTION] RF 5-fold CV macro-F1: 0.193 ± 0.065

=== FAMILY vs EVOLUTION — MODEL COMPARISON ===
                          acc          bal_acc         macro_F1       
dataset             evolution family evolution family evolution family
model                                                                 
Decision Tree           0.143  0.147     0.229  0.280     0.093  0.087
Logistic Regression     0.500  0.220     0.324  0.296     0.311  0.132
Random Forest           0.214  0.229     0.324  0.283     0.232  0.128

SHAP Interpretation


[SHAP] Random Forest — FAMILY


[SHAP] FAMILY — top features by average absolute contribution
 1. SS  —  mean|SHAP|=0.1858
 2. F  —  mean|SHAP|=0.1131
 3. T  —  mean|SHAP|=0.0595
 4. C  —  mean|SHAP|=0.0525
 5. M  —  mean|SHAP|=0.0441
 6. V  —  mean|SHAP|=0.0289
 7. O  —  mean|SHAP|=0.0192
 8. A  —  mean|SHAP|=0.0158
 9. G  —  mean|SHAP|=0.0062
10. S  —  mean|SHAP|=0.0026

[SHAP] Random Forest — EVOLUTION (global)


[SHAP] EVOLUTION — top features by average absolute contribution
 1. V  —  mean|SHAP|=0.0802
 2. C  —  mean|SHAP|=0.0550
 3. A  —  mean|SHAP|=0.0398
 4. F  —  mean|SHAP|=0.0365
 5. K  —  mean|SHAP|=0.0310
 6. M  —  mean|SHAP|=0.0309
 7. O  —  mean|SHAP|=0.0262
 8. T  —  mean|SHAP|=0.0190
 9. G  —  mean|SHAP|=0.0111
10. S  —  mean|SHAP|=0.0089

Conclusion

  • Conclusion: evolution has stronger overall predictive power than family (binary)
  • Future works or potential applications:
    • Remove SS traits and redo modeling
    • Improve data quality
    • Consider potential predictive signal overlaps
    • Different models?
    • SHAP applications to biological interpretations

Thank you for listening!