Predicting Animal Phyla from Sexually Selected Traits

INFO 523 - Summer 2025 - Final Project

Matthew Qi Lan Thompson

Motivation & Questions

Objectives
Research Questions

Challenges in classification (Tessler et al., 2022; van der Gulik et al., 2023)
Machine learning for taxonomy is new, needs evaluation (Alipour et al., 2024)
Stepping stone: refine models, improve performance
Potential: narrow candidate taxa, save time, improve relevance

How accurately can a machine learning model classify animal taxa based on the binary presence of sexually selected traits?
Do evolutionary origin rates of sexual traits provide stronger predictive power than binary trait presence when classifying animal taxa?

Evolution dataset shape: (84, 12)
Family dataset shape: (1087, 13)

Evolution dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Tree    84 non-null     int64  
 1   Phylum  84 non-null     object 
 2   A       84 non-null     float64
 3   G       84 non-null     float64
 4   O       84 non-null     float64
 5   T       84 non-null     float64
 6   V       84 non-null     float64
 7   C       84 non-null     float64
 8   F       84 non-null     float64
 9   K       84 non-null     float64
 10  M       84 non-null     float64
 11  S       84 non-null     float64
dtypes: float64(10), int64(1), object(1)
memory usage: 8.0+ KB

Family dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1087 entries, 0 to 1086
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Tree_Label  1087 non-null   object
 1   Phylum      1087 non-null   object
 2   SS          1087 non-null   int64 
 3   A           1087 non-null   int64 
 4   G           1087 non-null   int64 
 5   O           1087 non-null   int64 
 6   T           1087 non-null   int64 
 7   V           1087 non-null   int64 
 8   C           1087 non-null   int64 
 9   F           1087 non-null   int64 
 10  K           1087 non-null   int64 
 11  M           1087 non-null   int64 
 12  S           1087 non-null   int64 
dtypes: int64(11), object(2)
memory usage: 110.5+ KB

SS — Any sexually selected trait
A — Auditory trait
G — Gustatory trait
O — Olfactory trait
T — Tactile trait
V — Visual trait

C — Male–male competition trait
F — Female choice trait
K — Female–female competition trait
M — Male choice trait
S — Intersexual conflict trait

EDA — Family

Traits by Phylum

Non-zero trait prevalence by phylum (df1):

                   SS         A         G         O         T         V  \
Phylum                                                                    
Annelida     0.500000       NaN       NaN  0.500000  0.500000       NaN   
Arthropoda   0.295529  0.037077  0.025082  0.114504  0.149400  0.098146   
Chordata     0.346154  0.134615       NaN  0.076923  0.076923  0.211538   
Mollusca     0.014286       NaN       NaN       NaN       NaN  0.014286   
Rotifera     1.000000       NaN       NaN  1.000000       NaN       NaN   

                    C         F         K         M         S  
Phylum                                                         
Annelida     0.250000  0.250000  0.250000  0.250000       NaN  
Arthropoda   0.114504  0.199564  0.006543  0.122137  0.018539  
Chordata     0.230769  0.250000  0.019231  0.057692       NaN  
Mollusca     0.014286       NaN       NaN       NaN       NaN  
Rotifera          NaN       NaN       NaN  1.000000       NaN

EDA - Family Plots

Distribution
Prevalence
Binary Presence
Correlation Matrix

EDA — Evolution

Skewness
Outliers
Traits by Phylum


=== Skewness ===
A    :  3.5053
G    :  9.0968
O    :  3.3938
T    :  4.4907
V    :  4.2405
C    :  3.4515
F    :  3.4248
K    :  4.0070
M    :  3.8618
S    :  5.0951

Outlier summary (IQR):
Trait  n_outliers            Phyla_with_outliers
    C           9 Arthropoda, Chordata, Mollusca
    V           9 Arthropoda, Chordata, Mollusca
    A           6           Arthropoda, Chordata
    F           6           Arthropoda, Chordata
    K           6           Arthropoda, Chordata
    M           6           Arthropoda, Chordata
    O           6           Arthropoda, Chordata
    T           6           Arthropoda, Chordata
    G           3                     Arthropoda
    S           3                     Arthropoda

                   A         G         O         T         V         C  \
Phylum                                                                   
Arthropoda  0.000227  0.001774  0.000384  0.001187  0.000923  0.000997   
Chordata    0.000299       NaN  0.000387  0.000377  0.002300  0.000800   
Mollusca         NaN       NaN       NaN       NaN  0.000059  0.000059   

                   F         K         M         S  
Phylum                                              
Arthropoda  0.001152  0.000043  0.000506  0.000108  
Chordata    0.001333  0.000087  0.000276       NaN  
Mollusca         NaN       NaN       NaN       NaN

EDA - Evolution Plots

Distribution
Prevalence
Correlation Matrix

Modeling Approach

Data Preprocessing
Data Preprocessing (Continued)
Modeling Approaches
Evaluation Metrics

Target encoding - Created feature: Superphylum (label-encoded, 5 groups)

Ecdysozoa
Lophotrochozoa
Deuterostomia

Basal Metazoa & Non-Bilaterians
Basal Bilateria

Family dataset
- Binary trait presence (0/1) kept
- Class weights applied
Evolution dataset
- Trait rates log-transformed (log1p)

Standardized features (skewness)
Train/test split = 1/3
Stratified by superphyla

Excluded ID & class columns
Models:
- Logistic Regression
- Random Forest
- Decision Tree

Accuracy (overall correct)
Balanced Accuracy (avg recall per class)
Macro F1 (avg F1 per class)

Results

Family
Evolution
Comparison

=== FAMILY (Superphylum) RESULTS ===

Decision Tree: acc=0.147 | bal_acc=0.280 | macro-F1=0.087
              precision    recall  f1-score   support

           0       0.01      1.00      0.01         1
           1       0.00      0.00      0.00         2
           2       0.12      0.25      0.16        12
           3       0.93      0.15      0.26       186
           4       0.00      0.00      0.00        17

    accuracy                           0.15       218
   macro avg       0.21      0.28      0.09       218
weighted avg       0.80      0.15      0.23       218

Confusion matrix:
 [[  1   0   0   0   0]
 [  2   0   0   0   0]
 [  8   0   3   1   0]
 [136   0  22  28   0]
 [ 16   0   0   1   0]]

Random Forest: acc=0.229 | bal_acc=0.283 | macro-F1=0.128
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.01      1.00      0.02         2
           2       0.33      0.17      0.22        12
           3       0.94      0.25      0.39       186
           4       0.00      0.00      0.00        17

    accuracy                           0.23       218
   macro avg       0.26      0.28      0.13       218
weighted avg       0.82      0.23      0.35       218

Confusion matrix:
 [[  0   1   0   0   0]
 [  0   2   0   0   0]
 [  0   8   2   2   0]
 [  0 136   4  46   0]
 [  0  16   0   1   0]]

Logistic Regression: acc=0.220 | bal_acc=0.296 | macro-F1=0.132
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.01      1.00      0.02         2
           2       0.27      0.25      0.26        12
           3       0.98      0.23      0.37       186
           4       0.00      0.00      0.00        17

    accuracy                           0.22       218
   macro avg       0.25      0.30      0.13       218
weighted avg       0.85      0.22      0.33       218

Confusion matrix:
 [[  0   1   0   0   0]
 [  0   2   0   0   0]
 [  0   8   3   1   0]
 [  0 136   7  43   0]
 [  0  16   1   0   0]]

[FAMILY Superphylum] RF 4-fold CV macro-F1: 0.109 ± 0.011


=== EVOLUTION RESULTS ===

Decision Tree: acc=0.143 | bal_acc=0.229 | macro-F1=0.093
              precision    recall  f1-score   support

           0       0.12      1.00      0.21         3
           1       0.00      0.00      0.00         4
           2       0.00      0.00      0.00         3
           3       1.00      0.14      0.25         7
           4       0.00      0.00      0.00        11

    accuracy                           0.14        28
   macro avg       0.22      0.23      0.09        28
weighted avg       0.26      0.14      0.09        28

Confusion matrix:
 [[ 3  0  0  0  0]
 [ 4  0  0  0  0]
 [ 2  0  0  0  1]
 [ 5  0  1  1  0]
 [11  0  0  0  0]]

Random Forest: acc=0.214 | bal_acc=0.324 | macro-F1=0.232
              precision    recall  f1-score   support

           0       0.12      1.00      0.21         3
           1       0.00      0.00      0.00         4
           2       1.00      0.33      0.50         3
           3       1.00      0.29      0.44         7
           4       0.00      0.00      0.00        11

    accuracy                           0.21        28
   macro avg       0.42      0.32      0.23        28
weighted avg       0.37      0.21      0.19        28

Confusion matrix:
 [[ 3  0  0  0  0]
 [ 4  0  0  0  0]
 [ 2  0  1  0  0]
 [ 5  0  0  2  0]
 [11  0  0  0  0]]

Logistic Regression: acc=0.500 | bal_acc=0.324 | macro-F1=0.311
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         3
           1       0.00      0.00      0.00         4
           2       1.00      0.33      0.50         3
           3       1.00      0.29      0.44         7
           4       0.44      1.00      0.61        11

    accuracy                           0.50        28
   macro avg       0.49      0.32      0.31        28
weighted avg       0.53      0.50      0.40        28

Confusion matrix:
 [[ 0  0  0  0  3]
 [ 0  0  0  0  4]
 [ 0  0  1  0  2]
 [ 0  0  0  2  5]
 [ 0  0  0  0 11]]


[EVOLUTION] RF 5-fold CV macro-F1: 0.193 ± 0.065


=== FAMILY vs EVOLUTION — MODEL COMPARISON ===
                          acc          bal_acc         macro_F1       
dataset             evolution family evolution family evolution family
model                                                                 
Decision Tree           0.143  0.147     0.229  0.280     0.093  0.087
Logistic Regression     0.500  0.220     0.324  0.296     0.311  0.132
Random Forest           0.214  0.229     0.324  0.283     0.232  0.128

SHAP Interpretation

F - Plot
F - Scores
E - Plot
E - Scores


[SHAP] Random Forest — FAMILY


[SHAP] FAMILY — top features by average absolute contribution
 1. SS  —  mean|SHAP|=0.1858
 2. F  —  mean|SHAP|=0.1131
 3. T  —  mean|SHAP|=0.0595
 4. C  —  mean|SHAP|=0.0525
 5. M  —  mean|SHAP|=0.0441
 6. V  —  mean|SHAP|=0.0289
 7. O  —  mean|SHAP|=0.0192
 8. A  —  mean|SHAP|=0.0158
 9. G  —  mean|SHAP|=0.0062
10. S  —  mean|SHAP|=0.0026


[SHAP] Random Forest — EVOLUTION (global)


[SHAP] EVOLUTION — top features by average absolute contribution
 1. V  —  mean|SHAP|=0.0802
 2. C  —  mean|SHAP|=0.0550
 3. A  —  mean|SHAP|=0.0398
 4. F  —  mean|SHAP|=0.0365
 5. K  —  mean|SHAP|=0.0310
 6. M  —  mean|SHAP|=0.0309
 7. O  —  mean|SHAP|=0.0262
 8. T  —  mean|SHAP|=0.0190
 9. G  —  mean|SHAP|=0.0111
10. S  —  mean|SHAP|=0.0089

Conclusion

Conclusion: evolution has stronger overall predictive power than family (binary)
Future works or potential applications:
- Remove SS traits and redo modeling
- Improve data quality
- Consider potential predictive signal overlaps
- Different models?
- SHAP applications to biological interpretations

Predicting Animal Phyla from Sexually Selected Traits

Motivation & Questions

Data Overview

EDA — Family

EDA - Family Plots

EDA — Evolution

EDA - Evolution Plots

Modeling Approach

Results

SHAP Interpretation

Conclusion

Thank you for listening!