Evolution Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Tree 84 non-null int64
1 Phylum 84 non-null object
2 A 84 non-null float64
3 G 84 non-null float64
4 O 84 non-null float64
5 T 84 non-null float64
6 V 84 non-null float64
7 C 84 non-null float64
8 F 84 non-null float64
9 K 84 non-null float64
10 M 84 non-null float64
11 S 84 non-null float64
dtypes: float64(10), int64(1), object(1)
memory usage: 8.0+ KB
None
Family Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1087 entries, 0 to 1086
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Tree_Label 1087 non-null object
1 Phylum 1087 non-null object
2 SS 1087 non-null int64
3 A 1087 non-null int64
4 G 1087 non-null int64
5 O 1087 non-null int64
6 T 1087 non-null int64
7 V 1087 non-null int64
8 C 1087 non-null int64
9 F 1087 non-null int64
10 K 1087 non-null int64
11 M 1087 non-null int64
12 S 1087 non-null int64
dtypes: int64(11), object(2)
memory usage: 110.5+ KB
None
Trait-Based Prediction of Animal Taxa
Proposal
Dataset
Family-related metadata: 1 indicates trait presence, a whereas 0 indicates trait absence. SS: Combined (any sexually selected trait), A: Auditory, G: Gustatory, O: Olfactory, T:Tactile, V: Visual, C: Male-male competition, F: Female choice, K: Female-female competition, M: Male choice, S: Intersexual conflict.
Rates of trait evolution metadata: A: Auditory, G: Gustatory, O:Olfactory, T:Tactile, V: Visual, C: Male-male competition, F: Female choice, K: Female-female competition, M: Male choice, S: Intersexual conflict.
1. Evolution Rate Dataset
- File: animals_rateof_evolution.csv
- Dimensions: 84 rows × 12 columns
- Description: Contains continuous values representing the evolutionary rate of various sexual traits across different animal taxa.
2. Family-Level Trait Dataset
- File: family-related_data.csv
- Dimensions: 1087 rows × 13 columns
- Description: Encodes presence (1) or absence (0) of various sexually selected traits (e.g., visual, auditory, male-male competition) at the family level.
I chose these two datasets because the family-related data contains binary values (0 or 1) for various sexually selected traits, which makes it good for training machine learning models across a wide range of animal families. This gives me hands-on experience with multi-class classification and feature selection, and I may use SHAP to interpret the model’s predictions and feature contributions depending on how the models perform. The evolutionary rates data, on the other hand, includes continuous values (some zeros and some greater than zero), which lets me compare whether binary presence/absence data or continuous evolutionary rates give better predictive power. The phylum-level data seems more redundant with the family-level data, but the evolutionary rates dataset provides a different kind of information. Even though it is model-derived, I can still use the rates as input features to predict superphyla classes, since they reflect repeated origins of sexually selected traits across lineages.
Both datasets have a large number of distinct phyla. To reduce sparsity and improve interpretability, I will group these into higher-level clades called superphyla. This grouping stays below the kingdom level (since all taxa fall under Animalia) but still captures important evolutionary structure. Specifically, I will classify them into:
Ecdysozoa
Lophotrochozoa
Deuterostomia
Basal Metazoa & Non-Bilaterians
Basal Bilateria
So overall, the family-level dataset gives raw binary trait data, while the evolutionary rates dataset gives continuous, model-derived estimates of trait origins. Using both together lets me compare which type of data—binary or continuous—works better for predicting evolutionary groupings.
Dataset Source: https://frontiersin.figshare.com/articles/dataset/Data_Sheet_3_Evolution_of_sexually_selected_traits_across_animals_XLSX/21921510?file=38886321
(DataCite) Citation: Tuschhoff, E.; Wiens, John J. (2023). Data_Sheet_3_Evolution of sexually selected traits across animals.XLSX. Frontiers. Dataset. https://doi.org/10.3389/fevo.2023.1042747.s003
Other Sources:
“Animal.” Wikipedia, 16 Aug. 2025. Wikipedia, https://en.wikipedia.org/w/index.php?title=Animal&oldid=1306191709.
Collins, Allen G., et al. “Phylogenetic Context and Basal Metazoan Model Systems.” Integrative and Comparative Biology, vol. 45, no. 4, Aug. 2005, pp. 585–94. academic.oup.com, https://doi.org/10.1093/icb/45.4.585.
“Deuterostome.” Wikipedia, 9 Jul. 2025. Wikipedia, https://en.wikipedia.org/w/index.php?title=Deuterostome&oldid=1299664913.
“Ecdysozoa.” Wikipedia, 12 Aug. 2025. Wikipedia, https://en.wikipedia.org/w/index.php?title=Ecdysozoa&oldid=1305526962.
“Lophotrochozoa.” Wikipedia, 26 Jul. 2025. Wikipedia, https://en.wikipedia.org/w/index.php?title=Lophotrochozoa&oldid=1302526625.
Motivation
I was motivated to try out the machine learning-based project with the animal classification because there are challenges to accurately classify each animal based on characteristics such as physiology, geographical distribution, and other traits, and even down to genomic sequence similarity (Tessler et al., 2022; van der Gulik et al., 2023). The idea of using machine learning for taxonomic classification already existed, but it still seems like a fairly new approach, so I wanted to evaluate how effective machine learning can be in this area (Alipour et al., 2024). If the project works out, it could be a stepping stone for further tweaks and improvements before applying it to more taxonomic classification problems using various numeric traits in a practical setting. This would help narrow down more relevant relative candidates when classifying while also saving time assuming the model accuracy is good.
Sources:
Alipour, Fatemeh, et al. “Leveraging Machine Learning for Taxonomic Classification of Emerging Astroviruses.” Frontiers in Molecular Biosciences, vol. 10, Jan. 2024. Frontiers, https://doi.org/10.3389/fmolb.2023.1305506.
Tessler, Michael, et al. “Phylogenomics and the First Higher Taxonomy of Placozoa, an Ancient and Enigmatic Animal Phylum.” Frontiers in Ecology and Evolution, vol. 10, Dec. 2022. Frontiers, https://doi.org/10.3389/fevo.2022.1016357.
van der Gulik, Peter T. S., et al. “Renewing Linnaean Taxonomy: A Proposal to Restructure the Highest Levels of the Natural System.” Biological Reviews, vol. 98, no. 2, 2023, pp. 584–602. Wiley Online Library, https://doi.org/10.1111/brv.12920.
Questions
- How accurately can a machine learning model classify animal taxa based on the binary presence of sexually selected traits?
- Do evolutionary origin rates of sexual traits provide stronger predictive power than binary trait presence when classifying animal taxa?
Analysis plan
Overview
The primary objective is to assess and compare the predictive power and interpretability of both data.
Dataset Descriptions
1. Family-Level Trait Data (family_df
)
Format: Binary indicators (
0
or1
) for the presence or absence of sexually selected traits.Traits include:
SS
,A
,G
,O
,T
,V
,C
,F
,K
,M
,S
Each row represents a taxonomic family.
The target variable will likely be a categorical label such as
Class
or a user-defined taxon.
2. Evolutionary Rates Data (evolution_df
)
Format: Continuous numerical values representing the origin rates of the same traits listed above.
Each row represents a taxonomic group.
Target variable will match the one used in the binary dataset for comparison.
Exploratory Data Analysis (EDA)
EDA will be performed separately for both datasets:
Distribution plots for each trait in histogram
Visualization of binary presence in sexually selected traits if possible
Correlation matrix of sexually selected traits related to each other
Summary statistics for continuous traits.
Count plots or bar charts showing trait prevalence by taxonomic group (e.g., class or family).
Modeling Approach
For both datasets:
1. Data Preparation
I will double check for any null values.
I will encode target variable (animals taxa) using
LabelEncoder
. For the evolutionary rates, I will convert any values > 0 to 1 and hot-encode these numerical columns (sexually selected traits) for the presence of sexually selected traits.For
evolution_df
, I will standardize these sexually selected traits feature usingStandardScaler
.
2. Models to Use
DecisionTreeClassifier
RandomForestClassifier
LogisticRegression
3. Evaluation Metrics
Accuracy score
Confusion matrix
Classification report
Cross-validation score (
cross_val_score
)ROC-AUC score
4. Model Interpretation
I will use SHAP to interpret trained models.
I will generate SHAP summary plots and bar plots to highlight influential traits.
I will compare interpretability between models trained on binary and continuous data for the evolutionary rates.
Comparison and Interpretation
I will interpret the classification performance across both data and answer both questions. Then, I will draw conclusions about which dataset format (binary or continuous) provides stronger predictive insights for question #2.
I will analyze SHAP outputs to identify which traits contribute most to predictions.
I may also interpret which family has sexually selected traits and explain the results (if I have more time during the presentation).