def load_data(path):
"""
Load data and print a data information
Parameter:
Path:Str
Path to the CSV file to be loaded
return:
DataFrame
"""
#load data and store in df
df = pd.read_csv(path, na_values='?') #consider '?' to be missing
#return dataframe
return df
data = load_data('data/risk_factors_cervical_cancer.csv')
Cervical Cancer Risk Prediction
Proposal
Research Goal
To build a predictive model that identifies women at risk of cervical cancer using demographic, social and behavioral history as well as clinical factors from the UCI Cervical Cancer Risk Factors dataset.
Motivation
While the prevalence of Cervical cancer has decreased considerably in the past decade, it continues to be a significant global health problem, especially in low-income settings (Gopalkrishnan and Karim (2025)). Besides geographical differences, age-specific differences in cervical cancer incidence continue to persist (Islami, Fedewa, and Jemal (2019), Gargano et al. (2025)). Further, there is a strong socioeconomic gradient for both the risk and outcome of cervical cancer, with those belonging to lower-income and marginalized groups having a higher incidence and mortality due to limited access to healthcare, education, and preventive interventions (Singh D (2023)). Early identification of high-risk individuals can facilitate timely intervention and can appreciably decrease mortality. This project proposes to create a data-driven, machine learning–based predictive model to evaluate the risk of cervical cancer based on established risk factors of age, sexual history, contraceptive use, smoking, STD history, and socioeconomic indicators.
Questions
The rationale for this study stems from the fact that, while HPV vaccination significantly reduces the risk of cervical cancer, routine screening remains essential for early detection, even among vaccinated individuals, due to incomplete vaccine coverage, and the continued incidence of HPV-related cancers. Cervical cancer can still occur in older women who were never vaccinated and in younger women despite vaccination, highlighting the age variability in diagnosis, and the continuous monitoring of at-risk individuals. Therefore, there is a need to explore non-invasive, cost-effective alternatives to biopsy for early detection to improve outcomes, enhance accessibility, and support screening across diverse age groups.
- Can we accurately predict cervical cancer biopsy outcomes using only non-invasive risk factors and lifestyle history?
- Which factor contribute the most to cervical cancer risk?
Ethical Consideration
The data source does not explicitly state whether there was informed consent by participants for their data to be made publicly available, nor how the data was anonymized. Moreso, including children under 18 in research presents unique ethical challenges due to their limited autonomy and capacity for informed consent. Unauthorized use of sensitive medical data can be a violation of ethical standards, i.e., as established under HIPAA (U.S.) (Edemekong et al. (2025)). Further, even when anonymized, misuse of this data can reinforce harmful stereotypes or biases, especially against women, sexual minorities, or certain cultural groups, because the dataset includes intimate and potentially stigmatizing information. More importantly, behavioral factors in the dataset (e.g., sexual behavior, contraceptive use) have different social meanings across cultures. Thus, a contextual interpretation is necessary.
Dataset
For this study, we will be utilizing data from the UCI Cervical Cancer Risk Factors dataset from the UCI Machine Learning Repository. The data is best suited for this analysis because it was collected at a limited resource setting in Hospital Universitario de Caracas’ in Caracas, Venezuela. It contains both demographic and behavioral data for women, along with results from four different cervical cancer screening tests: Hinselmann, Schiller, Cytology, Biopsy. Our goal is to assess patterns that lead to positive biopsy results, which is the most definitive screening measure for cervical cancer.
This dataset was selected because it reflects real-world medical data of low social economic setting, with a variety of relevant features, including lifestyle factors (e.g., smoking, contraception), prior medical diagnoses, and social determinants of health. Overall, the total number of attributes in the dataset is 36.
Overview of the data
Target Variable
The target variable is the result of the biopsy test, which is the most definitive indicator for cervical cancer in this dataset. It is a binary variable: 1 indicates cancer detected, 0 indicates no cancer. Below is the information about the target variable:
Column Name | Non-Null Count | Data Type | |
---|---|---|---|
0 | Biopsy | 858 | int64 |
Covariates
Possible covarites to be included are age, number of pregnancies, age at first intercourse, smoking history, contraception use, STD history. Below are the information about these variables:
Column Name | Non-Null Count | Data Type | |
---|---|---|---|
0 | Age | 858 | int64 |
1 | Number of sexual partners | 832 | float64 |
2 | First sexual intercourse | 851 | float64 |
3 | Num of pregnancies | 802 | float64 |
4 | Smokes | 845 | float64 |
5 | Smokes (years) | 845 | float64 |
6 | Smokes (packs/year) | 845 | float64 |
7 | Hormonal Contraceptives | 750 | float64 |
8 | Hormonal Contraceptives (years) | 750 | float64 |
9 | IUD | 741 | float64 |
10 | IUD (years) | 741 | float64 |
11 | STDs | 753 | float64 |
12 | STDs (number) | 753 | float64 |
13 | STDs:condylomatosis | 753 | float64 |
14 | STDs:cervical condylomatosis | 753 | float64 |
15 | STDs:vaginal condylomatosis | 753 | float64 |
16 | STDs:vulvo-perineal condylomatosis | 753 | float64 |
17 | STDs:syphilis | 753 | float64 |
18 | STDs:pelvic inflammatory disease | 753 | float64 |
19 | STDs:genital herpes | 753 | float64 |
20 | STDs:molluscum contagiosum | 753 | float64 |
21 | STDs:AIDS | 753 | float64 |
22 | STDs:HIV | 753 | float64 |
23 | STDs:Hepatitis B | 753 | float64 |
24 | STDs:HPV | 753 | float64 |
25 | STDs: Number of diagnosis | 858 | int64 |
26 | STDs: Time since first diagnosis | 71 | float64 |
27 | STDs: Time since last diagnosis | 71 | float64 |
28 | Dx:Cancer | 858 | int64 |
29 | Dx:CIN | 858 | int64 |
30 | Dx:HPV | 858 | int64 |
31 | Dx | 858 | int64 |
Study Population
The population for this study consists of 858 female patients from the Hospital Universitario de Caracas in Caracas, Venezuela. The majority of the patients (approximately 94%) have a negative biopsy result, suggestive of a potential imbalance in the distribution of cancer diagnosis Figure 1. The patient ages range from 13 to 84, with a notable right-skewed distribution, indicating a larger representation of younger individuals in the sample Figure 2.
Distribution of the target variable
Distribution of the Age in the Cohort
Analysis Plan
Missing values will be evaluated and imputed using multiple strategies like the mean/mode imputation, KNN imputation, multiple imputation or completed removal depending on missingness patterns. Variables will transformed, normalized and encode appropriately. We will apply different classification algorithms and compare them using cross-validation and performance metrics like ROC-AUC, F1-Score, Precision, Recall to determine the best model. Proposed algorithms include Logistic Regression, Random Forest, XGBoost, Support Vector Machine.
For Feature importance and interpretation, we will use SHAP values for interpretability and visualize the top contributing risk factors.
Proposed Timeline
Since this is a single author project, I will be responsible for all aspects of the study, including data acquisition, preprocessing, analysis, model development, evaluation, interpretation, and reporting. The proposed timeline is as follows:
Time | Overall Goal | Specific Tasks |
---|---|---|
Week 1: | Data acquisition, literature review, and initial exploration | Load the dataset and perform a thorough exploratory data analysis, with a special focus on the distribution and patterns of missing values. |
Week 2: | Data cleaning, preprocessing, and feature engineering | Create new variables if necessary - Implement and compare several imputation strategies. - Split the data into training and testing sets |
Week 3: | Model selection, training, hyperpararmeter tuning | Train several classification models (e.g., Logistic Regression, Random Forest, Gradient Boosting) - Address class imbalance by applying techniques like SMOTE or class weighting during model training. - Use cross-validation to fine-tune model hyperparameters and select the best model. |
Week 4: | Model evaluation | Evaluate the final models on the test set and compare their performance using AUC-ROC, F1-score, precision, and recall. - Perform a SHAP analyis on the best model to determine the most predictive covariates |
Week 5: | Interpretation of results, visualization, and report writing | Interprer, write and finalize the project report |
Repo Organization
Path/File | Purpose and Description |
---|---|
.github/ | Contains GitHub-specific configurations, including workflows, actions, and issue templates that automate and streamline repository management. |
_extra/ | Serves as a flexible storage space for miscellaneous or supplementary files that do not fit into other predefined project categories. |
_freeze/ | Stores frozen environment snapshots, capturing the exact package versions and setup used during specific stages of the project for reproducibility. |
_analysis/ | Hosts Jupyter notebooks outlining the project’s analytical framework, including exploratory data analysis, modeling strategies, and evaluation plans. |
data/ | Central repository for all raw and processed data files essential to the project, including datasets, input files, and metadata. |
images/ | Contains visual assets such as diagrams, charts, and screenshots used throughout the project for documentation, presentations, and analysis. |
.gitignore | Specifies files and directories to exclude from Git tracking, helping maintain a clean and efficient version control history. |
README.md. | Provides a comprehensive overview of the project, including setup instructions, usage guidelines, objectives, and scope. Serves as the project’s landing document. |
_quarto.yml | Configuration file for Quarto, defining global settings for document rendering, output formats, and styling across all .qmd files. |
about.qmd | Supplementary Quarto document offering background on the project’s purpose, team member bios, and contextual information. |
index.qmd | Main Quarto document that will serve as the project’s homepage, integrating code, visualizations, narrative, and final results. |
presentation.qmd | Quarto file designed to generate the final project presentation in slideshow format, summarizing key findings and insights. |
proposal.qmd | Initial project planning document detailing the dataset, metadata, research questions, and a week-by-week roadmap. Updated regularly to reflect progress. |
requirements.txt | Lists all Python dependencies and their versions required to run the project, ensuring consistent environment setup across collaborators. |