Cervical Cancer Risk Prediction

Proposal

Info 523 Final Project
Author
Affiliation

Team Okhawere Team Member: Kennedy

College of Information Science, University of Arizona

Research Goal

To build a predictive model that identifies women at risk of cervical cancer using demographic, social and behavioral history as well as clinical factors from the UCI Cervical Cancer Risk Factors dataset.

Motivation

While the prevalence of Cervical cancer has decreased considerably in the past decade, it continues to be a significant global health problem, especially in low-income settings (Gopalkrishnan and Karim (2025)). Besides geographical differences, age-specific differences in cervical cancer incidence continue to persist (Islami, Fedewa, and Jemal (2019), Gargano et al. (2025)). Further, there is a strong socioeconomic gradient for both the risk and outcome of cervical cancer, with those belonging to lower-income and marginalized groups having a higher incidence and mortality due to limited access to healthcare, education, and preventive interventions (Singh D (2023)). Early identification of high-risk individuals can facilitate timely intervention and can appreciably decrease mortality. This project proposes to create a data-driven, machine learning–based predictive model to evaluate the risk of cervical cancer based on established risk factors of age, sexual history, contraceptive use, smoking, STD history, and socioeconomic indicators.

Questions

The rationale for this study stems from the fact that, while HPV vaccination significantly reduces the risk of cervical cancer, routine screening remains essential for early detection, even among vaccinated individuals, due to incomplete vaccine coverage, and the continued incidence of HPV-related cancers. Cervical cancer can still occur in older women who were never vaccinated and in younger women despite vaccination, highlighting the age variability in diagnosis, and the continuous monitoring of at-risk individuals. Therefore, there is a need to explore non-invasive, cost-effective alternatives to biopsy for early detection to improve outcomes, enhance accessibility, and support screening across diverse age groups.

  1. Can we accurately predict cervical cancer biopsy outcomes using only non-invasive risk factors and lifestyle history?
  2. Which factor contribute the most to cervical cancer risk?

Ethical Consideration

The data source does not explicitly state whether there was informed consent by participants for their data to be made publicly available, nor how the data was anonymized. Moreso, including children under 18 in research presents unique ethical challenges due to their limited autonomy and capacity for informed consent. Unauthorized use of sensitive medical data can be a violation of ethical standards, i.e., as established under HIPAA (U.S.) (Edemekong et al. (2025)). Further, even when anonymized, misuse of this data can reinforce harmful stereotypes or biases, especially against women, sexual minorities, or certain cultural groups, because the dataset includes intimate and potentially stigmatizing information. More importantly, behavioral factors in the dataset (e.g., sexual behavior, contraceptive use) have different social meanings across cultures. Thus, a contextual interpretation is necessary.

Dataset

def load_data(path):
  """
  Load data and print a data information

  Parameter:
     Path:Str
     Path to the CSV file to be loaded
  
  return:
     DataFrame 
  """
  #load data and store in df
  df = pd.read_csv(path, na_values='?') #consider '?' to be missing
  
  #return dataframe
  return df

data = load_data('data/risk_factors_cervical_cancer.csv')

For this study, we will be utilizing data from the UCI Cervical Cancer Risk Factors dataset from the UCI Machine Learning Repository. The data is best suited for this analysis because it was collected at a limited resource setting in Hospital Universitario de Caracas’ in Caracas, Venezuela. It contains both demographic and behavioral data for women, along with results from four different cervical cancer screening tests: Hinselmann, Schiller, Cytology, Biopsy. Our goal is to assess patterns that lead to positive biopsy results, which is the most definitive screening measure for cervical cancer.

This dataset was selected because it reflects real-world medical data of low social economic setting, with a variety of relevant features, including lifestyle factors (e.g., smoking, contraception), prior medical diagnoses, and social determinants of health. Overall, the total number of attributes in the dataset is 36.

Overview of the data

Target Variable

The target variable is the result of the biopsy test, which is the most definitive indicator for cervical cancer in this dataset. It is a binary variable: 1 indicates cancer detected, 0 indicates no cancer. Below is the information about the target variable:

Column Name Non-Null Count Data Type
0 Biopsy 858 int64

Covariates

Possible covarites to be included are age, number of pregnancies, age at first intercourse, smoking history, contraception use, STD history. Below are the information about these variables:

Column Name Non-Null Count Data Type
0 Age 858 int64
1 Number of sexual partners 832 float64
2 First sexual intercourse 851 float64
3 Num of pregnancies 802 float64
4 Smokes 845 float64
5 Smokes (years) 845 float64
6 Smokes (packs/year) 845 float64
7 Hormonal Contraceptives 750 float64
8 Hormonal Contraceptives (years) 750 float64
9 IUD 741 float64
10 IUD (years) 741 float64
11 STDs 753 float64
12 STDs (number) 753 float64
13 STDs:condylomatosis 753 float64
14 STDs:cervical condylomatosis 753 float64
15 STDs:vaginal condylomatosis 753 float64
16 STDs:vulvo-perineal condylomatosis 753 float64
17 STDs:syphilis 753 float64
18 STDs:pelvic inflammatory disease 753 float64
19 STDs:genital herpes 753 float64
20 STDs:molluscum contagiosum 753 float64
21 STDs:AIDS 753 float64
22 STDs:HIV 753 float64
23 STDs:Hepatitis B 753 float64
24 STDs:HPV 753 float64
25 STDs: Number of diagnosis 858 int64
26 STDs: Time since first diagnosis 71 float64
27 STDs: Time since last diagnosis 71 float64
28 Dx:Cancer 858 int64
29 Dx:CIN 858 int64
30 Dx:HPV 858 int64
31 Dx 858 int64

Study Population

The population for this study consists of 858 female patients from the Hospital Universitario de Caracas in Caracas, Venezuela. The majority of the patients (approximately 94%) have a negative biopsy result, suggestive of a potential imbalance in the distribution of cancer diagnosis Figure 1. The patient ages range from 13 to 84, with a notable right-skewed distribution, indicating a larger representation of younger individuals in the sample Figure 2.

Distribution of the target variable

Figure 1: Distribution of Biopsy Results in the Cohort

Distribution of the Age in the Cohort

Figure 2: Distribution of Age in the Cohort

Analysis Plan

Missing values will be evaluated and imputed using multiple strategies like the mean/mode imputation, KNN imputation, multiple imputation or completed removal depending on missingness patterns. Variables will transformed, normalized and encode appropriately. We will apply different classification algorithms and compare them using cross-validation and performance metrics like ROC-AUC, F1-Score, Precision, Recall to determine the best model. Proposed algorithms include Logistic Regression, Random Forest, XGBoost, Support Vector Machine.

For Feature importance and interpretation, we will use SHAP values for interpretability and visualize the top contributing risk factors.

Proposed Timeline

Since this is a single author project, I will be responsible for all aspects of the study, including data acquisition, preprocessing, analysis, model development, evaluation, interpretation, and reporting. The proposed timeline is as follows:

Time Overall Goal Specific Tasks
Week 1: Data acquisition, literature review, and initial exploration Load the dataset and perform a thorough exploratory data analysis, with a special focus on the distribution and patterns of missing values.
Week 2: Data cleaning, preprocessing, and feature engineering Create new variables if necessary - Implement and compare several imputation strategies. - Split the data into training and testing sets
Week 3: Model selection, training, hyperpararmeter tuning Train several classification models (e.g., Logistic Regression, Random Forest, Gradient Boosting) - Address class imbalance by applying techniques like SMOTE or class weighting during model training. - Use cross-validation to fine-tune model hyperparameters and select the best model.
Week 4: Model evaluation Evaluate the final models on the test set and compare their performance using AUC-ROC, F1-score, precision, and recall. - Perform a SHAP analyis on the best model to determine the most predictive covariates
Week 5: Interpretation of results, visualization, and report writing Interprer, write and finalize the project report

Repo Organization

Path/File Purpose and Description
.github/ Contains GitHub-specific configurations, including workflows, actions, and issue templates that automate and streamline repository management.
_extra/ Serves as a flexible storage space for miscellaneous or supplementary files that do not fit into other predefined project categories.
_freeze/ Stores frozen environment snapshots, capturing the exact package versions and setup used during specific stages of the project for reproducibility.
_analysis/ Hosts Jupyter notebooks outlining the project’s analytical framework, including exploratory data analysis, modeling strategies, and evaluation plans.
data/ Central repository for all raw and processed data files essential to the project, including datasets, input files, and metadata.
images/ Contains visual assets such as diagrams, charts, and screenshots used throughout the project for documentation, presentations, and analysis.
.gitignore Specifies files and directories to exclude from Git tracking, helping maintain a clean and efficient version control history.
README.md. Provides a comprehensive overview of the project, including setup instructions, usage guidelines, objectives, and scope. Serves as the project’s landing document.
_quarto.yml Configuration file for Quarto, defining global settings for document rendering, output formats, and styling across all .qmd files.
about.qmd Supplementary Quarto document offering background on the project’s purpose, team member bios, and contextual information.
index.qmd Main Quarto document that will serve as the project’s homepage, integrating code, visualizations, narrative, and final results.
presentation.qmd Quarto file designed to generate the final project presentation in slideshow format, summarizing key findings and insights.
proposal.qmd Initial project planning document detailing the dataset, metadata, research questions, and a week-by-week roadmap. Updated regularly to reflect progress.
requirements.txt Lists all Python dependencies and their versions required to run the project, ensuring consistent environment setup across collaborators.

References

Edemekong, Paul F, Pradeep Annamaraju, Muhammad Afzal, and et al. 2025. “Health Insurance Portability and Accountability Act (HIPAA) Compliance.” https://www.ncbi.nlm.nih.gov/books/NBK500019/.
Gargano, Jay, Rachel Stefanos, Rachael M Dahl, and et al. 2025. “Trends in Cervical Precancers Identified Through Population-Based Surveillance — Human Papillomavirus Vaccine Impact Monitoring Project, Five Sites, United States, 2008–2022.” MMWR Morb Mortal Wkly Rep 74: 96–101. https://doi.org/10.15585/mmwr.mm7406a4.
Gopalkrishnan, K., and R. Karim. 2025. “Addressing Global Disparities in Cervical Cancer Burden: A Narrative Review of Emerging Strategies.” Current HIV/AIDS Reports 22 (1): 18. https://doi.org/10.1007/s11904-025-00727-2.
Islami, Farhad, Stacey A Fedewa, and Ahmed Jemal. 2019. “Trends in Cervical Cancer Incidence Rates by Age, Race/Ethnicity, Histological Subtype, and Stage at Diagnosis in the United States.” Prev Med 123 (June): 316–23. https://doi.org/10.1016/j.ypmed.2019.04.010.
Singh D, Lorenzoni V, Vignat J. 2023. “Global Estimates of Incidence and Mortality of Cervical Cancer in 2020: A Baseline Analysis of the WHO Global Cervical Cancer Elimination Initiative.” Lancet Glob Health 11 (2): e197–206. https://doi.org/10.1016/S2214-109X(22)00501-0.