Cervical Cancer Risk Prediction

Proposal

Info 523 Final Project

Author

Affiliation

Team Okhawere Team Member: Kennedy

College of Information Science, University of Arizona

Research Goal

To build a predictive model that identifies women at risk of cervical cancer using demographic, social and behavioral history as well as clinical factors from the UCI Cervical Cancer Risk Factors dataset.

Motivation

While the prevalence of Cervical cancer has decreased considerably in the past decade, it continues to be a significant global health problem, especially in low-income settings (Gopalkrishnan and Karim (2025)). Besides geographical differences, age-specific differences in cervical cancer incidence continue to persist (Islami, Fedewa, and Jemal (2019), Gargano et al. (2025)). Further, there is a strong socioeconomic gradient for both the risk and outcome of cervical cancer, with those belonging to lower-income and marginalized groups having a higher incidence and mortality due to limited access to healthcare, education, and preventive interventions (Singh D (2023)). Early identification of high-risk individuals can facilitate timely intervention and can appreciably decrease mortality. This project proposes to create a data-driven, machine learning–based predictive model to evaluate the risk of cervical cancer based on established risk factors of age, sexual history, contraceptive use, smoking, STD history, and socioeconomic indicators.

Questions

The rationale for this study stems from the fact that, while HPV vaccination significantly reduces the risk of cervical cancer, routine screening remains essential for early detection, even among vaccinated individuals, due to incomplete vaccine coverage, and the continued incidence of HPV-related cancers. Cervical cancer can still occur in older women who were never vaccinated and in younger women despite vaccination, highlighting the age variability in diagnosis, and the continuous monitoring of at-risk individuals. Therefore, there is a need to explore non-invasive, cost-effective alternatives to biopsy for early detection to improve outcomes, enhance accessibility, and support screening across diverse age groups.

Can we accurately predict cervical cancer biopsy outcomes using only non-invasive risk factors and lifestyle history?
Which factor contribute the most to cervical cancer risk?

Ethical Consideration

The data source does not explicitly state whether there was informed consent by participants for their data to be made publicly available, nor how the data was anonymized. Moreso, including children under 18 in research presents unique ethical challenges due to their limited autonomy and capacity for informed consent. Unauthorized use of sensitive medical data can be a violation of ethical standards, i.e., as established under HIPAA (U.S.) (Edemekong et al. (2025)). Further, even when anonymized, misuse of this data can reinforce harmful stereotypes or biases, especially against women, sexual minorities, or certain cultural groups, because the dataset includes intimate and potentially stigmatizing information. More importantly, behavioral factors in the dataset (e.g., sexual behavior, contraceptive use) have different social meanings across cultures. Thus, a contextual interpretation is necessary.

Dataset

def load_data(path):
  """
  Load data and print a data information

  Parameter:
     Path:Str
     Path to the CSV file to be loaded
  
  return:
     DataFrame 
  """
  #load data and store in df
  df = pd.read_csv(path, na_values='?') #consider '?' to be missing
  
  #return dataframe
  return df

data = load_data('data/risk_factors_cervical_cancer.csv')

For this study, we will be utilizing data from the UCI Cervical Cancer Risk Factors dataset from the UCI Machine Learning Repository. The data is best suited for this analysis because it was collected at a limited resource setting in Hospital Universitario de Caracas’ in Caracas, Venezuela. It contains both demographic and behavioral data for women, along with results from four different cervical cancer screening tests: Hinselmann, Schiller, Cytology, Biopsy. Our goal is to assess patterns that lead to positive biopsy results, which is the most definitive screening measure for cervical cancer.

This dataset was selected because it reflects real-world medical data of low social economic setting, with a variety of relevant features, including lifestyle factors (e.g., smoking, contraception), prior medical diagnoses, and social determinants of health. Overall, the total number of attributes in the dataset is 36.

Overview of the data

Target Variable

The target variable is the result of the biopsy test, which is the most definitive indicator for cervical cancer in this dataset. It is a binary variable: 1 indicates cancer detected, 0 indicates no cancer. Below is the information about the target variable:

	Column Name	Non-Null Count	Data Type
0	Biopsy	858	int64

Covariates

Possible covarites to be included are age, number of pregnancies, age at first intercourse, smoking history, contraception use, STD history. Below are the information about these variables:

	Column Name	Non-Null Count	Data Type
0	Age	858	int64
1	Number of sexual partners	832	float64
2	First sexual intercourse	851	float64
3	Num of pregnancies	802	float64
4	Smokes	845	float64
5	Smokes (years)	845	float64
6	Smokes (packs/year)	845	float64
7	Hormonal Contraceptives	750	float64
8	Hormonal Contraceptives (years)	750	float64
9	IUD	741	float64
10	IUD (years)	741	float64
11	STDs	753	float64
12	STDs (number)	753	float64
13	STDs:condylomatosis	753	float64
14	STDs:cervical condylomatosis	753	float64
15	STDs:vaginal condylomatosis	753	float64
16	STDs:vulvo-perineal condylomatosis	753	float64
17	STDs:syphilis	753	float64
18	STDs:pelvic inflammatory disease	753	float64
19	STDs:genital herpes	753	float64
20	STDs:molluscum contagiosum	753	float64
21	STDs:AIDS	753	float64
22	STDs:HIV	753	float64
23	STDs:Hepatitis B	753	float64
24	STDs:HPV	753	float64
25	STDs: Number of diagnosis	858	int64
26	STDs: Time since first diagnosis	71	float64
27	STDs: Time since last diagnosis	71	float64
28	Dx:Cancer	858	int64
29	Dx:CIN	858	int64
30	Dx:HPV	858	int64
31	Dx	858	int64

Study Population

The population for this study consists of 858 female patients from the Hospital Universitario de Caracas in Caracas, Venezuela. The majority of the patients (approximately 94%) have a negative biopsy result, suggestive of a potential imbalance in the distribution of cancer diagnosis Figure 1. The patient ages range from 13 to 84, with a notable right-skewed distribution, indicating a larger representation of younger individuals in the sample Figure 2.

Distribution of the target variable

Figure 1: Distribution of Biopsy Results in the Cohort

Distribution of the Age in the Cohort

Figure 2: Distribution of Age in the Cohort

Analysis Plan

Missing values will be evaluated and imputed using multiple strategies like the mean/mode imputation, KNN imputation, multiple imputation or completed removal depending on missingness patterns. Variables will transformed, normalized and encode appropriately. We will apply different classification algorithms and compare them using cross-validation and performance metrics like ROC-AUC, F1-Score, Precision, Recall to determine the best model. Proposed algorithms include Logistic Regression, Random Forest, XGBoost, Support Vector Machine.

For Feature importance and interpretation, we will use SHAP values for interpretability and visualize the top contributing risk factors.

Proposed Timeline

Since this is a single author project, I will be responsible for all aspects of the study, including data acquisition, preprocessing, analysis, model development, evaluation, interpretation, and reporting. The proposed timeline is as follows:

Time	Overall Goal	Specific Tasks
Week 1:	Data acquisition, literature review, and initial exploration	Load the dataset and perform a thorough exploratory data analysis, with a special focus on the distribution and patterns of missing values.
Week 2:	Data cleaning, preprocessing, and feature engineering	Create new variables if necessary - Implement and compare several imputation strategies. - Split the data into training and testing sets
Week 3:	Model selection, training, hyperpararmeter tuning	Train several classification models (e.g., Logistic Regression, Random Forest, Gradient Boosting) - Address class imbalance by applying techniques like SMOTE or class weighting during model training. - Use cross-validation to fine-tune model hyperparameters and select the best model.
Week 4:	Model evaluation	Evaluate the final models on the test set and compare their performance using AUC-ROC, F1-score, precision, and recall. - Perform a SHAP analyis on the best model to determine the most predictive covariates
Week 5:	Interpretation of results, visualization, and report writing	Interprer, write and finalize the project report

Repo Organization

Path/File	Purpose and Description
.github/	Contains GitHub-specific configurations, including workflows, actions, and issue templates that automate and streamline repository management.
_extra/	Serves as a flexible storage space for miscellaneous or supplementary files that do not fit into other predefined project categories.
_freeze/	Stores frozen environment snapshots, capturing the exact package versions and setup used during specific stages of the project for reproducibility.
_analysis/	Hosts Jupyter notebooks outlining the project’s analytical framework, including exploratory data analysis, modeling strategies, and evaluation plans.
data/	Central repository for all raw and processed data files essential to the project, including datasets, input files, and metadata.
images/	Contains visual assets such as diagrams, charts, and screenshots used throughout the project for documentation, presentations, and analysis.
.gitignore	Specifies files and directories to exclude from Git tracking, helping maintain a clean and efficient version control history.
README.md.	Provides a comprehensive overview of the project, including setup instructions, usage guidelines, objectives, and scope. Serves as the project’s landing document.
_quarto.yml	Configuration file for Quarto, defining global settings for document rendering, output formats, and styling across all .qmd files.
about.qmd	Supplementary Quarto document offering background on the project’s purpose, team member bios, and contextual information.
index.qmd	Main Quarto document that will serve as the project’s homepage, integrating code, visualizations, narrative, and final results.
presentation.qmd	Quarto file designed to generate the final project presentation in slideshow format, summarizing key findings and insights.
proposal.qmd	Initial project planning document detailing the dataset, metadata, research questions, and a week-by-week roadmap. Updated regularly to reflect progress.
requirements.txt	Lists all Python dependencies and their versions required to run the project, ensuring consistent environment setup across collaborators.

References

Edemekong, Paul F, Pradeep Annamaraju, Muhammad Afzal, and et al. 2025. “Health Insurance Portability and Accountability Act (HIPAA) Compliance.” https://www.ncbi.nlm.nih.gov/books/NBK500019/.

Gargano, Jay, Rachel Stefanos, Rachael M Dahl, and et al. 2025. “Trends in Cervical Precancers Identified Through Population-Based Surveillance — Human Papillomavirus Vaccine Impact Monitoring Project, Five Sites, United States, 2008–2022.” MMWR Morb Mortal Wkly Rep 74: 96–101. https://doi.org/10.15585/mmwr.mm7406a4.

Gopalkrishnan, K., and R. Karim. 2025. “Addressing Global Disparities in Cervical Cancer Burden: A Narrative Review of Emerging Strategies.” Current HIV/AIDS Reports 22 (1): 18. https://doi.org/10.1007/s11904-025-00727-2.

Islami, Farhad, Stacey A Fedewa, and Ahmed Jemal. 2019. “Trends in Cervical Cancer Incidence Rates by Age, Race/Ethnicity, Histological Subtype, and Stage at Diagnosis in the United States.” Prev Med 123 (June): 316–23. https://doi.org/10.1016/j.ypmed.2019.04.010.

Singh D, Lorenzoni V, Vignat J. 2023. “Global Estimates of Incidence and Mortality of Cervical Cancer in 2020: A Baseline Analysis of the WHO Global Cervical Cancer Elimination Initiative.” Lancet Glob Health 11 (2): e197–206. https://doi.org/10.1016/S2214-109X(22)00501-0.

--- title: "Cervical Cancer Risk Prediction" subtitle: "Proposal" author: - name: 'Team Okhawere Team Member: Kennedy' affiliations: - name: "College of Information Science, University of Arizona" description: "Info 523 Final Project" format: html: code-tools: true code-overflow: wrap code-line-numbers: true embed-resources: true bibliography: _extra/References/references.bib editor: visual code-annotations: hover execute: warning: false jupyter: python3 --- ```{python} #| label: load-pkgs #| message: false #| echo: false import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns ``` ## Research Goal To build a predictive model that identifies women at risk of cervical cancer using demographic, social and behavioral history as well as clinical factors from the UCI Cervical Cancer Risk Factors dataset. ## Motivation While the prevalence of Cervical cancer has decreased considerably in the past decade, it continues to be a significant global health problem, especially in low-income settings (@gopalkrishnan2025cervical). Besides geographical differences, age-specific differences in cervical cancer incidence continue to persist (@Islami2019trends, @Gargano2025trends). Further, there is a strong socioeconomic gradient for both the risk and outcome of cervical cancer, with those belonging to lower-income and marginalized groups having a higher incidence and mortality due to limited access to healthcare, education, and preventive interventions (@Singh2023cervical). Early identification of high-risk individuals can facilitate timely intervention and can appreciably decrease mortality. This project proposes to create a data-driven, machine learning–based predictive model to evaluate the risk of cervical cancer based on established risk factors of age, sexual history, contraceptive use, smoking, STD history, and socioeconomic indicators. ## Questions The rationale for this study stems from the fact that, while HPV vaccination significantly reduces the risk of cervical cancer, routine screening remains essential for early detection, even among vaccinated individuals, due to incomplete vaccine coverage, and the continued incidence of HPV-related cancers. Cervical cancer can still occur in older women who were never vaccinated and in younger women despite vaccination, highlighting the age variability in diagnosis, and the continuous monitoring of at-risk individuals. Therefore, there is a need to explore non-invasive, cost-effective alternatives to biopsy for early detection to improve outcomes, enhance accessibility, and support screening across diverse age groups. 1. Can we accurately predict cervical cancer biopsy outcomes using only non-invasive risk factors and lifestyle history? 2. Which factor contribute the most to cervical cancer risk? ## Ethical Consideration The data source does not explicitly state whether there was informed consent by participants for their data to be made publicly available, nor how the data was anonymized. Moreso, including children under 18 in research presents unique ethical challenges due to their limited autonomy and capacity for informed consent. Unauthorized use of sensitive medical data can be a violation of ethical standards, i.e., as established under HIPAA (U.S.) (@Edemekong2025HIPAA). Further, even when anonymized, misuse of this data can reinforce harmful stereotypes or biases, especially against women, sexual minorities, or certain cultural groups, because the dataset includes intimate and potentially stigmatizing information. More importantly, behavioral factors in the dataset (e.g., sexual behavior, contraceptive use) have different social meanings across cultures. Thus, a contextual interpretation is necessary. ## Dataset ```{python} #| label: load-dataset #| message: false def load_data(path): """ Load data and print a data information Parameter: Path:Str Path to the CSV file to be loaded return: DataFrame """ #load data and store in df df = pd.read_csv(path, na_values='?') #consider '?' to be missing #return dataframe return df data = load_data('data/risk_factors_cervical_cancer.csv') ``` For this study, we will be utilizing data from the [UCI Cervical Cancer Risk Factors dataset](https://archive.ics.uci.edu/dataset/383/cervical+cancer+risk+factors) from the UCI Machine Learning Repository. The data is best suited for this analysis because it was collected at a limited resource setting in Hospital Universitario de Caracas' in Caracas, Venezuela. It contains both demographic and behavioral data for women, along with results from four different cervical cancer screening tests: Hinselmann, Schiller, Cytology, Biopsy. Our goal is to assess patterns that lead to positive biopsy results, which is the most definitive screening measure for cervical cancer. This dataset was selected because it reflects real-world medical data of low social economic setting, with a variety of relevant features, including lifestyle factors (e.g., smoking, contraception), prior medical diagnoses, and social determinants of health. Overall, the total number of attributes in the dataset is `{python} data.shape[1]`. ## Overview of the data ### Target Variable The target variable is the result of the biopsy test, which is the most definitive indicator for cervical cancer in this dataset. It is a binary variable: 1 indicates cancer detected, 0 indicates no cancer. Below is the information about the target variable: ```{python} #| label: target-info #| message: false #| echo: false # Information on target target_col = data.columns[-1] target_summary = pd.DataFrame({ 'Column Name': [target_col], 'Non-Null Count': [data[target_col].notnull().sum()], 'Data Type': [data[target_col].dtype] }) display(target_summary) ``` ### Covariates Possible covarites to be included are age, number of pregnancies, age at first intercourse, smoking history, contraception use, STD history. Below are the information about these variables: ```{python} #| label: covariate-info #| message: false #| echo: false # Information on covariates covariates = pd.DataFrame({ 'Column Name': data.iloc[:, 0:32].columns, 'Non-Null Count': data.iloc[:, 0:32].notnull().sum().values, 'Data Type': data.iloc[:, 0:32].dtypes.values }) display(covariates) ``` ## Study Population The population for this study consists of `{python} data.shape[0]` female patients from the Hospital Universitario de Caracas in Caracas, Venezuela. The majority of the patients (approximately `{python} int(round((data['Biopsy'] == 0).mean() * 100))`%) have a negative biopsy result, suggestive of a potential imbalance in the distribution of cancer diagnosis [Figure @fig-figure-1]. The patient ages range from `{python} data['Age'].min()` to `{python} data['Age'].max()`, with a notable right-skewed distribution, indicating a larger representation of younger individuals in the sample [Figure @fig-figure-2]. ### Distribution of the target variable ```{python} #| label: fig-figure-1 #| fig-cap: "Distribution of Biopsy Results in the Cohort" #| message: false #| echo: false plt.figure(figsize=(8, 6)) sns.countplot(x='Biopsy', data=data) plt.title('Distribution of the Biopsy in the Cohort') plt.xlabel('Biopsy Result (0: No, 1: Yes)') plt.ylabel('Count') plt.show() ``` ### Distribution of the Age in the Cohort ```{python} #| label: fig-figure-2 #| fig-cap: "Distribution of Age in the Cohort" #| message: false #| echo: false plt.figure(figsize=(8, 6)) sns.violinplot(x='Age', data=data) plt.title('Distribution of the Age in the Cohort') plt.ylabel('Age') plt.show() ``` ## Analysis Plan Missing values will be evaluated and imputed using multiple strategies like the mean/mode imputation, KNN imputation, multiple imputation or completed removal depending on missingness patterns. Variables will transformed, normalized and encode appropriately. We will apply different classification algorithms and compare them using cross-validation and performance metrics like ROC-AUC, F1-Score, Precision, Recall to determine the best model. Proposed algorithms include Logistic Regression, Random Forest, XGBoost, Support Vector Machine. For Feature importance and interpretation, we will use SHAP values for interpretability and visualize the top contributing risk factors. ## Proposed Timeline Since this is a single author project, I will be responsible for all aspects of the study, including data acquisition, preprocessing, analysis, model development, evaluation, interpretation, and reporting. The proposed timeline is as follows: | Time | Overall Goal | Specific Tasks | |-------------------|---------------------|--------------------------------| | **Week 1:** | Data acquisition, literature review, and initial exploration | Load the dataset and perform a thorough exploratory data analysis, with a special focus on the distribution and patterns of missing values. | | **Week 2:** | Data cleaning, preprocessing, and feature engineering | Create new variables if necessary - Implement and compare several imputation strategies. - Split the data into training and testing sets | | **Week 3:** | Model selection, training, hyperpararmeter tuning | Train several classification models (e.g., Logistic Regression, Random Forest, Gradient Boosting) - Address class imbalance by applying techniques like SMOTE or class weighting during model training. - Use cross-validation to fine-tune model hyperparameters and select the best model. | | **Week 4:** | Model evaluation | Evaluate the final models on the test set and compare their performance using AUC-ROC, F1-score, precision, and recall. - Perform a SHAP analyis on the best model to determine the most predictive covariates | | **Week 5:** | Interpretation of results, visualization, and report writing | Interprer, write and finalize the project report | ## Repo Organization | Path/File | Purpose and Description | |---------------------|---------------------------------------------------| | .github/ | Contains GitHub-specific configurations, including workflows, actions, and issue templates that automate and streamline repository management. | | \_extra/ | Serves as a flexible storage space for miscellaneous or supplementary files that do not fit into other predefined project categories. | | \_freeze/ | Stores frozen environment snapshots, capturing the exact package versions and setup used during specific stages of the project for reproducibility. | | \_analysis/ | Hosts Jupyter notebooks outlining the project's analytical framework, including exploratory data analysis, modeling strategies, and evaluation plans. | | data/ | Central repository for all raw and processed data files essential to the project, including datasets, input files, and metadata. | | images/ | Contains visual assets such as diagrams, charts, and screenshots used throughout the project for documentation, presentations, and analysis. | | .gitignore | Specifies files and directories to exclude from Git tracking, helping maintain a clean and efficient version control history. | | README.md. | Provides a comprehensive overview of the project, including setup instructions, usage guidelines, objectives, and scope. Serves as the project's landing document. | | \_quarto.yml | Configuration file for Quarto, defining global settings for document rendering, output formats, and styling across all .qmd files. | | about.qmd | Supplementary Quarto document offering background on the project’s purpose, team member bios, and contextual information. | | index.qmd | Main Quarto document that will serve as the project's homepage, integrating code, visualizations, narrative, and final results. | | presentation.qmd | Quarto file designed to generate the final project presentation in slideshow format, summarizing key findings and insights. | | proposal.qmd | Initial project planning document detailing the dataset, metadata, research questions, and a week-by-week roadmap. Updated regularly to reflect progress. | | requirements.txt | Lists all Python dependencies and their versions required to run the project, ensuring consistent environment setup across collaborators. | ## References