Modeling and Predicting YouTube Engagement

Proposal

This project analyzes YouTube video metadata such as title, category, and publish time to explore their impact on engagement metrics like views, likes, and comments. The goal is to uncover patterns that help explain why certain videos trend or perform better.

Author

Affiliation

Marimuthu - Ashok kumar Marimuthu

College of Information Science, University of Arizona

import numpy as np
import pandas as pd

Dataset

## Dataset 1: YouTube Trending Video Dataset (Kaggle – India)

df = pd.read_csv("data/IN_youtube_trending_data.csv")
print("shape:\n",df.shape)
print("==============================================================")
print("sample:\n",df.head())
print("==============================================================")
print("Info:")
df.info()

shape:
 (251277, 16)
==============================================================
sample:
       video_id                                              title  \
0  Iot0eF6EoNA  Sadak 2 | Official Trailer | Sanjay | Pooja | ...   
1  x-KbnJ9fvJc  Kya Baat Aa : Karan Aujla (Official Video) Tan...   
2  KX06ksuS6Xo  Diljit Dosanjh: CLASH (Official) Music Video |...   
3  UsMRgnTcchY  Dil Ko Maine Di Kasam Video | Amaal M Ft.Ariji...   
4  WNSEXJJhKTU  Baarish (Official Video) Payal Dev,Stebin Ben ...   

            publishedAt                 channelId    channelTitle  categoryId  \
0  2020-08-12T04:31:41Z  UCGqvJPRcv7aVFun-eTsatcA    FoxStarHindi          24   
1  2020-08-11T09:00:11Z  UCm9SZAl03Rev9sFwloCdz1g  Rehaan Records          10   
2  2020-08-11T07:30:02Z  UCZRdNleCgW-BGUJf-bbjzQg  Diljit Dosanjh          10   
3  2020-08-10T05:30:49Z  UCq-Fj5jknLsUf-MWSy4_brA        T-Series          10   
4  2020-08-11T05:30:13Z  UCye6Oz0mg46S362LwARGVcA   VYRLOriginals          10   

          trending_date                                               tags  \
0  2020-08-12T00:00:00Z  sadak|sadak 2|mahesh bhatt|vishesh films|pooja...   
1  2020-08-12T00:00:00Z                                             [None]   
2  2020-08-12T00:00:00Z  clash diljit dosanjh|diljit dosanjh|diljit dos...   
3  2020-08-12T00:00:00Z  hindi songs|2020 hindi songs|2020 new songs|t-...   
4  2020-08-12T00:00:00Z  VYRL Original|Mohsin Khan|Shivangi Joshi|Payal...   

   view_count   likes  dislikes  comment_count  \
0     9885899  224925   3979409         350210   
1    11308046  655450     33242         405146   
2     9140911  296533      6179          30058   
3    23564512  743931     84162         136942   
4     6783649  268817      8798          22984   

                                   thumbnail_link  comments_disabled  \
0  https://i.ytimg.com/vi/Iot0eF6EoNA/default.jpg              False   
1  https://i.ytimg.com/vi/x-KbnJ9fvJc/default.jpg              False   
2  https://i.ytimg.com/vi/KX06ksuS6Xo/default.jpg              False   
3  https://i.ytimg.com/vi/UsMRgnTcchY/default.jpg              False   
4  https://i.ytimg.com/vi/WNSEXJJhKTU/default.jpg              False   

   ratings_disabled                                        description  
0             False  Three Streams. Three Stories. One Journey. Sta...  
1             False  Singer/Lyrics: Karan Aujla Feat Tania Music/ D...  
2             False  CLASH official music video performed by DILJIT...  
3             False  Gulshan Kumar and T-Series presents Bhushan Ku...  
4             False  VYRL Originals brings to you ‘Baarish’ - the b...  
==============================================================
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251277 entries, 0 to 251276
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   video_id           251277 non-null  object
 1   title              251277 non-null  object
 2   publishedAt        251277 non-null  object
 3   channelId          251277 non-null  object
 4   channelTitle       251276 non-null  object
 5   categoryId         251277 non-null  int64 
 6   trending_date      251277 non-null  object
 7   tags               251277 non-null  object
 8   view_count         251277 non-null  int64 
 9   likes              251277 non-null  int64 
 10  dislikes           251277 non-null  int64 
 11  comment_count      251277 non-null  int64 
 12  thumbnail_link     251277 non-null  object
 13  comments_disabled  251277 non-null  bool  
 14  ratings_disabled   251277 non-null  bool  
 15  description        231822 non-null  object
dtypes: bool(2), int64(5), object(9)
memory usage: 27.3+ MB

Dataset 1: YouTube Trending Video Dataset (Kaggle – India)

Source and Provenance

Source: Kaggle – rsrishav/youtube-trending-video-dataset
Collected by: Kaggle user @rsrishav
Date Collected: The dataset was last updated in 2023
How it was collected: Video metadata was scraped daily from YouTube’s trending page in India using YouTube API and stored as structured CSVs.

Data Access

The dataset used in this project, IN_youtube_trending_data.csv, exceeds GitHub’s file size limit (100MB) and is therefore not included in the repository.

To access the dataset, please use the following Google Drive link:

Download IN_youtube_trending_data.csv

After downloading, place the file in the data folder.

Description of Observations

This file contains metadata for trending YouTube videos in India. Each row represents a video trending on a specific day. Videos that trend across multiple days appear multiple times in the dataset.

The dataset includes approximately 251,000 rows and 15 columns. Key variables include:

title – video title
channelTitle – channel name
publishedAt – original video upload time
view_count, likes, comment_count – performance metrics
tags, description, categoryId – contextual info

This dataset supports both categorical and quantitative analysis. It’s suitable for time-based, text-based, and engagement-based exploration.

Ethical Considerations

All metadata is collected from publicly accessible video pages
No private or personally identifiable user information (PII) is included
The data is shared under Kaggle’s open use policy for academic and non-commercial use

Research Question

1. What video characteristics (e.g., publish time, title structure, tags, category) are associated with higher engagement (views, likes, comments)?

2. Can we build a predictive model to identify whether a video will be high-performing based on its metadata?

I will examine how the following video characteristics influence engagement metrics (views, likes, comments):

Publish timing (hour and day of the week)
Title length and patterns (e.g., keyword use, clickbait phrases, presence of numbers)
Video category (categoryId)
Tags (number of tags, presence of specific keywords)

These characteristics may influence performance metrics such as:

view_count
likes
comment_count

Note: The dataset does not include video duration or thumbnail image content. These may be considered in future work using the YouTube Data API.

Why This Matters

This question is relevant for both established creators and new channels aiming to optimize content strategy for discoverability and engagement.

Analysis Plan

To answer the research question, I will:

- Preprocess the data:

Convert publishedAt to datetime format
Extract features such as hour, weekday, and create daypart buckets (morning/afternoon/evening)
Handle duplicate trending entries by keeping the first appearance or aggregating views/likes

- Create new variables:

title_length: total number of characters in the video title
has_numbers_in_title: binary indicator for numbers in title (e.g., “Top 5”, “2023”)
upload_hour_bucket: categorical variable (e.g., morning, afternoon, evening)
tag_count: number of tags used

- Explore patterns:

Group by categoryId, upload_hour_bucket, and title_length to visualize how engagement (views/likes/comments) varies
Use bar plots, boxplots, and heatmaps to show relationships

- Build a predictive model:

Define the target variable:
- high_performer: binary variable = 1 if video is in the top 25% by view count, else 0
Use supervised learning models (e.g., logistic regression, decision tree, or random forest)
Train/test split and evaluate model using accuracy, precision, recall, and ROC-AUC
Identify the most important features contributing to video performance

- Interpret results:

Use model outputs (coefficients or feature importance) to explain which video characteristics are most predictive of success
Relate findings back to the research question and practical implications for content creators

Note: The final set of features may evolve as the analysis progresses, based on data quality, correlations, or insights from EDA. While the initial modeling plan includes classification using logistic regression or decision trees, the specific model and feature set will be finalized based on what proves most effective during the modeling phase. The target variable is currently defined as videos in the top 25% of view count, but this threshold may be adjusted after reviewing the distribution.

## Dataset 2: 1000 Most Trending YouTube Videos (Kaggle)

df2 = pd.read_csv("data/top-1000-trending-youtube-videos.csv")
print("shape:\n",df2.shape)
print("==============================================================")
print("sample:\n",df2.head())
print("==============================================================")
print("Info:")
df2.info()

shape:
 (1000, 7)
==============================================================
sample:
    rank                                              Video    Video views  \
0     1  20 Tennis shots if they were not filmed, NOBOD...      3,471,237   
1     2  Lil Nas X - Old Town Road (Official Movie) ft....     54,071,677   
2     3                 JoJo Siwa - Karma (Official Video)     34,206,747   
3     4  Wiz Khalifa - See You Again ft. Charlie Puth [...  6,643,904,918   
4     5                       伊賀の天然水強炭酸水「家族で、シュワシェア。」篇　15秒    236,085,971   

        Likes Dislikes Category  published  
0      19,023      859      NaN       2017  
1   3,497,955   78,799    Music       2019  
2     293,563      NaN    Music       2024  
3  44,861,602      NaN    Music       2015  
4          38      NaN      NaN       2021  
==============================================================
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   rank         1000 non-null   int64 
 1   Video        1000 non-null   object
 2   Video views  1000 non-null   object
 3   Likes        973 non-null    object
 4   Dislikes     687 non-null    object
 5   Category     820 non-null    object
 6   published    1000 non-null   int64 
dtypes: int64(2), object(5)
memory usage: 54.8+ KB

Dataset 2: Top 1000 Trending YouTube Videos (Kaggle)

Source and Provenance

Source: Kaggle – 1000 Most Trending YouTube Videos
Collected by: Samith Sachidanandan
Date Collected: Not specified; likely compiled as a snapshot of all-time trending videos
How it was collected: Curated list of the most viewed and liked YouTube videos, possibly scraped from YouTube’s top charts. The dataset includes basic metadata and performance metrics.

Description of Observations

This dataset contains 1,000 records, each representing a globally popular YouTube video. It includes the following columns:

rank: position in the top 1000
Video: title of the video
Video views: number of views
Likes: number of likes
Dislikes: number of dislikes
Category: general topic (e.g., Music, Sports)
published: year the video was published

Although compact, this dataset captures extremely successful videos and is useful for identifying characteristics shared by top performers across different time periods and content types.

Ethical Considerations

The dataset contains only publicly available metadata from YouTube
No personal or user-level data is included
It is shared under Kaggle’s community license for academic and non-commercial use

Research Question

What common characteristics do top-performing YouTube videos share across categories and publishing years?

This dataset will help explore whether video success correlates with: - Category (e.g., Music vs. Gaming) - Year of publication (older vs. newer content) - View-to-like ratios or audience engagement patterns

Why This Matters:

This dataset provides a snapshot of top-tier performers, helping validate whether trends found in the larger India-specific dataset (Dataset 1) hold true at the global, all-time level.

Variables to Explore

Quantitative: Video views, Likes, Dislikes
Categorical: Category, published (as a proxy for video age)

Analysis Plan

Convert Video views, Likes, and Dislikes to numeric format (they may be strings with commas)

- Create new derived variables:

like_ratio = Likes / Video views
engagement_score = (Likes + Dislikes) / Video views

- Analyze view counts and like ratios by Category and published year

Visualize trends in engagement over time and across categories
Compare the findings with those from Dataset 1 to see if the characteristics of top-trending videos align with broader trending patterns

- Role in Final Project

This dataset will serve as a focused benchmark of top-performing content. While it won’t be used for predictive modeling, it provides valuable insight into common characteristics of high-success videos and supports cross-validation of patterns discovered in the larger primary dataset.

## Dataset 3: YouTube Trending Videos via API (India)

df3 = pd.read_csv("data/youtube_api_sample.csv")
print("shape:\n",df3.shape)
print("==============================================================")
print("sample:\n",df3.head())
print("==============================================================")
print("Info:")
df3.info()

shape:
 (50, 9)
==============================================================
sample:
        videoId                                              title  \
0  FbXOsVByKmk  They Call Him OG - Firestorm Lyric Video | Paw...   
1  qeVfT2iLiu0  Coolie - Official Trailer | Superstar Rajinika...   
2  VCqOcfGebaY  2025 PMWC at EWC Grand Finals D2 | English Co ...   
3  enjkcCdAlXc  Aavan Jaavan Song | WAR 2 | Hrithik Roshan, Ki...   
4  KkggRAFMg5c  Coolie | Trailer Reaction | Superstar Rajinika...   

       channelTitle  categoryId           publishedAt  viewCount  likeCount  \
0  Sony Music South          10  2025-08-02T08:53:07Z    4880098     641168   
1            Sun TV          24  2025-08-02T13:30:25Z    9825816     588809   
2       Snax Gaming          20  2025-08-02T15:55:41Z    1076294      54896   
3               YRF          10  2025-07-31T05:41:16Z   21057184     320699   
4     LifeofShazzam          24  2025-08-02T14:15:38Z     267028      21712   

   commentCount   duration  
0         27050     PT4M6S  
1         26836     PT3M2S  
2           115  PT4H51M1S  
3         15007       PT4M  
4          1126    PT8M20S  
==============================================================
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   videoId       50 non-null     object
 1   title         50 non-null     object
 2   channelTitle  50 non-null     object
 3   categoryId    50 non-null     int64 
 4   publishedAt   50 non-null     object
 5   viewCount     50 non-null     int64 
 6   likeCount     50 non-null     int64 
 7   commentCount  50 non-null     int64 
 8   duration      50 non-null     object
dtypes: int64(4), object(5)
memory usage: 3.6+ KB

Weekly Plan of Attack

Week 1: Finalize Data & Preprocessing (Aug 2–Aug 8)

Submit proposal and finalize dataset files in /data folder
Clean and standardize variables (e.g., dates, durations, text)
Engineer new features like:
- title_length, upload_hour, like_ratio, duration_minutes
Save cleaned version for modeling in /notebooks or /src

Week 2: Modeling & Exploratory Analysis (Aug 9–Aug 15)

Conduct exploratory analysis:
- Views by category, upload time, title length
- Correlations and class imbalance check
Define target variable:
- Binary (e.g., top 25% views = “high performer”) or regression
Train and evaluate models:
- Logistic Regression, Decision Tree, Random Forest
Use classification metrics:
- Accuracy, ROC-AUC, precision, recall
Review feature importance to guide interpretation

Week 3: Report Writing & Presentation (Aug 16–Aug 21)

Create presentation.qmd with key visualizations and model insights
Tell a clear story: problem → data → features → modeling → takeaways
Add a reflection section:
- What I learned, what I would improve with more time
Clean up GitHub repo:
- Add README, remove unused files, ensure reproducibility

Dataset

Dataset 1: YouTube Trending Video Dataset (Kaggle – India)

Source and Provenance

Data Access

Description of Observations

Ethical Considerations

Research Question

Why This Matters

Analysis Plan

- Preprocess the data:

- Create new variables:

- Explore patterns:

- Build a predictive model:

- Interpret results:

Dataset 2: Top 1000 Trending YouTube Videos (Kaggle)

Source and Provenance

Description of Observations

Ethical Considerations

Research Question

Why This Matters:

Variables to Explore

Analysis Plan

- Create new derived variables:

- Analyze view counts and like ratios by Category and published year

- Role in Final Project

Dataset 3: YouTube Trending Videos via API (India)

Source and Provenance

Description of Observations

Ethical Considerations

Research Use Case

Why This Matters

Variables in the Sample

Analysis Plan

Role in Final Project

Weekly Plan of Attack

Week 1: Finalize Data & Preprocessing (Aug 2–Aug 8)

Week 2: Modeling & Exploratory Analysis (Aug 9–Aug 15)

Week 3: Report Writing & Presentation (Aug 16–Aug 21)