Modeling and Predicting YouTube Engagement
Proposal
Dataset
## Dataset 1: YouTube Trending Video Dataset (Kaggle – India)
df = pd.read_csv("data/IN_youtube_trending_data.csv")
print("shape:\n",df.shape)
print("==============================================================")
print("sample:\n",df.head())
print("==============================================================")
print("Info:")
df.info()
shape:
(251277, 16)
==============================================================
sample:
video_id title \
0 Iot0eF6EoNA Sadak 2 | Official Trailer | Sanjay | Pooja | ...
1 x-KbnJ9fvJc Kya Baat Aa : Karan Aujla (Official Video) Tan...
2 KX06ksuS6Xo Diljit Dosanjh: CLASH (Official) Music Video |...
3 UsMRgnTcchY Dil Ko Maine Di Kasam Video | Amaal M Ft.Ariji...
4 WNSEXJJhKTU Baarish (Official Video) Payal Dev,Stebin Ben ...
publishedAt channelId channelTitle categoryId \
0 2020-08-12T04:31:41Z UCGqvJPRcv7aVFun-eTsatcA FoxStarHindi 24
1 2020-08-11T09:00:11Z UCm9SZAl03Rev9sFwloCdz1g Rehaan Records 10
2 2020-08-11T07:30:02Z UCZRdNleCgW-BGUJf-bbjzQg Diljit Dosanjh 10
3 2020-08-10T05:30:49Z UCq-Fj5jknLsUf-MWSy4_brA T-Series 10
4 2020-08-11T05:30:13Z UCye6Oz0mg46S362LwARGVcA VYRLOriginals 10
trending_date tags \
0 2020-08-12T00:00:00Z sadak|sadak 2|mahesh bhatt|vishesh films|pooja...
1 2020-08-12T00:00:00Z [None]
2 2020-08-12T00:00:00Z clash diljit dosanjh|diljit dosanjh|diljit dos...
3 2020-08-12T00:00:00Z hindi songs|2020 hindi songs|2020 new songs|t-...
4 2020-08-12T00:00:00Z VYRL Original|Mohsin Khan|Shivangi Joshi|Payal...
view_count likes dislikes comment_count \
0 9885899 224925 3979409 350210
1 11308046 655450 33242 405146
2 9140911 296533 6179 30058
3 23564512 743931 84162 136942
4 6783649 268817 8798 22984
thumbnail_link comments_disabled \
0 https://i.ytimg.com/vi/Iot0eF6EoNA/default.jpg False
1 https://i.ytimg.com/vi/x-KbnJ9fvJc/default.jpg False
2 https://i.ytimg.com/vi/KX06ksuS6Xo/default.jpg False
3 https://i.ytimg.com/vi/UsMRgnTcchY/default.jpg False
4 https://i.ytimg.com/vi/WNSEXJJhKTU/default.jpg False
ratings_disabled description
0 False Three Streams. Three Stories. One Journey. Sta...
1 False Singer/Lyrics: Karan Aujla Feat Tania Music/ D...
2 False CLASH official music video performed by DILJIT...
3 False Gulshan Kumar and T-Series presents Bhushan Ku...
4 False VYRL Originals brings to you ‘Baarish’ - the b...
==============================================================
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251277 entries, 0 to 251276
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 video_id 251277 non-null object
1 title 251277 non-null object
2 publishedAt 251277 non-null object
3 channelId 251277 non-null object
4 channelTitle 251276 non-null object
5 categoryId 251277 non-null int64
6 trending_date 251277 non-null object
7 tags 251277 non-null object
8 view_count 251277 non-null int64
9 likes 251277 non-null int64
10 dislikes 251277 non-null int64
11 comment_count 251277 non-null int64
12 thumbnail_link 251277 non-null object
13 comments_disabled 251277 non-null bool
14 ratings_disabled 251277 non-null bool
15 description 231822 non-null object
dtypes: bool(2), int64(5), object(9)
memory usage: 27.3+ MB
Dataset 1: YouTube Trending Video Dataset (Kaggle – India)
Source and Provenance
- Source: Kaggle – rsrishav/youtube-trending-video-dataset
- Collected by: Kaggle user @rsrishav
- Date Collected: The dataset was last updated in 2023
- How it was collected: Video metadata was scraped daily from YouTube’s trending page in India using YouTube API and stored as structured CSVs.
Data Access
The dataset used in this project, IN_youtube_trending_data.csv
, exceeds GitHub’s file size limit (100MB) and is therefore not included in the repository.
To access the dataset, please use the following Google Drive link:
Download IN_youtube_trending_data.csv
After downloading, place the file in the data
folder.
Description of Observations
This file contains metadata for trending YouTube videos in India. Each row represents a video trending on a specific day. Videos that trend across multiple days appear multiple times in the dataset.
The dataset includes approximately 251,000 rows and 15 columns. Key variables include:
title
– video title
channelTitle
– channel name
publishedAt
– original video upload time
view_count
,likes
,comment_count
– performance metrics
tags
,description
,categoryId
– contextual info
This dataset supports both categorical and quantitative analysis. It’s suitable for time-based, text-based, and engagement-based exploration.
Ethical Considerations
- All metadata is collected from publicly accessible video pages
- No private or personally identifiable user information (PII) is included
- The data is shared under Kaggle’s open use policy for academic and non-commercial use
Research Question
1. What video characteristics (e.g., publish time, title structure, tags, category) are associated with higher engagement (views, likes, comments)?
2. Can we build a predictive model to identify whether a video will be high-performing based on its metadata?
I will examine how the following video characteristics influence engagement metrics (views, likes, comments):
- Publish timing (hour and day of the week)
- Title length and patterns (e.g., keyword use, clickbait phrases, presence of numbers)
- Video category (
categoryId
) - Tags (number of tags, presence of specific keywords)
These characteristics may influence performance metrics such as:
view_count
likes
comment_count
Note: The dataset does not include video duration or thumbnail image content. These may be considered in future work using the YouTube Data API.
Why This Matters
This question is relevant for both established creators and new channels aiming to optimize content strategy for discoverability and engagement.
Analysis Plan
To answer the research question, I will:
- Preprocess the data:
- Convert
publishedAt
todatetime
format - Extract features such as
hour
,weekday
, and createdaypart
buckets (morning/afternoon/evening) - Handle duplicate trending entries by keeping the first appearance or aggregating views/likes
- Create new variables:
title_length
: total number of characters in the video titlehas_numbers_in_title
: binary indicator for numbers in title (e.g., “Top 5”, “2023”)upload_hour_bucket
: categorical variable (e.g., morning, afternoon, evening)tag_count
: number of tags used
- Explore patterns:
- Group by
categoryId
,upload_hour_bucket
, andtitle_length
to visualize how engagement (views/likes/comments) varies - Use bar plots, boxplots, and heatmaps to show relationships
- Build a predictive model:
- Define the target variable:
high_performer
: binary variable = 1 if video is in the top 25% by view count, else 0
- Use supervised learning models (e.g., logistic regression, decision tree, or random forest)
- Train/test split and evaluate model using accuracy, precision, recall, and ROC-AUC
- Identify the most important features contributing to video performance
- Interpret results:
- Use model outputs (coefficients or feature importance) to explain which video characteristics are most predictive of success
- Relate findings back to the research question and practical implications for content creators
Note: The final set of features may evolve as the analysis progresses, based on data quality, correlations, or insights from EDA. While the initial modeling plan includes classification using logistic regression or decision trees, the specific model and feature set will be finalized based on what proves most effective during the modeling phase. The target variable is currently defined as videos in the top 25% of view count, but this threshold may be adjusted after reviewing the distribution.
## Dataset 2: 1000 Most Trending YouTube Videos (Kaggle)
df2 = pd.read_csv("data/top-1000-trending-youtube-videos.csv")
print("shape:\n",df2.shape)
print("==============================================================")
print("sample:\n",df2.head())
print("==============================================================")
print("Info:")
df2.info()
shape:
(1000, 7)
==============================================================
sample:
rank Video Video views \
0 1 20 Tennis shots if they were not filmed, NOBOD... 3,471,237
1 2 Lil Nas X - Old Town Road (Official Movie) ft.... 54,071,677
2 3 JoJo Siwa - Karma (Official Video) 34,206,747
3 4 Wiz Khalifa - See You Again ft. Charlie Puth [... 6,643,904,918
4 5 伊賀の天然水強炭酸水「家族で、シュワシェア。」篇 15秒 236,085,971
Likes Dislikes Category published
0 19,023 859 NaN 2017
1 3,497,955 78,799 Music 2019
2 293,563 NaN Music 2024
3 44,861,602 NaN Music 2015
4 38 NaN NaN 2021
==============================================================
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 rank 1000 non-null int64
1 Video 1000 non-null object
2 Video views 1000 non-null object
3 Likes 973 non-null object
4 Dislikes 687 non-null object
5 Category 820 non-null object
6 published 1000 non-null int64
dtypes: int64(2), object(5)
memory usage: 54.8+ KB
Dataset 2: Top 1000 Trending YouTube Videos (Kaggle)
Source and Provenance
- Source: Kaggle – 1000 Most Trending YouTube Videos
- Collected by: Samith Sachidanandan
- Date Collected: Not specified; likely compiled as a snapshot of all-time trending videos
- How it was collected: Curated list of the most viewed and liked YouTube videos, possibly scraped from YouTube’s top charts. The dataset includes basic metadata and performance metrics.
Description of Observations
This dataset contains 1,000 records, each representing a globally popular YouTube video. It includes the following columns:
rank
: position in the top 1000
Video
: title of the video
Video views
: number of views
Likes
: number of likes
Dislikes
: number of dislikes
Category
: general topic (e.g., Music, Sports)
published
: year the video was published
Although compact, this dataset captures extremely successful videos and is useful for identifying characteristics shared by top performers across different time periods and content types.
Ethical Considerations
- The dataset contains only publicly available metadata from YouTube
- No personal or user-level data is included
- It is shared under Kaggle’s community license for academic and non-commercial use
Research Question
What common characteristics do top-performing YouTube videos share across categories and publishing years?
This dataset will help explore whether video success correlates with: - Category (e.g., Music vs. Gaming) - Year of publication (older vs. newer content) - View-to-like ratios or audience engagement patterns
Why This Matters:
This dataset provides a snapshot of top-tier performers, helping validate whether trends found in the larger India-specific dataset (Dataset 1) hold true at the global, all-time level.
Variables to Explore
- Quantitative:
Video views
,Likes
,Dislikes
- Categorical:
Category
,published
(as a proxy for video age)
Analysis Plan
Convert Video views, Likes, and Dislikes to numeric format (they may be strings with commas)
- Create new derived variables:
like_ratio = Likes / Video views
engagement_score = (Likes + Dislikes) / Video views
- Analyze view counts and like ratios by Category and published year
Visualize trends in engagement over time and across categories
Compare the findings with those from Dataset 1 to see if the characteristics of top-trending videos align with broader trending patterns
- Role in Final Project
This dataset will serve as a focused benchmark of top-performing content. While it won’t be used for predictive modeling, it provides valuable insight into common characteristics of high-success videos and supports cross-validation of patterns discovered in the larger primary dataset.
## Dataset 3: YouTube Trending Videos via API (India)
df3 = pd.read_csv("data/youtube_api_sample.csv")
print("shape:\n",df3.shape)
print("==============================================================")
print("sample:\n",df3.head())
print("==============================================================")
print("Info:")
df3.info()
shape:
(50, 9)
==============================================================
sample:
videoId title \
0 FbXOsVByKmk They Call Him OG - Firestorm Lyric Video | Paw...
1 qeVfT2iLiu0 Coolie - Official Trailer | Superstar Rajinika...
2 VCqOcfGebaY 2025 PMWC at EWC Grand Finals D2 | English Co ...
3 enjkcCdAlXc Aavan Jaavan Song | WAR 2 | Hrithik Roshan, Ki...
4 KkggRAFMg5c Coolie | Trailer Reaction | Superstar Rajinika...
channelTitle categoryId publishedAt viewCount likeCount \
0 Sony Music South 10 2025-08-02T08:53:07Z 4880098 641168
1 Sun TV 24 2025-08-02T13:30:25Z 9825816 588809
2 Snax Gaming 20 2025-08-02T15:55:41Z 1076294 54896
3 YRF 10 2025-07-31T05:41:16Z 21057184 320699
4 LifeofShazzam 24 2025-08-02T14:15:38Z 267028 21712
commentCount duration
0 27050 PT4M6S
1 26836 PT3M2S
2 115 PT4H51M1S
3 15007 PT4M
4 1126 PT8M20S
==============================================================
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 videoId 50 non-null object
1 title 50 non-null object
2 channelTitle 50 non-null object
3 categoryId 50 non-null int64
4 publishedAt 50 non-null object
5 viewCount 50 non-null int64
6 likeCount 50 non-null int64
7 commentCount 50 non-null int64
8 duration 50 non-null object
dtypes: int64(4), object(5)
memory usage: 3.6+ KB
Dataset 3: YouTube Trending Videos via API (India)
Source and Provenance
- Source: YouTube Data API v3 – Google Developer Platform
- Collected by: Custom script using Google’s official API
- Date Collected: 08/01/2025
- How it was collected:
A Python script was used to query the YouTube Data API for the top 50 trending videos in India (regionCode="IN"
) using thevideos().list()
endpoint withchart="mostPopular"
. Please refer to the script\code\youtube_api_loader.ipynb
and use your API to reproduce the data set.
Description of Observations
The dataset contains metadata for 50 currently trending videos in India, including:
videoId
,title
,channelTitle
,categoryId
publishedAt
,viewCount
,likeCount
,commentCount
duration
This data was saved locally as youtube_api_sample.csv
and is included in the /data
folder. This file reflects a real-time snapshot and complements the static, historical datasets used in the project.
Ethical Considerations
- Data is collected via the official YouTube Data API v3, in accordance with Google’s API Terms of Service
- Only public metadata is accessed; no user-level or private data is collected
- The API is used for read-only academic and exploratory purposes
Research Use Case
Can real-time YouTube video metadata validate or enrich findings from historical datasets, and how might it be used to support future business applications?
Why This Matters
Unlike static CSVs, the YouTube Data API enables ongoing and scalable access to up-to-date video trends. This can: - Help validate whether patterns found in Dataset 1 still apply today
- Provide real-time insights for content creators
- Support future business tools like dashboards, content calendars, or trend detection systems
This small sample demonstrates the ability to connect this project to live data streams — a powerful step toward applied analytics.
Variables in the Sample
- Quantitative:
viewCount
,likeCount
,commentCount
- Temporal:
publishedAt
- Categorical:
categoryId
,channelTitle
,duration
Analysis Plan
Use this file primarily for enrichment and validation — not modeling
Convert duration from ISO 8601 to minutes and add as a new variable
Visualize and compare distribution of viewCount and likeCount to Dataset 1
Check if shorter/longer durations correlate with engagement levels
Optionally explore:
View-to-like ratios
Category-level performance
Frame as a proof-of-concept for automated, ongoing data integration
Role in Final Project
This dataset demonstrates how the YouTube API can be used to build real-time or on-demand analytics pipelines. While it won’t be used for model training, it plays a key role in validating generalizability, supporting feature engineering, and establishing a future-facing path for creator-focused analytics tools.
Weekly Plan of Attack
Week 1: Finalize Data & Preprocessing (Aug 2–Aug 8)
- Submit proposal and finalize dataset files in
/data
folder - Clean and standardize variables (e.g., dates, durations, text)
- Engineer new features like:
title_length
,upload_hour
,like_ratio
,duration_minutes
- Save cleaned version for modeling in
/notebooks
or/src
Week 2: Modeling & Exploratory Analysis (Aug 9–Aug 15)
- Conduct exploratory analysis:
- Views by category, upload time, title length
- Correlations and class imbalance check
- Define target variable:
- Binary (e.g., top 25% views = “high performer”) or regression
- Train and evaluate models:
- Logistic Regression, Decision Tree, Random Forest
- Use classification metrics:
- Accuracy, ROC-AUC, precision, recall
- Review feature importance to guide interpretation
Week 3: Report Writing & Presentation (Aug 16–Aug 21)
- Create
presentation.qmd
with key visualizations and model insights - Tell a clear story: problem → data → features → modeling → takeaways
- Add a reflection section:
- What I learned, what I would improve with more time
- Clean up GitHub repo:
- Add README, remove unused files, ensure reproducibility