Modeling and Predicting YouTube Engagement

Proposal

This project analyzes YouTube video metadata such as title, category, and publish time to explore their impact on engagement metrics like views, likes, and comments. The goal is to uncover patterns that help explain why certain videos trend or perform better.
Author
Affiliation

Marimuthu - Ashok kumar Marimuthu

College of Information Science, University of Arizona

import numpy as np
import pandas as pd

Dataset

## Dataset 1: YouTube Trending Video Dataset (Kaggle – India)

df = pd.read_csv("data/IN_youtube_trending_data.csv")
print("shape:\n",df.shape)
print("==============================================================")
print("sample:\n",df.head())
print("==============================================================")
print("Info:")
df.info()
shape:
 (251277, 16)
==============================================================
sample:
       video_id                                              title  \
0  Iot0eF6EoNA  Sadak 2 | Official Trailer | Sanjay | Pooja | ...   
1  x-KbnJ9fvJc  Kya Baat Aa : Karan Aujla (Official Video) Tan...   
2  KX06ksuS6Xo  Diljit Dosanjh: CLASH (Official) Music Video |...   
3  UsMRgnTcchY  Dil Ko Maine Di Kasam Video | Amaal M Ft.Ariji...   
4  WNSEXJJhKTU  Baarish (Official Video) Payal Dev,Stebin Ben ...   

            publishedAt                 channelId    channelTitle  categoryId  \
0  2020-08-12T04:31:41Z  UCGqvJPRcv7aVFun-eTsatcA    FoxStarHindi          24   
1  2020-08-11T09:00:11Z  UCm9SZAl03Rev9sFwloCdz1g  Rehaan Records          10   
2  2020-08-11T07:30:02Z  UCZRdNleCgW-BGUJf-bbjzQg  Diljit Dosanjh          10   
3  2020-08-10T05:30:49Z  UCq-Fj5jknLsUf-MWSy4_brA        T-Series          10   
4  2020-08-11T05:30:13Z  UCye6Oz0mg46S362LwARGVcA   VYRLOriginals          10   

          trending_date                                               tags  \
0  2020-08-12T00:00:00Z  sadak|sadak 2|mahesh bhatt|vishesh films|pooja...   
1  2020-08-12T00:00:00Z                                             [None]   
2  2020-08-12T00:00:00Z  clash diljit dosanjh|diljit dosanjh|diljit dos...   
3  2020-08-12T00:00:00Z  hindi songs|2020 hindi songs|2020 new songs|t-...   
4  2020-08-12T00:00:00Z  VYRL Original|Mohsin Khan|Shivangi Joshi|Payal...   

   view_count   likes  dislikes  comment_count  \
0     9885899  224925   3979409         350210   
1    11308046  655450     33242         405146   
2     9140911  296533      6179          30058   
3    23564512  743931     84162         136942   
4     6783649  268817      8798          22984   

                                   thumbnail_link  comments_disabled  \
0  https://i.ytimg.com/vi/Iot0eF6EoNA/default.jpg              False   
1  https://i.ytimg.com/vi/x-KbnJ9fvJc/default.jpg              False   
2  https://i.ytimg.com/vi/KX06ksuS6Xo/default.jpg              False   
3  https://i.ytimg.com/vi/UsMRgnTcchY/default.jpg              False   
4  https://i.ytimg.com/vi/WNSEXJJhKTU/default.jpg              False   

   ratings_disabled                                        description  
0             False  Three Streams. Three Stories. One Journey. Sta...  
1             False  Singer/Lyrics: Karan Aujla Feat Tania Music/ D...  
2             False  CLASH official music video performed by DILJIT...  
3             False  Gulshan Kumar and T-Series presents Bhushan Ku...  
4             False  VYRL Originals brings to you ‘Baarish’ - the b...  
==============================================================
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251277 entries, 0 to 251276
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   video_id           251277 non-null  object
 1   title              251277 non-null  object
 2   publishedAt        251277 non-null  object
 3   channelId          251277 non-null  object
 4   channelTitle       251276 non-null  object
 5   categoryId         251277 non-null  int64 
 6   trending_date      251277 non-null  object
 7   tags               251277 non-null  object
 8   view_count         251277 non-null  int64 
 9   likes              251277 non-null  int64 
 10  dislikes           251277 non-null  int64 
 11  comment_count      251277 non-null  int64 
 12  thumbnail_link     251277 non-null  object
 13  comments_disabled  251277 non-null  bool  
 14  ratings_disabled   251277 non-null  bool  
 15  description        231822 non-null  object
dtypes: bool(2), int64(5), object(9)
memory usage: 27.3+ MB

Weekly Plan of Attack

Week 1: Finalize Data & Preprocessing (Aug 2–Aug 8)

  • Submit proposal and finalize dataset files in /data folder
  • Clean and standardize variables (e.g., dates, durations, text)
  • Engineer new features like:
    • title_length, upload_hour, like_ratio, duration_minutes
  • Save cleaned version for modeling in /notebooks or /src

Week 2: Modeling & Exploratory Analysis (Aug 9–Aug 15)

  • Conduct exploratory analysis:
    • Views by category, upload time, title length
    • Correlations and class imbalance check
  • Define target variable:
    • Binary (e.g., top 25% views = “high performer”) or regression
  • Train and evaluate models:
    • Logistic Regression, Decision Tree, Random Forest
  • Use classification metrics:
    • Accuracy, ROC-AUC, precision, recall
  • Review feature importance to guide interpretation

Week 3: Report Writing & Presentation (Aug 16–Aug 21)

  • Create presentation.qmd with key visualizations and model insights
  • Tell a clear story: problem → data → features → modeling → takeaways
  • Add a reflection section:
    • What I learned, what I would improve with more time
  • Clean up GitHub repo:
    • Add README, remove unused files, ensure reproducibility