
DATASETS
Public dataset can be found here:
https://www.kaggle.com/datasets/ramjasmaurya/top-1000-social-media-channels?resource=download
ABOUT DATA SET
The dataset I’m working with highlights the top 1,000 social media influencers on Instagram and YouTube from March to December 2022. Sourced from Kaggle, it offers insights into how different categories and platform sizes affect engagement. For Instagram, I focused on data like usernames, follower counts, average engagement rates, and associated categories. For YouTube, the dataset includes channel names, subscriber counts, average views, likes, comments, and corresponding categories. These platforms were split into two separate datasets which I compared against each other.
​
​
​
​
​
​
​
​
​
By using this data, I am able to observe and understand how users utilize social media and how that affects their consumption habits. Rugrien (2023) discusses in her Social Media Trend 2023: Short-form VS Long-form Video how social media serves different purposes for different users—whether it’s staying connected with friends and family, cultivating an online persona, or keeping up with news and trends. These motivations shape each user’s experience and journey through social media. With this dataset, I can explore how categories perform differently on each platform and how user behavior adapts to different content formats.

Top Instagram Influencers 2022 Dataset (Partial)
Cleaning Dataset
To make the datasets usable, I cleaned and reorganized the information using Python in Google Colab. One challenge was that influencers were often associated with multiple categories, either combined into one data point or scattered across several columns. To address this, I split the categories using non-alphabetical characters and “exploded” the data so that each Influencer’s category had its own row. Some of the categories were listed in phrases, so I also had to replace those with a more succinct word to create an easier data point to analyze (ex: sports with a ball -> sports).
​
Another hurdle was the number of categories was over 50, and some with fewer than five influencers associated. This made it hard to meaningfully analyze trends. To simplify this, I grouped the categories into seven broader bins: entertainment (art, music, movies, humor), lifestyle (fitness, health, travel, daily vlogs), fashion & beauty (fashion, modeling, beauty), food & cooking, technology & science, business & finance, and animals & nature. These general categories made it easier to identify trends and understand how each genre performs across platforms.
ABOUT SCRAPED DATA
Along with the dataset I sourced from Kaggle, I supplemented my research with additional data I scraped using APIfy. This provided a crucial middle ground for analyzing the rise of short-form content on Instagram and YouTube, specifically through Reels and Shorts. For my scraped data, I focused on influencers either already included in my Kaggle dataset or others who aligned with one of the seven general category bins I had identified earlier. I selected two influencers per platform for each category, resulting in approximately 3,700 data points on Instagram Reels and about 4,000 data points on YouTube Shorts.
​
This additional dataset allows me to directly compare how each platform treats the same content format across different categories. With short-form content becoming increasingly popular, platforms like Instagram and YouTube have integrated their unique approaches—Reels and Shorts, respectively—to cater to evolving audience preferences. By including this data, I can explore how engagement varies not only between platforms but also within categories for short-form video content.​
Why This Data is Relevant
This data is key to understanding how content format impacts influencer engagement across platforms and categories. By examining short-form content, I can investigate whether its rise aligns with shifts in audience behavior and platform strategies. For instance, do categories like entertainment or fashion & beauty perform better on Instagram Reels, which is photo-centric at its core, or on YouTube Shorts, a video-first platform? Additionally, this data allows me to analyze whether short-form content levels the playing field for smaller influencers or if audience size still significantly dictates engagement trends.
Cleaning Dataset
For this scraped dataset, I didn’t need to do extensive cleaning since I had control over which specific data I collected during the scraping process. However, one step I did make sure to do was hydrating the dataset. The Reels scraper didn’t include follower counts for each user, so I retrieved this data separately and mapped it back to the dataset using Python in Google Colab. From there, I was able to calculate and add columns for average engagement, enabling a deeper analysis through data visualizations.