Predicting popular YouTube videos among K-12 students
The California-based company GoGuardian helps teachers and school administrators manage Chromebooks for millions of K-12 students nationwide. These Chromebooks represent enormous educational potential, but we all know just how distracting unlimited access to the Internet can be. To further the mission of educators, GoGuardian provides a variety of services that help schools filter content and protect students online. Among their key products is a smart YouTube filter. This feature is important because at any hour of the day approximately 35% of students are watching YouTube . YouTube hosts a large amount of valuable educational content, so blocking the entire domain would be overly restrictive. Unfortunately, it hosts even more content that is distracting and not useful in an educational context. To filter the wheat from the chaff, GoGuardian would like to surface a small number of trending videos to administrators and teachers, thereby enabling them to make the decision whether any particular video should be accessible via school-provided laptops. I consulted with GoGuardian to create the model underlying this dashboard. The resulting model pulls data from YouTube's API and accurately predicts popular videos so that schools can make independent decisions of what each school wants or doesn't want students to watch. This model enables schools to make the final filtering decision using a dashboard provided by GoGuardian.
Trying to filter all the videos on YouTube at any given time by hand would be impossible. Luckily, a large and fairly constant proportion of students are usually watching a small number of top videos.
Just 5 of the top 100 videos are being watching by almost ~⅕ of the viewers at any given time, so evaluating those 5 videos would have a large impact with limited effort. Also, by focusing on just the top 5 videos I can smooth out the weekly and daily variation and focus on just predicting which videos will be popular in each hour. I split out a test set of data from the last two weeks, on which I validated the performance of different prediction algorithms.
To build my intuition about viewership trends, below I plot the number of unique users watching a typical top 5 video. This video reached the top 5 at 4 PM UTC on June 20, and stayed popular for the rest of the day. At and before 3 PM, no one was watching the video. By querying the YouTube API, I learned that this video was released a few minutes after 4 PM. In essence, this video became popular immediately after it was released.
That is just one example, but is it really representative of the data set at large? Below I plot the average (median) number of viewers watching the top 5 videos aligned to the hour in which the video reached the top 5. The area shaded in gray contains half of all the top 5 videos. From this plot we can see that one hour before reaching the top 5, 25% of videos were being watched by zero students. Two hours before, 50% of top 5 videos had zero viewers, and 3 hours before 75% had no viewers. I conclude that videos become popular extremely rapidly, and remain popular for at least several hours after their initial peak. This makes it difficult to predict if a video will become popular based on its past viewership, because quite often that viewership was zero.
Although the abruptness of increases in viewership make it difficult to make predictions based on past viewership, it is possible that there are other indicators that foretell popularity. I turned to the YouTube API to get additional information about the videos in my data set. The most striking feature of top 5 videos is that they are recent. In fact, 26% of all videos in the top 5 for any given hour were released less than an hour previously! If you are interested in the distribution, I have plotted a cumulative histogram below:
Because videos become popular almost immediately after being posted, a simple and amazingly effective algorithm for identifying the top 5 videos an hour from now is to look at the top 5 videos right now. Deploying this algorithm on my 2 week test set, I achieve 80% average accuracy in predicting the top 5 videos in each hour. The distribution of this performance is plotted below, where you can see that I get 4 or 5 out of 5 right over ¾ of the time.
By including a few more features, such as the change in performance from one to two hours previously, the number of distinct schools with students watching the video, and the time since the video was released, I can modestly improve the accuracy to 82%, but I doubt their utility, given the added complexity of the model.
Had the simple algorithm been applied to the ~6 week dataset, an administrator would have been able to filter almost 500,000 unique views of potentially distracting content by evaluating just 5 videos each hour (red segment in the following plot):
In conclusion, even seemingly unpredictable trends can be used to build smart tools for administrators to filter content for students.
Peter Weir (@ptweir) is an Insight Data Science Fellow located in San Francisco, CA.
 For the dataset I had access to, ~12,000 unique users were typically online in an hour, and ~4,300 of them were on YouTube.