Big Data 2.0 – YouTube Meets Text Analytics – Part 1


Benjamin Spiegel

An actionable example, focusing on YouTube comments, of how to go about using the data behind text to better understand consumer trends and behavior.

Last month I shared my perspective on the future of big data and the crucial role text analytics will play in it. The article talks about leveraging text analytics to go beyond the numeric signals and truly understanding consumer choices. Shortly after it got published, I received an overwhelming amount of inquiries asking for an actionable example of how to go about using the masses of text – and the data behind all that content – to develop a deeper understanding of consumer behavior and trends.

Since I am not the biggest fan of written how-to guides in this video world, I will try my best to make this as actionable as possible, but my main objective is to walk you through the process of how to leverage linguistic analysis to improve your consumer understanding. To make this applicable for most readers, I decided to use YouTube data as our example today.

Why YouTube? YouTube is an amazing platform that bridges the gap between search, discovery, content, video, and social. This potent mix makes YouTube highly engaging, and in addition to valuable text content it also collects a wide variety of traditional data points (Views, Likes, Votes, and Comments). And speaking of “big,” visitors upload more than 100 hours of YouTube every minute and we can only imagine the amount of text comments related to this content.

And this is what we want to focus on today – YouTube Comments.

For today’s example, let’s assume we are an automotive manufacturer trying to enter the high-end SUV market. In order to have a successful product and marketing campaign, we need to understand what today’s consumers are looking for, what matters to them, and what gets them excited. If we bring this thinking to YouTube, we might want to look at review videos and commercials of the latest high-end SUVs and the associated comments (text to analyze) left by consumers. We will evaluate the Porsche Cayenne, BMX X5, Range Rover, and the Audi Q7.

Building on the last article, we will go through the following steps in today’s piece:

  1. Acquisition: Collecting the required data from YouTube
  2. Transforming and Preprocessing: Cleansing and filtering the data
  3. Enrichment: Enhancing the original data
  4. Processing: Running analysis on the raw data
  5. Frequencies and Analysis: Calculating frequencies and occurrences
  6. Mining: Extracting the insights and findings

While this is by no means a linear or complete order, it will help you frame your own projects.

Data Source and Process

Before we jump into the actual processes, I wanted to quickly address the source of the data. For today’s exercise we will be using the YouTube API version 2.0. Although Version 2.0 has been deprecated as of March 4 2014 (the current version is YouTube Data API 3.0), we will be using version 2.0 here because it is simpler and does not require authentication. As for the method of data acquisition, there are hundreds of ways to connect to APIs. In order to keep this simple for the non-tech people, we will not discuss the actual acquisition methods, but instead focus on the process and share the progress results via Google Docs.


For our automotive example, the first thing we need to do is to identify the videos covering the competitive car models whose consumers we are trying to better understand. The simplest way to do this is to use the YouTube API to perform a search. The way to call the YouTube search API is a fairly straightforward GET request:

If you click on the above URL you will see an XML feed containing the search results for the above Query (Porsche Cayenne). The Query contains three sections we can focus on:

Orange: The actual search terms

Red: How do we want to sort it (Views, Date, and Relevance)

Green: Max Results, how many results do we want (I would load 100s, but for this exercise I am limiting it to 10)

To learn more about the call and the filter options, click here.

Once you perform the above search and parse the XML, you will be left with a table like this:

Google Doc for your reference.

This table contains about 85 columns of data on the videos – everything from rating and rankings to thumbnails, geolocation, and other engagement measures. This output alone is a pretty neat set of data to play with; we could look at the engagement rates to determine what type of content is engaging and what is not. We could look at the numerical sentiment (positive-to-negative ratio) to understand reactions and behavior. But today we want to go a level deeper and in order to do that we need to extract the comments for each of those videos. We can go about this very similarly to the earlier approach and leverage a Google API call:

This call takes the VideoID that is part of the YouTube URL ( or Column “BY” in our Google doc and adds it into the URL; we are also defining the limitand sort order. Again, if you were to click on the URL above, you would receive an XML with all the comments for the video “2013 Porsche Cayenne GTS Test Drive & Review” (ko5C82Iwcxg).

Once you again parsed the XML, you should find a table similar to the one below:


And another Google Doc for your reference and to play with.

Depending on how you are connecting, you will have to repeat this call for each of your videos. After performing some cleanup (removing unnecessary columns and running it for all the videos) I came up with something that looked like this:

1screen3Google Doc for reference.

What this leaves us with is a collection of comments left on the top videos mentioning Porsche Cayenne. Obviously now it is time for some cleanup. I would recommend removing videos that:

  • Are in foreign languages
  • Compare vehicles (look for “vs.” in title)
  • Are viral videos (speeding, crashes, etc.)

After you have recollected the comments, I recommend you limit them further by removing comments that:

  • Are less than two words long
  • Contain vulgar language
  • Are in foreign languages (the comment, not the video)
  • Contain certain spam signal words (Viagra, RX, etc.)

Then you should further clean up your comments by:

  • Transforming everything into lower case
  • Stripping out symbols
  • Removing numbers
  • Stemming the terms

From here on there are two different approaches you can take to move forward. One of them is to look at

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words, or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.

And the other approach is to breaking out the words by using

Bag of Words
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. Recently, the bag-of-words model has also been used for computer vision.

Let’s start with the n-gram approach. If you do not have any text mining tools handy, I recommend an online tool like: Simply copy and paste the extracted comments into the tool and it will perform an analysis for you. For me, the result looks like this:

1screen4Google Doc for reference.

Generally these tools will include frequencies and density measures, which will help you to identify the common themes and what we can exclude.

The other option, Bag of Words, will take the individual terms instead of the string patterns. Running it will produce a giant list of words. The simplest way to leverage this approach would be to generate a Word Cloud. You can either use a Desktop Data Visualization Tool or use something like Wordle. I took the bag of words result above and copied and pasted it into Wordle. This led to the visual below:


Wordle Project to play with.

In order to follow along with our original game plan, I took the comments from the top videos for the Porsche Cayenne, BMX X5, Range Rover, and the Audi Q7. Then I refined the selection and limited it to thetop terms and applied some simple tagging and linguistic analysis. This left me with the following visual:


This visual is color coded by SUV model. Can you guess which is which?

The next step from here is to start filtering and refining the data further. One way to do this is by applying some POS (Part of Speech) tagging in order to group the content by type. Performing POS tagging, which marks the tagged words based on their POS, does two things: it identifies the meaning of the words and allows you to filter and group by part of speech. For example, you could separate comparative adjectives from superlatives (faster vs. fastest) or filter it down to only look at simple adjectives (quiet, beautiful, etc.).

To learn more about POS click here.

I hope this little guide has gotten you excited about starting to work with linguistic analysis. Next month we will follow up with part two, in which we will go deeper into the POS analysis as well as using learners/predictors and merging the numerical with the textual data. And if this is of interest, we will also look at the connected Google+ profile of each commenter in order to evaluate the importance (Profile Views) and the purchase potential (region, career, etc.) of the vehicle.