Grit & Grid Search A Data Science Blog:
    About     Archive     Feed

YouTopics


Overview

YouTopics allows the user to quickly analyze the relevance of a YouTube video. Given a user-generated list of keywords representing a topic, YouTopics measures the similarity between the user’s topic and a youtube video’s topic.

Motivation

YouTube has a lot of information that can be analyzed with natural language processing thanks to transcripts and auto-generated captions. Despite its popularity, I was unable to find example projects working with this information and so it appears youtube is a largely untouched trove of human communication waiting to be processed.

Final Product

Image test YouTopics was built to be a web app, but the word embeddings file is too large to be hosted by services like Heroku and Google’s app engine. I can host it by opening up an AWS server. If I end up doing so a link will be posted here.

Approach

My idea is to extract a summary of the transcripts or captions and then provide an appropriate topic label. The go-to topic models are LDA - and SVD -.

LDA:

LDA takes a collection of documents and with an assumed number of topics, assumed distribution of topics across documents and assumed distribution of words across topics, generates, based on probability, a collection of words representing topics. Intuitively, given the high variance of topics and distributions across billions of videos, LDA won’t generalize well with so many assumptions.

SVD

I actually really like SVD, which is also known as the fundamental theorem of linear algebra. It’s the base technique for many analytical methods. When applied to NLP it is often referred to as LSA-Latent Semantic Analysis. SVD takes a matrix, in this case, a term by document matrix, and decomposes it into three matrices, a term by topics, topic by topic and topic by doc matrix. Once again, however, the number of topics is guessed or assumed.

Both methods take a sample of documents to generate word lists representing topics. Finding a sample large enough to generalize these models to all of YouTube would be impractical. So, I came up with a different approach.

Considering the outcome of LDA and SVD is just a list of keywords representing a topic and the python library Gensim has a built-in TextRank algorithm that extracts keywords for a given document. I used these keywords to represent topics and I then assign a label based on the similarity of the keywords with pre-established topic keywords. I used pre-trained word vectors to compute this similarity.

Word Vectors

Word2Vec and GLoVe word embedding map words to vector space where the distance between words relates semantic similarity. With Word2Vec, the semantic similarity is computed by taking the parameters from a neural network prediction of surrounding words. GloVe uses a more direct approach, GloVe’s objective is to predict the log probability of word occurrence across the whole corpus. GloVe’s approach produces generally better results than Word2Vec, though it is more difficult to train. Since I am going to using pre-trained models anyway, I decided to use GloVe.

I have determined a method to extract keywords and compare them. The last step is to extract a summary. NLP summary methods are divided into extractive and abstractive methods. Abstraction uses a lot of computing power to make new summary sentences. I prefer the extraction method, which forms summaries by just pulling important sentences from the document. It effectively conveys the same meaning while using less computing power. Gensim has an inbuilt summarizer based on BM25 along with TextRank.

TextRank

Is the same algorithm as PageRank, assigns rank based on the unique number of links. In TextRank, links are words in immediate proximity.

BM25

Ranks sentences based on the inverse frequency and frequency of keywords from TextRank. BM25 is still obscure to me but it seems intuitively linked to the idea information entropy. A sentence value is based not just on the frequency of keywords, but the information value the keywords have in the sentence. Regardless, the results pass the eye test.

The next step is to extract the transcripts from youtube. This can be done through the youtube API. There is always a kind soul in the python communities who will implement a python version of popular API’s. In this case, I found PyTube.

PyTube

Works great. Thanks, stranger.Link to the repo

Future Work

Add a database storage function for user-defined topics. Then apply a database of topics for more robust topic analysis. Add multi-topic analysis functionality.

Catboosterization


Image test

Overview

This project uses data from Kaggle’s Amazon.com - Employee Access Challenge and applies the Catboost algorithm to predict whether an employee will be granted access to a particular resource based on certain attributes of the employee’s role, such as manager’s id and department. These role attributes are all categorical. Catboost is a boosted tree algorithm similar to XGBoost, but designed to work well with largely categorical data.

Motivation

Generally, machine learning models perform better with purely quantitative inputs/outputs, and large amounts of categorical data is often ignored. However, categorical data is common and important for many processes and decisions (e.g. yellow bananas are good… brown not so much).

Having algorithms that handle categorical data, such as XGBoost and Catboost, in the machine learning toolbox expands the data we can work with and ultimately expands what can be accomplished through Data Science.

Why not XGBoost

XGBoost, as it is implemented in sklearn, automatically “one hot encodes” categorical data. One hot encoding takes each unique category (minus one) and turns it into a feature with a binary value.This is a problem when a features contains thousands of unique categories. An example is the feature RoleID, that contains thousand of different role categories.A large number of features can leads to a problem known as the curse of dimensionality; as dimensions, or features, increase data points become more distant , making it harder to discover relationships because everything is far away. Specifically tree-based algorithms must select some features to be roots and some branches. To make this determination, they can go through every possible combination, which is prohibitively expensive. Or they can take a random subset, which is unlikely to select anything of value since most features have been encoded into dummy variables of zero or one.

Image test

Enter Catboost

Catboost was developed and open-sourced by Yandex. Instead of one-hot encoding, it encodes with the formula (countInclass+prior)/(total count+1).

CountInclass = a cumulative count of assigned class for a combination of features. In my case, the classes are approved or rejected. CounInclass= the number of times this combination was approved or rejected. Total count = the cumulative count for the given category. Prior= a value assigned based on initial parameters for the model which exists to avoid zero values.

More information here:[link] (https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html)

Evaluation

The data set is imbalanced with approvals to rejections at 16 to 1. Accuracy would be a poor evaluation metric, since a model that predicted approval every time would score 94% accuracy. The Roc-Auc score is a good alternative.

Roc-Auc Score

In place of accuracy, the Roc-Auc score incorporates metrics of Precision and Recall to evaluate a model. High Precision indicates low occurrence of false positives; high Recall indicates low occurrence of false negatives. There is a natural trade-off between these metrics. The occurrence of false positives vs. false negatives can be adjusted by lowering or raising a decision threshold. The ROC curve (receiver operating characteristic curve) plots Recall vs. 1-Precision. The point (1,1) represents a threshold with perfect Recall, no false negatives, but max false positives. Computing the AUC (area under curve) produces the Roc-Auc Score, which measures the overall model performance for precision and recall, across all possible decision thresholds.

The best submissions scored around .92 Roc-Auc score. If the model always predicted approve, the Roc-Auc would be .85. To evaluate Catboost, a score around .92 would be very good and .85 would be poor.

Performance

When I first applied Catboost to my data, the model had a ROC-AUC score of 0.89. I attempted to increase it by performing a grid search, a loop through the model’s hyperparameters, but this did not produce higher scores. However, it did produce notable differences in the confusion matrices. This indicated the models were producing different results despite similar scores, suggesting the potential for ensembling.

Ensembling is a process of combining models together. Models with significant diversity tend to produce better results, likely because variances and biases of individual models are balanced against each other. Models can be assembled via stacking or a voting system. Stacking takes the output of one model and feeds into another. Boosted models are already stacked: Catboost’s algorithm will continuously add trees until model performance stops improving, so I don’t think there is much to gain with additional stacking.

I decided to use voting system to strengthen the model. Image test I selected three models with similar overall Roc-Auc scores which had variances in accuracy and precision. I ensembled them using max voting or majority voting. The overall improvement was 0.002. Kaggle competitions are often won by building very large ensembles to gain small incremental score improvements. Thus, I suspect ensembling more models would be an effective path forward to improve score, provided I can find models with sufficient diversity.

Future Work

  • Build a grid search for model diversity within high score threshold for better for ensembling
  • Try tree-based anomaly detection algorithm like Isolation Forest. The class imbalance makes this task an anomaly detection one. However, prebuilt implementations of Isolation Forest use one hot encoding. A custom built implementation with Catboost’s encoding might be very effective