Transform anything into a vector
entity2vec: Using cooperative learning approaches to generate entity vectors
Javed Qadrud-Din previously worked as a business architect at IBM Watson. At Insight, he developed a new method that allows businesses to efficiently represent users, customers, and other entities in order to better understand, predict, and serve them. He is now a Machine Learning Engineer at CaseText.
Want to learn applied Artificial Intelligence from top professionals in Silicon Valley or New York? Learn more about the Artificial Intelligence program.
Businesses commonly need to understand, organize, and make predictions about their users and partners. For example, trying to predict which users will leave the platform (churn prediction), or identifying different types of advertising partners (clustering). The challenge comes from trying to represent these entities in a meaningful and compact way, to feed them into a machine learning classifier for example.
I will be presenting the way I tackled this challenge below, all of the code is available on GitHub here.
Drawing inspiration from NLP
One of the most significant recent advances in Natural Language Processing (NLP) came from a team of researchers at Google (Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean) created word2vec, which is a technique to represent words as continuous vectors called embeddings.
The embeddings they trained on 100 billion words (and then open sourced) managed to capture much of the semantic meaning of the words they represent. For example, you can take the embedding for ‘king’, subtract the embedding for ‘man’, add the embedding for ‘woman’, and the result of those operations will be very close to the embedding for ‘queen’ — an almost spooky result that shows the extent to which the Google team managed to encode the meanings of human words.
Ever since, word2vec has been a staple of Natural Language Processing, providing an easy and efficient building block for many text based applications such as classification, clustering, and translation. The question I asked myself while at Insight was how techniques similar to word embeddings might be employed for other types of data, such as people or businesses.
Let’s first think about what an embedding is. Physically, an embedding is just a list of numbers (a vector) that represent some entity. For word2vec, the entities were English words. Each word had its own list of numbers.
These lists of numbers are optimized to be useful representations of the entities they stand for by adjusting them through gradient descent on a training task. If the training task requires remembering general information about the entities of interest, then the embeddings will end up absorbing that general information.
Embeddings for words
In the word2vec case, the training task involved taking a word (call it Word A) and predicting the probability that another word (Word B) appeared in a 10-word window around Word A somewhere in a massive corpus of text (100 billion words from Google News).
Each word would have this done tens of thousands of times during training, with words that commonly appear around it, and words that never appear in the same context (a technique called negative sampling).
This task forces the embedding for each word to encode information about the other words that co-occur with the embedded word. Words that co-occurred with similar sets of words would end up having similar embeddings. For example, the word ‘smart’ and the word ‘intelligent’ are often used interchangeably, so the set of words typically found around them in a large corpus will be a very similar set. As a result, the embeddings for ‘smart’ and ‘intelligent’ will be very similar to each other.
Embeddings created with this task are forced to encode so much general information about the word, that they can be used to stand for the word in unrelated tasks. The Google word2vec embeddings are used in a wide range of natural language processing applications, such as sentiment analysis and text classification.
Embeddings for anything
Word vectors are essential tools for a wide variety of NLP tasks. But pre-trained word vectors don’t exist for the types of entities businesses often care the most about. Where there are pre-trained word2vec embeddings for words like ‘red’ and ‘banana’, there are no pre-trained word2vec embeddings for users of a social network, local businesses, or any other entity that isn’t frequently mentioned in the Google News corpus from which the word2vec embeddings were derived.
Businesses care about their customers, their employees, their suppliers, and other entities for which there are no pre-trained embeddings. Once trained, vectorized representations of entities can be used as inputs to a wide range of machine learning models. For example, they could be used in models predicting which ads users are likely to click on, which university applicants are likely to graduate with honors, or which politician is likely to win an election.
Entity embeddings allow us to accomplish these types of tasks by leveraging the bodies of natural language text associated with these entities that businesses frequently have. For example, we can create entity embeddings from the posts a user has written, the personal statement a university applicant wrote, or the tweets and blog posts people write about a politician.
Any business that has entities paired with text could make use of entity embeddings, and when you think about it, most businesses have this one way or another: Facebook has users and the text they post or are tagged in, LinkedIn has users and the text of their profiles, Yelp has users and the reviews they write, along with businesses and the reviews written about them, Airbnb has places to stay along with descriptions and reviews, universities have applicants and the admission essays they write, and the list goes on. In fact, Facebook recently published a paper detailing an entity embedding technique.
The aim with my entity2vec project was to find a way to use text associated with entities to create general-use embeddings that represent those entities. To do this, I used a technique somewhat similar to word2vec’s negative sampling to squeeze the information from a large body of text known to be associated with a certain entity into entity embeddings.
Example 1: Famous People
To develop and test the technique, I tried training embeddings to represent prominent people (e.g. Barack Obama, Lady Gaga, Angelina Jolie, Bill Gates). Prominent people were a good starting point because, for these very famous peoples’ names, pre-trained Google word2vec embeddings exist and are freely available, so I’d be able to compare my embeddings’ performance against the word2vecs for those peoples’ names.
Like with word2vec, I needed a training task that would force the entity embeddings to learn general information about the entities they stand for. I decided to train a classifier that would take a snippet of text from a person’s Wikipedia article and learn to guess who that snippet is about.
The training task would take several entity embeddings as input and would output the position of the entity embedding that the text snippet is about. In the following example, the classifier would see as input a text snippet about Obama, as well as the embeddings for Obama, and three other randomly chosen people. The classifier would output a number representing which of its inputs is the Obama embedding.
All of the embeddings would be trainable in each step, so, not only would the correct person embedding learn information about what that person is, but the other incorrect embeddings would also learn something about what their people are not.
This technique seemed sensible intuitively, but, in order to validate my results, I needed to try the resulting embeddings out on some other tasks to see if they’d actually learned general information about their entities.
To do this, I trained simple classifiers on several other tasks that took entity embeddings as inputs and outputted classifications like the gender or occupation of the entity. Here is the architecture of these classifiers:
And here are the results obtained, compared against guessing and against doing the same thing with word2vec embeddings.
My embeddings performed pretty much on-par with the word2vec embeddings even though mine were trained on much less text — about 30 million words vs 100 billion. That is four orders of magnitude less text required!
Example 2: Yelp Businesses
Next, I wanted to see if this technique was generalizable. Did it just work on people from Wikipedia, or does the technique work more generally? I tested it by trying exactly the same technique to train embeddings that represent businesses using the Yelp dataset.
Yelp makes a slice of its dataset available online that contains businesses along with all the tips and reviews written about those businesses. I trained embeddings using precisely the same technique as I used with the Wikipedia people, except this time the text consisted of Yelp reviews about businesses and the entities were the businesses themselves. The task looked like this:
Once trained, I tested the embeddings on a new task — figuring out which type of business a certain business was, e.g. CVS Pharmacy is in the ‘health’ category whereas McDonalds is in the ‘restaurants’ category. There were ten possible categories a business could fall into, and a single business could fall into multiple categories — so it was a challenging multi-label classification task with ten labels. The results, as compared with educated guessing, were as follows:
This is a great result considering the difficulty of such a task!
Altogether, it was a successful experiment. I trained embeddings to capture the information in natural language text, and then I was able to get useful information back out of them by validating them on other tasks. Any business that has entities paired with text could use this technique, to be able to run predictive tasks on their data.
Next Steps and Code
While these results are promising, the idea can be taken further by incorporating structured data into the embeddings along with text, which I will be looking to explore in the future.
Anyone can now use this technique on their own data using a Python package I created and just a few lines of code. You can find the package on GitHub here.