I got intrigued by the concept of embeddings recently and decided to read up on it. It was a handful just trying to grasp the basics, and there was quite a bit of information out there. After all that reading I think I’m a tiny bit from where I started, with a long way to go. Here’s taking stock of my understanding so far before proceeding any further. Also hoping it can be useful for anyone needing a brief headstart into this interesting concept!
Disclaimer: I’m not an expert in embeddings nor do I have working experience in embeddings (yet). This a compilation of my understanding of the stuff I’ve read over the past few weeks on embeddings. There may be inaccuracies, so let me know if you spot some, thank you!
To be able to start answering questions with data, you’d need to be able to represent the information. Algorithms work with numerical data, so the representation has to be in a numerical format. If the variable in question is already of a numerical format things are slightly easier. Examples would be number of items bought or the temperature at a given hour. They fall on a unidirectional scale, and statistics and algorithms can make sense of the relationship between one data point and the other. It is either smaller, larger, or the same. Or, across time, it has either increased, decreased, or remained the same.
There is also a whole world of categorical variables – information that you’d describe more as qualitative than quantitative. An example might be fruits. Or colors. You can’t assign them to a single scale that would fully reflect what they are. You could assign 1 to apples, 2 to oranges and 3 to bananas. From a mathematical standpoint it would be saying that the difference between an orange and an apple is exactly the same as a banana to an orange. On the other hand leaving it as text would leave little room for delving deeper in the data, as you can’t (yet) directly perform mathematical operations on text data. So a single column wouldn’t cut it for representing categorical variables.
One way to represent such data is to have as many columns as there are unique values. Each value can be represented as binaries of 0 or 1. If you start with one column of fruits with 5 unique values, you’d have 5 one-hot columns. Each column will correspond to one fruit. Each row will be filled with 0s, except for the column corresponding to the fruit. E.g. apples may take the first column, oranges the second, and so on. For a data point of oranges, the second of five columns will be a ‘1’ and the other columns ‘0’s.
This allows the algorithm to learn that apples and oranges are different concepts that don’t lie on a single scale.
Limitations of one-hot encoding
This works fine for variables with few possible values, such as gender. However, if there are thousands or more of possible values, one-hot encodings would be computationally expensive. You’d end up with a huge dataframe/matrix/array to compute over.
Moreover, while it may not be feasible to describe all the values on a single scale, the values may still have some relationship between them. One-hot encoding eliminates the possibility of showing such relationships to the model as they are considered as separate variables. Being able to tell the model that the data points are related can help the algorithm converge faster. Conversely not being able to tell the algorithm that may hinder eventual performance. For example, lemons and oranges while different fruits, are citruses. Pink and orange have an element of red in them. With one-hot encoding the algorithm starts off learning that there is no relationship between lemons and oranges. None between pink and orange, just as there are none with blue. It may eventually learn such underlying details, but through much longer time.
Finally, when input columns are zeroes for almost all data points and just a few non-zeroes, the algorithm will have a hard time learning about that particular column. In effect, one-hot encoding splits a single variable into as many columns/ variables as there are unique values. The algorithm has to separately learn how each of these variables affects output, with very little data on each variable. The model is not likely to perform well especially as the number of unique values increase (and more zeros in the input).
Representing categorical data as concepts or dimensions
On one extreme we have representing categorical data on a single scale (1 dimensional), which in most cases is inappropriate. On the other extreme we have representing categorical data in as many dimensions as there are unique values. This too is inappropriate in most cases. What about something in between?
Most of these categorical data, such as words, colors, or movie titles, can’t be sufficiently described in one single dimension. However, chances are, given sufficient dimensions, each on a unidirectional scale, most of these categorical data can be sufficiently described.
For example, one way to represent colors is using 3 dimensions or channels, namely red, green and blue. Being primary colors, all other colors can be formed with the right intensity of red, green and blue.
A possible dimension to describe words could be in terms of how positive or negative it is. It could also be how ancient or futuristic the word is, or how masculine or feminine it is. But how many dimensions do we need to fully and effectively describe every word in the entire English vocabulary? How should we know what dimensions are important in representing words? How can we assign weights to each dimension without allowing our biases to affect the magnitude of each, especially in subjective concepts?
Given the concerns with manually coded dimensions, it seems ripe as a task for machines to help. With embeddings, we allow algorithms to work out what these dimensions should be, and the magnitude of each.
The downside is of this approach that each of these dimensions may not be intuitively intepretable. Furthermore, the quality of these embeddings and presence of bias is dependent on the quality and quantity of input data. On the upside, this is an approach that is massively scalable and way more effective than manually coming up with dimensions and magnitudes, especially for data with huge number of possible values.
Much of the world’s curated information is in text. However, in order to be able to leverage on this, it has to be transformed into a structured form. As a result significant research has gone into the effective representation of words in numerical formats. Being able to convert free text into numerical data and vice versa would also open up possibilities in human-machine interactions. Given the amount of work that has gone into representation of words, I’ll focus on word embeddings in subsequent sections.
Word2Vec is one of the earlier popular word embedding models. It was created by a team at Google in 2013. The underlying idea is that the meaning of a word can be inferred from the words around it. If word A is often seen around words X, Y and Z, and word B is also often seen around words X, Y and Z, then words A and B are likely to be similar in meaning. For example, given 2 sentences ‘I ate bread for breakfast’ and ‘I ate cereal for breakfast’, it can be inferred that ‘bread’ and ‘cereal’ have similarities between them, since they appear in similar contexts (exactly the same in this case, ‘I ate .. for breakfast’).
How Word2Vec works
There are 2 model architectures for Word2Vec. In continuous bag-of-words (CBOW), the model takes in the surrounding words (context), passes through the network and predicts the target word. Skip-gram, on the other hand, takes in the target word and predicts the context words.
Either way, the model generally takes input and passes it to an embedding layer. The embedding layer that transform the input and passes the data on to an output/prediction layer. The output/ prediction layer that maps this transformed data to the target or context. Taking skip-gram and the earlier 2 sentences for example, the embedding layer should eventually learn that inputs of ‘bread’ and ‘cereal’ should be transformed similarly to be able to produce the same predicted context.
Word2Vec typically has 100-300 dimensions. That’s tiny, compared to the words it can represent, in the tens or hundreds of thousands. At the end, we are not really concerned about the word predictions, or if it is predicting target words or context. Only the middle layer, the embedding, would be extracted and saved. The embeddings can be reused elsewhere, provided the usage of words are similar to the training context.
Concept representation in Word2Vec
Each dimension on it’s own may not make meaningful sense for manual analysis. As a whole, on the other hand, they can represent many different concepts. One of the common examples would be comparing the difference between ‘man’ and ‘woman’. In the n-dimensional vector space, applying that same vector transformation on to ‘king’ should bring you somewhere in the vicinity of ‘queen’. I.e. man is to woman as king is to queen.
Lots more details on Word2Vec, more in the paper here and here, or the Wikipedia article for a summary. There are also couple of newer, more advanced embedding models that came after and address some of Word2Vec’s limitations. Some of these are GloVe (Stanford, 2014), fastText (Facebook, 2015), flair (Zalando, 2018), ELMo (AllenNLP, 2018) and BERT (Google, 2019).
Given that these words are projected onto a hundred-or-more dimensional space, it is difficult to visualize these embeddings directly. For machine tasks we would not need to handle this problem as machines can deal with dimensions way beyond 3. However for humans to get a glimpse of what is under the hood we would need to bring the number of dimensions to 3 or fewer. One possible way is through PCA (Principal Component Analysis). The other popular way to visualize embeddings is using t-SNE (t-Distributed Stochastic Neighbor Embedding). Either would reduce the efficacy of the embeddings but this is inevitable for visualization purposes. Once it is in 3 dimensions, we have a more tangible way of understanding what the embeddings are doing. Google provides an intuitive interface to visualize Word2Vec using PCA or T-SNE through a really cool application, the embedding projector.
One way to appreciate the huge number of dimensions without visualizing it is through finding nearby words. Regardless of number of dimensions the distance can be summarized as one single metric, cosine similarity or euclidean distance. You can find similar words by these metrics by searching a word on the right sidebar of the embedding projector. Sometimes words that appear very close in 3-d space might in fact be further in 100-d space. This shows the limitation of reducing dimensions for visualization purposes. On the other hand this also the power of using these 100-d embeddings in your natural language tasks.
To be able to train an embedding layer for a sizable vocabulary you’d need a significant amount of data for each word, in a variety of context. And to be able to train on that huge about of data you’d need significant computing resources. Neither are readily available to most people. That is why most teams that have worked on embedding models have also released pre-trained embedding layers for general use. With pre-trained embedding layers, you can download these files, plug them into your model, and use them almost instantaneously.
You can obtain a pre-trained, 300-dimensional Word2Vec model from Google here. The model had been trained on Google News data with 100 billion words. It has a vocabulary of 3 million words and phrases. As long as usage of words in your target context are similar to usage of words in the Google News context, you’d be able to get better results using this model than training one from scratch. Chances are, most words in your target context are used similarly to those in the Google News context.
Other applications of embeddings
The concept of embeddings can be applied to other categorical data with a huge number of possible values and some relationship between the values. To train the embeddings you’ll need a context of which the target variables occur. Armed with a target and context, you’ll be able to feed these into an embeddings network such as Word2Vec and map the variables out into a denser, more effective representation. It could be modeling users of an app (target) based on the friends they have (context), the music they listen to (another context) or the food they like (yet another context). If you have a huge list of products, being able to map products (target) to a denser embeddings space based on the other items bought in that purchase (context) may be helpful for structuring product data to predict sales or make recommendations.
I’ve ideas on applying embeddings to some of the data science projects I’m handling on and off work, will update when I get some progress!