Finally, back from a deep slumber!
Herein, I attempt to give some intuition on the Continuous Bag of Words model (CBOW) which produces word embeddings. For details, please check the linked paper.
Now lets imagine a world where you are trying to create these word embeddings. Whats a word embedding? .. you may ask. The answer is simple, its just a vector representation of the word. Behold! you say. Why a vector representation? And I'd say it can be used to explore relationship between words or a group of words.
Now think about it this way, a word has two representations .. one of its own and the other when it is in the context of other words. What's a context? .. you may ask. And it just means the surrounding words (within a finite symmetrical window to left and right) for a given word in a text corpora. And whats the goal? .. you may ask. I'd say, the goal is to find the word embeddings for all known words in the vocabulary. Wow! .. you exclaim and then ask How? And the answer would be to use the text corpora to train a Machine Learning model for figuring out the word embeddings.
That doesn't make much sense, you say and I hear you. To keep things simple, imagine we are scanning each word with vector representation v in the text corpora one at a time. And our context for word v is the immediate left occurring word with context vector representation u. So basically, a single word context is all we have. What do we do now? Well, first of all, we initialize all vectors (v and u for each word in vocabulary) randomly to some small numbers. And what might be the dimension of these vectors? Well, upto you as you can embed them into any N dimensional space you want.
Given the current context word with vector representation u, we simply take the dot product of u with all v vectors in the vocabulary. That gives you multiple scalar numbers, doesn't it? And then you apply softmax to these bunch of scalars to get a meaningful score between 0 and 1 for each word in the vocabulary. What do these numbers between 0 to 1 represent? They represent the probability of word with v.r. v occurring given the word with context v.r. u. Well, this is your estimated probability and you do have the true probability i.e. you do know the word occurring to the right of the context word as you have the text corpora. So ideally, your estimated probability at this point in time should look like 1 for the true word occurring to the right of the context word and 0 for all other words in the vocabulary. That most certainly won't be the case and then you'd use some calculus to update all the vectors v and u so that you move closer to the desired probability vector at this point in time. Voila! what do you do next? Rinse and repeat i.e. move onto the next word in the text corpora, recalculate the probabilities, adjust the vectors to be closer to reflecting the true probability vector (1 for the right word and 0 for all others). Once you scan the whole text corpora, you will be left with certain values of v and u for each word in the vocabulary and that, my friend are your embeddings.
[Note: v.r. = vector representation]
What if your context is multi-word? And in that case, your u will just be an average of all the context v.r. of the words in the context. And some calculus will change so that to be close to the true probability vector, you will adjust each u by equal amount (uniformly distributing the error to each context word)
This, my friend, is CBOW. And you can represent it using a shallow one layer feed forward neural network. But even if you do that, the math still doesn't change. Simple? yes. Easy to get your head around to? No. Hopefully this make things slightly less cloudy than they were before and provides you some iotas of intuition.
I'll be back, till then .. ciao!
Herein, I attempt to give some intuition on the Continuous Bag of Words model (CBOW) which produces word embeddings. For details, please check the linked paper.
Now lets imagine a world where you are trying to create these word embeddings. Whats a word embedding? .. you may ask. The answer is simple, its just a vector representation of the word. Behold! you say. Why a vector representation? And I'd say it can be used to explore relationship between words or a group of words.
Now think about it this way, a word has two representations .. one of its own and the other when it is in the context of other words. What's a context? .. you may ask. And it just means the surrounding words (within a finite symmetrical window to left and right) for a given word in a text corpora. And whats the goal? .. you may ask. I'd say, the goal is to find the word embeddings for all known words in the vocabulary. Wow! .. you exclaim and then ask How? And the answer would be to use the text corpora to train a Machine Learning model for figuring out the word embeddings.
That doesn't make much sense, you say and I hear you. To keep things simple, imagine we are scanning each word with vector representation v in the text corpora one at a time. And our context for word v is the immediate left occurring word with context vector representation u. So basically, a single word context is all we have. What do we do now? Well, first of all, we initialize all vectors (v and u for each word in vocabulary) randomly to some small numbers. And what might be the dimension of these vectors? Well, upto you as you can embed them into any N dimensional space you want.
Given the current context word with vector representation u, we simply take the dot product of u with all v vectors in the vocabulary. That gives you multiple scalar numbers, doesn't it? And then you apply softmax to these bunch of scalars to get a meaningful score between 0 and 1 for each word in the vocabulary. What do these numbers between 0 to 1 represent? They represent the probability of word with v.r. v occurring given the word with context v.r. u. Well, this is your estimated probability and you do have the true probability i.e. you do know the word occurring to the right of the context word as you have the text corpora. So ideally, your estimated probability at this point in time should look like 1 for the true word occurring to the right of the context word and 0 for all other words in the vocabulary. That most certainly won't be the case and then you'd use some calculus to update all the vectors v and u so that you move closer to the desired probability vector at this point in time. Voila! what do you do next? Rinse and repeat i.e. move onto the next word in the text corpora, recalculate the probabilities, adjust the vectors to be closer to reflecting the true probability vector (1 for the right word and 0 for all others). Once you scan the whole text corpora, you will be left with certain values of v and u for each word in the vocabulary and that, my friend are your embeddings.
[Note: v.r. = vector representation]
This, my friend, is CBOW. And you can represent it using a shallow one layer feed forward neural network. But even if you do that, the math still doesn't change. Simple? yes. Easy to get your head around to? No. Hopefully this make things slightly less cloudy than they were before and provides you some iotas of intuition.
I'll be back, till then .. ciao!