Sunday, September 18, 2016

Word2vec CBOW

Finally, back from a deep slumber!

Herein, I attempt to give some intuition on the Continuous Bag of Words model (CBOW) which produces word embeddings. For details, please check the linked paper.

Now lets imagine a world where you are trying to create these word embeddings. Whats a word embedding? ..  you may ask. The answer is simple, its just a vector representation of the word. Behold! you say. Why a vector representation? And I'd say it can be used to explore relationship between words or a group of words.

Now think about it this way, a word has two representations .. one of its own and the other when it is in the context of other words. What's a context? .. you may ask. And it just means the surrounding words (within a finite symmetrical window to left and right) for a given word in a text corpora. And whats the goal? .. you may ask. I'd say, the goal is to find the word embeddings for all known words in the vocabulary. Wow! .. you exclaim and then ask How? And the answer would be to use the text corpora to train a Machine Learning model for figuring out the word embeddings.

That doesn't make much sense, you say and I hear you. To keep things simple, imagine we are scanning each word with vector representation v in the text corpora one at a time. And our context for word v is the immediate left occurring word with context vector representation u. So basically, a single word context is all we have. What do we do now? Well, first of all, we initialize all vectors (v and u for each word in vocabulary) randomly to some small numbers. And what might be the dimension of these vectors? Well, upto you as you can embed them into any N dimensional space you want.

Given the current context word with vector representation u, we simply take the dot product of u with all v vectors in the vocabulary. That gives you multiple scalar numbers, doesn't it? And then you apply softmax to these bunch of scalars to get a meaningful score between 0 and 1 for each word in the vocabulary. What do these numbers between 0 to 1 represent? They represent the probability of word with v.r. v occurring given the word with context v.r. u. Well, this is your estimated probability and you do have the true probability i.e. you do know the word occurring to the right of the context word as you have the text corpora. So ideally, your estimated probability at this point in time should look like 1 for the true word occurring to the right of the context word and 0 for all other words in the vocabulary. That most certainly won't be the case and then you'd use some calculus to update all the vectors v and u so that you move closer to the desired probability vector at this point in time. Voila! what do you do next? Rinse and repeat i.e. move onto the next word in the text corpora, recalculate the probabilities, adjust the vectors to be closer to reflecting the true probability vector (1 for the right word and 0 for all others). Once you scan the whole text corpora, you will be left with certain values of v and u for each word in the vocabulary and that, my friend are your embeddings.

[Note: v.r. = vector representation]

What if your context is multi-word? And in that case, your u will just be an average of all the context v.r. of the words in the context. And some calculus will change so that to be close to the true probability vector, you will adjust each u by equal amount (uniformly distributing the error to each context word)

This, my friend, is CBOW. And you can represent it using a shallow one layer feed forward neural network. But even if you do that, the math still doesn't change. Simple? yes. Easy to get your head around to? No. Hopefully this make things slightly less cloudy than they were before and provides you some iotas of intuition.

I'll be back, till then .. ciao!


Monday, July 20, 2015

Principal Component Analysis - The Truth (I)

In context of "Data Science", Principal Component Analysis or PCA is a technique used for dimensionality reduction of data.

"Dimensionality reduction" .. what is that you may ask?

Well .. imagine you are given a bunch of data where each data point is represented by an n dimensional vector. Now all you want to do is map your data in a new space (new set of orthonormal axes) of lower dimensionality than n. And if you want to do that, you could use PCA. 

That makes sense, says you but why would I do that? Why project my data into lower dimensional space? Well, the answer my friend is blowing in the wind :)

The answer is simple: there are two main advantages of PCA:
1) It helps reduce data redundancy
2) Finds "interesting" set of new axes where your data might make more sense

That does not make much sense. Could you elaborate more? ..  you ask. Certainly, my dear Watson.

Lets talk about reducing data redundancy. What does that mean? Imagine you have a vector of n dimensions where each dimension is a measurement of some physical quantity .. say first dimension represents height, second weight, third waist size of a human being and so on. Now imagine that the nth dimension represents twice the weight. Wait a minute .. you'd say. Second dimension already represents weight, so why would I have a nth dimension representing twice the weight. Well, you will not have it for sure, I say .. but in real world, thats how the data is my friend. Noisy, not so beautiful in raw state and redundant. Redundant in the example would mean that the data in the nth dimension is just redundant since you already are measuring the weight in the second dimension and to get twice, thrice, four or five times the weight is just a piece of cake. So if you could have your ideal vector, it would not have the nth dimension but unfortunately, this is real world and thats what you get .. redundant data.

Now PCA could pitch in here and overcome this .. and thus when you project the data in lower dimension space, voila! you get a reduced dimension vector for each data point minimizing data redundancy.

Thats fine, you say but whats the talk about finding "interesting set" of new axes. Well, first to make it clear, we are talking about finding a set of axes which are orthonormal to each other (all are at right angles to each other). Having set that understanding, lets embark on the voyage of understanding this bit about "interesting axes".

Lets say you have a lot of two dimensional data points which are close to the line Y=X.  If it makes things easier, you could imagine a plot where a lot of points are around the line Y=X. Now all this data is represented by 2 dimensions, right? So here is the catch. If I could choose my horizontal axis as Y=X line and my vertical axis as the line perpendicular to Y=X (Y=-X), then all the points would lie very close to my horizontal axis (even though they could be spread out along that axis). And this, my friend, is great because if you think about it now, your horizontal axis itself is a good representation of the data now because the points are clustered around it. You don't need two axes anymore. Voila! We just found a set of interesting new axes and also performed dimensionality reduction because now we can just use one axes to represent the data points.

That, my friend, is what the PCA is and does. Reduces redundancy, finds these interesting set of axes (which if you think are the directions of maximum variance) for representing your data and in the process, reduces your data dimensionality.

All this is great, but just blabbering .. nothing concrete .. nothing really mathy, you say? Well, thats true and I will write another part once I wake from my deep slumber. And that part will give a mathematical intuition to these blabberings so that the intuition is more complete :). Till then, happy intuition'alising PCA :)

Sunday, November 3, 2013

Randomized Series: Karger's Min-Cut

I am quite intrigued by randomized algorithms. I hope you are as well. If not, why would you be here?

To begin with, imagine an undirected, connected and unweighted graph G. A cut in G is simply a set of edges whose removal would break down the graph into two separate pieces. And a min-cut is the smallest of such a set of edges. And to pump you with some more info, we already have deterministic algorithms to calculate min-cut. So why a randomized algorithm, you say? Well, because its beautiful and simple and you must have heard that saying: 'Simple is beautiful'.

The Algorithm
The algorithm is quite simple actually. You uniformly and randomly pick an edge (u, v) in G, shrink that edge, and combine u and v into a new vertex which we shall call {u, v}. Cleaver readers would note that this causes the graph G to have one less vertex than before. (One step, one less vertex. Way to go!)
Now what you do is keep on repeating this process of randomly picking an edge, collapsing the two vertices across it to one until you just have two vertices left. And then what, you say? Simple. Just return the set of edges across the last two vertices left as your answer for the min-cut.

The Why?
Why does this even give a cut, forget about the min-cut? Well, think about it this way. At the end of the algorithm, you are left with only two vertices. Each of the vertex amongst the two is actually a set of vertices that was formed when we shrinked an edge while we were executing the algorithm. Now here is the interesting and obvious thing to note but before that, label the last last two vertices left as set S1 and S2.
All edges that are across any vertex in S1 and any vertex in S2 are untouched when the algorithm finishes (otherwise those two vertices would have been combined into one set). And so, if you remove the set of edges across S1 and S2, you separate out these two sets of vertices .. thus forming a cut in G.

The Analysis
The real interesting question is: What is the probability of finding a min-cut via the above algorithm? And here is the analysis:

Lets say that the real min-cut is of size k. Now if we are lucky and we do not choose any of those k edges to shrink while executing the algorithm, Voila! We would find a min-cut then. And what must be that probability?

Well, lets think. We need to choose any edge other than those k edges out of the total edges left. Total edges left, you say and despair? How do I know that since total edges change at each step in the algorithm. Patience, my friend. Imagine that s vertices are left at the end of a shrinking step in the algorithm. Now each vertex in the graph at the end of that step must be atleast of degree k otherwise you could get a min-cut of less than k by simply cutting across that vertex and rest of the vertices left. So that must mean that the total number of edges in the graph is atleast s.k / 2.

Awesome and then what? Well, so the probability that you pick one of the k edges out of the total edges left when we have s vertices is: p <= k / (s.k/2) or p <= 2 / s. (the inequality kicks in because the total edges is atleast s.k / 2)
And the probability that you do not pick any of the k edge: q >= (1 - p) or q >= (s - 2) / s.

Now you run the algorithm for (n - 2) steps if there are n vertices in G initially. (Remember, one step, one vertex gone .. n - 2 steps, n - 2 vertices gone and only 2 left). So the probability that we pick the min-cut is the probability that we don't pick any of the k edges in shrinking step 1 and also in shrinking step 2 and also in shrinking step 3 ... and also in shrinking step (n - 2). I think you get the idea. So all you do is calculate the product of (s - 2) / s giving s the value from 3 to n. So what is that?

It seems to be:
(n-2) / n . (n-3) / (n-1) . (n-4) / (n-2) . . . . . . (2 / 4) . (1 / 3) =
2 / n.(n-1) or 1/ [(n.(n-1)/2]

Conclusion
So the probability of finding the min-cut seems to be: 1 / O(n^2). Well, you know that you can run the experiment a large number of times and get to the min-cut with a high probability. How many times, you ask? I'd say O(n^2) times. And that makes the time complexity to be quadratic in n for the algorithm. Yes, it does but it also gives a very good chance of success (~63%) based on some simple analysis which I am not going to give.

But ain't the algorithm beautiful and simple? Ah, right! :)

Monday, August 8, 2011

Bayes Theorem and Predicting the Future

I think Bayes theorem is one of the most beautiful things I have come across - its so simple and insanely efficient. In this post, I am going to present an idea about how Bayes theorem can be used for predicting the future ... Yes! you heard it right - for predicting the future.

The idea came to me while attending a talk by Peter Norvig. If you haven't already, you must, and I insist, take a look at the spell checker written by him which uses Bayes theorem. (Search Google!)

Before the idea, what is Bayes theorem: Well, it relates two events A and B in the following way:
Pr (A | B). Pr(B) = Pr (B | A). Pr(A)

How you may ask? And I must explain that intuitively upholding the spirit of this blog.
So here it is:
Pr(A | B) means the probability of A happening after event B has already taken place. So imagine two circles which are intersecting and represent A and B. Stretch your imagination and think that both the circles are invisible for now. What happens next?
B.
Event B happens next.
At this point of time, you should imagine a circle representing event B emerge out. What happens next?
A.
Event A happens next.
At this point of time,  you should be careful. The whole of circle representing event A should not emerge out. Since B has happened, we are restricted to play around only inside B. So what should emerge out is the intersecting area between A and B - which is already a part of B.

Now at this point of time, either you are confused enough and about to leave or have in your imagination - a circle representing B with a shaded area that represents A and B happening together. Voila! What have we got?
Pr (A | B) = Pr ( A intersecting B) / Pr(B)  -- see the circle in your head - this is simple maths.

Swap A and B and what have you got:
Pr (B | A) = Pr(B intersecting A) / Pr(A)

Well, Pr (A intersecting B) = Pr (B intersecting A) , so if you do that, you get the form which I originally mentioned:

Pr( A | B) = Pr(B | A). Pr(A) / Pr(B)



Predicting the future - how?
This is the interesting part. Let B represent your present(or past) and A represent one possible state of your future.

What you ought to do is:

Pr( Your future state | (Given) Your present state) = 


Pr ( Your present state| (Given) Your future state). Pr(Your future state) / Pr(Your present state)

Now note that given your present state (pathetic as it is - reading this blog) , your future state could be many. So dont forget to apply a max function and take the future state with the maximum probability.

Also note that Pr (Your present state) is the common denominator across all your future states - so it can safely be ignored or if you are a stickler to pure maths - still can you ignore it (humble request)?

But the million $ question would: WTH does this term on RHS mean:

Pr ( Your present state| (Given) Your future state).

What it means is what it actually means! Given your future state, whats the probability of your present state? Does that make any sense? If you knew your future state, wouldn't you not be reading this blog already.

True.

Now the answer is - resolving this term depends on the context of the problem and how we can model it in Bayes theorem context. However, since I won't let you go intuitionally starved: here's a way: Why don't you observe your past and evaluate this term? So basically you spread out your past as a set of discrete events that have already happened and you can very easily divide it into states that are similar to your present states and also the states which you eventually went into. Having done that, you can get a fair amount of idea of the term that invoked a WTH from you a while before.

See, you can model your future given your present - isn't Bayes theorem insanely beautiful? :)

Tuesday, May 3, 2011

Expecto Patronum - writing a sudoku solver (Part 2)

Qn: So how much have we progressed in writing a Sudoku solver?
Ans: 33.33 %

Lets finish the other two parts as well. I'll start with the easy one first, getting all valid choices for a cell in the sudoku map.

Getting valid choices for a cell in sudoku map
The sudoku is partially filled and I am analysing a cell (x, y). What I'd want is to get a list of all possible valid choices for (x, y). Simple! Eliminate all numbers (between 1-9) from row x, column y and the 3 X 3 square containing (x, y). Whatever remains, however improbable, are the valid choices, my friend. Here is the swash-buckling code:


def get_choices(s, x, y):
  '''Get possible choices for cell [x,y]
     Args:
       s: sudoku map
       x: row
       y: column
    Returns:
      list of possible choices
  '''
  l = 9
  all_choices = set(range(1, l+1))
  invalid_choices = set([])
  for i in range(1, l+1):
    r_ele = s[10*x + i]
    if r_ele:
      invalid_choices.add(r_ele)
    c_ele = s[10*i + y]
    if c_ele:
      invalid_choices.add(c_ele)

  #Use your own discretion to find out the top left cell of 3X3 square containing (x, y)
  sq_x, sq_y = find_3X3_top-left-cell(x, y)

  for i in range(3):
    for j in range(3):
      key = 10 * (sq_x + i) + (sq_y + j)
      if s[key]:
        invalid_choices.add(s[key])
    
  return list(all_choices - invalid_choices)  

And now comes the grand finale!!!


Main Algorithm
                                      "To iterate is human, to recurse divine"

And since we all want to write divine code, we would use recursion to solve this problem. How exactly? 

# Termination condition
If all cells in sudoku map are filled:
  if sudoku is solved:
     return the sudoku map

# Main recursion 
As long as there are unfilled cells in sudoku map:
1. Go to a unfilled cell 
2. Iterate over all possible valid values for that cell.
3. Fill that cell with each iterating valid value (thus modifying the sudoku map)
4. Recursively solve this new modified sudoku map.
5. If recursion returns because we found a solution (recursion will return sudoku map object):
      return the currently filled sudoku map as it is

    else (recursion returned None because no solution is found along the iterated value we filled):
      restore the sudoku map before recursion
      repeat from step 2 by moving on to the next value in the iteration

  Heuristic
   I also included a heuristic in the actual program for the order of visiting the unfilled cells. You could  visit them sequentially but turns out that it is not a good strategy. A simple heuristic is to start from the cell which has least number of possible valid choices. Using this simple heuristic, you cut your search space exponentially. (Think, think!)


All this algorithm (including the heuristic) translates to the following code:

def solve_sudoku(s):
  '''Solves sudoku
     Args: 
       s: sudoku map
    Returns:
      Solved sudoku map
  '''
  l = 9

  if all(val for val in s.values()): 
    if check_sudoku(s):
      return s

  min_choices = range(1, 10) 
  min_choice_key = None
  for i in range(1, l+1):
    for j in range(1, l+1):
      key = 10*i + j
      if s[key]:
        continue
      choices = get_choices(s, i, j)
      if len(min_choices) > len(choices):
        min_choices = choices
        min_choice_key = key

  for k in min_choices:
    s[min_choice_key] = k
    if solve_sudoku(s):
      return s
    s[min_choice_key] = 0

So this is the end, beautiful friend! 
Here was a simple, recursive sudoku solver. And it solved the world's hardest sudoku in around 3-4 sec. (http://www.usatoday.com/news/offbeat/2006-11-06-sudoku_x.htm)

There are more optimal implementations out there than the one I have given. But my aim was just to explain the process of writing a simple sudoku solver, and hopefully, you would have got a feeling of it now!

Acknowledgement:
I figured out the logic myself, except for the heuristic taken from http://norvig.com/sudoku.html which created a real big performance difference.

Expecto Patronum - writing a sudoku solver (Part 1)

Writing a sudoku solver is not a dark art. It appears so. And appearances can be deceptive.

So this post is about how to write a sudoku solver. And I will show you how to write it in Python - though choice of programming language really does not matter.

This post is about writing a sudoku solver for a 9 X 9 sudoku puzzle. At this point, I assume you all know what a sudoku puzzle is and what are its rules - if you do not, stop reading and go back to sleep (sigh! yawn!)

Input
First of all, how do we represent the input. The answer comes naturally - read from a file. Cool! And what would be the file format to represent a 9 X 9 sudoku puzzle? Thats your choice. Not being too much fancy in file format, I will keep it simple: there will be 9 lines in the file, with each line having 9 numbers, from  0 - 9 , separated by single space. An example for the content of the input file is:

#############

0 0 3 0 7 5 0 0 0
0 0 0 0 2 0 0 8 6
9 0 2 0 0 1 0 0 3
0 4 0 1 8 0 0 0 0
0 0 7 0 0 0 0 0 0
2 0 0 0 0 0 6 0 0
0 0 0 7 0 0 3 0 1
0 0 4 0 0 0 0 0 7
0 3 0 4 5 0 2 0 0

#############


**0 represents the unfilled entries in the sudoku which we have to find out.


Representing Sudoku in the program
2D Array? Grid? Well, well, well, hold your horses. I am writing this in Python and Python gives me list as a awesome data structure. Since the natural way to represent the sudoku puzzle in the program would be a 2D sort of matrix, I will have to use a collection of lists. Now wait! What I want is to be able to glide down any row or column or a 3 X 3 square and with a collection of lists, its a bit tedious. So I am going to use Python's hero of all times - a dictionary to represent the sudoku. 

What about the keys in the dictionary? Easy one mate! Lets churn a key out of (row, column) for each sudoku cell. Well, you can make any key, it does not matter as long as you could manufacture 81 different keys with the same hash function. (It might vary in efficiency but that would not be a bottleneck unless the extent of one's stupidity is proportional to the size of the universe)

I would use the first function that came to my mind:
                                                   f(row, column) -> 10 * row + column

Now using this hash function, just read from the file and construct this sudoku map (sudoku hash, sudoku dictionary - anything is fine as long as you understand it to be the same thing I am talking about).

Checking if sudoku is solved
This post is getting too long and I'd like to continue onto another. But before that, lets add one real function to our code which checks whether a currently filled sudoku map has solved the sudoku or not. The algorithm is pretty easy - check all rows and each row should have all numbers from 1 - 9 (no repetitions), and do this for all columns too. So here is the first real function code in this blog:

def check_sudoku(s):
  '''Checks if sudoku is solved
     Args: 
       s: sudoku map
    Returns:
      True if sudoku is solved
  '''
  l = 9
  # Check each row and column
  for i in range(1, l+1):
    r = []
    c = []
    for j in range(1, l+1):
      r_ele = s[10*i + j]
      c_ele = s[10*j + i]
      if r_ele in r or c_ele in c:
        return False
      r.append(r_ele)
      c.append(c_ele)
  return True

I got really bored now(Yaaaawn!). So more on this in part 2.

Saturday, March 5, 2011

Dynamic Programming - This, That and the Universe

Divide and Conquer versus Dynamic Programming

Let me outline two very important truths here:

  • D&C is a top-down approach while DP is a bottom-up approach to solve problems.
  • D&C can be used to solve a problem instance of DP.
What then, is the difference, you may ask? Why DP? Why not D&C? The answer, my friend, is blowing in the wind. :)

I will give it a shot to explain this as simply as I can:

In D&C, you divide the main problem, say P into smaller parts. Lets consider, for the sake of simplicity, that P is broken down into two smaller sub-problems S1 and S2. Simply think of P as being: P = S1+ S2. Now, in a pure D&C setting, you would solve S1 and S2 separately, combine their solutions in a certain way and get the answer to P. Perfect!

In DP, the first realization you should have is that S1 and S2 overlap. Lets consider that S2 depends on S1 (overlap!). This means that S1 has to be solved before S2 can be solved. Now you may ask, why? Why isn't there any other way to solve S2? Because if P is a DP problem and P has a optimal sub-structure, then its sub-problems also have a optimal sub-structure. And if S1 and S2 have a optimal sub-structure and S2 is dependent on S1, you should optimally solve S1 first which would lead you to getting a optimal solution to S2.

And how does this all say that you cannot use D&C? As a matter of fact, you can. You could solve S1 in one part, S2 in another ... in a D&C setting. But that would involve more work because you are going to solve S1 twice, once by itself and once when you are trying to solve S2. That loosely-said more-work grows exponentially as the overlapping sub-problems increase. 

All that said and done, whats a simple, intuitive way to solve such a problem P which includes the notion of overlapping sub-problems and optimal sub-structure? Its simple, isn't it? Solve S1 first and remember the result. Then solve S2 utilizing the remembered result for S1 (S2 depends on S1). Then solve P using the solutions of S2 and S1. All you had to do was start from bottom (S1) and move gradually (through S2) towards up (P). That makes your computation easier, gives you a solution and a chance to boast about your DP prowess.

So long, DP!