Monday, July 20, 2015

Principal Component Analysis - The Truth (I)

In context of "Data Science", Principal Component Analysis or PCA is a technique used for dimensionality reduction of data.

"Dimensionality reduction" .. what is that you may ask?

Well .. imagine you are given a bunch of data where each data point is represented by an n dimensional vector. Now all you want to do is map your data in a new space (new set of orthonormal axes) of lower dimensionality than n. And if you want to do that, you could use PCA. 

That makes sense, says you but why would I do that? Why project my data into lower dimensional space? Well, the answer my friend is blowing in the wind :)

The answer is simple: there are two main advantages of PCA:
1) It helps reduce data redundancy
2) Finds "interesting" set of new axes where your data might make more sense

That does not make much sense. Could you elaborate more? ..  you ask. Certainly, my dear Watson.

Lets talk about reducing data redundancy. What does that mean? Imagine you have a vector of n dimensions where each dimension is a measurement of some physical quantity .. say first dimension represents height, second weight, third waist size of a human being and so on. Now imagine that the nth dimension represents twice the weight. Wait a minute .. you'd say. Second dimension already represents weight, so why would I have a nth dimension representing twice the weight. Well, you will not have it for sure, I say .. but in real world, thats how the data is my friend. Noisy, not so beautiful in raw state and redundant. Redundant in the example would mean that the data in the nth dimension is just redundant since you already are measuring the weight in the second dimension and to get twice, thrice, four or five times the weight is just a piece of cake. So if you could have your ideal vector, it would not have the nth dimension but unfortunately, this is real world and thats what you get .. redundant data.

Now PCA could pitch in here and overcome this .. and thus when you project the data in lower dimension space, voila! you get a reduced dimension vector for each data point minimizing data redundancy.

Thats fine, you say but whats the talk about finding "interesting set" of new axes. Well, first to make it clear, we are talking about finding a set of axes which are orthonormal to each other (all are at right angles to each other). Having set that understanding, lets embark on the voyage of understanding this bit about "interesting axes".

Lets say you have a lot of two dimensional data points which are close to the line Y=X.  If it makes things easier, you could imagine a plot where a lot of points are around the line Y=X. Now all this data is represented by 2 dimensions, right? So here is the catch. If I could choose my horizontal axis as Y=X line and my vertical axis as the line perpendicular to Y=X (Y=-X), then all the points would lie very close to my horizontal axis (even though they could be spread out along that axis). And this, my friend, is great because if you think about it now, your horizontal axis itself is a good representation of the data now because the points are clustered around it. You don't need two axes anymore. Voila! We just found a set of interesting new axes and also performed dimensionality reduction because now we can just use one axes to represent the data points.

That, my friend, is what the PCA is and does. Reduces redundancy, finds these interesting set of axes (which if you think are the directions of maximum variance) for representing your data and in the process, reduces your data dimensionality.

All this is great, but just blabbering .. nothing concrete .. nothing really mathy, you say? Well, thats true and I will write another part once I wake from my deep slumber. And that part will give a mathematical intuition to these blabberings so that the intuition is more complete :). Till then, happy intuition'alising PCA :)