Prinicpal Components Analysis

There are lots of reasons why we might want to reduce the number of features in our dataset. For example, we might want to visualise the data in 2- or 3-D, or we might only have a limited number of examples and seek to reduce the risk of overfitting.

However, if we discard features from our dataset there's a chance we might be throwing away information which is essential for our model to achieve good predictive accuracy. We can use principal components analysis (PCA) to reduce the size of our feature set whilst retaining important information. In this context, we say 'important information' to mean that observations which were not the same in our original feature set will not be the same in our transformed dataset and that points which were far away from each other in the original dataset will remain far away from each other.

Online resources

  • An example of PCA being applied to the classic Iris dataset (notes from a course at the University of Maryland);
  • An excellent visualisation tool created by Victor Powell

Click the links below to access the Jupyter Notebooks for PCA