- About These Resources
- Getting Started
- How To Use The Notebooks
- General Prerequisites
- What To Do If You Get Stuck
About These Resources
With libraries like scikit-learn and tensorflow at our fingertips, training and using machine learning models has never been easier. However, in order to understand the results these models give us and their respective strengths and limitations, it's important that we understand how they are obtained.
One of the best ways of understanding how an algorithm works is by implementing it ourselves - if we can implement the algorithm from scratch, then we must understand it on some level. This is often more easily said than done, particularly when first starting out; selecting an appropriate dataset, managing the general structure of the code and verifying that the algorithm has been implemented successfully can all be difficult.
On this website we've implemented a series of Machine Learning algorithms in Jupyter Notebooks. For each algorithm we've provided a series of 'skeleton' notebooks, where we leave it to you to fill in the redacted portions of code using your knowledge of the algorithm and the comments we've left. Depending on your level of confidence, you can choose how much of the algorithm you want to implement for yourself and if you get stuck you can view the solution provided.
Please feel free to get in touch with any comments or suggestions. If there's an algorithm we don't currently provide notebooks for that you'd like to see, please let us know and we'll try to include it.
Getting Started
All of the exercises on this website are written in the Python programming language. Python is an open-source (i.e. free) language that is widely used across the Machine Learning and Data Science community. If you've never used Python before but have some experience using other programming tools such as R or Matlab, then hopefully you'll find it reasonably straightforward to adapt to Python's syntax. If you've had limited exposure to programming before, we would recommend taking an introductory course in Python before attempting to complete the exercises on this website; gaining a firm foundation in the principles of programming will stand you in good stead in the long run!
Jupyter Notebooks
Jupyter Notebooks are extremely useful tools for writing code in Python. They allow us to easily experiment with small snippets of code and interactively develop a larger program. They're widely used in Statistics and Machine Learning because they allow us to easily visualise and annotate the results of our calculations. You can find out more about Jupyter Notebooks here and experiment with using one in your browser without needing to download anything.
Accessing The Notebooks
There are (at least) two ways you can use the Jupyter Notebooks, each with their own pros and cons. The first way is to use one of the online notebooks available on each module page, hosted by The Binder Project. You can edit and run these notebooks online and they have all the required modules installed for you to be able to complete the exercises. Whilst using the online notebooks doesn't require you to download and install anything yourself, you won't be able to save and reload the notebooks - once you exit the page you can't get it back again. The online notebooks are therefore primarily for people new to coding who just want to get started; if you start to use Python regularly then you'll want to have it installed on your own computer.
If you already have Python set up on your own computer then we would recommend downloading the .ipynb file (also available on each module page) and loading it as a Jupyter Notebook. If you haven't got Python installed on your own computer yet but want to complete the notebooks on your own computer, the 'Anaconda' section below contains a link to a set of instructions for getting up and running.
Anaconda
To be able to run Jupyter Notebooks on your personal computer, we'd recommend downloading Anaconda. The Anaconda Navigator makes it simple to access Jupyter Notebooks and also allows you to manage different Python environments using Conda. If you find that Anaconda takes up too much disk space for your liking, then you can download a smaller version, called Miniconda, which you might find a little less easy to work with than the full Anaconda but takes up less space.
How To Use The Notebooks
Each of the modules contains a series of Jupyter Notebooks which you can download and edit. For most of the modules, there are three different notebooks: Complete, Empty and Redacted.
- "Complete" contains the fully implemented algorithm and is intended to serve as a solution which can be viewed if you get stuck on part of the algorithm. In some cases it might be clear that our implementation is slower than the equivalent model in the scikit-learn package. This is primarily because we have tried provide an version of the algorithm which is as straightforward to implement as possible, so we might have neglected to include a computational trick which improves the algorithm's efficiency but does little to aid our understanding of the algorithm.
- "Empty" is a slight misnomer as these notebooks still contain lots of code. We still provide the dataset to train the algorithm on, as well as a way of assessing the algorithm's performance once the model has been fit. For the algorithm itself, we have deleted all of the code that implements the algorithm, leaving it to you to fill in yourself.
- In the "Redacted" notebook, we have deleted the lines of code which we consider to be essential to understanding how the algorithm works, but have left some of the other code in and on some occasions have provided helpful comments describing what needs to be done at a certain stage.
- On each of the module pages, we have also provided an html version of the completed notebooks, which can be easily viewed in your browser.
Which notebook you should attempt to fill in depends on your levels of experience and confidence. If you're just starting out with Machine Learning, then we would recommend starting on the "Redacted" notebooks. Once you feel that you understand the general structure of how an algorithm should be implemented, try one of the "Empty" ones and see how you get on! The most accessible modules are probably the Linear Regression, PCA and K-Means ones, whilst the Decision Tree and Neural Networks modules are amongst the most challenging.
General Prerequisites
Machine Learning theory borrows heavily from Linear Algebra, Multivariate Calculus and Statistics. Whilst we have tried to keep our focus on intuitively understanding how an algorithm works rather than the rigourous maths, some familiarity of the following would be beneficial:
- Matrices
- Differentiation and finding the minimum/maximum of a function
- Differentiating with respect to more than one variable
- Random Variables (especially the normal distribution)
- Maximum Likelihood
- Bayes' Theorem and the basics of Bayesian Statistics
On the programming side of things, it's well worth being familiar with:
- Manipulating numpy arrays and pandas dataframes
- Writing functions
- Python classes
What To Do If You Get Stuck
If you get stuck on one of the modules, don't worry! If you really want to know the answer straight away, you can peek at the "Complete" notebook. In general we'd recommend against doing this apart from as a last resort as reading someone else's solution is much less beneficial for your understanding then creating your own.
Below is a general checklist of things to try before reading the solution:
- Try checking out some of the resources on the module page or elsewhere on the internet
- Take a pencil and paper and sketch out how the algorithm works on a toy dataset
- Explain in words how the algorithm works to a friend or classmate
- See if you can implement a simpler version of the algorithm (e.g. with a single feature or less depth in the case of a decision tree)