higher dimension knn

Hello Check the full project and the code here

Implementing algorithms from scratch is one of the best ways to understand what’s going on when we use them. Just plug in the scikit-learn implementation shouldn’t be enough.

Theory

When performing Time Series Forecasting, sometimes the data we want to forecast needs to be of higher dimension than usual.

For example, when forecasting the price of electricity, it can be useful to predict the hourly price of a whole day instead of forecasting each hourly price recursively.

The scikit-learn implementation only lets us predict float numbers, this implementation will let us predict arrays.

The algorithm is simple.

Fit is saving the training data and providing a value for n.
Predict is finding the closest n points on the training data (with respect to the chosen distance) and returning an average (or weighted average) of the corresponding training dependent variables.

Example: hourly stock GOOG value

Firstly we need to get the stock price every hour. We use yfinance library to get the last 2 years with hourly data for GOOG.

From scratch ROC Curve

Then we define our training dataset as three consecutive days, and the forth will be our dependent variable.

From scratch ROC Curve

We split randomly the data, using 30% for test and the rest for training. Training is straightforward, as we don’t really need to find any parameter, we just need to save the training data.

Let’s see how predicting works:

Take a point from the test dataset and get the 2 closest neighbors (from the training dataset).

From scratch ROC Curve

We look for the dependent variables of the nearest neighbours.

From scratch ROC Curve

And make the weighted average to predict:

From scratch ROC Curve

Notes

This is just an example to illustrate how the KNN algorithm can be adapted to higher dimensions, but does not compare the performance to other algorithms.