K-Nearest Neighbors

Posted: 24/09/2025

A K-NN learner is a type of instance based/lazy classifier. This learner represents each instance in the training set as a point in a n-dimensional space. The data is simply stored as is, and for each new instance provided as input, its class/value is estimated by looking at the label of the k points closest to it.

If it’s a classification task, the output will correspond to the majority vote. If there’s two classes, the output is calculated as:

\[ h(x) = \begin{cases} 1 & \text{if } \text{avg}_k(x) > 0.5 \\ 0 & \text{otherwise} \end{cases}\]

Or, in the case of multiple classes:

\[h(x) = \arg \max_v \sum_{x_i \in N_k(x)} 1_{v, y_i}, \quad 1_{v, y_i} = \begin{cases} 1 & \text{if } v = y_i \\ 0 & \text{otherwise} \end{cases}\]

If it’s a regression task, it will be calculated as the mean of the labels:

\[h(x) = \text{avg}_k(x) = \frac{\sum_{x_i \in N_k(x)} y_i}{k}\]

K-NN is not a model in the traditional sense; it does not produce a hypothesis that can be used on any new input, and instead constructs a new hypothesis ad hoc for each input.

A key aspect to consider when using K-NN is setting the appropriate value of k. If k is too big, for example k close to l (the number of training instances), then it will perform poorly, since each input, including training points, will always correspond to the same output (majority vote among all points / mean of all labels), thus leading to underfitting.

If k is instead too small, for example k = 1, then we will have overfitting, where each training point is classified correctly, but we will have high generalization error for unseen data, since the output will be estimated by looking at the single closest point to it.As a general practice, k is set to $ \sqrt{l} $, although the method is not infallible.

A variant of K-NN uses weighted voting, where weights are calculated as the inverse of the distance between the input and the neighbor so that closer points will influence the result much more than distant ones.

The hypothesis for classification becomes:

\[h(x) = \arg \max_v \sum_{x_i \in N_k(x)} \frac{1_{v, y_i}}{d(x, x_i)^2}\]

While the hypothesis for regression becomes:

\[h(x) = \frac{1}{k}\sum_{x_i \in N_k(x)} \frac{y_i}{d(x, x_i)^2}\]

The main issue of K-NN is that it’s susceptible to the curse of dimensionality: as the number of dimensions in the data increases, the density of the points decreases, therefore estimation of new inputs will be less and less accurate as it won’t be as local anymore. K-NN is also sensitive to scaling, since it deals with distances.

The main advantages of K-NN are that it’s very simple to implement, and can be easily “retrained” by adding or removing instances. It can also create decision boundaries of any shape.

The main disadvantages are (other than the ones explained above) that since it’s a lazy learner, it needs all the data to be available to make an estimation, so it’s very demanding in terms of space needed, as well as computational time, since estimating an output has a complexity of $O(l n)$ (where $l$ is the number of instances and $n$ is the number of dimensions). It also does not handle missing values.

← Back to Blog List