K-Nearest Neighbour

Introduction to Classification

Machine learning utilize computer to turn data into insight and action. One of the sub domain of machine learning called supervised learning focuses on training a machine to learn from prior examples. When the concept to be learned is a set of categories, the tast is called classification. From identifying diseases, predicting the weather or detecting whether an image contains a cat, classification tasks are diverse yet common.

In this notes, we will learn classification methods while exploring some real-world applications.

Understanding Nearest Neighbours

If you experiences on the road are anything like mine, self-driving cars cannot get here soon enough. It is easy to imagine aspects of autonomous driving that involves classification, for example, when a vehicle’s camera observes an object, it must classify the object before it can react.Though the algorithms that govern automonous cars are sophisticated, we can simulate aspects of their behaviour. In this example, we will suppose the vehicle can see but not distinguish the roadway signs. Our job will be use the machine learning to classify the sign’s type.

To start training a self-driving car, we might supervise it by demonstrating the desired behaviour as it observes each type of sign. We stop at intersections, yield to pedestrians and change speed as needed. After some time under your instruction, the vehicle has build a database that records the sign as well as the target behaviour.

The image below, simulate some dataset:

Similarities signs

We can suspect some similarities, the machine can too! A nearest neighbour classifier takes advantage of the fact that signs that look alike should be similar to, or “nearby” other signs of the same type. For example, if the car observes a sign that seem similar to those in the group of stop signs, the car will probably need to stop.

So how does a nearest neighbor learner decide whether two signs are similar?

Measuring similarity with distance

It can measure the similarity by measuring the distance between them. That’s not to say that it measures the distance between signs in physical space. A stop sign in Perak is the same as a stop sign in Melaka, but instead, it imagines the properties of the signs as coordinates in what is called a feature space.

Consider, for instance, the sign’s colour,

By imagining the colour as 3-dimensional feature space measuring levels of red, green and blue, the signs of similar colour are located naturally close to one another. Once the feature space has been constructed in this way, we can measure the distance using a formula like those you may have seen in geometry class. Many nearest neigbour learners use the Euclidean distance formula as below, which measures the straight line distance between two points.

Applying nearest neighbours in R

An algorithm called k-Nearest Neighbour, or K-NN uses the principle of nearest neighbour to classify unlabled examples. We will get into the specifics later in this notes, but for now, it suffices to know that by default, in R’s K-NN function searches a dataset for the historic observation most similar to the newly observed one.

The K-NN function is part of the “class” package and requires three parameters. The first parameter is the set of training data, second is the test data to be classifies and third is the labels for the training data.

library(class)
pred <- knn(training_data, testing_data, training_labels)

Example 1

Car Dataset

Recognizing a road sign with KNN.

After several trips with a human behind the wheel, it is time for the self-driving car to attempt the test course alone.

As it begins to drive away, its camera captures the following image:

We will apply a K-NN classifier to help car recognize this sign.

loading dataset into R environment

signs <- read.csv("signs.csv")

dim(signs)

## [1] 206  51

head(signs$sample)

## [1] train train train train train train
## Levels: example test train

There are 206 observations and 51 variables in the signs dataset. In this dataset, we already divide the training sample and testing sample in the dataset. So, i am going to separate these dataset into training and test.

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

sign_train <- signs %>% filter(sample == "train") %>% 
  select(everything(), -id, -sample)

sign_test <- signs %>% filter(sample=="test") %>% 
  select(everything(), -id, -sample)

Now the dataset for train is sign_train and dataset for test is sign_test. The number of observations in sign_test is 146 with 49 variables. The first 6 variables in the dataset are:-

head(names(sign_train))

## [1] "sign_type" "r1"        "g1"        "b1"        "r2"        "g2"

We will create a vector of sign label to use with K-NN by extracting the column sign_type from sign_train.

# creating a vector of labels 
sign_types <- sign_train$sign_type

Next, we will identify the next sign using k-NN function. First, we will set the train argument equal to the sign_train dataframe without the first column. Then we set the test argument equal to sign_test. Lastly, we use the vector of labels that we created as the “cl” argument.

#load the "class" package
library(class)

#Classify the next sign observed
knn(train = sign_train[-1], test = sign_test[-1], cl = sign_types)

##  [1] pedestrian pedestrian pedestrian pedestrian pedestrian pedestrian
##  [7] pedestrian pedestrian pedestrian pedestrian pedestrian pedestrian
## [13] pedestrian pedestrian pedestrian pedestrian pedestrian pedestrian
## [19] pedestrian stop       pedestrian speed      speed      speed     
## [25] speed      speed      speed      stop       pedestrian speed     
## [31] speed      speed      speed      speed      speed      speed     
## [37] speed      speed      speed      speed      stop       stop      
## [43] stop       stop       stop       stop       stop       stop      
## [49] stop       stop       stop       stop       stop       stop      
## [55] stop       stop       stop       stop       stop      
## Levels: pedestrian speed stop

Exploring the traffic sign dataset

To better understand how the knn() function was able to classify the stop sign, it may help to examine the training dataset it used.

Each previously observed street sign was divided into a 4x4 grid, and the red, green, and blue level for each of the 16 center pixels is recorded as illustrated here.

The result is a dataset that records the sign_type as well as 16 x 3 = 48 color properties of each sign.

# Examine the structure of the signs dataset
str(sign_train)

## 'data.frame':    146 obs. of  49 variables:
##  $ sign_type: Factor w/ 3 levels "pedestrian","speed",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ r1       : int  155 142 57 22 169 75 136 149 13 123 ...
##  $ g1       : int  228 217 54 35 179 67 149 225 34 124 ...
##  $ b1       : int  251 242 50 41 170 60 157 241 28 107 ...
##  $ r2       : int  135 166 187 171 231 131 200 34 5 83 ...
##  $ g2       : int  188 204 201 178 254 89 203 45 21 61 ...
##  $ b2       : int  101 44 68 26 27 53 107 1 11 26 ...
##  $ r3       : int  156 142 51 19 97 214 150 155 123 116 ...
##  $ g3       : int  227 217 51 27 107 144 167 226 154 124 ...
##  $ b3       : int  245 242 45 29 99 75 134 238 140 115 ...
##  $ r4       : int  145 147 59 19 123 156 171 147 21 67 ...
##  $ g4       : int  211 219 62 27 147 169 218 222 46 67 ...
##  $ b4       : int  228 242 65 29 152 190 252 242 41 52 ...
##  $ r5       : int  166 164 156 42 221 67 171 170 36 70 ...
##  $ g5       : int  233 228 171 37 236 50 158 191 60 53 ...
##  $ b5       : int  245 229 50 3 117 36 108 113 26 26 ...
##  $ r6       : int  212 84 254 217 205 37 157 26 75 26 ...
##  $ g6       : int  254 116 255 228 225 36 186 37 108 26 ...
##  $ b6       : int  52 17 36 19 80 42 11 12 44 21 ...
##  $ r7       : int  212 217 211 221 235 44 26 34 13 52 ...
##  $ g7       : int  254 254 226 235 254 42 35 45 27 45 ...
##  $ b7       : int  11 26 70 20 60 44 10 19 25 27 ...
##  $ r8       : int  188 155 78 181 90 192 180 221 133 117 ...
##  $ g8       : int  229 203 73 183 110 131 211 249 163 109 ...
##  $ b8       : int  117 128 64 73 9 73 236 184 126 83 ...
##  $ r9       : int  170 213 220 237 216 123 129 226 83 110 ...
##  $ g9       : int  216 253 234 234 236 74 109 246 125 74 ...
##  $ b9       : int  120 51 59 44 66 22 73 59 19 12 ...
##  $ r10      : int  211 217 254 251 229 36 161 30 13 98 ...
##  $ g10      : int  254 255 255 254 255 34 190 40 27 70 ...
##  $ b10      : int  3 21 51 2 12 37 10 34 25 26 ...
##  $ r11      : int  212 217 253 235 235 44 161 34 9 20 ...
##  $ g11      : int  254 255 255 243 254 42 190 44 23 21 ...
##  $ b11      : int  19 21 44 12 60 44 6 35 18 20 ...
##  $ r12      : int  172 158 66 19 163 197 187 241 85 113 ...
##  $ g12      : int  235 225 68 27 168 114 215 255 128 76 ...
##  $ b12      : int  244 237 68 29 152 21 236 54 21 14 ...
##  $ r13      : int  172 164 69 20 124 171 141 205 83 106 ...
##  $ g13      : int  235 227 65 29 117 102 142 229 125 69 ...
##  $ b13      : int  244 237 59 34 91 26 140 46 19 9 ...
##  $ r14      : int  172 182 76 64 188 197 189 226 85 102 ...
##  $ g14      : int  228 228 84 61 205 114 171 246 128 67 ...
##  $ b14      : int  235 143 22 4 78 21 140 59 21 6 ...
##  $ r15      : int  177 171 82 211 125 123 214 235 85 106 ...
##  $ g15      : int  235 228 93 222 147 74 221 252 128 69 ...
##  $ b15      : int  244 196 17 78 20 22 201 67 21 9 ...
##  $ r16      : int  22 164 58 19 160 180 188 237 83 43 ...
##  $ g16      : int  52 227 60 27 183 107 211 254 125 29 ...
##  $ b16      : int  53 237 60 29 187 26 227 53 19 11 ...

Use table() to count the number of observations of each sign type by passing it the column containing the labels.

# Count the number of signs of each type
table(sign_train$sign_type)

## 
## pedestrian      speed       stop 
##         46         49         51

Run the provided aggregate() command to see whether the average red level might vary by sign type.

# Check r10's average red level by sign type
aggregate(r10 ~ sign_type, data = sign_train, mean)

##    sign_type       r10
## 1 pedestrian 113.71739
## 2      speed  80.63265
## 3       stop 132.39216

Classifying a collection of road signs

Now that the autonomous vehicle has successfully stopped on its own, your team feels confident allowing the car to continue the test course.

The test course includes 58 additional road signs divided into three types:

At the conclusion of the trial, we will be asked to measure the car’s overall performance at recognizing these signs.

The class package and the dataset signs are already loaded in your workspace. So is the dataframe test_signs, which holds a set of observations you’ll test your model on.

# Use kNN to identify the test road signs
sign_types <- sign_train$sign_type
signs_pred <- knn(train = sign_train[-1], test = sign_test[-1], cl = sign_types)

# Create a confusion matrix of the predicted versus actual values
signs_actual <- sign_test$sign_type
table(signs_pred, signs_actual)

##             signs_actual
## signs_pred   pedestrian speed stop
##   pedestrian         19     2    0
##   speed               0    17    0
##   stop                0     2   19

# Compute the accuracy
mean(signs_pred == signs_actual)

## [1] 0.9322034

What about the ‘k’ in KNN?

We may be wondering why K-NN is called “k” Nearest Neighbour, and what is exactly is “k”? The letter K is a variable that specifies the number of neighbours to consider when making the classification. We can imagine it as detemining the size of the neighbourhoods.

Until now, we have ignored “K” and thus R has used the default value of “1”. This means that only the single nearest, most similar neighbour was used to classify the unlabeled example. While this seems “orait” on the surface, let’s work through an example to see why the value of “K” may have a susbtantial impact on the performance of our classifier.

Suppose our vehicle observed the sign at the center of the image here.

it have 5 nearest neighbours are depicted. The single nearest neighbour is a speed limit sign, which shares a very similar background colours. The single nearest neighbour is a speed limit sign, which shares a very similar background colour. Unfortunately, in this case, a K-NN classifier with K set to one would make an incorrect classification.

Slighly further away are the second, third and fourth nearest neighbours which are all pedestrian crossing signs. Suppose we should set K equal to 3. What would happen?

The 3 nearest neighbours, a speed limit sign and 2 pedestrian crossing signs, would take a vote. The category with the majority of nearest neighbours, in this case the pedestrian crossing sign is the winner.

Increasing “K” to 5 allows the five nearest neighbours to vote. The pedestrian crossing still wins with a margin of 3 to 2. Note that in the case of a tie, the winner is typically decided at random!.

Bigger “K” is not always better!

In the previous notes above, setting “K” to a higher value resulted in a correct prediction. But it is not always the case that bigger is better. A Small “k” creates very small neighbourhoods, the classifier is able to discover very subtle patterns.

As this image illustrates, we might imagine it as being able to distinguish between groups even when their boundary is somewhat “fuzzy”.

In other hand, sometimes a “fuzzy” boundary is not a true pattern, but rather due to some other factor that adds randomness into the data. This is called noise.

Setting “k” larger, as this image shows, ignores some potentially noisy points in an effort to discover a broader, more general pattern.

So, how should we set “K”?

Unfortunately, there is no universal rule. In practice, the optimal value depends on the complexity of the pattern to be learned, as well as the impact of noisy data. Some suggested a rule of thumb starting with “K” equal to the square root of the number of observations in the training data. For Example, if the car had observed 100 previous road signs, we might set “k” equal to 10.

An even better approach is to test several different values of “k” and compare the performance on data it has not seen before.

Testing other “K” values

By default, the knn() function in the class package uses only the single nearest neighbor.

Setting a k parameter allows the algorithm to consider additional nearby neighbors. This enlarges the collection of neighbors which will vote on the predicted class.

Compare k values of 1, 7, and 15 to examine the impact on traffic sign classification accuracy.

We would like to compute the accuracy of the default k=1 model using the given code, then find the accuracy of the model using mean() to compare signs_actual and the model’s predictions.

# Compute the accuracy of the baseline model (default k = 1)
k_1 <- knn(train = sign_train[-1], test = sign_test[-1], cl = sign_types)
mean(signs_actual == k_1)

## [1] 0.9322034

Modify the knn() function call by setting k=7 and again find the accuracy value.

# Modify the above to set k = 7
k_7 <- knn(train = sign_train[-1], test = sign_test[-1], cl = sign_types, k = 7)
mean(signs_actual == k_7)

## [1] 0.9491525

Revise the code once more by setting k=15, plus find the accuracy value one more time.

# Set k = 15 and compare to the above
k_15 <- knn(train = sign_train[-1], test = sign_test[-1], cl = sign_types, k = 15)
mean(signs_actual == k_15)

## [1] 0.9152542

Seeing how the neighbours voted

When multiple nearest neighbors hold a vote, it can sometimes be useful to examine whether the voters were unanimous or widely separated.

For example, knowing more about the voters’ confidence in the classification could allow an autonomous vehicle to use caution in the case there is any chance at all that a stop sign is ahead.

In this example, we will learn how to obtain the voting results from the knn() function.

Building a K-NN model with “prob=TRUE” parameter to compute the vote proportions. Set K = 7

# Use the prob parameter to get the proportion of votes for the winning class
sign_pred <- knn(train = sign_train[-1], test = sign_test[-1], cl = sign_types, k = 7, prob = TRUE)

Use the “attr()” function to obtain the vote proportions for the predicted class. These are stored in the attributed “prob”

# Get the "prob" attribute from the predicted classes
sign_prob <- attr(sign_pred, "prob")

Examine the first several vote outcomes and percentages using the “head()” function to see how the confidence varies from sign to sign.

# Examine the first several predictions
head(sign_pred)

## [1] pedestrian pedestrian pedestrian stop       pedestrian pedestrian
## Levels: pedestrian speed stop

# Examine the proportion of votes for the winning class
head(sign_prob)

## [1] 0.5714286 0.5714286 0.8571429 0.5714286 0.8571429 0.5714286

Data Preparation for k-NN

So far, we have seen the K-NN algorithm in action while simulating aspects of a self-driving vehicle. We have gained an understanding of the impact of “K” on the algorithm’s performance, and know how to examine the neighbour’s votes to better understanding which predictions are closer to unanimous.

But before, applying KNN to our own projects, we will need to know one important thing, which is how to prepare our data for nearest neighbour.

KNN assumes numeric data

As noted previously, nearest neighbour learners use distance function to identify the most similar or nearest examples. Many common distance functions assume that our data are in numeric format, as it is difficult to define the distance between categories.

For example, there’s no obvious way to define the distance between “red” and “Yellow”, consequently, the traffic sign dataset represented these using numeric colour intensities. But suppose that you have a property that cannot be measured numerically, such as whether a road sign is a rectangle, diamond or octagon. A common solution, uses 1/0 indicators to represent these categories. This is called Dummy codding (rectangle = 0, diamond = 0). A binary dummy variable is created for each category except one. This variable is set to “1” if the category applies and “0” otherwise. The category that is left out can be easily deduced, if the stop sign is not a rectanlge or a diamond, then it must be an octagon.

Dummy coded data can be used directly in a distance function, two rectangle signs, both having values of “1”, will be found to be closer together than a rectangle and a diamond.

KNN benefits from normalized data

It is also important to be aware that when calculating distance, each feature of the input data should be measured with the same rannge of values. This was true for the traffic sign data, each colour component ranged from a minimum of zero to a maximum of 255. However, suppose that we added the 1/0 dummy variables for sign shapes into the distance calculation. Two different shapes may differ by at most one unit, but two different colours may differs by as much as 255 units!. Such a different scale allows the features with a wider range to have more influence over the distance calculation, as this figure illustrate

Here, the topmost speed limit sign is closer to the pedestrian sign than it is to its correct neighbours, simply because the range of blue values is wider than the 0 to 1 range of shape values.

Compressing the blue axis so that it also follows a 0 to 1 range correct this issue, and the speed limit sign is now closer to its true neighbourhood.

Normalizing data in R

R Does not have a build in function to rescale data toa given range, so we need to create one ourself.

These code defines a function called normalize which can be used to perform min-max normalization.

#define min-max normalize() function 

normalize <- function(x) {
  return((x-min(x))/ (max(x)-min(x)))
}

This rescales a vector x such that it its minimum value is zero and its maximum value is one. It does this by substracting the minimum value from each value of x and dividing by the range of x values.

After applying this function to “r1”, one of the color vectors, we can use the summary function to see that the new minimum and maximum values are 0 and 1 respectively.

#normalize version of r1
summary(normalize(sign_train$r1))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1894  0.3571  0.4237  0.6483  1.0000

Calculating the same summary statistics for the unnormalized data shows a minimum of 3 and a maximum of 234.

#normalize version of r1
summary(sign_train$r1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   46.75   85.50  100.87  152.75  234.00

. . . . . . . . . . . . . .

.. Disclaimer: This note is not my orginal work, I took it from datacamp platform for my learning purpose in machine learning classifier!