Support Vector Machines in R
Kailash Awati - DataCamp
Course Description
This course will introduce a powerful classifier, the support vector machine (SVM) using an intuitive, visual approach. Support Vector Machines in R will help students develop an understanding of the SVM model as a classifier and gain practical experience using R’s libsvm implementation from the e1071 package. Along the way, students will gain an intuitive understanding of important concepts, such as hard and soft margins, the kernel trick, different types of kernels, and how to tune SVM parameters. Get ready to classify data with this impressive model.
1 Introduction
1.1 Sugar content of soft drinks
1.1.1 Visualizing a sugar content dataset
In this exercise, you will create a 1-dimensional scatter plot of 25 soft drink sugar content measurements. The aim is to visualize distinct clusters in the dataset as a first step towards identifying candidate decision boundaries.
The dataset with 25 sugar content measurements is stored in the sugar_content column of the data frame df, which has been preloaded for you.
ggplot2 package.
# Load ggplot2
library(ggplot2)## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.
df.
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
df=data.frame(sugar_content=c(10.9,10.9,10.6,10,8,8.2,8.6,10.9,10.7,8,7.7,7.8,8.4,11.5,11.2,8.9,8.7,7.4,10.9,10,11.4,10.8,8.5,8.2,10.6)) %>% mutate(sample.=row_number())
# Print variable names
colnames(df)## [1] "sugar_content" "sample."
df dataset, plot the sugar content of samples along the x-axis (at y equal to zero).
# Plot sugar content along the x-axis
plot_df <- ggplot(data = df, aes(x = sugar_content, y = 0)) + 
    geom_point() + 
    geom_text(aes(label = sugar_content), size = 2.5, vjust = 2, hjust = 0.5)ggplot() code to display sugar content in df as a scatter plot. Can you spot two distinct clusters corresponding to high and low sugar content samples?
# Display plot
plot_dfNice work! Notice the gap between 9 and 10. Sample with sugar content below 9 form a “low sugar” cluster, and samples above 10 form a “high sugar” cluster.
1.1.2 Identifying decision boundaries
Based on the plot you created in the previous exercise (reproduced on the right), which of the following points is not a legitimate decision boundary?
- 
9g/100 ml
 - 
9.1g/100 ml
 - 
9.8 g/100 ml
 - 
8.9g/100 ml
 
That’s correct! 8.9 g/100ml is not a legitimate decision boundary as it is part of the lower sugar content cluster.
1.1.3 Find the maximal margin separator
Recall that the dataset we are working with consists of measurements of sugar content of 25 randomly chosen samples of two soft drinks, one regular and the other reduced sugar. In one of our earlier plots, we identified two distinct clusters (classes). A dataset in which the classes do not overlap is called separable, the classes being separated by a decision boundary. The maximal margin separator is the decision boundary that is furthest from both classes. It is located at the mean of the relevant extreme points from each class. In this case the relevant points are the highest valued point in the low sugar content class and the lowest valued point in the high sugar content class. This exercise asks you to find the maximal margin separator for the sugar content dataset.
mm_separator.
#The maximal margin separator is at the midpoint of the two extreme points in each cluster.
mm_separator <- (8.9+10)/2Well done! We’ll visualize the separator in the next exercise.
1.1.4 Visualize the maximal margin separator
In this exercise you will add the maximal margin separator to the scatter plot you created in an earlier exercise. The plot has been reproduced on the right.
separator containing the maximal margin separator. This is available in the variable mm_separator(enter mm_separator to see it)
#create data frame containing the maximum margin separator
separator <- data.frame(sep = mm_separator)plot_=plot_df
#add separator to sugar content scatterplot
plot_sep <- plot_ + geom_point(data = separator, aes(x = mm_separator, y = 0), color = "blue", size = 4)
#display plot
plot_sepWell done! It should be clear from the plot that the blue point is the best possible separator. Why?
1.2 Linearly separable dataset
1.2.1 Generate a 2d uniformly distributed dataset.
The aim of this lesson is to create a dataset that will be used to illustrate the basic principles of support vector machines. In this exercise we will do the first step, which is to create a 2 dimensional uniformly distributed dataset containing 600 datapoints.
n.
#set seed
set.seed(42)
#set number of data points. 
n <- 600df with two uniformly distributed variables, x1 and x2 lying in (0, 1).
#Generate data frame with two uniformly distributed predictors lying between 0 and 1.
df <- data.frame(x1 = runif(n), 
                 x2 = runif(n))Good work. Next we’ll divide the dataset into two classes that are separated by a linear decision boundary.
1.2.2 Create a decision boundary
The dataset you created in the previous exercise is available to you in the dataframe df
 (recall that it consists of two uniformly distributed variables x1 and 
x2, lying between 0 and 1). In this exercise you will add a class 
variable to that dataset. You will do this by creating a variable y whose value is -1 or +1 depending on whether the point (x1, x2) lies below or above the straight line that passes through the origin and has slope 1.4.
y in the dataframe df with the following specs:
y = -1 if x2 < 1.4*x1
y = 1 if x2 > 1.4*x1
#classify data points depending on location
df$y <- factor(ifelse(df$x2-1.4*df$x1 < 0, -1, 1), 
    levels = c(-1, 1))Nice work. Next we’ll introduce a margin in the dataset and visualize it.
1.2.3 Introduce a margin in the dataset
Your final task for Chapter 1 is to create a margin in the dataset that 
you generated in the previous exercise and then display the margin in a 
plot. The ggplot2 library has been preloaded for you. 
Recall that the slope of the linear decision boundary you created in the
 previous exercise is 1.4.
delta of 0.07 units in your dataset.
#set margin
delta <- 0.07# retain only those points that lie outside the margin
df1 <- df[abs(1.4*df$x1 - df$x2) > delta, ]
#build plot
plot_margins <- ggplot(data = df1, aes(x = x1, y = x2, color = y)) + geom_point() + 
    scale_color_manual(values = c("red", "blue")) + 
    geom_abline(slope = 1.4, intercept = 0)+
    geom_abline(slope = 1.4, intercept = delta, linetype = "dashed") +
    geom_abline(slope = 1.4, intercept = -delta, linetype = "dashed")
 
#display plot
plot_marginsNice work! We will use this dataset to learn about linear support vector machines in the next chapter.
2 Linear Kernels
2.1 Linear SVM
2.1.1 Creating training and test datasets
Splitting a dataset into training and test sets is an important step in building and testing a classification model. The training set is used to build the model and the test set to evaluate its predictive accuracy.
In this exercise, you will split the dataset you created in the previous
 chapter into training and test sets. The dataset has been loaded in the
 dataframe df and a seed has already been set to ensure reproducibility.
train in df and randomly assign 80% of the rows in df a value of 1 for this column (and the remaining rows a value of 0).
#split train and test data in an 80/20 proportion
df[, "train"] <- ifelse(runif(nrow(df))<0.8, 1, 0)train == 1 to the dataframe trainset and those with train == 0 to the dataframe testset.
#assign training rows to data frame trainset
trainset <- df[df$train == 1, ]
#assign test rows to data frame testset
testset <- df[df$train == 0, ]train column from training and test datasets by index.
#find index of "train" column
trainColNum <- grep("train", names(df))
#remove "train" column from train and test dataset
trainset <- trainset[, -trainColNum]
testset <- testset[, -trainColNum]Nice work! In the next exercise we will use these datasets to build our first SVM model.
2.1.2 Building a linear SVM classifier
In this exercise, you will use the svm() function from the e1071
 library to build a linear SVM classifier using training dataset you 
created in the previous exercise. The training dataset has been loaded 
for you in the dataframe trainset
e1071 library.
library(e1071)#build svm model, setting required parameters
svm_model<- svm(y ~ ., 
                data = trainset, 
                type = "C-classification", 
                kernel = "linear", 
                scale = FALSE)Nice work! In the next exercise we will explore the contents of the model.
2.1.3 Exploring the model and calculating accuracy
In this exercise you will explore the contents of the model and 
calculate its training and test accuracies. The training and test data 
are available in the data frames trainset and testset respectively, and the SVM model is stored in the variable svm_model.
- List the components of your SVM model.
 
#list components of model
names(svm_model)##  [1] "call"            "type"            "kernel"          "cost"           
##  [5] "degree"          "gamma"           "coef0"           "nu"             
##  [9] "epsilon"         "sparse"          "scaled"          "x.scale"        
## [13] "y.scale"         "nclasses"        "levels"          "tot.nSV"        
## [17] "nSV"             "labels"          "SV"              "index"          
## [21] "rho"             "compprob"        "probA"           "probB"          
## [25] "sigma"           "coefs"           "na.action"       "fitted"         
## [29] "decision.values" "terms"
List the contents of SV, index, and rho.
#list values of the SV, index and rho
head(svm_model$SV)##           x1        x2
## 7  0.7365883 0.7688522
## 8  0.1346666 0.1639289
## 14 0.2554288 0.3517920
## 19 0.4749971 0.4866429
## 40 0.6117786 0.7146319
## 45 0.4317512 0.5203398
svm_model$index##   [1]   5   6  10  14  30  33  45  47  54  72  83  85  95  97 113 116 117 118
##  [19] 132 138 142 146 150 160 164 165 170 175 202 207 214 216 221 225 229 235
##  [37] 242 243 253 269 277 279 284 293 296 316 325 332 335 336 338 345 352 366
##  [55] 372 383 385 387 388 397 406 414 416 434 435 453 468 485 489 490 492  13
##  [73]  27  37  46  50  61  68  70  74  76  81  86  90  93  94 101 107 122 130
##  [91] 144 147 156 162 176 178 179 181 183 192 196 197 200 204 205 213 231 241
## [109] 254 255 257 290 297 302 311 312 321 340 354 359 364 371 384 390 391 395
## [127] 398 404 420 421 433 438 452 457 460 464 467 477 481 486
svm_model$rho## [1] -0.2796884
Calculate the training accuracy of the model.
#compute training accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)## [1] 0.9776423
Calculate the test accuracy of the model.
#compute test accuracy
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)## [1] 0.9814815
Excellent! You are now ready for the next lesson in which we’ll visually explore the model.
2.2 Visualizing Linear SVMs
2.2.1 Visualizing support vectors using ggplot
In this exercise you will plot the training dataset you used to build a 
linear SVM and mark out the support vectors. The training dataset has 
been preloaded for you in the dataframe trainset and the SVM model is stored in the variable svm_model.
ggplot2.
#load ggplot
library(ggplot2)#build scatter plot of training dataset
scatter_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + 
    geom_point() + 
    scale_color_manual(values = c("red", "blue"))#add plot layer marking out the support vectors 
layered_plot <- 
    scatter_plot + geom_point(data = trainset[svm_model$index, ], aes(x = x1, y = x2), color = "purple", size = 4, alpha = 0.5)
#display plot
layered_plotWell done! Now let’s add the decision and margin boundaries to the plot.
2.2.2 Visualizing decision & margin bounds using ggplot2
In this exercise, you will add the decision and margin boundaries to the
 support vector scatter plot created in the previous exercise. The SVM 
model is available in the variable svm_model and the weight vector has been precalculated for you and is available in the variable w. The ggplot2 library has also been preloaded.
w=t(svm_model$coefs) %*% svm_model$SV
#calculate slope and intercept of decision boundary from weight vector and svm model
slope_1 <- -w[1]/w[2]
intercept_1 <- svm_model$rho/w[2]
#build scatter plot of training dataset
scatter_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + 
    geom_point() + scale_color_manual(values = c("red", "blue"))#add decision boundary 
plot_decision <- scatter_plot + geom_abline(slope = slope_1, intercept = intercept_1) 
#add margin boundaries
plot_margins <- plot_decision + 
 geom_abline(slope = slope_1, intercept = intercept_1 - 1/w[2], linetype = "dashed")+
 geom_abline(slope = slope_1, intercept = intercept_1 + 1/w[2], linetype = "dashed")
#display plot
plot_marginsExcellent! We’ll now visualize the decision regions and support vectors using the svm plot function.
2.2.3 Visualizing decision & margin bounds using plot()
In this exercise, you will rebuild the SVM model (as a refresher) and use the built in SVM plot() function to visualize the decision regions and support vectors. The training data is available in the dataframe trainset.
#load required library
library(e1071)#build svm model
svm_model<- 
    svm(y ~ ., data = trainset, type = "C-classification", 
        kernel = "linear", scale = FALSE)#plot decision boundaries and support vectors for the training data
plot(x = svm_model, data = trainset)Excellent! We’re now ready for the next lesson in which we’ll learn how to tune linear SVMs.
2.3 Tuning linear SVMs
2.3.1 Tuning a linear SVM
In this exercise you will study the influence of varying cost on the 
number of support vectors for linear SVMs. To do this, you will build 
two SVMs, one with cost = 1 and the other with cost = 100 and find the 
number of support vectors. A model training dataset is available in the 
dataframe trainset.
#build svm model, cost = 1
svm_model_1 <- svm(y ~ .,
                   data = trainset,
                   type = "C-classification",
                   cost = 1,
                   kernel = "linear",
                   scale = FALSE)#print model details
svm_model_1## 
## Call:
## svm(formula = y ~ ., data = trainset, type = "C-classification", 
##     cost = 1, kernel = "linear", scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  140
#build svm model, cost = 100
svm_model_100 <- svm(y ~ .,
                   data = trainset,
                   type = "C-classification",
                   cost = 100,
                   kernel = "linear",
                   scale = FALSE)#print model details
svm_model_100## 
## Call:
## svm(formula = y ~ ., data = trainset, type = "C-classification", 
##     cost = 100, kernel = "linear", scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  100 
## 
## Number of Support Vectors:  32
Excellent! The number of support vectors decreases as cost increases because the margin becomes narrower.
2.3.2 Visualizing decision boundaries and margins
In the previous exercise you built two linear classifiers for a linearly separable dataset, one with cost = 1 and the other cost = 100.
 In this exercise you will visualize the margins for the two classifiers
 on a single plot. The following objects are available for use:
- 
The training dataset: 
trainset. - 
The 
cost = 1andcost = 100classifiers insvm_model_1andsvm_model_100, respectively. - 
The slope and intercept for the 
cost = 1classifier is stored inslope_1andintercept_1. - 
The slope and intercept for the 
cost = 100classifier is stored inslope_100andintercept_100. - 
Weight vectors for the two costs are stored in 
w_1andw_100, respectively - 
A basic scatter plot of the training data is stored in 
train_plot 
The ggplot2 library has been preloaded.
w_1=w[1]
w_2=w[2]
train_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + 
    geom_point() + scale_color_manual(values = c("red", "blue"))
#add decision boundary and margins for cost = 1 to training data scatter plot
train_plot_with_margins <- train_plot + 
    geom_abline(slope = slope_1, intercept = intercept_1) +
    geom_abline(slope = slope_1, intercept = intercept_1-1/w_1[2], linetype = "dashed")+
    geom_abline(slope = slope_1, intercept = intercept_1+1/w_1[2], linetype = "dashed")#display plot
train_plot_with_margins## Warning: Removed 1 rows containing missing values (geom_abline).
## Removed 1 rows containing missing values (geom_abline).
#build svm model, cost = 100
svm_model_100 <- svm(y ~ .,
                   data = trainset,
                   type = "C-classification",
                   cost = 100,
                   kernel = "linear",
                   scale = FALSE)
w_100=t(svm_model_100$coefs) %*% svm_model_100$SV
#calculate slope and intercept of decision boundary from weight vector and svm model
slope_100 <- -w_100[1]/w_100[2]
intercept_100 <- svm_model_100$rho/w_100[2]
train_plot_100=train_plot_with_margins
#add decision boundary and margins for cost = 100 to training data scatter plot
train_plot_with_margins <- train_plot_100 + 
    geom_abline(slope = slope_100, intercept = intercept_100, color = "goldenrod") +
    geom_abline(slope = slope_100, intercept = intercept_100-1/w_100[2], linetype = "dashed", color = "goldenrod")+
    geom_abline(slope = slope_100, intercept = intercept_100+1/w_100[2], linetype = "dashed", color = "goldenrod")#display plot 
train_plot_with_margins## Warning: Removed 1 rows containing missing values (geom_abline).
## Removed 1 rows containing missing values (geom_abline).
Well done! The plot clearly shows the effect of increasing the cost on linear classifiers.
2.3.3 When are soft margin classifiers useful?
In this lesson, we looked at an example in which a soft margin linear SVM (low cost, wide margin) had a better accuracy than its hard margin counterpart (high cost, narrow margin). Which of the phrases listed best completes the following statement:
Linear soft margin classifiers are most likely to be useful when:
- 
Working with a linearly separable dataset.
 - 
Dealing with a dataset that has a highly nonlinear decision boundary.
 - 
Working with a dataset that is almost linearly separable.
 
2.4 Multiclass problems
2.4.1 A multiclass classification problem
In this exercise, you will use the svm() function from the e1071 library to build a linear multiclass SVM classifier for a dataset that is known to be perfectly
 linearly separable. Calculate the training and test accuracies, and 
plot the model using the training data. The training and test datasets 
are available in the dataframes trainset and testset. Use the default setting for the cost parameter.
- Load the required library and build a default cost linear SVM.
 
#load library and build svm model
library(e1071)
svm_model<- 
    svm(y ~ ., data = trainset, type = "C-classification", 
        kernel = "linear", scale = FALSE)Calculate training accuracy.
#compute training accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)## [1] 0.9776423
Calculate test accuracy.
#compute test accuracy
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)## [1] 0.9814815
Plot classifier against training data.
#plot
plot(svm_model, trainset)Well done! The model performs very well even for default settings. The actual separators are lines that pass through the origin at angles of 30 and 60 degrees to the horizontal.
2.4.2 Iris redux - a more robust accuracy.
In this exercise, you will build linear SVMs for 100 distinct training/test partitions of the iris dataset. You will then evaluate the performance of your model by calculating the mean accuracy and standard deviation. This procedure, which is quite general, will give you a far more robust measure of model performance than the ones obtained from a single partition.
- 
For each trial:
- Partition the dataset into training and test sets in a random 80/20 split.
 - Build a default cost linear SVM on the training dataset.
 - Evaluate the accuracy of your model.
 
 
accuracy=NULL
for (i in 1:100){ 
    #assign 80% of the data to the training set
    iris[, "train"] <- ifelse(runif(nrow(iris)) < 0.8, 1, 0)
    trainColNum <- grep("train", names(iris))
    trainset <- iris[iris$train == 1, -trainColNum]
    testset <- iris[iris$train == 0, -trainColNum]
    #build model using training data
    svm_model <- svm(Species~ ., data = trainset, 
                     type = "C-classification", kernel = "linear")
    #calculate accuracy on test data
    pred_test <- predict(svm_model, testset)
    accuracy[i] <- mean(pred_test == testset$Species)
}
mean(accuracy)## [1] 0.9642514
sd(accuracy)## [1] 0.03151982
Well done! The high accuracy and low standard deviation confirms that the dataset is almost linearly separable.
3 Polynomial Kernels
3.1 Radially separable dataset
3.1.1 Generating a 2d radially separable dataset
In this exercise you will create a 2d radially separable dataset containing 400 uniformly distributed data points.
df with:
- 
400 points with variables 
x1andx2. - 
x1andx2uniformly distributed in (-1, 1). 
#set number of variables and seed
n <- 400
set.seed(1)
#Generate data frame with two uniformly distributed predictors, x1 and x2
df <- data.frame(x1 = runif(n, min = -1, max = 1), 
                 x2 = runif(n, min = -1, max = 1))#We want a circular boundary. Set boundary radius 
radius <- 0.8
radius_squared <- radius^2df$y, which takes value -1 or 1 depending on whether a point lies within or outside the circle.
#create dependent categorical variable, y, with value -1 or 1 depending on whether point lies
#within or outside the circle.
df$y <- factor(ifelse(df$x1^2 + df$x2^2 < radius_squared, -1, 1), levels = c(-1, 1))Excellent! Now let’s visualize the dataset.
3.1.2 Visualizing the dataset
In this exercise you will use ggplot() to visualize the dataset you created in the previous exercise. The dataset is available in the dataframe df. Use color to distinguish between the two classes.
ggplot2 library.
#load ggplot
library(ggplot2)#build scatter plot, distinguish class by color
scatter_plot <- ggplot(data = df, aes(x = x1, y = x2, color = y)) + 
    geom_point() +
    scale_color_manual(values = c("red", "blue"))
#display plot
scatter_plotNice work! We’ll use this dataset extensively in this chapter.
3.2 Linear SVMs on radial data
3.2.1 Linear SVM for a radially separable dataset
In this exercise you will build two linear SVMs, one for cost = 1 
(default) and the other for cost = 100, for the radially separable 
dataset you created in the first lesson of this chapter. You will also 
calculate the training and test accuracies for both costs. The e1071 library has been loaded, and test and training datasets have been created for you and are available in the data frames trainset and testset.
#split train and test data in an 80/20 proportion
df[, "train"] <- ifelse(runif(nrow(df))<0.8, 1, 0)
#assign training rows to data frame trainset
trainset <- df[df$train == 1, ]
#assign test rows to data frame testset
testset <- df[df$train == 0, ]
#find index of "train" column
trainColNum <- grep("train", names(df))
#remove "train" column from train and test dataset
trainset <- trainset[, -trainColNum]
testset <- testset[, -trainColNum]#default cost mode;
svm_model_1 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 1, kernel = "linear")#training accuracy
pred_train <- predict(svm_model_1, trainset)
mean(pred_train == trainset$y)## [1] 0.5673981
#test accuracy
pred_test <- predict(svm_model_1, testset)
mean(pred_test == testset$y)## [1] 0.4938272
- Set cost = 100 and repeat.
 
#cost = 100 model
svm_model_2 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 100, kernel = "linear")
#accuracy
pred_train <- predict(svm_model_2, trainset)
mean(pred_train == trainset$y)## [1] 0.5673981
pred_test <- predict(svm_model_2, testset)
mean(pred_test == testset$y)## [1] 0.4938272
Good work! Next, we’ll get a more reliable measure of accuracy for one of the models.
3.2.2 Average accuracy for linear SVM
In this exercise you will calculate the average accuracy for a default 
cost linear SVM using 100 different training/test partitions of the 
dataset you generated in the first lesson of this chapter. The e1071 library has been preloaded and the dataset is available in the dataframe df. Use random 80/20 splits of the data in df when creating training and test datasets for each iteration.
- Create a vector to hold accuracies for each step.
 
# Print average accuracy and standard deviation
accuracy <- rep(NA, 100)
set.seed(2)Create training / test datasets, build default cost SVMs and calculate the test accuracy for each iteration.
# Calculate accuracies for 100 training/test partitions
for (i in 1:100){
    df[, "train"] <- ifelse(runif(nrow(df)) < 0.8, 1, 0)
    trainset <- df[df$train == 1, ]
    testset <- df[df$train == 0, ]
    trainColNum <- grep("train", names(trainset))
    trainset <- trainset[, -trainColNum]
    testset <- testset[, -trainColNum]
    svm_model <- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear")
    pred_test <- predict(svm_model, testset)
    accuracy[i] <- mean(pred_test == testset$y)
}Compute the average accuracy and standard deviation over all iterations.
# Print average accuracy and its standard deviation
mean(accuracy)## [1] 0.5554571
sd(accuracy)## [1] 0.04243524
Compute the average accuracy and standard deviation over all iterations.
3.3 The kernel trick
3.3.1 Visualizing transformed radially separable data
In this exercise you will transform the radially separable dataset you created earlier in this chapter and visualize it in the x12-x22 plane. As a reminder, the separation boundary for the data is the circle x1^2 + x2^2 = 0.64(radius = 0.8 units). The dataset has been loaded for you in the dataframe df.
#transform data
df1 <- data.frame(x1sq = df$x1^2, x2sq = df$x2^2, y = df$y)#plot data points in the transformed space
plot_transformed <- ggplot(data = df1, aes(x = x1sq, y = x2sq, color = y)) + 
    geom_point()+ guides(color = FALSE) + 
    scale_color_manual(values = c("red", "blue"))## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
#add decision boundary and visualize
plot_decision <- plot_transformed + geom_abline(slope = -1, intercept = 0.64)
plot_decisionExcellent! As expected, the data is linearly separable in the x12 - x22 plane.
3.3.2 SVM with polynomial kernel
In this exercise you will build a SVM with a quadratic kernel 
(polynomial of degree 2) for the radially separable dataset you created 
earlier in this chapter. You will then calculate the training and test 
accuracies and create a plot of the model using the built in plot() function. The training and test datasets are available in the dataframes trainset and testset, and the e1071 library has been preloaded.
svm_model<- 
    svm(y ~ ., data = trainset, type = "C-classification", 
        kernel = "polynomial", degree = 2)#measure training and test accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)## [1] 0.9692308
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)## [1] 0.9466667
#plot
plot(svm_model, trainset)Well done! The decision boundary using default parameters looks good.
3.4 Tuning SVMs
3.4.1 Using tune.svm()
This exercise will give you hands-on practice with using the tune.svm() function. You will use it to obtain the optimal values for the cost, gamma, and coef0
 parameters for an SVM model based on the radially separable dataset you
 created earlier in this chapter. The training data is available in the 
dataframe trainset, the test data in testset, and the e1071 library has been preloaded for you. Remember that the class variable y is stored in the third column of the trainset and testset.
Also recall that in the video, Kailash used cost=10^(1:3) to get a range of the cost parameter from 10=10^1 to 1000=10^3 in multiples of 10.
- 
Set parameter search ranges as follows:
- 
cost- from 0.1 (10^(-1)) to 100 (10^2) in multiples of 10. - 
gammaandcoef0- one of the following values: 0.1, 1 and 10. 
 - 
 
#tune model
tune_out <- 
    tune.svm(x = trainset[, -3], y = trainset[, 3], 
             type = "C-classification", 
             kernel = "polynomial", degree = 2, cost = 10^(-1:2), 
             gamma = c(0.1, 1, 10), coef0 = c(0.1, 1, 10))
#list optimal values
tune_out$best.parameters$cost## [1] 10
tune_out$best.parameters$gamma## [1] 10
tune_out$best.parameters$coef0## [1] 1
Well done! You have obtained the optimal parameters for the specified parameter ranges.
3.4.2 Building and visualizing the tuned model
In the final exercise of this chapter, you will build a polynomial SVM 
using the optimal values of the parameters that you obtained from tune.svm() in the previous exercise. You will then calculate the training and test accuracies and visualize the model using svm.plot(). The e1071 library has been preloaded and the test and training datasets are available in the dataframes trainset and testset. The output of tune.svm() is available in the variable tune_out.
tune.svm().
#Build tuned model
svm_model <- svm(y~ ., data = trainset, type = "C-classification", 
                 kernel = "polynomial", degree = 2, 
                 cost = tune_out$best.parameters$cost, 
                 gamma = tune_out$best.parameters$gamma, 
                 coef0 = tune_out$best.parameters$coef0)#Calculate training and test accuracies   
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)## [1] 1
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)## [1] 0.9866667
#plot model
plot(svm_model, trainset)Excellent! Tuning the parameters has given us a considerably better accuracy.
4 Radial Basis Function Kernels
4.1 Generating a complex dataset
4.1.1 Generating a complex dataset - part 1
In this exercise you will create a dataset that has two attributes x1 and x2, with x1 normally distributed (mean = -0.5, sd = 1) and x2 uniformly distributed in (-1, 1).
df with 1000 points (x1, x2) distributed as follows:
#number of data points
n <- 1000x1 - normally distributed with mean = -0.5 and std deviation 1.
#set seed
set.seed(1)x2 uniformly distributed in (-1, 1).
#create dataframe
df <- data.frame(x1 = rnorm(n, mean = -0.5, sd = 1), 
                 x2 = runif(n, min = -1, max = 1))Excellent! Now let’s create a complex decision boundary.
4.1.2 Generating a complex dataset - part 2
In this exercise, you will create a decision boundary for the dataset 
you created in the previous exercise. The boundary consists of two 
circles of radius 0.8 units with centers at x1 = -0.8, x2 = 0) and (x1 =
 0.8, x2 = 0) that just touch each other at the origin. Define a binary 
classification variable y such that points that lie within either of the circles have y = -1 and those that lie outside both circle have y = 1.
The dataset created in the previous exercise is available in the dataframe df.
#set radius and centers 
radius <- 0.8
center_1 <- c(-0.8, 0)
center_2 <- c(0.8, 0)
radius_squared <- radius^2df containing the binary classification variable y.
#create binary classification variable 
df$y <- factor(ifelse((df$x1-center_1[1])^2 + (df$x2-center_1[2])^2 < radius_squared|
                      (df$x1-center_2[1])^2 + (df$x2-center_2[2])^2 < radius_squared, -1, 1),
                      levels = c(-1, 1))Well done! Now let’s visualize the decision boundary.
4.1.3 Visualizing the dataset
In this exercise you will use ggplot() to visualise the complex dataset you created in the previous exercises. The dataset is available in the dataframe df. You are not required to visualize the decision boundary.
Here you will use coord_equal() to give the x and y axes 
the same physical representation on the plot, making the circles appear 
as circles rather than ellipses.
aesthetics parameter.
# Load ggplot2
library(ggplot2)geom_ function for a scatter plot.
coord_equal() without arguments.
# Plot x2 vs. x1, colored by y
scatter_plot <- ggplot(data = df, aes(x = x1, y = x2 , color = y)) + 
    # Add a point layer
    geom_point() + 
    scale_color_manual(values = c("red", "blue")) +
    # Specify equal coordinates
    coord_equal()
 
scatter_plot Excellent! In the next lesson we will see how linear and quadratic kernels perform against this dataset.
4.2 Motivating the RBF kernel
4.2.1 Linear SVM for complex dataset
In this exercise you will build a default cost linear SVM for the 
complex dataset you created in the first lesson of this chapter. You 
will also calculate the training and test accuracies and plot the 
classification boundary against the test dataset. The e1071 library has been loaded, and test and training datasets have been created for you and are available in the data frames trainset and testset.
- Build a linear SVM using the default value of cost.
 
#build model
svm_model<- 
    svm(y ~ ., data = trainset, type = "C-classification", 
        kernel = "linear")Calculate training and test accuracies.
#accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)## [1] 0.6153846
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)## [1] 0.5866667
Plot decision boundary against the test data.
#plot model against testset
plot(svm_model, testset)Nice work! As expected, the accuracy is poor and the plot clearly shows why a linear boundary will never work well.
4.2.2 Quadratic SVM for complex dataset
In this exercise you will build a default quadratic (polynomial, degree =
 2) linear SVM for the complex dataset you created in the first lesson 
of this chapter. You will also calculate the training and test 
accuracies plot the classification boundary against the training 
dataset. The e1071 library has been loaded, and test and training datasets have been created for you and are available in the data frames trainset and testset.
- Build a polynomial SVM of degree 2 using the default parameters.
 
#build model
svm_model<- 
    svm(y ~ ., data = trainset, type = "C-classification", 
        kernel = "polynomial", degree = 2)Calculate training and test accuracies.
#accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)## [1] 0.9692308
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)## [1] 0.9466667
Plot decision boundary against the training data.
#plot model
plot(svm_model, trainset)Well done! The model accuracy is not too bad, but the plot shows that it is impossible to capture the figure of 8 shape of the actual boundary using a degree 2 polynomial.
4.3 The RBF Kernel
4.3.1 Polynomial SVM on a complex dataset
Calculate the average accuracy for a degree 2 polynomial
 kernel SVM using 100 different training/test partitions of the complex 
dataset you generated in the first lesson of this chapter. Use default 
settings for the parameters. The e1071 library has been preloaded and the dataset is available in the dataframe df. Use random 80/20 splits of the data in df when creating training and test datasets for each iteration.
- Create a vector to hold accuracies for each step.
 
#create vector to store accuracies and set random number seed
accuracy <- rep(NA, 100)
set.seed(2)Create training/test datasets, build default cost polynomial SVMs of degree 2, and calculate the test accuracy for each iteration.
#calculate accuracies for 100 training/test partitions
for (i in 1:100){
    df[, "train"] <- ifelse(runif(nrow(df))<0.8, 1, 0)
    trainset <- df[df$train == 1, ]
    testset <- df[df$train == 0, ]
    trainColNum <- grep("train", names(trainset))
    trainset <- trainset[, -trainColNum]
    testset <- testset[, -trainColNum]
    svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2)
    pred_test <- predict(svm_model, testset)
    accuracy[i] <- mean(pred_test == testset$y)
}Compute the average accuracy and standard deviation over all iterations.
#print average accuracy and standard deviation
mean(accuracy)## [1] 0.804765
sd(accuracy)## [1] 0.02398396
Nice work! Please note down the average accuracy and standard deviation. We’ll compare these to the default RBF kernel SVM next.
4.3.2 RBF SVM on a complex dataset
Calculate the average accuracy for a RBF kernel SVM 
using 100 different training/test partitions of the complex dataset you 
generated in the first lesson of this chapter. Use default settings for 
the parameters. The e1071 library has been preloaded and the dataset is available in the dataframe df. Use random 80/20 splits of the data in df when creating training and test datasets for each iteration.
- Create a vector of length 100 to hold accuracies for each step.
 
#create vector to store accuracies and set random number seed
accuracy <- rep(NA, 100)
set.seed(2)Create training/test datasets, build RBF SVMs with default settings for all parameters and calculate the test accuracy for each iteration.
#calculate accuracies for 100 training/test partitions
for (i in 1:100){
    df[, "train"] <- ifelse(runif(nrow(df))<0.8, 1, 0)
    trainset <- df[df$train == 1, ]
    testset <- df[df$train == 0, ]
    trainColNum <- grep("train", names(trainset))
    trainset <- trainset[, -trainColNum]
    testset <- testset[, -trainColNum]
    svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "radial")
    pred_test <- predict(svm_model, testset)
    accuracy[i] <- mean(pred_test == testset$y)
}Compute the average accuracy and standard deviation over all iterations.
#print average accuracy and standard deviation
mean(accuracy)## [1] 0.9034203
sd(accuracy)## [1] 0.01786378
Well done! Note that the average accuracy is almost 10% better than the one obtained in the previous exercise (polynomial kernel of degree 2)
4.3.3 Tuning an RBF kernel SVM
In this exercise you will build a tuned RBF kernel SVM for a the given training dataset (available in dataframe trainset) and calculate the accuracy on the test dataset (available in dataframe testset). You will then plot the tuned decision boundary against the test dataset.
- 
Use 
tune.svm()to build a tuned RBF kernel SVM. 
#tune model
tune_out <- tune.svm(x = trainset[, -3], y = trainset[, 3], 
                     gamma = 5*10^(-2:2), 
                     cost = c(0.01, 0.1, 1, 10, 100), 
                     type = "C-classification", kernel = "radial")
Rebuild SVM using optimal values of cost and gamma.
#build tuned model
svm_model <- svm(y~ ., data = trainset, type = "C-classification", kernel = "radial", 
                 cost = tune_out$best.parameters$cost, 
                 gamma = tune_out$best.parameters$gamma)Calculate the accuracy of your model using the test dataset.
#calculate test accuracy
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)## [1] 0.9891892
Plot the decision boundary against testset.
#Plot decision boundary against test data
plot(svm_model, testset)Well done! That’s it for this course. I hope you found it useful.