Support Vector Machines in R
Kailash Awati - DataCamp
Course Description
This course will introduce a powerful classifier, the support vector machine (SVM) using an intuitive, visual approach. Support Vector Machines in R will help students develop an understanding of the SVM model as a classifier and gain practical experience using R’s libsvm implementation from the e1071 package. Along the way, students will gain an intuitive understanding of important concepts, such as hard and soft margins, the kernel trick, different types of kernels, and how to tune SVM parameters. Get ready to classify data with this impressive model.
1 Introduction
1.1 Sugar content of soft drinks
1.1.1 Visualizing a sugar content dataset
In this exercise, you will create a 1-dimensional scatter plot of 25 soft drink sugar content measurements. The aim is to visualize distinct clusters in the dataset as a first step towards identifying candidate decision boundaries.
The dataset with 25 sugar content measurements is stored in the sugar_content
column of the data frame df
, which has been preloaded for you.
ggplot2
package.
# Load ggplot2
library(ggplot2)
## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.
df
.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
=data.frame(sugar_content=c(10.9,10.9,10.6,10,8,8.2,8.6,10.9,10.7,8,7.7,7.8,8.4,11.5,11.2,8.9,8.7,7.4,10.9,10,11.4,10.8,8.5,8.2,10.6)) %>% mutate(sample.=row_number())
df# Print variable names
colnames(df)
## [1] "sugar_content" "sample."
df
dataset, plot the sugar content of samples along the x-axis (at y equal to zero).
# Plot sugar content along the x-axis
<- ggplot(data = df, aes(x = sugar_content, y = 0)) +
plot_df geom_point() +
geom_text(aes(label = sugar_content), size = 2.5, vjust = 2, hjust = 0.5)
ggplot()
code to display sugar content in df
as a scatter plot. Can you spot two distinct clusters corresponding to high and low sugar content samples?
# Display plot
plot_df
Nice work! Notice the gap between 9 and 10. Sample with sugar content below 9 form a “low sugar” cluster, and samples above 10 form a “high sugar” cluster.
1.1.2 Identifying decision boundaries
Based on the plot you created in the previous exercise (reproduced on the right), which of the following points is not a legitimate decision boundary?
-
9g/100 ml
-
9.1g/100 ml
-
9.8 g/100 ml
-
8.9g/100 ml
That’s correct! 8.9 g/100ml is not a legitimate decision boundary as it is part of the lower sugar content cluster.
1.1.3 Find the maximal margin separator
Recall that the dataset we are working with consists of measurements of sugar content of 25 randomly chosen samples of two soft drinks, one regular and the other reduced sugar. In one of our earlier plots, we identified two distinct clusters (classes). A dataset in which the classes do not overlap is called separable, the classes being separated by a decision boundary. The maximal margin separator is the decision boundary that is furthest from both classes. It is located at the mean of the relevant extreme points from each class. In this case the relevant points are the highest valued point in the low sugar content class and the lowest valued point in the high sugar content class. This exercise asks you to find the maximal margin separator for the sugar content dataset.
mm_separator
.
#The maximal margin separator is at the midpoint of the two extreme points in each cluster.
<- (8.9+10)/2 mm_separator
Well done! We’ll visualize the separator in the next exercise.
1.1.4 Visualize the maximal margin separator
In this exercise you will add the maximal margin separator to the scatter plot you created in an earlier exercise. The plot has been reproduced on the right.
separator
containing the maximal margin separator. This is available in the variable mm_separator
(enter mm_separator
to see it)
#create data frame containing the maximum margin separator
<- data.frame(sep = mm_separator) separator
=plot_df
plot_#add separator to sugar content scatterplot
<- plot_ + geom_point(data = separator, aes(x = mm_separator, y = 0), color = "blue", size = 4)
plot_sep
#display plot
plot_sep
Well done! It should be clear from the plot that the blue point is the best possible separator. Why?
1.2 Linearly separable dataset
1.2.1 Generate a 2d uniformly distributed dataset.
The aim of this lesson is to create a dataset that will be used to illustrate the basic principles of support vector machines. In this exercise we will do the first step, which is to create a 2 dimensional uniformly distributed dataset containing 600 datapoints.
n
.
#set seed
set.seed(42)
#set number of data points.
<- 600 n
df
with two uniformly distributed variables, x1
and x2
lying in (0, 1).
#Generate data frame with two uniformly distributed predictors lying between 0 and 1.
<- data.frame(x1 = runif(n),
df x2 = runif(n))
Good work. Next we’ll divide the dataset into two classes that are separated by a linear decision boundary.
1.2.2 Create a decision boundary
The dataset you created in the previous exercise is available to you in the dataframe df
(recall that it consists of two uniformly distributed variables x1 and
x2, lying between 0 and 1). In this exercise you will add a class
variable to that dataset. You will do this by creating a variable y
whose value is -1 or +1 depending on whether the point (x1, x2)
lies below or above the straight line that passes through the origin and has slope 1.4.
y
in the dataframe df
with the following specs:
y = -1
if x2 < 1.4*x1
y = 1
if x2 > 1.4*x1
#classify data points depending on location
$y <- factor(ifelse(df$x2-1.4*df$x1 < 0, -1, 1),
dflevels = c(-1, 1))
Nice work. Next we’ll introduce a margin in the dataset and visualize it.
1.2.3 Introduce a margin in the dataset
Your final task for Chapter 1 is to create a margin in the dataset that
you generated in the previous exercise and then display the margin in a
plot. The ggplot2
library has been preloaded for you.
Recall that the slope of the linear decision boundary you created in the
previous exercise is 1.4.
delta
of 0.07 units in your dataset.
#set margin
<- 0.07 delta
# retain only those points that lie outside the margin
<- df[abs(1.4*df$x1 - df$x2) > delta, ]
df1
#build plot
<- ggplot(data = df1, aes(x = x1, y = x2, color = y)) + geom_point() +
plot_margins scale_color_manual(values = c("red", "blue")) +
geom_abline(slope = 1.4, intercept = 0)+
geom_abline(slope = 1.4, intercept = delta, linetype = "dashed") +
geom_abline(slope = 1.4, intercept = -delta, linetype = "dashed")
#display plot
plot_margins
Nice work! We will use this dataset to learn about linear support vector machines in the next chapter.
2 Linear Kernels
2.1 Linear SVM
2.1.1 Creating training and test datasets
Splitting a dataset into training and test sets is an important step in building and testing a classification model. The training set is used to build the model and the test set to evaluate its predictive accuracy.
In this exercise, you will split the dataset you created in the previous
chapter into training and test sets. The dataset has been loaded in the
dataframe df
and a seed has already been set to ensure reproducibility.
train
in df
and randomly assign 80% of the rows in df
a value of 1 for this column (and the remaining rows a value of 0).
#split train and test data in an 80/20 proportion
"train"] <- ifelse(runif(nrow(df))<0.8, 1, 0) df[,
train == 1
to the dataframe trainset
and those with train == 0
to the dataframe testset
.
#assign training rows to data frame trainset
<- df[df$train == 1, ]
trainset #assign test rows to data frame testset
<- df[df$train == 0, ] testset
train
column from training and test datasets by index.
#find index of "train" column
<- grep("train", names(df))
trainColNum
#remove "train" column from train and test dataset
<- trainset[, -trainColNum]
trainset <- testset[, -trainColNum] testset
Nice work! In the next exercise we will use these datasets to build our first SVM model.
2.1.2 Building a linear SVM classifier
In this exercise, you will use the svm()
function from the e1071
library to build a linear SVM classifier using training dataset you
created in the previous exercise. The training dataset has been loaded
for you in the dataframe trainset
e1071
library.
library(e1071)
#build svm model, setting required parameters
<- svm(y ~ .,
svm_modeldata = trainset,
type = "C-classification",
kernel = "linear",
scale = FALSE)
Nice work! In the next exercise we will explore the contents of the model.
2.1.3 Exploring the model and calculating accuracy
In this exercise you will explore the contents of the model and
calculate its training and test accuracies. The training and test data
are available in the data frames trainset
and testset
respectively, and the SVM model is stored in the variable svm_model
.
- List the components of your SVM model.
#list components of model
names(svm_model)
## [1] "call" "type" "kernel" "cost"
## [5] "degree" "gamma" "coef0" "nu"
## [9] "epsilon" "sparse" "scaled" "x.scale"
## [13] "y.scale" "nclasses" "levels" "tot.nSV"
## [17] "nSV" "labels" "SV" "index"
## [21] "rho" "compprob" "probA" "probB"
## [25] "sigma" "coefs" "na.action" "fitted"
## [29] "decision.values" "terms"
List the contents of SV
, index
, and rho
.
#list values of the SV, index and rho
head(svm_model$SV)
## x1 x2
## 7 0.7365883 0.7688522
## 8 0.1346666 0.1639289
## 14 0.2554288 0.3517920
## 19 0.4749971 0.4866429
## 40 0.6117786 0.7146319
## 45 0.4317512 0.5203398
$index svm_model
## [1] 5 6 10 14 30 33 45 47 54 72 83 85 95 97 113 116 117 118
## [19] 132 138 142 146 150 160 164 165 170 175 202 207 214 216 221 225 229 235
## [37] 242 243 253 269 277 279 284 293 296 316 325 332 335 336 338 345 352 366
## [55] 372 383 385 387 388 397 406 414 416 434 435 453 468 485 489 490 492 13
## [73] 27 37 46 50 61 68 70 74 76 81 86 90 93 94 101 107 122 130
## [91] 144 147 156 162 176 178 179 181 183 192 196 197 200 204 205 213 231 241
## [109] 254 255 257 290 297 302 311 312 321 340 354 359 364 371 384 390 391 395
## [127] 398 404 420 421 433 438 452 457 460 464 467 477 481 486
$rho svm_model
## [1] -0.2796884
Calculate the training accuracy of the model.
#compute training accuracy
<- predict(svm_model, trainset)
pred_train mean(pred_train == trainset$y)
## [1] 0.9776423
Calculate the test accuracy of the model.
#compute test accuracy
<- predict(svm_model, testset)
pred_test mean(pred_test == testset$y)
## [1] 0.9814815
Excellent! You are now ready for the next lesson in which we’ll visually explore the model.
2.2 Visualizing Linear SVMs
2.2.1 Visualizing support vectors using ggplot
In this exercise you will plot the training dataset you used to build a
linear SVM and mark out the support vectors. The training dataset has
been preloaded for you in the dataframe trainset
and the SVM model is stored in the variable svm_model
.
ggplot2
.
#load ggplot
library(ggplot2)
#build scatter plot of training dataset
<- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) +
scatter_plot geom_point() +
scale_color_manual(values = c("red", "blue"))
#add plot layer marking out the support vectors
<-
layered_plot + geom_point(data = trainset[svm_model$index, ], aes(x = x1, y = x2), color = "purple", size = 4, alpha = 0.5)
scatter_plot
#display plot
layered_plot
Well done! Now let’s add the decision and margin boundaries to the plot.
2.2.2 Visualizing decision & margin bounds using ggplot2
In this exercise, you will add the decision and margin boundaries to the
support vector scatter plot created in the previous exercise. The SVM
model is available in the variable svm_model
and the weight vector has been precalculated for you and is available in the variable w
. The ggplot2
library has also been preloaded.
=t(svm_model$coefs) %*% svm_model$SV
w#calculate slope and intercept of decision boundary from weight vector and svm model
<- -w[1]/w[2]
slope_1 <- svm_model$rho/w[2]
intercept_1
#build scatter plot of training dataset
<- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) +
scatter_plot geom_point() + scale_color_manual(values = c("red", "blue"))
#add decision boundary
<- scatter_plot + geom_abline(slope = slope_1, intercept = intercept_1)
plot_decision
#add margin boundaries
<- plot_decision +
plot_margins geom_abline(slope = slope_1, intercept = intercept_1 - 1/w[2], linetype = "dashed")+
geom_abline(slope = slope_1, intercept = intercept_1 + 1/w[2], linetype = "dashed")
#display plot
plot_margins
Excellent! We’ll now visualize the decision regions and support vectors using the svm plot function.
2.2.3 Visualizing decision & margin bounds using plot()
In this exercise, you will rebuild the SVM model (as a refresher) and use the built in SVM plot()
function to visualize the decision regions and support vectors. The training data is available in the dataframe trainset
.
#load required library
library(e1071)
#build svm model
<-
svm_modelsvm(y ~ ., data = trainset, type = "C-classification",
kernel = "linear", scale = FALSE)
#plot decision boundaries and support vectors for the training data
plot(x = svm_model, data = trainset)
Excellent! We’re now ready for the next lesson in which we’ll learn how to tune linear SVMs.
2.3 Tuning linear SVMs
2.3.1 Tuning a linear SVM
In this exercise you will study the influence of varying cost on the
number of support vectors for linear SVMs. To do this, you will build
two SVMs, one with cost = 1 and the other with cost = 100 and find the
number of support vectors. A model training dataset is available in the
dataframe trainset
.
#build svm model, cost = 1
<- svm(y ~ .,
svm_model_1 data = trainset,
type = "C-classification",
cost = 1,
kernel = "linear",
scale = FALSE)
#print model details
svm_model_1
##
## Call:
## svm(formula = y ~ ., data = trainset, type = "C-classification",
## cost = 1, kernel = "linear", scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 140
#build svm model, cost = 100
<- svm(y ~ .,
svm_model_100 data = trainset,
type = "C-classification",
cost = 100,
kernel = "linear",
scale = FALSE)
#print model details
svm_model_100
##
## Call:
## svm(formula = y ~ ., data = trainset, type = "C-classification",
## cost = 100, kernel = "linear", scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 100
##
## Number of Support Vectors: 32
Excellent! The number of support vectors decreases as cost increases because the margin becomes narrower.
2.3.2 Visualizing decision boundaries and margins
In the previous exercise you built two linear classifiers for a linearly separable dataset, one with cost = 1
and the other cost = 100
.
In this exercise you will visualize the margins for the two classifiers
on a single plot. The following objects are available for use:
-
The training dataset:
trainset
. -
The
cost = 1
andcost = 100
classifiers insvm_model_1
andsvm_model_100
, respectively. -
The slope and intercept for the
cost = 1
classifier is stored inslope_1
andintercept_1
. -
The slope and intercept for the
cost = 100
classifier is stored inslope_100
andintercept_100
. -
Weight vectors for the two costs are stored in
w_1
andw_100
, respectively -
A basic scatter plot of the training data is stored in
train_plot
The ggplot2
library has been preloaded.
=w[1]
w_1=w[2]
w_2<- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) +
train_plot geom_point() + scale_color_manual(values = c("red", "blue"))
#add decision boundary and margins for cost = 1 to training data scatter plot
<- train_plot +
train_plot_with_margins geom_abline(slope = slope_1, intercept = intercept_1) +
geom_abline(slope = slope_1, intercept = intercept_1-1/w_1[2], linetype = "dashed")+
geom_abline(slope = slope_1, intercept = intercept_1+1/w_1[2], linetype = "dashed")
#display plot
train_plot_with_margins
## Warning: Removed 1 rows containing missing values (geom_abline).
## Removed 1 rows containing missing values (geom_abline).
#build svm model, cost = 100
<- svm(y ~ .,
svm_model_100 data = trainset,
type = "C-classification",
cost = 100,
kernel = "linear",
scale = FALSE)
=t(svm_model_100$coefs) %*% svm_model_100$SV
w_100#calculate slope and intercept of decision boundary from weight vector and svm model
<- -w_100[1]/w_100[2]
slope_100 <- svm_model_100$rho/w_100[2]
intercept_100 =train_plot_with_margins
train_plot_100#add decision boundary and margins for cost = 100 to training data scatter plot
<- train_plot_100 +
train_plot_with_margins geom_abline(slope = slope_100, intercept = intercept_100, color = "goldenrod") +
geom_abline(slope = slope_100, intercept = intercept_100-1/w_100[2], linetype = "dashed", color = "goldenrod")+
geom_abline(slope = slope_100, intercept = intercept_100+1/w_100[2], linetype = "dashed", color = "goldenrod")
#display plot
train_plot_with_margins
## Warning: Removed 1 rows containing missing values (geom_abline).
## Removed 1 rows containing missing values (geom_abline).
Well done! The plot clearly shows the effect of increasing the cost on linear classifiers.
2.3.3 When are soft margin classifiers useful?
In this lesson, we looked at an example in which a soft margin linear SVM (low cost, wide margin) had a better accuracy than its hard margin counterpart (high cost, narrow margin). Which of the phrases listed best completes the following statement:
Linear soft margin classifiers are most likely to be useful when:
-
Working with a linearly separable dataset.
-
Dealing with a dataset that has a highly nonlinear decision boundary.
-
Working with a dataset that is almost linearly separable.
2.4 Multiclass problems
2.4.1 A multiclass classification problem
In this exercise, you will use the svm()
function from the e1071
library to build a linear multiclass SVM classifier for a dataset that is known to be perfectly
linearly separable. Calculate the training and test accuracies, and
plot the model using the training data. The training and test datasets
are available in the dataframes trainset
and testset
. Use the default setting for the cost parameter.
- Load the required library and build a default cost linear SVM.
#load library and build svm model
library(e1071)
<-
svm_modelsvm(y ~ ., data = trainset, type = "C-classification",
kernel = "linear", scale = FALSE)
Calculate training accuracy.
#compute training accuracy
<- predict(svm_model, trainset)
pred_train mean(pred_train == trainset$y)
## [1] 0.9776423
Calculate test accuracy.
#compute test accuracy
<- predict(svm_model, testset)
pred_test mean(pred_test == testset$y)
## [1] 0.9814815
Plot classifier against training data.
#plot
plot(svm_model, trainset)
Well done! The model performs very well even for default settings. The actual separators are lines that pass through the origin at angles of 30 and 60 degrees to the horizontal.
2.4.2 Iris redux - a more robust accuracy.
In this exercise, you will build linear SVMs for 100 distinct training/test partitions of the iris dataset. You will then evaluate the performance of your model by calculating the mean accuracy and standard deviation. This procedure, which is quite general, will give you a far more robust measure of model performance than the ones obtained from a single partition.
-
For each trial:
- Partition the dataset into training and test sets in a random 80/20 split.
- Build a default cost linear SVM on the training dataset.
- Evaluate the accuracy of your model.
=NULL
accuracyfor (i in 1:100){
#assign 80% of the data to the training set
"train"] <- ifelse(runif(nrow(iris)) < 0.8, 1, 0)
iris[, <- grep("train", names(iris))
trainColNum <- iris[iris$train == 1, -trainColNum]
trainset <- iris[iris$train == 0, -trainColNum]
testset #build model using training data
<- svm(Species~ ., data = trainset,
svm_model type = "C-classification", kernel = "linear")
#calculate accuracy on test data
<- predict(svm_model, testset)
pred_test <- mean(pred_test == testset$Species)
accuracy[i]
}mean(accuracy)
## [1] 0.9642514
sd(accuracy)
## [1] 0.03151982
Well done! The high accuracy and low standard deviation confirms that the dataset is almost linearly separable.
3 Polynomial Kernels
3.1 Radially separable dataset
3.1.1 Generating a 2d radially separable dataset
In this exercise you will create a 2d radially separable dataset containing 400 uniformly distributed data points.
df
with:
-
400 points with variables
x1
andx2
. -
x1
andx2
uniformly distributed in (-1, 1).
#set number of variables and seed
<- 400
n set.seed(1)
#Generate data frame with two uniformly distributed predictors, x1 and x2
<- data.frame(x1 = runif(n, min = -1, max = 1),
df x2 = runif(n, min = -1, max = 1))
#We want a circular boundary. Set boundary radius
<- 0.8
radius <- radius^2 radius_squared
df$y
, which takes value -1 or 1 depending on whether a point lies within or outside the circle.
#create dependent categorical variable, y, with value -1 or 1 depending on whether point lies
#within or outside the circle.
$y <- factor(ifelse(df$x1^2 + df$x2^2 < radius_squared, -1, 1), levels = c(-1, 1)) df
Excellent! Now let’s visualize the dataset.
3.1.2 Visualizing the dataset
In this exercise you will use ggplot()
to visualize the dataset you created in the previous exercise. The dataset is available in the dataframe df
. Use color
to distinguish between the two classes.
ggplot2
library.
#load ggplot
library(ggplot2)
#build scatter plot, distinguish class by color
<- ggplot(data = df, aes(x = x1, y = x2, color = y)) +
scatter_plot geom_point() +
scale_color_manual(values = c("red", "blue"))
#display plot
scatter_plot
Nice work! We’ll use this dataset extensively in this chapter.
3.2 Linear SVMs on radial data
3.2.1 Linear SVM for a radially separable dataset
In this exercise you will build two linear SVMs, one for cost = 1
(default) and the other for cost = 100, for the radially separable
dataset you created in the first lesson of this chapter. You will also
calculate the training and test accuracies for both costs. The e1071
library has been loaded, and test and training datasets have been created for you and are available in the data frames trainset
and testset
.
#split train and test data in an 80/20 proportion
"train"] <- ifelse(runif(nrow(df))<0.8, 1, 0)
df[,
#assign training rows to data frame trainset
<- df[df$train == 1, ]
trainset #assign test rows to data frame testset
<- df[df$train == 0, ]
testset
#find index of "train" column
<- grep("train", names(df))
trainColNum
#remove "train" column from train and test dataset
<- trainset[, -trainColNum]
trainset <- testset[, -trainColNum] testset
#default cost mode;
<- svm(y ~ ., data = trainset, type = "C-classification", cost = 1, kernel = "linear") svm_model_1
#training accuracy
<- predict(svm_model_1, trainset)
pred_train mean(pred_train == trainset$y)
## [1] 0.5673981
#test accuracy
<- predict(svm_model_1, testset)
pred_test mean(pred_test == testset$y)
## [1] 0.4938272
- Set cost = 100 and repeat.
#cost = 100 model
<- svm(y ~ ., data = trainset, type = "C-classification", cost = 100, kernel = "linear")
svm_model_2
#accuracy
<- predict(svm_model_2, trainset)
pred_train mean(pred_train == trainset$y)
## [1] 0.5673981
<- predict(svm_model_2, testset)
pred_test mean(pred_test == testset$y)
## [1] 0.4938272
Good work! Next, we’ll get a more reliable measure of accuracy for one of the models.
3.2.2 Average accuracy for linear SVM
In this exercise you will calculate the average accuracy for a default
cost linear SVM using 100 different training/test partitions of the
dataset you generated in the first lesson of this chapter. The e1071
library has been preloaded and the dataset is available in the dataframe df
. Use random 80/20 splits of the data in df
when creating training and test datasets for each iteration.
- Create a vector to hold accuracies for each step.
# Print average accuracy and standard deviation
<- rep(NA, 100)
accuracy set.seed(2)
Create training / test datasets, build default cost SVMs and calculate the test accuracy for each iteration.
# Calculate accuracies for 100 training/test partitions
for (i in 1:100){
"train"] <- ifelse(runif(nrow(df)) < 0.8, 1, 0)
df[, <- df[df$train == 1, ]
trainset <- df[df$train == 0, ]
testset <- grep("train", names(trainset))
trainColNum <- trainset[, -trainColNum]
trainset <- testset[, -trainColNum]
testset <- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear")
svm_model <- predict(svm_model, testset)
pred_test <- mean(pred_test == testset$y)
accuracy[i] }
Compute the average accuracy and standard deviation over all iterations.
# Print average accuracy and its standard deviation
mean(accuracy)
## [1] 0.5554571
sd(accuracy)
## [1] 0.04243524
Compute the average accuracy and standard deviation over all iterations.
3.3 The kernel trick
3.3.1 Visualizing transformed radially separable data
In this exercise you will transform the radially separable dataset you created earlier in this chapter and visualize it in the x12-x22
plane. As a reminder, the separation boundary for the data is the circle x1^2 + x2^2 = 0.64
(radius = 0.8 units). The dataset has been loaded for you in the dataframe df
.
#transform data
<- data.frame(x1sq = df$x1^2, x2sq = df$x2^2, y = df$y) df1
#plot data points in the transformed space
<- ggplot(data = df1, aes(x = x1sq, y = x2sq, color = y)) +
plot_transformed geom_point()+ guides(color = FALSE) +
scale_color_manual(values = c("red", "blue"))
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
#add decision boundary and visualize
<- plot_transformed + geom_abline(slope = -1, intercept = 0.64)
plot_decision plot_decision
Excellent! As expected, the data is linearly separable in the x12 - x22 plane.
3.3.2 SVM with polynomial kernel
In this exercise you will build a SVM with a quadratic kernel
(polynomial of degree 2) for the radially separable dataset you created
earlier in this chapter. You will then calculate the training and test
accuracies and create a plot of the model using the built in plot()
function. The training and test datasets are available in the dataframes trainset
and testset
, and the e1071
library has been preloaded.
<-
svm_modelsvm(y ~ ., data = trainset, type = "C-classification",
kernel = "polynomial", degree = 2)
#measure training and test accuracy
<- predict(svm_model, trainset)
pred_train mean(pred_train == trainset$y)
## [1] 0.9692308
<- predict(svm_model, testset)
pred_test mean(pred_test == testset$y)
## [1] 0.9466667
#plot
plot(svm_model, trainset)
Well done! The decision boundary using default parameters looks good.
3.4 Tuning SVMs
3.4.1 Using tune.svm()
This exercise will give you hands-on practice with using the tune.svm()
function. You will use it to obtain the optimal values for the cost
, gamma
, and coef0
parameters for an SVM model based on the radially separable dataset you
created earlier in this chapter. The training data is available in the
dataframe trainset
, the test data in testset
, and the e1071
library has been preloaded for you. Remember that the class variable y
is stored in the third column of the trainset
and testset
.
Also recall that in the video, Kailash used cost=10^(1:3)
to get a range of the cost parameter from 10=10^1
to 1000=10^3
in multiples of 10.
-
Set parameter search ranges as follows:
-
cost
- from 0.1 (10^(-1)
) to 100 (10^2
) in multiples of 10. -
gamma
andcoef0
- one of the following values: 0.1, 1 and 10.
-
#tune model
<-
tune_out tune.svm(x = trainset[, -3], y = trainset[, 3],
type = "C-classification",
kernel = "polynomial", degree = 2, cost = 10^(-1:2),
gamma = c(0.1, 1, 10), coef0 = c(0.1, 1, 10))
#list optimal values
$best.parameters$cost tune_out
## [1] 10
$best.parameters$gamma tune_out
## [1] 10
$best.parameters$coef0 tune_out
## [1] 1
Well done! You have obtained the optimal parameters for the specified parameter ranges.
3.4.2 Building and visualizing the tuned model
In the final exercise of this chapter, you will build a polynomial SVM
using the optimal values of the parameters that you obtained from tune.svm()
in the previous exercise. You will then calculate the training and test accuracies and visualize the model using svm.plot()
. The e1071
library has been preloaded and the test and training datasets are available in the dataframes trainset
and testset
. The output of tune.svm()
is available in the variable tune_out
.
tune.svm()
.
#Build tuned model
<- svm(y~ ., data = trainset, type = "C-classification",
svm_model kernel = "polynomial", degree = 2,
cost = tune_out$best.parameters$cost,
gamma = tune_out$best.parameters$gamma,
coef0 = tune_out$best.parameters$coef0)
#Calculate training and test accuracies
<- predict(svm_model, trainset)
pred_train mean(pred_train == trainset$y)
## [1] 1
<- predict(svm_model, testset)
pred_test mean(pred_test == testset$y)
## [1] 0.9866667
#plot model
plot(svm_model, trainset)
Excellent! Tuning the parameters has given us a considerably better accuracy.
4 Radial Basis Function Kernels
4.1 Generating a complex dataset
4.1.1 Generating a complex dataset - part 1
In this exercise you will create a dataset that has two attributes x1
and x2
, with x1
normally distributed (mean = -0.5, sd = 1) and x2
uniformly distributed in (-1, 1).
df
with 1000 points (x1, x2) distributed as follows:
#number of data points
<- 1000 n
x1
- normally distributed with mean = -0.5 and std deviation 1.
#set seed
set.seed(1)
x2
uniformly distributed in (-1, 1).
#create dataframe
<- data.frame(x1 = rnorm(n, mean = -0.5, sd = 1),
df x2 = runif(n, min = -1, max = 1))
Excellent! Now let’s create a complex decision boundary.
4.1.2 Generating a complex dataset - part 2
In this exercise, you will create a decision boundary for the dataset
you created in the previous exercise. The boundary consists of two
circles of radius 0.8 units with centers at x1 = -0.8, x2 = 0) and (x1 =
0.8, x2 = 0) that just touch each other at the origin. Define a binary
classification variable y
such that points that lie within either of the circles have y = -1
and those that lie outside both circle have y = 1
.
The dataset created in the previous exercise is available in the dataframe df
.
#set radius and centers
<- 0.8
radius <- c(-0.8, 0)
center_1 <- c(0.8, 0)
center_2 <- radius^2 radius_squared
df
containing the binary classification variable y
.
#create binary classification variable
$y <- factor(ifelse((df$x1-center_1[1])^2 + (df$x2-center_1[2])^2 < radius_squared|
df$x1-center_2[1])^2 + (df$x2-center_2[2])^2 < radius_squared, -1, 1),
(dflevels = c(-1, 1))
Well done! Now let’s visualize the decision boundary.
4.1.3 Visualizing the dataset
In this exercise you will use ggplot()
to visualise the complex dataset you created in the previous exercises. The dataset is available in the dataframe df
. You are not required to visualize the decision boundary.
Here you will use coord_equal()
to give the x and y axes
the same physical representation on the plot, making the circles appear
as circles rather than ellipses.
aesthetics
parameter.
# Load ggplot2
library(ggplot2)
geom_
function for a scatter plot.
coord_equal()
without arguments.
# Plot x2 vs. x1, colored by y
<- ggplot(data = df, aes(x = x1, y = x2 , color = y)) +
scatter_plot # Add a point layer
geom_point() +
scale_color_manual(values = c("red", "blue")) +
# Specify equal coordinates
coord_equal()
scatter_plot
Excellent! In the next lesson we will see how linear and quadratic kernels perform against this dataset.
4.2 Motivating the RBF kernel
4.2.1 Linear SVM for complex dataset
In this exercise you will build a default cost linear SVM for the
complex dataset you created in the first lesson of this chapter. You
will also calculate the training and test accuracies and plot the
classification boundary against the test dataset. The e1071
library has been loaded, and test and training datasets have been created for you and are available in the data frames trainset
and testset
.
- Build a linear SVM using the default value of cost.
#build model
<-
svm_modelsvm(y ~ ., data = trainset, type = "C-classification",
kernel = "linear")
Calculate training and test accuracies.
#accuracy
<- predict(svm_model, trainset)
pred_train mean(pred_train == trainset$y)
## [1] 0.6153846
<- predict(svm_model, testset)
pred_test mean(pred_test == testset$y)
## [1] 0.5866667
Plot decision boundary against the test data.
#plot model against testset
plot(svm_model, testset)
Nice work! As expected, the accuracy is poor and the plot clearly shows why a linear boundary will never work well.
4.2.2 Quadratic SVM for complex dataset
In this exercise you will build a default quadratic (polynomial, degree =
2) linear SVM for the complex dataset you created in the first lesson
of this chapter. You will also calculate the training and test
accuracies plot the classification boundary against the training
dataset. The e1071
library has been loaded, and test and training datasets have been created for you and are available in the data frames trainset
and testset
.
- Build a polynomial SVM of degree 2 using the default parameters.
#build model
<-
svm_modelsvm(y ~ ., data = trainset, type = "C-classification",
kernel = "polynomial", degree = 2)
Calculate training and test accuracies.
#accuracy
<- predict(svm_model, trainset)
pred_train mean(pred_train == trainset$y)
## [1] 0.9692308
<- predict(svm_model, testset)
pred_test mean(pred_test == testset$y)
## [1] 0.9466667
Plot decision boundary against the training data.
#plot model
plot(svm_model, trainset)
Well done! The model accuracy is not too bad, but the plot shows that it is impossible to capture the figure of 8 shape of the actual boundary using a degree 2 polynomial.
4.3 The RBF Kernel
4.3.1 Polynomial SVM on a complex dataset
Calculate the average accuracy for a degree 2 polynomial
kernel SVM using 100 different training/test partitions of the complex
dataset you generated in the first lesson of this chapter. Use default
settings for the parameters. The e1071
library has been preloaded and the dataset is available in the dataframe df
. Use random 80/20 splits of the data in df
when creating training and test datasets for each iteration.
- Create a vector to hold accuracies for each step.
#create vector to store accuracies and set random number seed
<- rep(NA, 100)
accuracy set.seed(2)
Create training/test datasets, build default cost polynomial SVMs of degree 2, and calculate the test accuracy for each iteration.
#calculate accuracies for 100 training/test partitions
for (i in 1:100){
"train"] <- ifelse(runif(nrow(df))<0.8, 1, 0)
df[, <- df[df$train == 1, ]
trainset <- df[df$train == 0, ]
testset <- grep("train", names(trainset))
trainColNum <- trainset[, -trainColNum]
trainset <- testset[, -trainColNum]
testset <- svm(y ~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2)
svm_model<- predict(svm_model, testset)
pred_test <- mean(pred_test == testset$y)
accuracy[i] }
Compute the average accuracy and standard deviation over all iterations.
#print average accuracy and standard deviation
mean(accuracy)
## [1] 0.804765
sd(accuracy)
## [1] 0.02398396
Nice work! Please note down the average accuracy and standard deviation. We’ll compare these to the default RBF kernel SVM next.
4.3.2 RBF SVM on a complex dataset
Calculate the average accuracy for a RBF kernel SVM
using 100 different training/test partitions of the complex dataset you
generated in the first lesson of this chapter. Use default settings for
the parameters. The e1071
library has been preloaded and the dataset is available in the dataframe df
. Use random 80/20 splits of the data in df
when creating training and test datasets for each iteration.
- Create a vector of length 100 to hold accuracies for each step.
#create vector to store accuracies and set random number seed
<- rep(NA, 100)
accuracy set.seed(2)
Create training/test datasets, build RBF SVMs with default settings for all parameters and calculate the test accuracy for each iteration.
#calculate accuracies for 100 training/test partitions
for (i in 1:100){
"train"] <- ifelse(runif(nrow(df))<0.8, 1, 0)
df[, <- df[df$train == 1, ]
trainset <- df[df$train == 0, ]
testset <- grep("train", names(trainset))
trainColNum <- trainset[, -trainColNum]
trainset <- testset[, -trainColNum]
testset <- svm(y ~ ., data = trainset, type = "C-classification", kernel = "radial")
svm_model<- predict(svm_model, testset)
pred_test <- mean(pred_test == testset$y)
accuracy[i] }
Compute the average accuracy and standard deviation over all iterations.
#print average accuracy and standard deviation
mean(accuracy)
## [1] 0.9034203
sd(accuracy)
## [1] 0.01786378
Well done! Note that the average accuracy is almost 10% better than the one obtained in the previous exercise (polynomial kernel of degree 2)
4.3.3 Tuning an RBF kernel SVM
In this exercise you will build a tuned RBF kernel SVM for a the given training dataset (available in dataframe trainset
) and calculate the accuracy on the test dataset (available in dataframe testset
). You will then plot the tuned decision boundary against the test dataset.
-
Use
tune.svm()
to build a tuned RBF kernel SVM.
#tune model
<- tune.svm(x = trainset[, -3], y = trainset[, 3],
tune_out gamma = 5*10^(-2:2),
cost = c(0.01, 0.1, 1, 10, 100),
type = "C-classification", kernel = "radial")
Rebuild SVM using optimal values of cost
and gamma
.
#build tuned model
<- svm(y~ ., data = trainset, type = "C-classification", kernel = "radial",
svm_model cost = tune_out$best.parameters$cost,
gamma = tune_out$best.parameters$gamma)
Calculate the accuracy of your model using the test dataset.
#calculate test accuracy
<- predict(svm_model, testset)
pred_test mean(pred_test == testset$y)
## [1] 0.9891892
Plot the decision boundary against testset
.
#Plot decision boundary against test data
plot(svm_model, testset)
Well done! That’s it for this course. I hope you found it useful.