Support Vector Machines in R

Kailash Awati - DataCamp


Course Description

This course will introduce a powerful classifier, the support vector machine (SVM) using an intuitive, visual approach. Support Vector Machines in R will help students develop an understanding of the SVM model as a classifier and gain practical experience using R’s libsvm implementation from the e1071 package. Along the way, students will gain an intuitive understanding of important concepts, such as hard and soft margins, the kernel trick, different types of kernels, and how to tune SVM parameters. Get ready to classify data with this impressive model.

1 Introduction

1.1 Sugar content of soft drinks

1.1.1 Visualizing a sugar content dataset

In this exercise, you will create a 1-dimensional scatter plot of 25 soft drink sugar content measurements. The aim is to visualize distinct clusters in the dataset as a first step towards identifying candidate decision boundaries.

The dataset with 25 sugar content measurements is stored in the sugar_content column of the data frame df, which has been preloaded for you.

  • Load the ggplot2 package.
  • # Load ggplot2
    library(ggplot2)
    ## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
    ## register S3 method.
  • List the variables in dataframe df.
  • library(tidyverse)
    ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
    ## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
    ## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
    ## ✓ readr   2.1.1     ✓ forcats 0.5.1
    ## ✓ purrr   0.3.4
    ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ## x dplyr::filter() masks stats::filter()
    ## x dplyr::lag()    masks stats::lag()
    df=data.frame(sugar_content=c(10.9,10.9,10.6,10,8,8.2,8.6,10.9,10.7,8,7.7,7.8,8.4,11.5,11.2,8.9,8.7,7.4,10.9,10,11.4,10.8,8.5,8.2,10.6)) %>% mutate(sample.=row_number())
    # Print variable names
    colnames(df)
    ## [1] "sugar_content" "sample."
  • Complete the scatter plot code. Using the df dataset, plot the sugar content of samples along the x-axis (at y equal to zero).
  • # Plot sugar content along the x-axis
    plot_df <- ggplot(data = df, aes(x = sugar_content, y = 0)) + 
        geom_point() + 
        geom_text(aes(label = sugar_content), size = 2.5, vjust = 2, hjust = 0.5)
  • Write ggplot() code to display sugar content in df as a scatter plot. Can you spot two distinct clusters corresponding to high and low sugar content samples?
  • # Display plot
    plot_df

    Nice work! Notice the gap between 9 and 10. Sample with sugar content below 9 form a “low sugar” cluster, and samples above 10 form a “high sugar” cluster.

    1.1.2 Identifying decision boundaries

    Based on the plot you created in the previous exercise (reproduced on the right), which of the following points is not a legitimate decision boundary?

    • 9g/100 ml
    • 9.1g/100 ml
    • 9.8 g/100 ml
    • 8.9g/100 ml

    That’s correct! 8.9 g/100ml is not a legitimate decision boundary as it is part of the lower sugar content cluster.

    1.1.3 Find the maximal margin separator

    Recall that the dataset we are working with consists of measurements of sugar content of 25 randomly chosen samples of two soft drinks, one regular and the other reduced sugar. In one of our earlier plots, we identified two distinct clusters (classes). A dataset in which the classes do not overlap is called separable, the classes being separated by a decision boundary. The maximal margin separator is the decision boundary that is furthest from both classes. It is located at the mean of the relevant extreme points from each class. In this case the relevant points are the highest valued point in the low sugar content class and the lowest valued point in the high sugar content class. This exercise asks you to find the maximal margin separator for the sugar content dataset.

  • Find the maximal margin separator and assign it to the variable mm_separator.
  • #The maximal margin separator is at the midpoint of the two extreme points in each cluster.
    mm_separator <- (8.9+10)/2
  • Use the displayed plot to find the sugar content values of the relevant extremal data points in each class.
  • Well done! We’ll visualize the separator in the next exercise.

    1.1.4 Visualize the maximal margin separator

    In this exercise you will add the maximal margin separator to the scatter plot you created in an earlier exercise. The plot has been reproduced on the right.

  • Create a data frame called separator containing the maximal margin separator. This is available in the variable mm_separator(enter mm_separator to see it)
  • #create data frame containing the maximum margin separator
    separator <- data.frame(sep = mm_separator)
  • Use the data frame created to add the maximal margin separator to the sugar content scatterplot created in the earlier exercise and display the result.
  • plot_=plot_df
    #add separator to sugar content scatterplot
    plot_sep <- plot_ + geom_point(data = separator, aes(x = mm_separator, y = 0), color = "blue", size = 4)
    
    #display plot
    plot_sep

    Well done! It should be clear from the plot that the blue point is the best possible separator. Why?

    1.2 Linearly separable dataset

    1.2.1 Generate a 2d uniformly distributed dataset.

    The aim of this lesson is to create a dataset that will be used to illustrate the basic principles of support vector machines. In this exercise we will do the first step, which is to create a 2 dimensional uniformly distributed dataset containing 600 datapoints.

  • Set the number of data points, n.
  • #set seed
    set.seed(42)
    
    #set number of data points. 
    n <- 600
  • Generate a dataframe df with two uniformly distributed variables, x1 and x2 lying in (0, 1).
  • #Generate data frame with two uniformly distributed predictors lying between 0 and 1.
    df <- data.frame(x1 = runif(n), 
                     x2 = runif(n))

    Good work. Next we’ll divide the dataset into two classes that are separated by a linear decision boundary.

    1.2.2 Create a decision boundary

    The dataset you created in the previous exercise is available to you in the dataframe df (recall that it consists of two uniformly distributed variables x1 and x2, lying between 0 and 1). In this exercise you will add a class variable to that dataset. You will do this by creating a variable y whose value is -1 or +1 depending on whether the point (x1, x2) lies below or above the straight line that passes through the origin and has slope 1.4.

  • Create a new column y in the dataframe df with the following specs:
  • y = -1 if x2 < 1.4*x1
  • y = 1 if x2 > 1.4*x1
  • #classify data points depending on location
    df$y <- factor(ifelse(df$x2-1.4*df$x1 < 0, -1, 1), 
        levels = c(-1, 1))

    Nice work. Next we’ll introduce a margin in the dataset and visualize it.

    1.2.3 Introduce a margin in the dataset

    Your final task for Chapter 1 is to create a margin in the dataset that you generated in the previous exercise and then display the margin in a plot. The ggplot2 library has been preloaded for you. Recall that the slope of the linear decision boundary you created in the previous exercise is 1.4.

  • Introduce a margin delta of 0.07 units in your dataset.
  • #set margin
    delta <- 0.07
  • Replot the dataset, displaying the margin boundaries as dashed lines and the decision boundary as a solid line.
  • # retain only those points that lie outside the margin
    df1 <- df[abs(1.4*df$x1 - df$x2) > delta, ]
    
    #build plot
    plot_margins <- ggplot(data = df1, aes(x = x1, y = x2, color = y)) + geom_point() + 
        scale_color_manual(values = c("red", "blue")) + 
        geom_abline(slope = 1.4, intercept = 0)+
        geom_abline(slope = 1.4, intercept = delta, linetype = "dashed") +
        geom_abline(slope = 1.4, intercept = -delta, linetype = "dashed")
     
    #display plot
    plot_margins

    Nice work! We will use this dataset to learn about linear support vector machines in the next chapter.

    2 Linear Kernels

    2.1 Linear SVM

    2.1.1 Creating training and test datasets

    Splitting a dataset into training and test sets is an important step in building and testing a classification model. The training set is used to build the model and the test set to evaluate its predictive accuracy.

    In this exercise, you will split the dataset you created in the previous chapter into training and test sets. The dataset has been loaded in the dataframe df and a seed has already been set to ensure reproducibility.

  • Create a column called train in df and randomly assign 80% of the rows in df a value of 1 for this column (and the remaining rows a value of 0).
  • #split train and test data in an 80/20 proportion
    df[, "train"] <- ifelse(runif(nrow(df))<0.8, 1, 0)
  • Assign the rows with train == 1 to the dataframe trainset and those with train == 0 to the dataframe testset.
  • #assign training rows to data frame trainset
    trainset <- df[df$train == 1, ]
    #assign test rows to data frame testset
    testset <- df[df$train == 0, ]
  • Remove train column from training and test datasets by index.
  • #find index of "train" column
    trainColNum <- grep("train", names(df))
    
    #remove "train" column from train and test dataset
    trainset <- trainset[, -trainColNum]
    testset <- testset[, -trainColNum]

    Nice work! In the next exercise we will use these datasets to build our first SVM model.

    2.1.2 Building a linear SVM classifier

    In this exercise, you will use the svm() function from the e1071 library to build a linear SVM classifier using training dataset you created in the previous exercise. The training dataset has been loaded for you in the dataframe trainset

  • Load the e1071 library.
  • library(e1071)
  • Build an SVM model using a linear kernel.
  • Do not scale the variables (this is to allow comparison with the original dataset later).
  • #build svm model, setting required parameters
    svm_model<- svm(y ~ ., 
                    data = trainset, 
                    type = "C-classification", 
                    kernel = "linear", 
                    scale = FALSE)

    Nice work! In the next exercise we will explore the contents of the model.

    2.1.3 Exploring the model and calculating accuracy

    In this exercise you will explore the contents of the model and calculate its training and test accuracies. The training and test data are available in the data frames trainset and testset respectively, and the SVM model is stored in the variable svm_model.

    • List the components of your SVM model.
    #list components of model
    names(svm_model)
    ##  [1] "call"            "type"            "kernel"          "cost"           
    ##  [5] "degree"          "gamma"           "coef0"           "nu"             
    ##  [9] "epsilon"         "sparse"          "scaled"          "x.scale"        
    ## [13] "y.scale"         "nclasses"        "levels"          "tot.nSV"        
    ## [17] "nSV"             "labels"          "SV"              "index"          
    ## [21] "rho"             "compprob"        "probA"           "probB"          
    ## [25] "sigma"           "coefs"           "na.action"       "fitted"         
    ## [29] "decision.values" "terms"

    List the contents of SV, index, and rho.

    #list values of the SV, index and rho
    head(svm_model$SV)
    ##           x1        x2
    ## 7  0.7365883 0.7688522
    ## 8  0.1346666 0.1639289
    ## 14 0.2554288 0.3517920
    ## 19 0.4749971 0.4866429
    ## 40 0.6117786 0.7146319
    ## 45 0.4317512 0.5203398
    svm_model$index
    ##   [1]   5   6  10  14  30  33  45  47  54  72  83  85  95  97 113 116 117 118
    ##  [19] 132 138 142 146 150 160 164 165 170 175 202 207 214 216 221 225 229 235
    ##  [37] 242 243 253 269 277 279 284 293 296 316 325 332 335 336 338 345 352 366
    ##  [55] 372 383 385 387 388 397 406 414 416 434 435 453 468 485 489 490 492  13
    ##  [73]  27  37  46  50  61  68  70  74  76  81  86  90  93  94 101 107 122 130
    ##  [91] 144 147 156 162 176 178 179 181 183 192 196 197 200 204 205 213 231 241
    ## [109] 254 255 257 290 297 302 311 312 321 340 354 359 364 371 384 390 391 395
    ## [127] 398 404 420 421 433 438 452 457 460 464 467 477 481 486
    svm_model$rho
    ## [1] -0.2796884

    Calculate the training accuracy of the model.

    #compute training accuracy
    pred_train <- predict(svm_model, trainset)
    mean(pred_train == trainset$y)
    ## [1] 0.9776423

    Calculate the test accuracy of the model.

    #compute test accuracy
    pred_test <- predict(svm_model, testset)
    mean(pred_test == testset$y)
    ## [1] 0.9814815

    Excellent! You are now ready for the next lesson in which we’ll visually explore the model.

    2.2 Visualizing Linear SVMs

    2.2.1 Visualizing support vectors using ggplot

    In this exercise you will plot the training dataset you used to build a linear SVM and mark out the support vectors. The training dataset has been preloaded for you in the dataframe trainset and the SVM model is stored in the variable svm_model.

  • Load ggplot2.
  • #load ggplot
    library(ggplot2)
  • Plot the training dataset.
  • #build scatter plot of training dataset
    scatter_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + 
        geom_point() + 
        scale_color_manual(values = c("red", "blue"))
  • Mark out the support vectors on the plot using their indices from the SVM model.
  • #add plot layer marking out the support vectors 
    layered_plot <- 
        scatter_plot + geom_point(data = trainset[svm_model$index, ], aes(x = x1, y = x2), color = "purple", size = 4, alpha = 0.5)
    
    #display plot
    layered_plot

    Well done! Now let’s add the decision and margin boundaries to the plot.

    2.2.2 Visualizing decision & margin bounds using ggplot2

    In this exercise, you will add the decision and margin boundaries to the support vector scatter plot created in the previous exercise. The SVM model is available in the variable svm_model and the weight vector has been precalculated for you and is available in the variable w. The ggplot2 library has also been preloaded.

  • Calculate the slope and intercept of the decision boundary.
  • w=t(svm_model$coefs) %*% svm_model$SV
    #calculate slope and intercept of decision boundary from weight vector and svm model
    slope_1 <- -w[1]/w[2]
    intercept_1 <- svm_model$rho/w[2]
    
    #build scatter plot of training dataset
    scatter_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + 
        geom_point() + scale_color_manual(values = c("red", "blue"))
  • Add the decision boundary to the plot.
  • Add the margin boundaries to the plot.
  • #add decision boundary 
    plot_decision <- scatter_plot + geom_abline(slope = slope_1, intercept = intercept_1) 
    
    #add margin boundaries
    plot_margins <- plot_decision + 
     geom_abline(slope = slope_1, intercept = intercept_1 - 1/w[2], linetype = "dashed")+
     geom_abline(slope = slope_1, intercept = intercept_1 + 1/w[2], linetype = "dashed")
    
    #display plot
    plot_margins

    Excellent! We’ll now visualize the decision regions and support vectors using the svm plot function.

    2.2.3 Visualizing decision & margin bounds using plot()

    In this exercise, you will rebuild the SVM model (as a refresher) and use the built in SVM plot() function to visualize the decision regions and support vectors. The training data is available in the dataframe trainset.

  • Load the library needed to build an SVM model.
  • #load required library
    library(e1071)
  • Build a linear SVM model using the training data.
  • #build svm model
    svm_model<- 
        svm(y ~ ., data = trainset, type = "C-classification", 
            kernel = "linear", scale = FALSE)
  • Plot the decision regions and support vectors.
  • #plot decision boundaries and support vectors for the training data
    plot(x = svm_model, data = trainset)

    Excellent! We’re now ready for the next lesson in which we’ll learn how to tune linear SVMs.

    2.3 Tuning linear SVMs

    2.3.1 Tuning a linear SVM

    In this exercise you will study the influence of varying cost on the number of support vectors for linear SVMs. To do this, you will build two SVMs, one with cost = 1 and the other with cost = 100 and find the number of support vectors. A model training dataset is available in the dataframe trainset.

  • Build a linear SVM with cost = 1 (default setting).
  • #build svm model, cost = 1
    svm_model_1 <- svm(y ~ .,
                       data = trainset,
                       type = "C-classification",
                       cost = 1,
                       kernel = "linear",
                       scale = FALSE)
  • Print the model to find the number of support vectors.
  • #print model details
    svm_model_1
    ## 
    ## Call:
    ## svm(formula = y ~ ., data = trainset, type = "C-classification", 
    ##     cost = 1, kernel = "linear", scale = FALSE)
    ## 
    ## 
    ## Parameters:
    ##    SVM-Type:  C-classification 
    ##  SVM-Kernel:  linear 
    ##        cost:  1 
    ## 
    ## Number of Support Vectors:  140
  • Build the model again with cost = 100.
  • #build svm model, cost = 100
    svm_model_100 <- svm(y ~ .,
                       data = trainset,
                       type = "C-classification",
                       cost = 100,
                       kernel = "linear",
                       scale = FALSE)
  • Print the model.
  • #print model details
    svm_model_100
    ## 
    ## Call:
    ## svm(formula = y ~ ., data = trainset, type = "C-classification", 
    ##     cost = 100, kernel = "linear", scale = FALSE)
    ## 
    ## 
    ## Parameters:
    ##    SVM-Type:  C-classification 
    ##  SVM-Kernel:  linear 
    ##        cost:  100 
    ## 
    ## Number of Support Vectors:  32

    Excellent! The number of support vectors decreases as cost increases because the margin becomes narrower.

    2.3.2 Visualizing decision boundaries and margins

    In the previous exercise you built two linear classifiers for a linearly separable dataset, one with cost = 1 and the other cost = 100. In this exercise you will visualize the margins for the two classifiers on a single plot. The following objects are available for use:

    • The training dataset: trainset.
    • The cost = 1 and cost = 100 classifiers in svm_model_1 and svm_model_100, respectively.
    • The slope and intercept for the cost = 1 classifier is stored in slope_1 and intercept_1.
    • The slope and intercept for the cost = 100 classifier is stored in slope_100 and intercept_100.
    • Weight vectors for the two costs are stored in w_1 and w_100, respectively
    • A basic scatter plot of the training data is stored in train_plot

    The ggplot2 library has been preloaded.

  • Add the decision boundary and margins for the cost = 1 classifier to the training data plot.
  • w_1=w[1]
    w_2=w[2]
    train_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + 
        geom_point() + scale_color_manual(values = c("red", "blue"))
    #add decision boundary and margins for cost = 1 to training data scatter plot
    train_plot_with_margins <- train_plot + 
        geom_abline(slope = slope_1, intercept = intercept_1) +
        geom_abline(slope = slope_1, intercept = intercept_1-1/w_1[2], linetype = "dashed")+
        geom_abline(slope = slope_1, intercept = intercept_1+1/w_1[2], linetype = "dashed")
  • Display the resulting plot.
  • #display plot
    train_plot_with_margins
    ## Warning: Removed 1 rows containing missing values (geom_abline).
    ## Removed 1 rows containing missing values (geom_abline).

  • Add the decision boundary and margins for the cost = 100 classifier to the plot you created in the first step.
  • #build svm model, cost = 100
    svm_model_100 <- svm(y ~ .,
                       data = trainset,
                       type = "C-classification",
                       cost = 100,
                       kernel = "linear",
                       scale = FALSE)
    
    w_100=t(svm_model_100$coefs) %*% svm_model_100$SV
    #calculate slope and intercept of decision boundary from weight vector and svm model
    slope_100 <- -w_100[1]/w_100[2]
    intercept_100 <- svm_model_100$rho/w_100[2]
    train_plot_100=train_plot_with_margins
    #add decision boundary and margins for cost = 100 to training data scatter plot
    train_plot_with_margins <- train_plot_100 + 
        geom_abline(slope = slope_100, intercept = intercept_100, color = "goldenrod") +
        geom_abline(slope = slope_100, intercept = intercept_100-1/w_100[2], linetype = "dashed", color = "goldenrod")+
        geom_abline(slope = slope_100, intercept = intercept_100+1/w_100[2], linetype = "dashed", color = "goldenrod")
  • Display the final plot showing decision boundaries and margins for both classifiers.
  • #display plot 
    train_plot_with_margins
    ## Warning: Removed 1 rows containing missing values (geom_abline).
    ## Removed 1 rows containing missing values (geom_abline).

    Well done! The plot clearly shows the effect of increasing the cost on linear classifiers.

    2.3.3 When are soft margin classifiers useful?

    In this lesson, we looked at an example in which a soft margin linear SVM (low cost, wide margin) had a better accuracy than its hard margin counterpart (high cost, narrow margin). Which of the phrases listed best completes the following statement:

    Linear soft margin classifiers are most likely to be useful when:

    • Working with a linearly separable dataset.
    • Dealing with a dataset that has a highly nonlinear decision boundary.
    • Working with a dataset that is almost linearly separable.

    “That’s right! A soft margin linear classifier would work well for a nearly linearly separable dataset.”

    2.4 Multiclass problems

    2.4.1 A multiclass classification problem

    In this exercise, you will use the svm() function from the e1071 library to build a linear multiclass SVM classifier for a dataset that is known to be perfectly linearly separable. Calculate the training and test accuracies, and plot the model using the training data. The training and test datasets are available in the dataframes trainset and testset. Use the default setting for the cost parameter.

    • Load the required library and build a default cost linear SVM.
    #load library and build svm model
    library(e1071)
    svm_model<- 
        svm(y ~ ., data = trainset, type = "C-classification", 
            kernel = "linear", scale = FALSE)

    Calculate training accuracy.

    #compute training accuracy
    pred_train <- predict(svm_model, trainset)
    mean(pred_train == trainset$y)
    ## [1] 0.9776423

    Calculate test accuracy.

    #compute test accuracy
    pred_test <- predict(svm_model, testset)
    mean(pred_test == testset$y)
    ## [1] 0.9814815

    Plot classifier against training data.

    #plot
    plot(svm_model, trainset)

    Well done! The model performs very well even for default settings. The actual separators are lines that pass through the origin at angles of 30 and 60 degrees to the horizontal.

    2.4.2 Iris redux - a more robust accuracy.

    In this exercise, you will build linear SVMs for 100 distinct training/test partitions of the iris dataset. You will then evaluate the performance of your model by calculating the mean accuracy and standard deviation. This procedure, which is quite general, will give you a far more robust measure of model performance than the ones obtained from a single partition.

    • For each trial:
      • Partition the dataset into training and test sets in a random 80/20 split.
      • Build a default cost linear SVM on the training dataset.
      • Evaluate the accuracy of your model.
    accuracy=NULL
    for (i in 1:100){ 
        #assign 80% of the data to the training set
        iris[, "train"] <- ifelse(runif(nrow(iris)) < 0.8, 1, 0)
        trainColNum <- grep("train", names(iris))
        trainset <- iris[iris$train == 1, -trainColNum]
        testset <- iris[iris$train == 0, -trainColNum]
        #build model using training data
        svm_model <- svm(Species~ ., data = trainset, 
                         type = "C-classification", kernel = "linear")
        #calculate accuracy on test data
        pred_test <- predict(svm_model, testset)
        accuracy[i] <- mean(pred_test == testset$Species)
    }
    mean(accuracy)
    ## [1] 0.9642514
    sd(accuracy)
    ## [1] 0.03151982

    Well done! The high accuracy and low standard deviation confirms that the dataset is almost linearly separable.

    3 Polynomial Kernels

    3.1 Radially separable dataset

    3.1.1 Generating a 2d radially separable dataset

    In this exercise you will create a 2d radially separable dataset containing 400 uniformly distributed data points.

  • Generate a data frame df with:
    • 400 points with variables x1 and x2.
    • x1 and x2 uniformly distributed in (-1, 1).
  • #set number of variables and seed
    n <- 400
    set.seed(1)
    
    #Generate data frame with two uniformly distributed predictors, x1 and x2
    df <- data.frame(x1 = runif(n, min = -1, max = 1), 
                     x2 = runif(n, min = -1, max = 1))
  • Introduce a circular boundary of radius 0.8, centred at the origin.
  • #We want a circular boundary. Set boundary radius 
    radius <- 0.8
    radius_squared <- radius^2
  • Create df$y, which takes value -1 or 1 depending on whether a point lies within or outside the circle.
  • #create dependent categorical variable, y, with value -1 or 1 depending on whether point lies
    #within or outside the circle.
    df$y <- factor(ifelse(df$x1^2 + df$x2^2 < radius_squared, -1, 1), levels = c(-1, 1))

    Excellent! Now let’s visualize the dataset.

    3.1.2 Visualizing the dataset

    In this exercise you will use ggplot() to visualize the dataset you created in the previous exercise. The dataset is available in the dataframe df. Use color to distinguish between the two classes.

  • Load ggplot2 library.
  • #load ggplot
    library(ggplot2)
  • Create 2d scatter plot and color the two classes (y = -1 and y = 1) red and blue.
  • #build scatter plot, distinguish class by color
    scatter_plot <- ggplot(data = df, aes(x = x1, y = x2, color = y)) + 
        geom_point() +
        scale_color_manual(values = c("red", "blue"))
    
    #display plot
    scatter_plot

    Nice work! We’ll use this dataset extensively in this chapter.

    3.2 Linear SVMs on radial data

    3.2.1 Linear SVM for a radially separable dataset

    In this exercise you will build two linear SVMs, one for cost = 1 (default) and the other for cost = 100, for the radially separable dataset you created in the first lesson of this chapter. You will also calculate the training and test accuracies for both costs. The e1071 library has been loaded, and test and training datasets have been created for you and are available in the data frames trainset and testset.

  • Build a linear SVM using the default cost.
  • #split train and test data in an 80/20 proportion
    df[, "train"] <- ifelse(runif(nrow(df))<0.8, 1, 0)
    
    #assign training rows to data frame trainset
    trainset <- df[df$train == 1, ]
    #assign test rows to data frame testset
    testset <- df[df$train == 0, ]
    
    #find index of "train" column
    trainColNum <- grep("train", names(df))
    
    #remove "train" column from train and test dataset
    trainset <- trainset[, -trainColNum]
    testset <- testset[, -trainColNum]
    #default cost mode;
    svm_model_1 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 1, kernel = "linear")
  • Calculate training and test accuracies.
  • #training accuracy
    pred_train <- predict(svm_model_1, trainset)
    mean(pred_train == trainset$y)
    ## [1] 0.5673981
    #test accuracy
    pred_test <- predict(svm_model_1, testset)
    mean(pred_test == testset$y)
    ## [1] 0.4938272
    • Set cost = 100 and repeat.
    #cost = 100 model
    svm_model_2 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 100, kernel = "linear")
    
    #accuracy
    pred_train <- predict(svm_model_2, trainset)
    mean(pred_train == trainset$y)
    ## [1] 0.5673981
    pred_test <- predict(svm_model_2, testset)
    mean(pred_test == testset$y)
    ## [1] 0.4938272

    Good work! Next, we’ll get a more reliable measure of accuracy for one of the models.

    3.2.2 Average accuracy for linear SVM

    In this exercise you will calculate the average accuracy for a default cost linear SVM using 100 different training/test partitions of the dataset you generated in the first lesson of this chapter. The e1071 library has been preloaded and the dataset is available in the dataframe df. Use random 80/20 splits of the data in df when creating training and test datasets for each iteration.

    • Create a vector to hold accuracies for each step.
    # Print average accuracy and standard deviation
    accuracy <- rep(NA, 100)
    set.seed(2)

    Create training / test datasets, build default cost SVMs and calculate the test accuracy for each iteration.

    # Calculate accuracies for 100 training/test partitions
    for (i in 1:100){
        df[, "train"] <- ifelse(runif(nrow(df)) < 0.8, 1, 0)
        trainset <- df[df$train == 1, ]
        testset <- df[df$train == 0, ]
        trainColNum <- grep("train", names(trainset))
        trainset <- trainset[, -trainColNum]
        testset <- testset[, -trainColNum]
        svm_model <- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear")
        pred_test <- predict(svm_model, testset)
        accuracy[i] <- mean(pred_test == testset$y)
    }

    Compute the average accuracy and standard deviation over all iterations.

    # Print average accuracy and its standard deviation
    mean(accuracy)
    ## [1] 0.5554571
    sd(accuracy)
    ## [1] 0.04243524

    Compute the average accuracy and standard deviation over all iterations.

    3.3 The kernel trick

    3.3.1 Visualizing transformed radially separable data

    In this exercise you will transform the radially separable dataset you created earlier in this chapter and visualize it in the x12-x22 plane. As a reminder, the separation boundary for the data is the circle x1^2 + x2^2 = 0.64(radius = 0.8 units). The dataset has been loaded for you in the dataframe df.

  • Transform data to x12-x22 plane.
  • #transform data
    df1 <- data.frame(x1sq = df$x1^2, x2sq = df$x2^2, y = df$y)
  • Visualize data in terms of transformed coordinates.
  • #plot data points in the transformed space
    plot_transformed <- ggplot(data = df1, aes(x = x1sq, y = x2sq, color = y)) + 
        geom_point()+ guides(color = FALSE) + 
        scale_color_manual(values = c("red", "blue"))
    ## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
    ## "none")` instead.
  • Add a boundary that is linear in terms of transformed coordinates.
  • #add decision boundary and visualize
    plot_decision <- plot_transformed + geom_abline(slope = -1, intercept = 0.64)
    plot_decision

    Excellent! As expected, the data is linearly separable in the x12 - x22 plane.

    3.3.2 SVM with polynomial kernel

    In this exercise you will build a SVM with a quadratic kernel (polynomial of degree 2) for the radially separable dataset you created earlier in this chapter. You will then calculate the training and test accuracies and create a plot of the model using the built in plot() function. The training and test datasets are available in the dataframes trainset and testset, and the e1071 library has been preloaded.

  • Build SVM model on the training data using a polynomial kernel of degree 2.
  • svm_model<- 
        svm(y ~ ., data = trainset, type = "C-classification", 
            kernel = "polynomial", degree = 2)
  • Calculate training and test accuracy for the given training/test partition.
  • #measure training and test accuracy
    pred_train <- predict(svm_model, trainset)
    mean(pred_train == trainset$y)
    ## [1] 0.9692308
    pred_test <- predict(svm_model, testset)
    mean(pred_test == testset$y)
    ## [1] 0.9466667
  • Plot the model against the training data.
  • #plot
    plot(svm_model, trainset)

    Well done! The decision boundary using default parameters looks good.

    3.4 Tuning SVMs

    3.4.1 Using tune.svm()

    This exercise will give you hands-on practice with using the tune.svm() function. You will use it to obtain the optimal values for the cost, gamma, and coef0 parameters for an SVM model based on the radially separable dataset you created earlier in this chapter. The training data is available in the dataframe trainset, the test data in testset, and the e1071 library has been preloaded for you. Remember that the class variable y is stored in the third column of the trainset and testset.

    Also recall that in the video, Kailash used cost=10^(1:3) to get a range of the cost parameter from 10=10^1 to 1000=10^3 in multiples of 10.

    • Set parameter search ranges as follows:
      • cost - from 0.1 (10^(-1)) to 100 (10^2) in multiples of 10.
      • gamma and coef0 - one of the following values: 0.1, 1 and 10.
    #tune model
    tune_out <- 
        tune.svm(x = trainset[, -3], y = trainset[, 3], 
                 type = "C-classification", 
                 kernel = "polynomial", degree = 2, cost = 10^(-1:2), 
                 gamma = c(0.1, 1, 10), coef0 = c(0.1, 1, 10))
    
    #list optimal values
    tune_out$best.parameters$cost
    ## [1] 10
    tune_out$best.parameters$gamma
    ## [1] 10
    tune_out$best.parameters$coef0
    ## [1] 1

    Well done! You have obtained the optimal parameters for the specified parameter ranges.

    3.4.2 Building and visualizing the tuned model

    In the final exercise of this chapter, you will build a polynomial SVM using the optimal values of the parameters that you obtained from tune.svm() in the previous exercise. You will then calculate the training and test accuracies and visualize the model using svm.plot(). The e1071 library has been preloaded and the test and training datasets are available in the dataframes trainset and testset. The output of tune.svm() is available in the variable tune_out.

  • Build an SVM using a polynomial kernel of degree 2.
  • Use the optimal parameters calculated using tune.svm().
  • #Build tuned model
    svm_model <- svm(y~ ., data = trainset, type = "C-classification", 
                     kernel = "polynomial", degree = 2, 
                     cost = tune_out$best.parameters$cost, 
                     gamma = tune_out$best.parameters$gamma, 
                     coef0 = tune_out$best.parameters$coef0)
  • Obtain training and test accuracies.
  • #Calculate training and test accuracies   
    pred_train <- predict(svm_model, trainset)
    mean(pred_train == trainset$y)
    ## [1] 1
    pred_test <- predict(svm_model, testset)
    mean(pred_test == testset$y)
    ## [1] 0.9866667
  • Plot the decision boundary against the training data.
  • #plot model
    plot(svm_model, trainset)

    Excellent! Tuning the parameters has given us a considerably better accuracy.

    4 Radial Basis Function Kernels

    4.1 Generating a complex dataset

    4.1.1 Generating a complex dataset - part 1

    In this exercise you will create a dataset that has two attributes x1 and x2, with x1 normally distributed (mean = -0.5, sd = 1) and x2 uniformly distributed in (-1, 1).

  • Generate a data frame df with 1000 points (x1, x2) distributed as follows:
  • #number of data points
    n <- 1000
  • x1 - normally distributed with mean = -0.5 and std deviation 1.
  • #set seed
    set.seed(1)
  • x2 uniformly distributed in (-1, 1).
  • #create dataframe
    df <- data.frame(x1 = rnorm(n, mean = -0.5, sd = 1), 
                     x2 = runif(n, min = -1, max = 1))

    Excellent! Now let’s create a complex decision boundary.

    4.1.2 Generating a complex dataset - part 2

    In this exercise, you will create a decision boundary for the dataset you created in the previous exercise. The boundary consists of two circles of radius 0.8 units with centers at x1 = -0.8, x2 = 0) and (x1 = 0.8, x2 = 0) that just touch each other at the origin. Define a binary classification variable y such that points that lie within either of the circles have y = -1 and those that lie outside both circle have y = 1.

    The dataset created in the previous exercise is available in the dataframe df.

  • Set radii and centers of circles.
  • #set radius and centers 
    radius <- 0.8
    center_1 <- c(-0.8, 0)
    center_2 <- c(0.8, 0)
    radius_squared <- radius^2
  • Add a column to df containing the binary classification variable y.
  • #create binary classification variable 
    df$y <- factor(ifelse((df$x1-center_1[1])^2 + (df$x2-center_1[2])^2 < radius_squared|
                          (df$x1-center_2[1])^2 + (df$x2-center_2[2])^2 < radius_squared, -1, 1),
                          levels = c(-1, 1))

    Well done! Now let’s visualize the decision boundary.

    4.1.3 Visualizing the dataset

    In this exercise you will use ggplot() to visualise the complex dataset you created in the previous exercises. The dataset is available in the dataframe df. You are not required to visualize the decision boundary.

    Here you will use coord_equal() to give the x and y axes the same physical representation on the plot, making the circles appear as circles rather than ellipses.

  • Load the required plot library.
  • Set the arguments of the aesthetics parameter.
  • # Load ggplot2
    library(ggplot2)
  • Set the appropriate geom_ function for a scatter plot.
  • Specify equal coordinates by adding coord_equal() without arguments.
  • # Plot x2 vs. x1, colored by y
    scatter_plot <- ggplot(data = df, aes(x = x1, y = x2 , color = y)) + 
        # Add a point layer
        geom_point() + 
        scale_color_manual(values = c("red", "blue")) +
        # Specify equal coordinates
        coord_equal()
     
    scatter_plot 

    Excellent! In the next lesson we will see how linear and quadratic kernels perform against this dataset.

    4.2 Motivating the RBF kernel

    4.2.1 Linear SVM for complex dataset

    In this exercise you will build a default cost linear SVM for the complex dataset you created in the first lesson of this chapter. You will also calculate the training and test accuracies and plot the classification boundary against the test dataset. The e1071 library has been loaded, and test and training datasets have been created for you and are available in the data frames trainset and testset.

    • Build a linear SVM using the default value of cost.
    #build model
    svm_model<- 
        svm(y ~ ., data = trainset, type = "C-classification", 
            kernel = "linear")

    Calculate training and test accuracies.

    #accuracy
    pred_train <- predict(svm_model, trainset)
    mean(pred_train == trainset$y)
    ## [1] 0.6153846
    pred_test <- predict(svm_model, testset)
    mean(pred_test == testset$y)
    ## [1] 0.5866667

    Plot decision boundary against the test data.

    #plot model against testset
    plot(svm_model, testset)

    Nice work! As expected, the accuracy is poor and the plot clearly shows why a linear boundary will never work well.

    4.2.2 Quadratic SVM for complex dataset

    In this exercise you will build a default quadratic (polynomial, degree = 2) linear SVM for the complex dataset you created in the first lesson of this chapter. You will also calculate the training and test accuracies plot the classification boundary against the training dataset. The e1071 library has been loaded, and test and training datasets have been created for you and are available in the data frames trainset and testset.

    • Build a polynomial SVM of degree 2 using the default parameters.
    #build model
    svm_model<- 
        svm(y ~ ., data = trainset, type = "C-classification", 
            kernel = "polynomial", degree = 2)

    Calculate training and test accuracies.

    #accuracy
    pred_train <- predict(svm_model, trainset)
    mean(pred_train == trainset$y)
    ## [1] 0.9692308
    pred_test <- predict(svm_model, testset)
    mean(pred_test == testset$y)
    ## [1] 0.9466667

    Plot decision boundary against the training data.

    #plot model
    plot(svm_model, trainset)

    Well done! The model accuracy is not too bad, but the plot shows that it is impossible to capture the figure of 8 shape of the actual boundary using a degree 2 polynomial.

    4.3 The RBF Kernel

    4.3.1 Polynomial SVM on a complex dataset

    Calculate the average accuracy for a degree 2 polynomial kernel SVM using 100 different training/test partitions of the complex dataset you generated in the first lesson of this chapter. Use default settings for the parameters. The e1071 library has been preloaded and the dataset is available in the dataframe df. Use random 80/20 splits of the data in df when creating training and test datasets for each iteration.

    • Create a vector to hold accuracies for each step.
    #create vector to store accuracies and set random number seed
    accuracy <- rep(NA, 100)
    set.seed(2)

    Create training/test datasets, build default cost polynomial SVMs of degree 2, and calculate the test accuracy for each iteration.

    #calculate accuracies for 100 training/test partitions
    for (i in 1:100){
        df[, "train"] <- ifelse(runif(nrow(df))<0.8, 1, 0)
        trainset <- df[df$train == 1, ]
        testset <- df[df$train == 0, ]
        trainColNum <- grep("train", names(trainset))
        trainset <- trainset[, -trainColNum]
        testset <- testset[, -trainColNum]
        svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2)
        pred_test <- predict(svm_model, testset)
        accuracy[i] <- mean(pred_test == testset$y)
    }

    Compute the average accuracy and standard deviation over all iterations.

    #print average accuracy and standard deviation
    mean(accuracy)
    ## [1] 0.804765
    sd(accuracy)
    ## [1] 0.02398396

    Nice work! Please note down the average accuracy and standard deviation. We’ll compare these to the default RBF kernel SVM next.

    4.3.2 RBF SVM on a complex dataset

    Calculate the average accuracy for a RBF kernel SVM using 100 different training/test partitions of the complex dataset you generated in the first lesson of this chapter. Use default settings for the parameters. The e1071 library has been preloaded and the dataset is available in the dataframe df. Use random 80/20 splits of the data in df when creating training and test datasets for each iteration.

    • Create a vector of length 100 to hold accuracies for each step.
    #create vector to store accuracies and set random number seed
    accuracy <- rep(NA, 100)
    set.seed(2)

    Create training/test datasets, build RBF SVMs with default settings for all parameters and calculate the test accuracy for each iteration.

    #calculate accuracies for 100 training/test partitions
    for (i in 1:100){
        df[, "train"] <- ifelse(runif(nrow(df))<0.8, 1, 0)
        trainset <- df[df$train == 1, ]
        testset <- df[df$train == 0, ]
        trainColNum <- grep("train", names(trainset))
        trainset <- trainset[, -trainColNum]
        testset <- testset[, -trainColNum]
        svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "radial")
        pred_test <- predict(svm_model, testset)
        accuracy[i] <- mean(pred_test == testset$y)
    }

    Compute the average accuracy and standard deviation over all iterations.

    #print average accuracy and standard deviation
    mean(accuracy)
    ## [1] 0.9034203
    sd(accuracy)
    ## [1] 0.01786378

    Well done! Note that the average accuracy is almost 10% better than the one obtained in the previous exercise (polynomial kernel of degree 2)

    4.3.3 Tuning an RBF kernel SVM

    In this exercise you will build a tuned RBF kernel SVM for a the given training dataset (available in dataframe trainset) and calculate the accuracy on the test dataset (available in dataframe testset). You will then plot the tuned decision boundary against the test dataset.

    • Use tune.svm() to build a tuned RBF kernel SVM.
    #tune model
    tune_out <- tune.svm(x = trainset[, -3], y = trainset[, 3], 
                         gamma = 5*10^(-2:2), 
                         cost = c(0.01, 0.1, 1, 10, 100), 
                         type = "C-classification", kernel = "radial")

    Rebuild SVM using optimal values of cost and gamma.

    #build tuned model
    svm_model <- svm(y~ ., data = trainset, type = "C-classification", kernel = "radial", 
                     cost = tune_out$best.parameters$cost, 
                     gamma = tune_out$best.parameters$gamma)

    Calculate the accuracy of your model using the test dataset.

    #calculate test accuracy
    pred_test <- predict(svm_model, testset)
    mean(pred_test == testset$y)
    ## [1] 0.9891892

    Plot the decision boundary against testset.

    #Plot decision boundary against test data
    plot(svm_model, testset)

    Well done! That’s it for this course. I hope you found it useful.