Labeling nodes
In this exercise, you will label the nodes in the network. You are given a dataframe called customers
with the same customer IDs as in the network. Each customer has an
indication of whether they churned or not, given by 1 or 0 respectively.
You will add the churn status to the nodes of the network and visualize
it.
Note that a network can have both node and edge attributes. The node attributes are represented by the function V()
(for vertex) and the edge attributes by the function E()
. The node attributes of the churn network are V(network)
.
# Inspect the customers dataframe
head(customers)
id churn
1 1 0
2 250 0
3 308 0
4 413 0
5 525 0
6 803 0
# Count the number of churners and non-churners
table(customers$churn)
0 1
926 30
# Add a node attribute called churn
V(network)$churn <- customers$churn
# Visualize the network
plot(network, vertex.label = NA, edge.label = NA,
edge.color = 'black', vertex.size = 2)
Great work adding the churn attribute to the network and visualizing it!
Coloring nodes
In this exercise, you will color the nodes in the network depending on the churn status. The customers who have churned will be colored red and the non-churners will be colored white. Then you will visualize the network again.
# Add a node attribute called color
V(network)$color <- V(network)$churn
# Change the color of churners to red and non-churners to white
V(network)$color <- gsub("1","red",V(network)$color)
V(network)$color <- gsub("0","white",V(network)$color)
# Plot the network
plot(network, vertex.label = NA, edge.label = NA,
edge.color = "black", vertex.size = 2)
Great work on getting the colors set up! However, in a large network like this it is difficult to see what is going on.
Visualizing Churners
In the previous exercises, you noticed that it can be difficult to visualize a large network. Now you will visualize only the churners in the network by conditioning on the churn attribute.
# Create a subgraph with only churners
churnerNetwork <- induced_subgraph(network,
v = V(network)[which(V(network)$churn == 1)])
# Plot the churner network
plot(churnerNetwork, vertex.label = NA, vertex.size = 2)
Nice! You have successfully made a network of churners! As you can see, some of them are connected!
Relational Neighbor Classifier
In this exercise, you will apply a simple network based classifier called the relational neighbor classifier. It uses the class labels of neighboring nodes to compute a churn probability for each node in the network. For example, in the network below where red nodes denote churners and white nodes denote non-churners, the churn probability of the blue node is 0.4.
You are given two vectors: ChurnNeighbors
and NonChurnNeighbors
with each customer’s number of neighbors that have churned and not churned, respectively.
# Compute the churn probabilities
churnProb <- ChurnNeighbors / (ChurnNeighbors + NonChurnNeighbors)
# Find who is most likely to churn
mostLikelyChurners <- which(churnProb == max(churnProb))
# Extract the IDs of the most likely churners
customers$id[mostLikelyChurners]
[1] 948 478
Great job using the relational neighbor classifier to discover that customers with IDs 948 and 478 are most likely to churn!
Examples: - iterative classification - relaxation labeling
Probabilistic Relational Neighbor Classifier
In this exercise, you will apply the probabilistic relational neighbor classifier to infer churn probabilities based on the prior churn probability of the other nodes.
Instead of knowing the nodes’ labels, suppose you know each node’s probability of churn, as in the image below. In the image, C stands for churn and NC for non-churn. Then, as before, you can update the churn probability of the nodes by finding the average of the neighboring nodes’ churn probabilities.
# Find churn probability of the 44th customer
churnProb[44]
758
0.2
# Update the churn probabilties and the non-churn probabilities
churnProb_updated <- as.vector((AdjacencyMatrix %*% churnProb) / neighbors)
# Find updated churn probability of the 44th customer
churnProb_updated[44]
[1] 0.2733333
Well done applying the probabilistic relational neighbor classifier! The churn probabiltiy of the 44th customer increased from 0.2 to 0.273333
Collective Inferencing
Collective inferencing is a procedure to simultaneously label nodes in interconnected data to reduce classification error.
In this exercise you will perform collective inferencing and see the effect it has on the churn prediction using the AUC performance measure. AUC, or area under the ROC curve, is commonly used to assess the performance of classification techniques.
AUC = probability that a randomly chosen churner is ranked higher by the model than a randomly chosen non-churner AUC = number between 0.5 and 1, where a higher number means a better model Does collective inferencing increase the AUC value?
# Load the pROC package and data
library(pROC)
load("Nex132.RData")
# Compute the AUC
auc(customers$churn, as.vector(churnProb))
Area under the curve: 0.5813
# Write a for loop to update the probabilities
for(i in 1:10){
churnProb <- as.vector((AdjacencyMatrix %*% churnProb) / neighbors)
}
# Compute the AUC again
auc(customers$churn, as.vector(churnProb))
Area under the curve: 0.7114
Great job doing collective inferencing! The AUC increased from 0.5813 to 0.7114!
Homophily explained
birds of a feather flock together
Extracting types of edges
In this exercise, you will match the customer IDs in the customer
dataframe with the customer edgelist to find out whether each edge is a
churn, non-churn or a mixed edge. Using the function match()
, you will add two columns to the edgelist.
fromLabel
with the churn status of the from
columntoLabel
with the churn status of the to
columnThe command match(x, y)
returns a vector with the location of x
in y
. In the figure above match(edgeList$from, customers$id)
is 1,1,1,2,2
. For example, the fourth line in edgeList$from
, which is customer with id 393, is the second element in customers$id
. The churn label of this customer is, therefore, customers[2,2]
or 0. Similarly, the churn label of everyone in edgeList$from
is customers[match(edgeList$from, customers$id),2]
.
# Add the column edgeList$FromLabel
edgeList$FromLabel <- customers[match(edgeList$from, customers$id), 2]
# Add the column edgeList$ToLabel
edgeList$ToLabel <- customers[match(edgeList$to, customers$id), 2]
# Add the column edgeList$edgeType
edgeList$edgeType <- edgeList$FromLabel + edgeList$ToLabel
# Count the number of each type of edge
table(edgeList$edgeType)
0 1 2
1589 70 4
Great job extracting the edge types! In your churn network, you have 1589 edges of type 0, 70 edges of type 1, and 4 edges of type 2.
Counting types of edges
In this exercise, you will count the number of each edge type using the edgeType
column in the edgeList
dataframe. In the edgeList$edgeType
column, there are three different values:
# Count churn edges
ChurnEdges <- sum(edgeList$edgeType == 2)
# Count non-churn edges
NonChurnEdges <- sum(edgeList$edgeType == 0)
# Count mixed edges
MixedEdges <- sum(edgeList$edgeType == 1)
# Count all edges
edges <- ChurnEdges + NonChurnEdges + MixedEdges
#Print the number of edges
edges
[1] 1663
Well done! Now you know the number of all edge types. There are 1663 in total.
Counting nodes and computing connectance
In this exercise, you will count the number of each type of node using the customers
dataframe. The churn
column has two different values:
You will also compute the network’s connectance using the formula where is the number of nodes and the number of edges in the network.
# Count the number of churn nodes
ChurnNodes <- sum(customers$churn == 1)
# Count the number of non-churn nodes
NonChurnNodes <- sum(customers$churn == 0)
# Count the total number of nodes
nodes <- ChurnNodes + NonChurnNodes
# Compute the network connectance
connectance <- 2 * edges / nodes / (nodes - 1)
# Print the value
connectance
[1] 0.003643015
Great job computing the connectance of the network! Its value is 0.00364. This network is rather sparse!
Dyadicity
connectedness between nodes with the same label compared to what is expected in a random configuration of the network
expected number of same label edges:
example:
dyadicity equals the actual number of same label edges divided by the expected number same label edge
Dyadicity of churners
In this exercise, you will compute the dyadicity amongst the churners in the network to see if they share more or fewer edges than expected in a random configuration of the network.
The variables ChurnNodes
, ChurnEdges
, and connectance
are available for you to use.
For expected dyadicity, use the formula , where is the number of churners, is the number of nodes, and is the connectance. Dyadicity of the churners is the ratio between the actual churners and the expected churn dyadicity.
# Compute the expected churn dyadicity
ExpectedDyadChurn <- ChurnNodes * (ChurnNodes-1) * connectance / 2
# Compute the churn dyadicity
DyadChurn <- ChurnEdges / ExpectedDyadChurn
# Inspect the value
DyadChurn
[1] 2.524119
Great job computing the dyadicity! Because the value is greater than 1, there is dyadicity amongst the churners!
Heterophilicity
connectedness between nodes with different labels compared to what is expected for a random configuration of the network
Compute heterophilicity
Similar to dyadicity, heterophilicity is the ratio between the actual number of mixed label edges and the expected number of mixed label edges. In this exercise, you will compute the network’s heterophilicity.
For the expected number of mixed edges, use the formula where is the number of non-churners, is the number of churners and p is the connectance.
You can use the variables NonChurnNodes
, ChurnNodes
, connectance
, and MixedEdges
.
# Compute the expected heterophilicity
ExpectedHet <- ChurnNodes * NonChurnNodes * connectance
# Compute the heterophilicity
Het <- MixedEdges / ExpectedHet
# Inspect the heterophilicity
Het
[1] 0.6916794
Great job computing the heterophilicity! It is less than one, which means that the network is heterophobic.
Can I do predictive analytics with my network?
Are the relationships between nodes important? Are the labels randomly spread through the network or is there some structure? is the network homophilic?
Simple network features
In this exercise, you will extract simple network features from the churn network and add them to the network
object as node attributes. The igraph
object called network
with the node attribute Churn
, indicating customers who have already churned, has been pre-loaded. The igraph
library has also been loaded for you.
# Extract network degree
V(network)$degree <- degree(network, normalized=TRUE)
# Extract 2.order network degree
degree2 <- neighborhood.size(network, 2)
# Normalize 2.order network degree
V(network)$degree2 <- degree2 / (length(V(network)) - 1)
# Extract number of triangles
V(network)$triangles <- count_triangles(network)
Great job! You’ve computed the first and second order degree, as well as the number of triangles, and added all these features to the network object.
Centrality features
In this exercise, you will compute and extract the centrality features betweenness, closeness, and eigenvector centrality.
# Extract the betweenness
V(network)$betweenness <- betweenness(network, normalized=TRUE)
# Extract the closeness
V(network)$closeness <- closeness(network, normalized=TRUE)
# Extract the eigenvector centrality
V(network)$eigenCentrality <- eigen_centrality(network, scale = TRUE)$vector
Great job! You have added the features betweenness, closeness, and eigenvector centrality
Transitivity In this exercise you will compute the network’s transitivity, or clustering coefficient, both on the node level and on the global level.
# Extract the local transitivity
V(network)$transitivity <- transitivity(network, type='local', isolates='zero')
# Compute the network's transitivity
transitivity(network)
[1] 0.1002653
Great job! The global transitivity is 0.1002653.
Adjacency matrices
In this exercise, you will extract and compute the first and second
order adjacency matrices of the network. You’ve already seen how to
extract the first order adjacency matrix using the as_adjaceny_matrix()
function in the slides. For the second-order adjacency matrix, you need
to multiply the first order matrix with itself and replace all the
positive values with 1 since we are working with unweighted networks
only. You also need to make sure the elements on the diagonal are 0
since we do not allow self-edges.
# Extract the adjacency matrix
AdjacencyMatrix <- as_adjacency_matrix(network)
# Compute the second order matrix
SecondOrderMatrix_adj <- AdjacencyMatrix %*% AdjacencyMatrix
# Adjust the second order matrix
SecondOrderMatrix <- ((SecondOrderMatrix_adj) > 0) + 0
diag(SecondOrderMatrix) <- 0
# Inspect the second order matrix
SecondOrderMatrix[1:10, 1:10]
1 0 . 1 1 . 1 1 . . .
250 . 0 1 1 1 1 1 . . .
308 1 1 0 1 1 1 1 . . .
413 1 1 1 0 1 1 1 . . .
525 . 1 1 1 0 1 1 . . .
803 1 1 1 1 1 0 1 . . .
894 1 1 1 1 1 1 0 . . .
2 . . . . . . . 0 . .
332 . . . . . . . . 0 1
433 . . . . . . . . 1 0
Great job computing the adjacency matrices! Now you are ready to add some link-based features!
Link-based features
In this exercise, you will compute first order link-based features by multiplying the Churn
attribute of the network with the network’s adjacency matrix.
Note, that since churn is a binary indicator, the attribute Churn
has 1 for churners and 0 for non-churners. Consequently, the attribute 1-Churn
has 1 for non-churners and 0 for churners. This is helpful when computing the number of non-churn neighbors.
# Compute the number of churn neighbors
V(network)$ChurnNeighbors <- as.vector(AdjacencyMatrix %*% V(network)$Churn)
# Compute the number of non-churn neighbors
V(network)$NonChurnNeighbors <- as.vector(AdjacencyMatrix %*% (1 - V(network)$Churn))
# Compute the relational neighbor probability
V(network)$RelationalNeighbor <- as.vector(V(network)$ChurnNeighbors /
(V(network)$ChurnNeighbors + V(network)$NonChurnNeighbors))
Great job obtaining the link-based features! Next we go for the second order neighborhood!
Second order link-based features
In this exercise, you will compute the number and ratio of churn and non-churn neighbors in the second order neighborhood. The procedure is the same as in the previous exercise, except now you use the second order adjacency matrix.
# Compute the number of churners in the second order neighborhood
V(network)$ChurnNeighbors2 <- as.vector(SecondOrderMatrix %*% V(network)$Churn)
# Compute the number of non-churners in the second order neighborhood
V(network)$NonChurnNeighbors2 <- as.vector(SecondOrderMatrix %*% (1 - V(network)$Churn))
# Compute the relational neighbor probability in the second order neighborhood
V(network)$RelationalNeighbor2 <- as.vector(V(network)$ChurnNeighbors2 /
(V(network)$ChurnNeighbors2 + V(network)$NonChurnNeighbors2))
Great job! You’ve now extracted second order link-based features!
Neighborhood link-based features
Sometimes, the feature values of neighboring nodes have an effect on behavior. In this exercise, you will look at the attribute value of neighboring nodes and compute their average. You will do this for degree, triangles, transitivity, and betweenness.
You need to:
degree
which has been pre-loaded.network
as a node attribute.# Extract the average degree of neighboring nodes
V(network)$averageDegree <-
as.vector(AdjacencyMatrix %*% V(network)$degree) / degree
# Extract the average number of triangles of neighboring nodes
V(network)$averageTriangles <-
as.vector(AdjacencyMatrix %*% V(network)$triangles) / degree
# Extract the average transitivity of neighboring nodes
V(network)$averageTransitivity<-
as.vector(AdjacencyMatrix %*% V(network)$transitivity) / degree
# Extract the average betweenness of neighboring nodes
V(network)$averageBetweenness <-
as.vector(AdjacencyMatrix %*% V(network)$betweenness) / degree
You’ve now computed the average of degree, triangles, transitivity, and betweenness in the neighborhood. Well done!
Answer: Diane
Changes in PageRank
The PageRank formula can be solved for iteratively. In each iteration, the current value of is used to compute a new value that is closer to the true value. This means that the difference between the ’s of every two subsequent iterations becomes smaller and smaller until converges to the true value and the difference becomes (almost) zero. In this exercise, you will inspect the PageRank algorithm and how it converges.
# Compute one iteration of PageRank
iter1 <- page.rank(network, algo = 'power', options = list(niter = 1))$vector
# Compute two iterations of PageRank
iter2 <- page.rank(network, algo='power', options = list(niter = 2))$vector
# Inspect the change between one and two iterations
sum(abs(iter2 - iter1))
[1] 0.2790693
# Inspect the change between nine and ten iterations
sum(abs(iter10 - iter9))
[1] 0.009208855
Great job! The difference between iterations decreased from 0.279 to 0.009!
Convergence of PageRank
In this exercise, you will visually inspect how the PageRanks converge by plotting the differences of every two subsequent iterations.
# Create an empty vector
value <- c()
# Write a loop to compute PageRank
for(i in 1:15){
value <- cbind(value, page.rank(network, algo = 'power',options = list(niter = i))$vector)
}
# Compute the differences
difference <- colSums(abs(value[,1:14] - value[,2:15]))
# Plot the differences
plot(1:14, difference)
Great job! Did you see how the difference between iterations became less and less?
Personalized PageRank
In this exercise, you will study the difference between the PageRank
and the personalized PageRank algorithms. You can use the function boxplots
, which shows the score distributions of churners and non-churners with two separate boxplots. The function has two arguments:
damping
(), indicating the value of the damping factor. The default value is set to 0.85.personalized
, a Boolean parameter that indicates
whether the personalized PageRank algorithm should be used. When TRUE,
the restart vector has 1 for the churners in the network and 0 for the
non-churners. The default value is FALSE, i.e. not personalized.boxplots <- function(damping=0.85, personalized=FALSE){
if(personalized){
V(network)$pp<-page.rank(network,damping=damping,personalized = V(network)$Churn)$vector
}
else{
V(network)$pp<-page.rank(network,damping=damping)$vector
}
boxplot(V(network)$pp~V(network)$Churn)#
}
# Look at the distribution of standard PageRank scores
boxplots(damping = 0.85)
# Inspect the distribution of personalized PageRank scores
boxplots(damping = 0.85, personalized = TRUE)
# Look at the standard PageRank with damping factor 0.2
boxplots(damping = 0.2)
# Inspect the personalized PageRank scores with a damping factor 0.99
boxplots(damping = 0.99, personalized = TRUE)
Great job! There is clear difference between personalized and non-personalized PageRank scores.
Extract PageRank features
In this exercise, you will compute a few PageRank scores and add them as features to the network.
When personalizing, use the prior churners in the network as the restart vector. That is, include the argument personalized = V(network)$Churn
.
The default damping value is 0.85.
# Compute the default PageRank score
V(network)$pr_0.85 <- page.rank(network)$vector
# Compute the PageRank score with damping 0.2
V(network)$pr_0.20 <- page.rank(network, damping = 0.2)$vector
# Compute the personalized PageRank score
V(network)$perspr_0.85 <- page.rank(network, personalized = V(network)$Churn)$vector
# Compute the personalized PageRank score with damping 0.99
V(network)$perspr_0.99 <- page.rank(network, damping = 0.99, personalized = V(network)$Churn)$vector
You now have the final PageRank score and are finished featurizing the network! Well done!
Getting a flat dataset
In this exercise, you will turn your network into a dataframe, where the rows are the people in the network and the columns are the network features you computed in the previous chapter. You will also prepare the dataset for the pre-processing.
# Extract the dataset
studentnetworkdata_full <- as_data_frame(network, what = "vertices")
# Inspect the dataset
head(studentnetworkdata_full)
head(studentnetworkdata_full)
name Churn Future degree degree2 triangles betweenness closeness
1 1 0 0 0.006282723 0.02198953 2 0.008143888 0.08535931
250 250 0 0 0.004188482 0.01884817 0 0.005203912 0.08324616
308 308 0 0 0.003141361 0.01361257 1 0.002343329 0.08361790
413 413 0 0 0.005235602 0.02094241 1 0.006856084 0.08570403
525 525 0 0 0.003141361 0.01256545 0 0.002902738 0.08065197
803 803 0 0 0.006282723 0.02722513 1 0.014510436 0.08777574
eigenCentrality transitivity ChurnNeighbors NonChurnNeighbors
1 0.16674956 0.13333333 0 6
250 0.12172320 0.00000000 0 4
308 0.08880306 0.33333333 0 3
413 0.15380274 0.10000000 0 5
525 0.04339212 0.00000000 0 3
803 0.22528120 0.06666667 0 6
RelationalNeighbor ChurnNeighborsSecond NonChurnNeighborsSecond
1 0 0 18
250 0 0 13
308 0 0 11
413 0 0 16
525 0 0 8
803 0 0 21
RelationalNeighborSecond averageDegree averageTriangles averageTransitivity
1 0 0.004363002 0.8333333 0.13888889
250 0 0.004450262 1.0000000 0.12857143
308 0 0.004886562 1.3333333 0.15555556
413 0 0.004397906 1.0000000 0.10666667
525 0 0.003839442 0.6666667 0.04444444
803 0 0.004886562 0.8333333 0.10555556
averageBetweenness pr_0.85 pr_0.20 perspr_0.85 perspr_0.99
1 0.005713676 0.0016432968 0.0011679051 0.0005999950 0.0016861963
250 0.005733908 0.0012545054 0.0011611762 0.0002994797 0.0010956510
308 0.006030621 0.0008559875 0.0009801438 0.0003068000 0.0008458014
413 0.006727380 0.0014219824 0.0011938511 0.0005301201 0.0014117851
525 0.003698440 0.0009856206 0.0010436938 0.0001917604 0.0008062519
803 0.009493972 0.0016022681 0.0011561387 0.0015758243 0.0018944739
# Remove customers who already churned
studentnetworkdata_filtered <- studentnetworkdata_full[-which(studentnetworkdata_full$Churn == 1), ]
# Remove useless columns
studentnetworkdata <- studentnetworkdata_filtered[, -c(1, 2)]
Great job extracting the dataset! The next step is preprocessing!
Missing Values
You can check if a dataset has missing values per column using the apply()
function.
apply()
is the dataframe you are inspecting, in this case, studentnetworkdata
.function(x) ...
where ...
indicates the operation. You want to count the number of missing values in each column. Use sum(is.na(x))
to find all the missing values and count them.How many columns in studentnetworkdata
have missing values?
apply(studentnetworkdata, 2, function(x) sum(is.na(x)))
Future degree degree2
0 0 0
triangles betweenness closeness
0 0 0
eigenCentrality transitivity ChurnNeighbors
0 0 0
NonChurnNeighbors RelationalNeighbor ChurnNeighborsSecond
0 0 0
NonChurnNeighborsSecond RelationalNeighborSecond averageDegree
0 6 0
averageTriangles averageTransitivity averageBetweenness
0 0 0
pr_0.85 pr_0.20 perspr_0.85
0 0 0
perspr_0.99
0
The column RelationalNeighborSecond
has six missing values.
Replace missing values
In the last exercise, you noticed that for six observations, the value of RelationalNeighborSecond
was missing. In this exercise, you will replace those missing values with 0.
# Inspect the feature
summary(studentnetworkdata$RelationalNeighborSecond)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00000 0.00000 0.00000 0.02399 0.00000 1.00000 6
# Find the indices of the missing values
toReplace <- which(is.na(studentnetworkdata$RelationalNeighborSecond))
# Replace the missing values with 0
studentnetworkdata$RelationalNeighborSecond[toReplace] <- 0
# Inspect the feature again
summary(studentnetworkdata$RelationalNeighborSecond)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.00000 0.00000 0.02383 0.00000 1.00000
Great job! You have dealt with all the missing values.
Correlated variables
In this exercise, you will inspect the dataset with respect to correlated variables. It is important to remove them before applying a binary classifier, especially in the case of logistic regression. When two or more variables are highly correlated you should remove all except for one.
First, we will use the corrplot()
function in the corrplot
package to visualize the correlations. In the correlation plot, blue
represents a positive correlation and red a negative correlation. A
darker color indicates a higher correlation. Finally, you will remove
the highly correlated variables from the data set.
# Remove the Future column from studentnetworkdata
no_future <- studentnetworkdata[, -1]
# Load the corrplot package
library(corrplot)
# Generate the correlation matrix
M <- cor(no_future)
# Plot the correlations
corrplot(M, method = "circle")
# Print the column names
colnames(studentnetworkdata)
[1] "Future" "degree"
[3] "degree2" "triangles"
[5] "betweenness" "closeness"
[7] "eigenCentrality" "transitivity"
[9] "ChurnNeighbors" "NonChurnNeighbors"
[11] "RelationalNeighbor" "ChurnNeighborsSecond"
[13] "NonChurnNeighborsSecond" "RelationalNeighborSecond"
[15] "averageDegree" "averageTriangles"
[17] "averageTransitivity" "averageBetweenness"
[19] "pr_0.85" "pr_0.20"
[21] "perspr_0.85" "perspr_0.99"
# Create toRemove
toRemove <- c(10, 13, 19, 22)
# Remove the columns
studentnetworkdata_no_corrs <- studentnetworkdata[, -toRemove]
Well done! You removed the highly correlated features: NonChurnNeighbors
, pr_0.85
, perspr_0.99
and NonChurnNeighborsSecond
.
Split into train and test
Now that we have a dataframe, we can apply standard techniques for modeling. In this exercise, you will split the data into training and test sets.
# Set the seed
set.seed(7)
# Create the index vector
index_train <- sample(1:nrow(studentnetworkdata), 2 / 3 * nrow(studentnetworkdata))
# Make the training set
training_set <- studentnetworkdata[index_train, ]
# Make the test set
test_set <- studentnetworkdata[-index_train, ]
Well done! Now you have a training set and a test set.
Logistic regression model
In this exercise, you will build churn prediction models using
logistic regression. These models predict which customers will churn in
the future. You will build three models with different sets of features.
The target variable is called Future
. You will build the models using the training dataset training_set
and the function glm()
.
# Make firstModel
firstModel <- glm(Future ~ degree + degree2 + triangles + betweenness + closeness + transitivity, family = "binomial", data = training_set)
firstModel
Call: glm(formula = Future ~ degree + degree2 + triangles + betweenness +
closeness + transitivity, family = "binomial", data = training_set)
Coefficients:
(Intercept) degree degree2 triangles betweenness
-6.9805 -48.6542 -156.1732 0.4957 -137.5345
closeness transitivity
74.7736 -0.8865
Degrees of Freedom: 616 Total (i.e. Null); 610 Residual
Null Deviance: 202.9
Residual Deviance: 182.7 AIC: 196.7
# Build the model
secondModel <- glm(Future ~ ChurnNeighbors + RelationalNeighbor + ChurnNeighborsSecond + RelationalNeighborSecond + averageDegree + averageTriangles + averageTransitivity + averageBetweenness,
family = "binomial", data = training_set)
secondModel
Call: glm(formula = Future ~ ChurnNeighbors + RelationalNeighbor +
ChurnNeighborsSecond + RelationalNeighborSecond + averageDegree +
averageTriangles + averageTransitivity + averageBetweenness,
family = "binomial", data = training_set)
Coefficients:
(Intercept) ChurnNeighbors RelationalNeighbor
-4.1202 -0.2709 3.6742
ChurnNeighborsSecond RelationalNeighborSecond averageDegree
0.1167 -1.4032 775.7303
averageTriangles averageTransitivity averageBetweenness
-0.9539 1.8714 -323.5743
Degrees of Freedom: 616 Total (i.e. Null); 608 Residual
Null Deviance: 202.9
Residual Deviance: 187.9 AIC: 205.9
# Build the model
thirdModel <- glm(Future ~ ., family = "binomial", data = training_set)
thirdModel
Call: glm(formula = Future ~ ., family = "binomial", data = training_set)
Coefficients:
(Intercept) degree degree2
-24.8438 -1980.5282 0.1002
triangles betweenness closeness
0.2696 -235.0185 152.1249
eigenCentrality transitivity ChurnNeighbors
10.9705 -1.1261 1.9778
RelationalNeighbor ChurnNeighborsSecond RelationalNeighborSecond
0.6155 1.4341 -6.8080
averageDegree averageTriangles averageTransitivity
443.9935 -1.2387 4.7712
averageBetweenness pr_0.20 perspr_0.85
-228.4347 13966.7768 -181.8855
Degrees of Freedom: 616 Total (i.e. Null); 599 Residual
Null Deviance: 202.9
Residual Deviance: 162.6 AIC: 198.6
You have now made three logistic regression models! Great job!
Random forest model
In this exercise, you will use the randomForest()
function in the randomForest
package to build a random forest model for predicting churn of the customers in the training data set, training_set
. The target variable is called Future
. You will also inspect and visualize the importance of the variables in the model.
# Load package
library(randomForest)
# Set seed
set.seed(863)
# Build model
rfModel <- randomForest(as.factor(Future)~. ,data=training_set)
rfModel
Call:
randomForest(formula = as.factor(Future) ~ ., data = training_set)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 3.4%
Confusion matrix:
0 1 class.error
0 596 0 0
1 21 0 1
# Plot variable importance
varImpPlot(rfModel)
You trained a random forest on you data! Well done! In the variable importance plot, you could see that perspr_0.85
and eigenCentrality
are the most important features.
Predicting churn
In this exercise, you will use the predict()
function in the pROC
package to predict the churn probability of the customers in the test set, test_set
. The function has three arguments:
newdata
: The dataset the model should be applied to.type
: the type of prediction. We want to know the probability of churn, so we choose either response
or prob
.Use the function predict()
to predict the churn probability of the customers in test_set
.
# Load the package
library(pROC)
# Predict with the first model
firstPredictions <- predict(firstModel, newdata = test_set, type = "response")
# Predict with the second model
secondPredictions <- predict(secondModel, newdata = test_set, type = "response")
# Predict with the third model
thirdPredictions <- predict(thirdModel, newdata = test_set, type = "response")
# Predict with the rfModel
rfPredictions<- predict(rfModel, newdata = test_set, type= "prob")
Great! You computed churn probabilites using the random forest model!
Measure AUC
In this exercise, you will compute the AUC of your churn prediction models to find the best one. Use the auc()
function in the pROC
package. The function has two arguments:
test_set$Future
.predict
function.predict
function.The objects firstPredictions
,secondPredictions
, thirdPredictions
, and rfPredictions
have been loaded for you.
Which model has the highest AUC value?
auc(test_set$Future, firstPredictions)
Area under the curve: 0.7452
auc(test_set$Future, secondPredictions)
Area under the curve: 0.74
auc(test_set$Future, thirdPredictions)
Area under the curve: 0.7221
auc(test_set$Future, rfPredictions[,2])
Area under the curve: 0.7548
The AUC of the random forest model is 0.7548.
Measure top decile lift
Compute and compare the top decile lift of the churn prediction models using the function TopDecileLift()
in the lift
package. The function takes two arguments:
test_set$Future
.predict
function.predict
function.The objects firstPredictions
, secondPredictions
, thirdPredictions
, and rfPredictions
, and the package lift
have been loaded for you.
Which model has the highest top decile lift?
library(lift)
TopDecileLift(test_set$Future, firstPredictions)
[1] 1.053
TopDecileLift(test_set$Future, secondPredictions)
[1] 1.179
TopDecileLift(test_set$Future, thirdPredictions)
[1] 1.055
TopDecileLift(test_set$Future, rfPredictions[,2])
[1] 1.19
The lift of the random forest model is 1.19.