Similarity agrees with the following thesis:
Two nodes are similar to each other if they share many neighbors.
In case of directed graphs, we may replace neighbors with either predecessor or successor nodes. For instance, is a friendship network like Facebook, two people are similar if they have many friends in common. In a directed social network like Twitter, two users are similar if they share many followers or many followees.
Let be the adjacency matrix of a possibly weighted graph. The general idea is to associate each node with a th row of the adjacency matrix. If the graph is directed, we can associate node with either the -th adjacency row if we focus on out-going links or with the -th adjacency column if we focus on in-coming links. Hence a node is associated with a vector in , where is the number of nodes. The similarity (or dissimilarity) of two nodes and is then measured using a similarity (or distance) function among the associated vectors.
Let us focus on unweighted undirected graphs with nodes. We denote with the degree of node and with the number of neighbors that nodes and have in common. Moreover, is the -th row of , a boolean (0-1) vector of length .
We are going to describe a pool of similarity measures. The first one is cosine similarity: the similarity of nodes and is the cosine of the angle between vectors and :
An alternative similarity measure in Pearson correlation coefficient: the similarity of nodes and is the correlation coefficient between vectors and :
similarity = function(g, type = "cosine", mode = "col" ) {
A = as_adjacency_matrix(g, sparse = FALSE)
if (mode == "row") {A = t(A)}
if (type == "cosine") {
euclidean = function(x) {sqrt(x %*% x)}
d = apply(A, 2, euclidean)
D = diag(1/d)
S = D %*% t(A) %*% A %*% D
}
if (type == "pearson") {
S = cor(A)
}
return(S)
}