Similarity
Similarity agrees with the following thesis:
Two nodes are similar to each other if they share many neighbors.
In case of directed graphs, we may replace neighbors with either
predecessor or successor nodes. For instance, is a friendship network
like Facebook, two people are similar if they have many friends in
common. In a directed social network like Twitter, two users are similar
if they share many followers or many followees.
Let A=(ai,j) be the adjacency matrix of a possibly weighted graph. The general idea is to associate each node i with a ith row of the adjacency matrix. If the graph is directed, we can associate node i with either the i-th adjacency row if we focus on out-going links or with the i-th adjacency column if we focus on in-coming links. Hence a node is associated with a vector in n, where n is the number of nodes. The similarity (or dissimilarity) of two nodes i and j is then measured using a similarity (or distance) function among the associated vectors.
Let us focus on unweighted undirected graphs with n nodes. We denote with ki=∑kAi,k the degree of node i and with ni,j=∑kAi,kAj,k the number of neighbors that nodes i and j have in common. Moreover, Ai is the i-th row of A, a boolean (0-1) vector of length n.
We are going to describe a pool of similarity measures. The first one is cosine similarity: the similarity σi,j of nodes i and j is the cosine of the angle between vectors Ai and Aj:
σi,j=cos(Ai,Aj)=Ai⋅Aj‖Ai‖‖Aj‖=∑kAi,kAj,k∑kA2i,k‾‾‾‾‾‾‾√∑kA2j,k‾‾‾‾‾‾‾√
The measure runs from 0 (orthogonal vectors or maximum dissimilarity) to
1 (parallel vectors or maximum similarity). Since the involved vectors
are 0-1 vectors, we have that:
σi,j=ni,jkikj‾‾‾‾√
That is, cosine similarity between i and j is the number of neighbors shared by i and j divided by the geometric mean of their degrees.
An alternative similarity measure in Pearson correlation coefficient: the similarity σi,j of nodes i and j is the correlation coefficient between vectors Ai and Aj:
σi,j=cov(Ai,Aj)sd(Ai)sd(Aj)=∑k(Ai,k−⟨Ai⟩)(Aj,k−⟨Aj⟩)∑k(Ai,k−⟨Ai⟩)2‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√∑k(Aj,k−⟨Aj⟩)2‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√
The measure runs from -1 (maximum negative correlation or maximum
dissimilarity) to 1 (maximum positive correlation or maximum
similarity). Notice that values close to 0
indicate no correlation, hence neither similarity nor dissimilarity.
Again, since the involved vectors are 0-1 vectors, it is not difficult
to see that the numerator of the correlation coefficient, that is the
co-variance between vectors Ai and Aj is:
cov(Ai,Aj)=ni,j−kikjn
Notice that kikj/n is the expected number of common neighbors between i and j if they would choose their neighbors at random: the probability that a random neighbor of i is also a neighbor of j is kj/n, hence the expected number of common neighbors between i and j is kikj/n.
Hence a positive covariance, or a similarity between i and j, holds when i and j share more neighbors than we would expect by change, while a negative covariance, or a dissimilarity between i and j happens when i and j have less neighbors in common than we would expect by change.
similarity = function(g, type = "cosine", mode = "col" ) {
A = as_adjacency_matrix(g, sparse = FALSE)
if (mode == "row") {A = t(A)}
if (type == "cosine") {
euclidean = function(x) {sqrt(x %*% x)}
d = apply(A, 2, euclidean)
D = diag(1/d)
S = D %*% t(A) %*% A %*% D
}
if (type == "pearson") {
S = cor(A)
}
return(S)
}