Similarity

Similarity agrees with the following thesis:

Two nodes are similar to each other if they share many neighbors.

In case of directed graphs, we may replace neighbors with either predecessor or successor nodes. For instance, is a friendship network like Facebook, two people are similar if they have many friends in common. In a directed social network like Twitter, two users are similar if they share many followers or many followees.

Let A=(ai,j) be the adjacency matrix of a possibly weighted graph. The general idea is to associate each node i with a ith row of the adjacency matrix. If the graph is directed, we can associate node i with either the i-th adjacency row if we focus on out-going links or with the i-th adjacency column if we focus on in-coming links. Hence a node is associated with a vector in Rn, where n is the number of nodes. The similarity (or dissimilarity) of two nodes i and j is then measured using a similarity (or distance) function among the associated vectors.

Let us focus on unweighted undirected graphs with n nodes. We denote with ki=kAi,k the degree of node i and with ni,j=kAi,kAj,k the number of neighbors that nodes i and j have in common. Moreover, Ai is the i-th row of A, a boolean (0-1) vector of length n.

We are going to describe a pool of similarity measures. The first one is cosine similarity: the similarity σi,j of nodes i and j is the cosine of the angle between vectors Ai and Aj:

σi,j=cos(Ai,Aj)=AiAjAiAj=kAi,kAj,kkAi,k2kAj,k2
The measure runs from 0 (orthogonal vectors or maximum dissimilarity) to 1 (parallel vectors or maximum similarity). Since the involved vectors are 0-1 vectors, we have that:
σi,j=ni,jkikj
That is, cosine similarity between i and j is the number of neighbors shared by i and j divided by the geometric mean of their degrees.

An alternative similarity measure in Pearson correlation coefficient: the similarity σi,j of nodes i and j is the correlation coefficient between vectors Ai and Aj:

σi,j=cov(Ai,Aj)sd(Ai)sd(Aj)=k(Ai,kAi)(Aj,kAj)k(Ai,kAi)2k(Aj,kAj)2
The measure runs from -1 (maximum negative correlation or maximum dissimilarity) to 1 (maximum positive correlation or maximum similarity). Notice that values close to 0 indicate no correlation, hence neither similarity nor dissimilarity. Again, since the involved vectors are 0-1 vectors, it is not difficult to see that the numerator of the correlation coefficient, that is the co-variance between vectors Ai and Aj is:
cov(Ai,Aj)=ni,jkikjn
Notice that kikj/n is the expected number of common neighbors between i and j if they would choose their neighbors at random: the probability that a random neighbor of i is also a neighbor of j is kj/n, hence the expected number of common neighbors between i and j is kikj/n. Hence a positive covariance, or a similarity between i and j, holds when i and j share more neighbors than we would expect by change, while a negative covariance, or a dissimilarity between i and j happens when i and j have less neighbors in common than we would expect by change.

similarity = function(g, type = "cosine", mode = "col" ) { A = as_adjacency_matrix(g, sparse = FALSE) if (mode == "row") {A = t(A)} if (type == "cosine") { euclidean = function(x) {sqrt(x %*% x)} d = apply(A, 2, euclidean) D = diag(1/d) S = D %*% t(A) %*% A %*% D } if (type == "pearson") { S = cor(A) } return(S) }