Similarity

Similarity agrees with the following thesis:

Two nodes are similar to each other if they share many neighbors.

In case of directed graphs, we may replace neighbors with either predecessor or successor nodes. For instance, is a friendship network like Facebook, two people are similar if they have many friends in common. In a directed social network like Twitter, two users are similar if they share many followers or many followees.

Let $A = (a_{i,j})$ be the adjacency matrix of a possibly weighted graph. The general idea is to associate each node $i$ with a $i$ th row of the adjacency matrix. If the graph is directed, we can associate node $i$ with either the $i$ -th adjacency row if we focus on out-going links or with the $i$ -th adjacency column if we focus on in-coming links. Hence a node is associated with a vector in $\mathcal{R}^n$ , where $n$ is the number of nodes. The similarity (or dissimilarity) of two nodes $i$ and $j$ is then measured using a similarity (or distance) function among the associated vectors.

Let us focus on unweighted undirected graphs with $n$ nodes. We denote with $k_i = \sum_k A_{i,k}$ the degree of node $i$ and with $n_{i,j} = \sum_k A_{i,k} A_{j,k}$ the number of neighbors that nodes $i$ and $j$ have in common. Moreover, $A_i$ is the $i$ -th row of $A$ , a boolean (0-1) vector of length $n$ .

We are going to describe a pool of similarity measures. The first one is cosine similarity: the similarity $\sigma_{i,j}$ of nodes $i$ and $j$ is the cosine of the angle between vectors $A_i$ and $A_j$ :

σ_{i, j} = \cos (A_{i}, A_{j}) = \frac{A_{i} \cdot A_{j}}{‖ A_{i} ‖ ‖ A_{j} ‖} = \frac{\sum_{k} A_{i, k} A_{j, k}}{\sqrt{\sum_{k} A_{i, k}^{2}} \sqrt{\sum_{k} A_{j, k}^{2}}}

$\sigma_{i,j} = \cos (A_i, A_j) = \frac{A_i \cdot A_j}{\|A_i\| \|A_j\|} = \frac{\sum_k A_{i,k} A_{j,k}}{\sqrt{\sum_k A_{i,k}^{2}} \sqrt{\sum_k A_{j,k}^{2}}}$ The measure runs from 0 (orthogonal vectors or maximum dissimilarity) to 1 (parallel vectors or maximum similarity). Since the involved vectors are 0-1 vectors, we have that:

σ_{i, j} = \frac{n_{i, j}}{\sqrt{k_{i} k_{j}}}

$\sigma_{i,j} = \frac{n_{i,j}}{\sqrt{k_i k_j}}$ That is, cosine similarity between

i

$i$ and

j

$j$ is the number of neighbors shared by

i

$i$ and

j

$j$ divided by the geometric mean of their degrees.

An alternative similarity measure in Pearson correlation coefficient: the similarity $\sigma_{i,j}$ of nodes $i$ and $j$ is the correlation coefficient between vectors $A_i$ and $A_j$ :

σ_{i, j} = \frac{c o v (A_{i}, A_{j})}{s d (A_{i}) s d (A_{j})} = \frac{\sum_{k} (A_{i, k} - ⟨ A_{i} ⟩) (A_{j, k} - ⟨ A_{j} ⟩)}{\sqrt{\sum_{k} (A_{i, k} - ⟨ A_{i} ⟩)^{2}} \sqrt{\sum_{k} (A_{j, k} - ⟨ A_{j} ⟩)^{2}}}

$\sigma_{i,j} = \frac{cov(A_i, A_j)}{sd(A_i) \, sd(A_j)} = \frac{\sum_k (A_{i,k} - \langle A_i \rangle) (A_{j,k} - \langle A_j \rangle)} {\sqrt{\sum_k (A_{i,k} - \langle A_i \rangle)^{2}} \sqrt{\sum_k (A_{j,k} - \langle A_j \rangle)^{2}}}$ The measure runs from -1 (maximum negative correlation or maximum dissimilarity) to 1 (maximum positive correlation or maximum similarity). Notice that values close to

0

$0$ indicate no correlation, hence neither similarity nor dissimilarity. Again, since the involved vectors are 0-1 vectors, it is not difficult to see that the numerator of the correlation coefficient, that is the co-variance between vectors

A_{i}

$A_i$ and

A_{j}

$A_j$ is:

c o v (A_{i}, A_{j}) = n_{i, j} - \frac{k_{i} k_{j}}{n}

$cov(A_i, A_j) = n_{i,j} - \frac{k_i k_j}{n}$ Notice that

k_{i} k_{j} / n

$k_i k_j/n$ is the expected number of common neighbors between

i

$i$ and

j

$j$ if they would choose their neighbors at random: the probability that a random neighbor of

i

$i$ is also a neighbor of

j

$j$ is

k_{j} / n

$k_j / n$ , hence the expected number of common neighbors between

i

$i$ and

j

$j$ is

k_{i} k_{j} / n

$k_i k_j/n$ . Hence a positive covariance, or a similarity between

i

$i$ and

j

$j$ , holds when

i

$i$ and

j

$j$ share more neighbors than we would expect by change, while a negative covariance, or a dissimilarity between

i

$i$ and

j

$j$ happens when

i

$i$ and

j

$j$ have less neighbors in common than we would expect by change.


similarity = function(g, type = "cosine", mode = "col" ) {
  A = as_adjacency_matrix(g, sparse = FALSE)
  if (mode == "row") {A = t(A)}
  if (type == "cosine") {
    euclidean = function(x) {sqrt(x %*% x)}
    d = apply(A, 2, euclidean)
    D = diag(1/d)
    S = D %*% t(A) %*% A %*% D
  }
  if (type == "pearson") {
    S = cor(A)
  }
  return(S)
}