network analysis in R

Nodes/vertices are connected via edges. a 1 in the adjacency matrix indicates an edge exists between nodes in the matrix

each row represents an edge between two individual vertices in each column

first number indicates number of vertices (7)
second number indicates number of edges (7)
final row shows all edges in the network

number and which vertices
- V(g)
number and which edges
- E(g)
number of vertices
- gorder(g)
number of edges
- gsize(g)

Creating an igraph object

Here you will learn how to create an igraph ‘object’ from data stored in an edgelist. The data are friendships in a group of students. You will also learn how to make a basic visualization of the network.

Each row of the friends dataframe represents an edge in the network.

friends <- read.csv("_data/friends.csv")

# Load igraph
library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

# Inspect the first few rows of the dataframe 'friends'
head(friends)

##    name1  name2
## 1 Jessie Sidney
## 2 Jessie  Britt
## 3 Sidney  Britt
## 4 Sidney Donnie
## 5   Karl  Berry
## 6 Sidney   Rene

# Convert friends dataframe to a matrix
friends.mat <- as.matrix(friends)

# Convert friends matrix to an igraph object
g <- graph.edgelist(friends.mat, directed = FALSE)

# Make a very basic plot of the network
plot(g)

Counting vertices and edges

A lot of basic information about a network can be extracted from an igraph object. In this exercise you will learn how to count the vertices and edges from a network by applying several functions to the graph object g.

Each row of the friends dataframe represents an edge in the network.

# Subset vertices and edges
V(g)

## + 16/16 vertices, named, from 9e2d785:
##  [1] Jessie  Sidney  Britt   Donnie  Karl    Berry   Rene    Shayne  Elisha 
## [10] Whitney Odell   Lacy    Eugene  Jude    Rickie  Tommy

E(g)

## + 27/27 edges from 9e2d785 (vertex names):
##  [1] Jessie --Sidney  Jessie --Britt   Sidney --Britt   Sidney --Donnie 
##  [5] Karl   --Berry   Sidney --Rene    Britt  --Rene    Sidney --Shayne 
##  [9] Sidney --Elisha  Sidney --Whitney Jessie --Whitney Donnie --Odell  
## [13] Sidney --Odell   Rene   --Whitney Donnie --Shayne  Jessie --Lacy   
## [17] Rene   --Lacy    Elisha --Eugene  Eugene --Jude    Berry  --Odell  
## [21] Odell  --Rickie  Karl   --Odell   Britt  --Lacy    Elisha --Jude   
## [25] Whitney--Lacy    Britt  --Whitney Karl   --Tommy

# Count number of edges
gsize(g)

## [1] 27

# Count number of vertices
gorder(g)

## [1] 16

Network attributes

attr indicates that this igraph object has attribute name defined

the most common edge attribute is weight between edges, with line thickness indicating weights.

Node attributes and subsetting

In this exercise you will learn how to add attributes to vertices in the network and view them.

library(igraph)

# Inspect the objects 'genders' and 'ages'

genders <- c("M", "F", "F", "M", "M", "M", "F", "M", "M", "F", "M", "F", "M", "F", "M", "M")
#dput(genders)
ages <- c(18, 19, 21, 20, 22, 18, 23, 21, 22, 20, 20, 22, 21, 18, 19, 20)
#dput(ages)

genders

##  [1] "M" "F" "F" "M" "M" "M" "F" "M" "M" "F" "M" "F" "M" "F" "M" "M"

ages

##  [1] 18 19 21 20 22 18 23 21 22 20 20 22 21 18 19 20

# Create new vertex attribute called 'gender'
g <- set_vertex_attr(g, "gender", value = genders)

# Create new vertex attribute called 'age'
g <- set_vertex_attr(g, "age", value = ages)

# View all vertex attributes in a list
vertex_attr(g)

## $name
##  [1] "Jessie"  "Sidney"  "Britt"   "Donnie"  "Karl"    "Berry"   "Rene"   
##  [8] "Shayne"  "Elisha"  "Whitney" "Odell"   "Lacy"    "Eugene"  "Jude"   
## [15] "Rickie"  "Tommy"  
## 
## $gender
##  [1] "M" "F" "F" "M" "M" "M" "F" "M" "M" "F" "M" "F" "M" "F" "M" "M"
## 
## $age
##  [1] 18 19 21 20 22 18 23 21 22 20 20 22 21 18 19 20

# View attributes of first five vertices in a dataframe
V(g)[[1:5]]

## + 5/16 vertices, named, from 9e2d785:
##     name gender age
## 1 Jessie      M  18
## 2 Sidney      F  19
## 3  Britt      F  21
## 4 Donnie      M  20
## 5   Karl      M  22

Edge attributes and subsetting

In this exercise you will learn how to add attributes to edges in the network and view them. For instance, we will add the attribute ‘hours’ that represents how many hours per week each pair of friends spend with each other.

library(igraph)

# View hours
hours <- c(1, 2, 2, 1, 2, 5, 5, 1, 1, 3, 2, 1, 1, 5, 1, 2, 4, 1, 3, 1, 1, 1, 4, 1, 3, 3, 4)
#dput(hours)
hours

##  [1] 1 2 2 1 2 5 5 1 1 3 2 1 1 5 1 2 4 1 3 1 1 1 4 1 3 3 4

# Create new edge attribute called 'hours'
g <- set_edge_attr(g, "hours", value = hours)

# View edge attributes of graph object
edge_attr(g)

## $hours
##  [1] 1 2 2 1 2 5 5 1 1 3 2 1 1 5 1 2 4 1 3 1 1 1 4 1 3 3 4

# Find all edges that include "Britt"
E(g)[[inc('Britt')]]

## + 5/27 edges from 9e2d785 (vertex names):
##      tail    head tid hid hours
## 2  Jessie   Britt   1   3     2
## 3  Sidney   Britt   2   3     2
## 7   Britt    Rene   3   7     5
## 23  Britt    Lacy   3  12     4
## 26  Britt Whitney   3  10     3

# Find all pairs that spend 4 or more hours together per week
E(g)[[hours>=4]]

## + 6/27 edges from 9e2d785 (vertex names):
##      tail    head tid hid hours
## 6  Sidney    Rene   2   7     5
## 7   Britt    Rene   3   7     5
## 14   Rene Whitney   7  10     5
## 17   Rene    Lacy   7  12     4
## 23  Britt    Lacy   3  12     4
## 27   Karl   Tommy   5  16     4

Visualizing attributes

In this exercise we will learn how to create igraph objects with attributes directly from dataframes and how to visualize attributes in plots. We will use a second network of friendship connections between students.

library(igraph)

friends1_edges <- read.csv("_data/friends1_edges.csv")
friends1_nodes <- read.csv("_data/friends1_nodes.csv")

# Create an igraph object with attributes directly from dataframes
g1 <- graph_from_data_frame(d = friends1_edges, vertices = friends1_nodes, directed = FALSE)


# Subset edges greater than or equal to 5 hours
E(g1)[[hours >= 5]]

## + 4/25 edges from 9e77c77 (vertex names):
##         tail      head tid hid hours
## 5     Kelley Valentine   3   6     5
## 8     Ronald   Jasmine   4   8     5
## 12 Valentine     Perry   6  15     5
## 15   Jasmine      Juan   8   9     6

# Set vertex color by gender
V(g1)$color <- ifelse(V(g1)$gender == "F", "orange", "dodgerblue")

# Plot the graph
plot(g1, vertex.label.color = "black")

Network visualization

Choosing the appropriate layout

minimize edge crossing
do not allow vertices to overlap
make edge lengths as uniform as possible
increase symmetry of the network as much as possible
position more influential nodes towards the center

igraph network layouts

The igraph package provides several built in layout algorithms for network visualization. Depending upon the size of a given network different layouts may be more effective in communicating the structure of the network. Ideally the best layout is the one that minimizes the number of edges that cross each other in the network. In this exercise you will explore just a few of the many default layout algorithms. Re-executing the code for each plot will lead to a slightly different version of the same layout type. Doing this a few times can help to find the best looking visualization for your network.

library(igraph)

# Plot the graph object g1 in a circle layout
plot(g1, vertex.label.color = "black", layout = layout_in_circle(g1))

# Plot the graph object g1 in a Fruchterman-Reingold layout 
plot(g1, vertex.label.color = "black", layout = layout_with_fr(g1))

# Plot the graph object g1 in a Tree layout 
m <- layout_as_tree(g1)
plot(g1, vertex.label.color = "black", layout = m)

# Plot the graph object g1 using igraph's chosen layout 
m1 <- layout_nicely(g1)
plot(g1, vertex.label.color = "black", layout = m1)

Visualizing edges

In this exercise you will learn how to change the size of edges in a network based on their weight, as well as how to remove edges from a network which can sometimes be helpful in more effectively visualizing large and highly clustered networks. In this introductory chapter, we have just scratched the surface of what’s possible in visualizing igraph networks. You will continue to develop these skills in future chapters.

library(igraph)

# Create a vector of weights based on the number of hours each pair spend together
w1 <- E(g1)$hours

# Plot the network varying edges by weights
m1 <- layout_nicely(g1)
plot(g1, 
        vertex.label.color = "black", 
        edge.color = 'black',
        edge.width = w1,
        layout = m1)

# Create a new igraph object by deleting edges that are less than 2 hours long 
g2 <- delete_edges(g1, E(g1)[hours < 2])


# Plot the new graph 
w2 <- E(g2)$hours
m2 <- layout_nicely(g2)

plot(g2, 
     vertex.label.color = "black", 
     edge.color = 'black',
     edge.width = w2,
     layout = m2)

Directed networks

Directed igraph objects

In this exercise you will learn how to create a directed graph from a dataframe, how to inspect whether a graph object is directed and/or weighted and how to extract those vertices at the beginning and end of directed edges.

library(igraph)

measles <- read.csv("_data/measles.csv")
head(measles)

##   from to
## 1   45  1
## 2   45  2
## 3  172  3
## 4  180  4
## 5   45  5
## 6  180  6

tail(measles)

##     from  to
## 179  184 182
## 180  184 183
## 181   82 185
## 182   45 186
## 183   82 187
## 184  175 188

# Get graph object
g <- graph_from_data_frame(measles, directed = TRUE)

# Is the graph directed?
is.directed(g)

## [1] TRUE

# Is the graph weighted?
is.weighted(g)

## [1] FALSE

# Where does each edge originate from?
table(head_of(g, E(g)))

## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  17  18  19  20  21 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  61  62  63 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 184 185 186 187 
##   1   1   1   1

Identifying edges for each vertex

In this exercise you will learn how to identify particular edges. You will learn how to determine if an edge exists between two vertices as well as finding all vertices connected in either direction to a given vertex.

library(igraph)

# Make a basic plot
plot(g, 
     vertex.label.color = "black", 
     edge.color = 'gray77',
     vertex.size = 0,
     edge.arrow.size = 0.1,
     layout = layout_nicely(g))

# Is there an edge going from vertex 184 to vertex 178?
g['184', '178']

## [1] 1

# Is there an edge going from vertex 178 to vertex 184?
g['178', '184']

## [1] 0

# Show all edges going to or from vertex 184
incident(g, '184', mode = c("all"))

## + 6/184 edges from 9f376dc (vertex names):
## [1] 184->45  184->182 184->181 184->178 184->183 184->177

# Show all edges going out from vertex 184
incident(g, '184', mode = c("out"))

## + 6/184 edges from 9f376dc (vertex names):
## [1] 184->45  184->182 184->181 184->178 184->183 184->177

Relationships between vertices

Neighbors

Often in network analysis it is important to explore the patterning of connections that exist between vertices. One way is to identify neighboring vertices of each vertex. You can then determine which neighboring vertices are shared even by unconnected vertices indicating how two vertices may have an indirect relationship through others. In this exercise you will learn how to identify neighbors and shared neighbors between pairs of vertices.

library(igraph)

# Identify all neighbors of vertex 12 regardless of direction
neighbors(g, '12', mode = c('all'))

## + 5/187 vertices, named, from 9f376dc:
## [1] 45  13  72  89  109

# Identify other vertices that direct edges towards vertex 12
neighbors(g, '12', mode = c('in'))

## + 1/187 vertex, named, from 9f376dc:
## [1] 45

# Identify any vertices that receive an edge from vertex 42 and direct an edge to vertex 124
n1 <- neighbors(g, '42', mode = c('out'))
n2 <- neighbors(g, '124', mode = c('in'))
intersection(n1, n2)

## + 1/187 vertex, named, from 9f376dc:
## [1] 7

Distances between vertices

The inter-connectivity of a network can be assessed by examining the number and length of paths between vertices. A path is simply the chain of connections between vertices. The number of intervening edges between two vertices represents the geodesic distance between vertices. Vertices that are connected to each other have a geodesic distance of 1. Those that share a neighbor in common but are not connected to each other have a geodesic distance of 2 and so on. In directed networks, the direction of edges can be taken into account. If two vertices cannot be reached via following directed edges they are given a geodesic distance of infinity. In this exercise you will learn how to find the longest paths between vertices in a network and how to discern those vertices that are withinconnections of a given vertex. For disease transmission networks such as the measles dataset this helps you to identify how quickly the disease spreads through the network.

library(igraph)

#- Which two vertices are the furthest apart in the graph ?
farthest_vertices(g)

## $vertices
## + 2/187 vertices, named, from 9f376dc:
## [1] 184 162
## 
## $distance
## [1] 5

#- Shows the path sequence between two furthest apart vertices.
get_diameter(g)

## + 6/187 vertices, named, from 9f376dc:
## [1] 184 178 42  7   123 162

# Identify vertices that are reachable within two connections from vertex 42
ego(g, 2, '42', mode = c('out'))

## [[1]]
## + 13/187 vertices, named, from 9f376dc:
##  [1] 42  7   106 43  123 101 120 124 125 128 129 108 127

# Identify vertices that can reach vertex 42 within two connections
ego(g, 2, '42', mode = c('in'))

## [[1]]
## + 3/187 vertices, named, from 9f376dc:
## [1] 42  178 184

Important and influential vertices

degree: how many connections
eigenvector centrality: vertices that are in turn connected to other vertices that are highly interconnected in the network
betweenness: how frequently a vertex lies on the shortest path(s) between any two vertices in the network.

Identifying key vertices

Perhaps the most straightforward measure of vertex importance is the degree of a vertex. The out-degree of a vertex is the number of other individuals to which a vertex has an outgoing edge directed to. The in-degree is the number of edges received from other individuals. In the measles network, individuals that infect many other individuals will have a high out-degree. In this exercise you will identify whether individuals infect equivalent amount of other children or if there are key children who have high out-degrees and infect many other children.

library(igraph)

# Calculate the out-degree of each vertex
g.outd <- degree(g, mode = c("out"))

# View a summary of out-degree
table(g.outd)

## g.outd
##   0   1   2   3   4   6   7   8  30 
## 125  21  16  12   6   2   3   1   1

# Make a histogram of out-degrees
hist(g.outd, breaks = 30)

# Find the vertex that has the maximum out-degree
which.max(g.outd)

## 45 
##  1

Betweenness

Another measure of the importance of a given vertex is its betweenness. This is an index of how frequently the vertex lies on shortest paths between any two vertices in the network. It can be thought of as how critical the vertex is to the flow of information through a network. Individuals with high betweenness are key bridges between different parts of a network. In our measles transmission network, vertices with high betweenness are those children who were central to passing on the disease to other parts of the network. In this exercise, you will identify the betweenness score for each vertex and then make a new plot of the network adjusting the vertex size by its betweenness score to highlight these key vertices.

library(igraph)

# Calculate betweenness of each vertex
g.b <- betweenness(g, directed = TRUE)

# Show histogram of vertex betweenness
hist(g.b, breaks = 80)

# Create plot with vertex size determined by betweenness score
plot(g, 
     vertex.label = NA,
     edge.color = 'black',
     vertex.size = sqrt(g.b)+1,
     edge.arrow.size = 0.05,
     layout = layout_nicely(g))

Visualizing important nodes and edges

One issue with the measles dataset is that there are three individuals for whom no information is known about who infected them. One of these individuals (vertex 184) appears ultimately responsible for spreading the disease to many other individuals even though they did not directly infect too many individuals. However, because vertex 184 has no incoming edge in the network they appear to have low betweenness. One way to explore the importance of this vertex is by visualizing the geodesic distances of connections going out from this individual. In this exercise you shall create a plot of these distances from this patient zero.

# Make an ego graph
g184 <- make_ego_graph(g, diameter(g), nodes = '184', mode = c("all"))[[1]]

# Get a vector of geodesic distances of all vertices from vertex 184 
dists <- distances(g184, "184")

# Create a color palette of length equal to the maximal geodesic distance plus one.
colors <- c("black", "red", "orange", "blue", "dodgerblue", "cyan")

# Set color attribute to vertices of network g184.
V(g184)$color <- colors[dists+1]

# Visualize the network based on geodesic distance from vertex 184 (patient zero).
plot(g184, 
     vertex.label = dists, 
     vertex.label.color = "white",
     vertex.label.cex = .6,
     edge.color = 'black',
     vertex.size = 7,
     edge.arrow.size = .05,
     main = "Geodesic Distances from Patient Zero"
     )

Characterizing network structures - introduction

density: proportion of edges that actually do exist in a network out of all those that potentially exist between every pair of vertices
- a measure of how interconnected a network is

average path length: the mean lengths between the shortest paths of all pairs of vertices in the network

Forrest Gump network

In this chapter you will use a social network based on the movie Forrest Gump. Each edge of the network indicates that those two characters were in at least one scene of the movie together. Therefore this network is undirected. To familiarize yourself with the network, you will first create the network object from the raw dataset. Then, you will identify key vertices using a measure called eigenvector centrality. Individuals with high eigenvector centrality are those that are highly connected to other highly connected individuals. You will then make an exploratory visualization of the network.

gump <- read.csv("_data/gump.csv")

library(igraph)

# Inspect Forrest Gump Movie dataset
head(gump)

##              V1        V2
## 1 ABBIE HOFFMAN     JENNY
## 2 ABBIE HOFFMAN POLICEMAN
## 3     ANCHORMAN   FORREST
## 4     ANCHORMAN    LT DAN
## 5     ANCHORMAN     MARGO
## 6     ANCHORMAN  MRS GUMP

# Make an undirected network
g <- graph_from_data_frame(gump, directed = FALSE)

# Identify key nodes using eigenvector centrality
g.ec <- eigen_centrality(g)
which.max(g.ec$vector)

## FORREST 
##      36

# Plot Forrest Gump Network
plot(g,
vertex.label.color = "black", 
vertex.label.cex = 0.6,
vertex.size = 25*(g.ec$vector),
edge.color = 'gray88',
main = "Forrest Gump Network"
)

Network density and average path length

The first graph level metric you will explore is the density of a graph. This is essentially the proportion of all potential edges between vertices that actually exist in the network graph. It is an indicator of how well connected the vertices of the graph are.

Another measure of how interconnected a network is average path length. This is calculated by determining the mean of the lengths of the shortest paths between all pairs of vertices in the network. The longest path length between any pair of vertices is called the diameter of the network graph. You will calculate the diameter and average path length of the original graph g.

library(igraph)

# Get density of a graph
gd <- edge_density(g)
gd

## [1] 0.06199954

#Get the diameter of the graph g
diameter(g, directed = FALSE)

## [1] 4

#Get the average path length of the graph g
g.apl <- mean_distance(g, directed = FALSE)
g.apl

## [1] 1.994967

Understanding network structures

Random graphs & randomization tests

Generate 1000 random graphs based on the original network

e.g., with the same number of vertices and approximate density.

Calculate the average path length of the original network.
Calculate the average path length of the 1000 random networks.
Determine how many random networks have an average path length greater or less than the original network’s average path length.

Random graphs

Generating random graphs is an important method for investigating how likely or unlikely other network metrics are likely to occur given certain properties of the original graph. The simplest random graph is one that has the same number of vertices as your original graph and approximately the same density as the original graph. Here you will create one random graph that is based on the original Forrest Gump Network.

library(igraph)

# Create one random graph with the same number of nodes and edges as g
g.random <- erdos.renyi.game(n = gorder(g), p.or.m = gd, type = "gnp")

g.random

## IGRAPH a17f8cc U--- 94 272 -- Erdos renyi (gnp) graph
## + attr: name (g/c), type (g/c), loops (g/l), p (g/n)
## + edges from a17f8cc:
##  [1]  2-- 3  2-- 5  4-- 5  1-- 9  2--10  7--13  7--15  4--17  3--18  6--18
## [11] 12--19 18--19  5--20  3--22 11--22 11--23 14--25  9--26 21--26  1--27
## [21] 14--27  2--28 14--28 16--28 19--28 17--29  2--30  9--30  8--31 17--31
## [31] 21--31 26--31  4--33 22--33  4--34 15--34 32--34 33--35 15--37 30--37
## [41] 33--37  3--38  9--38 30--38 14--39 16--39 34--39 10--40 15--40 29--40
## [51] 35--41 17--43 30--43  5--45 32--45 38--45 42--45 13--46 25--46 30--46
## [61] 36--46 42--46 11--47 16--47 17--47 39--47  7--48 26--48 37--48 24--49
## [71] 25--50 30--50 15--51 19--51 20--51 32--51 11--52 14--52 15--52 23--52
## + ... omitted several edges

plot(g.random)

# Get density of new random graph `g.random`
edge_density(g.random)

## [1] 0.06222832

#Get the average path length of the random graph g.random
mean_distance(g.random, directed = FALSE)

## [1] 2.749256

You can also use average.path.length() to compute this value.

Network randomizations

In the previous exercise you may have noticed that the average path length of the Forrest Gump network was smaller than the average path length of the random network. If you ran the code a few times you will have noticed that it is nearly always lower in the Forrest Gump network than the random network. What this suggests is that the Forrest Gump network is more highly interconnected than each random network even though the random networks have the same number of vertices and approximately identical graph densities. Rather than re-running this code many times, you can more formally address this by creating 1000 random graphs based on the number of vertices and density of the original Forrest Gump graph. Then, you can see how many times the average path length of the random graphs is less than the original Forrest Gump network. This is called a randomization test.

The graph g, and its average path length (that you calculated in the previous exercise), g.apl are in your workspace.

library(igraph)

# Generate 1000 random graphs
gl <- vector('list',1000)
  
for(i in 1:1000){
  gl[[i]] <- erdos.renyi.game(n = gorder(g), p.or.m = gd, type = "gnp")
}

# Calculate average path length of 1000 random graphs
gl.apls <- unlist(lapply(gl, mean_distance, directed = FALSE))

# Plot the distribution of average path lengths
hist(gl.apls, xlim = range(c(1.5, 6)))
abline(v = g.apl, col = "red", lty = 3, lwd = 2)

# Calculate the proportion of graphs with an average path length lower than our observed
mean(gl.apls < g.apl)

## [1] 0

As you can see, the Forrest Gump network is far more interconnected than we would expect by chance as zero random networks have an average path length smaller than the Forrest Gump network’s average path length.

Network substructures

global transitivity: probability that the adjacent vertices of a given vertex are connected

local transitivity: calculates the proportion of closed triangles that the vertex is a part of out of the theoretical number of closed triangles it could be a part of, given its connections.

clique: every vertex is connected to every other vertex.
- all of their triangles are closed.

Triangles and transitivity

Another important measure of local connectivity in a network graph involves investigating triangles (also known as triads). In this exercise you will find all closed triangles that exist in a network. This means that an edge exists between three given vertices. You can then calculate the transitivity of the network. This is equivalent to the proportion of all possible triangles in the network that are closed. You will also learn how to identify the number of closed triangles that any given vertex is a part of and its local transitivity - that is, the proportion of closed triangles that the vertex is a part of given the theoretical number of triangles it could be a part of.

library(igraph)

# Show first 20 triangles in the network.
net_tri <- matrix(triangles(g), nrow = 3)
cat("There are", ncol(net_tri), "triangles in this network. Showing you the first 20 below")

## There are 384 triangles in this network. Showing you the first 20 below

head(net_tri, c(3,20))

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,]   36   36   36   36   36   36   36   36   36    36    36    36    36    36
## [2,]    1    1    1    1    2    4    4    6    6     6     6     7     7     8
## [3,]   83   38   39   66   68   57   24   27   75    40    45     8    69    69
##      [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]    36    36    36    36    36    36
## [2,]    11    11    11    12    12    13
## [3,]    12    13    70    70    13    70

# Count the number of triangles that vertex "BUBBA" is in.
count_triangles(g, vids = 'BUBBA')

## [1] 37

# Calculate the global transitivity of the network.
g.tr <- transitivity(g)
g.tr

## [1] 0.1918082

# Calculate the local transitivity for vertex BUBBA.
transitivity(g, vids = 'BUBBA', type = "local")

## [1] 0.6727273

Transitivity randomizations

As you did for the average path length, let’s investigate if the global transitivity of the Forrest Gump network is significantly higher than we would expect by chance for random networks of the same size and density. You can compare Forrest Gump’s global transitivity to 1000 other random networks.

library(igraph)

# Calculate average transitivity of 1000 random graphs
gl.tr <- lapply(gl, transitivity)
gl.trs <- unlist(gl.tr)

# Get summary statistics of transitivity scores
summary(gl.trs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03156 0.05375 0.06126 0.06143 0.06881 0.09492

# Calculate the proportion of graphs with a transitivity score higher than Forrest Gump's network
mean(gl.trs > g.tr)

## [1] 0

Cliques

Identifying cliques is a common practice in undirected networks. In a clique every two unique nodes are adjacent - that means that every individual node is connected to every other individual node in the clique. In this exercise you will identify the largest cliques in the Forrest Gump network. You will also identify the number of maximal cliques of various sizes. A clique is maximal if it cannot be extended to a larger clique.

library(igraph)

# Identify the largest cliques in the network
largest_cliques(g)

## [[1]]
## + 9/94 vertices, named, from a130cd0:
## [1] FORREST   STRONGARM BUBBA     DALLAS    LT DAN    MAN       SGT SIMS 
## [8] SOLDIER   SONG     
## 
## [[2]]
## + 9/94 vertices, named, from a130cd0:
## [1] FORREST JENNY   EMCEE   MAN #   MAN #1  MAN #2  MAN #3  MAN #5  MEN

# Determine all maximal cliques in the network and assign to object 'clq'
clq <- max_cliques(g)

# Calculate the size of each maximal clique.
table(unlist(lapply(clq, length)))

## 
##  2  3  4  5  6  7  9 
## 12 24  7  2  4  2  2

Visualize largest cliques

Often in network visualization you will need to subset part of a network to inspect the inter-connections of particular vertices. Here, you will create a visualization of the largest cliques in the Forrest Gump network. In the last exercise you determined that there were two cliques of size 9. You will plot these side-by-side after creating two new igraph objects by subsetting out these cliques from the main network. The function subgraph() enables you to choose which vertices to keep in a new network object.

library(igraph)

# Assign largest cliques output to object 'lc'
lc <- largest_cliques(g)

# Create two new undirected subgraphs, each containing only the vertices of each largest clique.
gs1 <- as.undirected(induced_subgraph(g, lc[[1]]))
gs2 <- as.undirected(induced_subgraph(g, lc[[2]]))


# Plot the two largest cliques side-by-side

par(mfrow=c(1,2)) # To plot two plots side-by-side

plot(gs1,
     vertex.label.color = "black", 
     vertex.label.cex = 0.9,
     vertex.size = 0,
     edge.color = 'gray28',
     main = "Largest Clique 1",
     layout = layout.circle(gs1)
)

plot(gs2,
     vertex.label.color = "black", 
     vertex.label.cex = 0.9,
     vertex.size = 0,
     edge.color = 'gray28',
     main = "Largest Clique 2",
     layout = layout.circle(gs2)
)

Close relationships: assortativity & reciprocity

reciprocity: the proportion of edges that are symmetrical
- the proportion of outgoing edges that also have an incoming edge

Assortativity

In this exercise you will determine the assortativity() of the second friendship network from the first chapter. This is a measure of how preferentially attached vertices are to other vertices with identical attributes. You will also determine the degree assortativity which determines how preferentially attached are vertices to other vertices of a similar degree.

# Plot the network
plot(g1)

# Convert the gender attribute into a numeric value
values <- as.numeric(factor(V(g1)$gender))

# Calculate the assortativity of the network based on gender
assortativity(g1, values)

## [1] 0.1319444

# Calculate the assortativity degree of the network
assortativity.degree(g1, directed = FALSE)

## [1] 0.4615385

Using randomizations to assess assortativity

In this exercise you will determine how likely the observed assortativity in the friendship network is given the genders of vertices by performing a randomization procedure. You will randomly permute the gender of vertices in the network 1000 times and recalculate the assortativity for each random network.

# Calculate the observed assortativity
observed.assortativity <- assortativity(g1, values)

# Calculate the assortativity of the network randomizing the gender attribute 1000 times
results <- vector('list', 1000)
for(i in 1:1000){
  results[[i]] <- assortativity(g1, sample(values))
}

# Plot the distribution of assortativity values and add a red vertical line at the original observed value
hist(unlist(results))
abline(v = observed.assortativity, col = "red", lty = 3, lwd=2)

Reciprocity

The reciprocity of a directed network reflects the proportion of edges that are symmetrical. That is, the proportion of outgoing edges that also have an incoming edge. It is commonly used to determine how inter-connected directed networks are. An example of a such a network may be grooming exchanges in chimpanzees. Certain chimps may groom another but do not get groomed by that individual, whereas other chimps may both groom each other and so would have a reciprocal tie.

library(igraph)

# Make a plot of the chimp grooming network
plot(g,
     edge.color = "black",
     edge.arrow.size = 0.3,
     edge.arrow.width = 0.5)

# Calculate the reciprocity of the graph
reciprocity(g)

[1] 0.2711864

Community detection

works by building a larger community in each iteration, and computing a modularity score at each iteration.

works by dividing the network in each iteration based on vertices with higher relative betweenness centrality scores.

Fast-greedy community detection

The first community detection method you will try is fast-greedy community detection. You will use the Zachary Karate Club network. This social network contains 34 club members and 78 edges. Each edge indicates that those two club members interacted outside the karate club as well as at the club. Using this network you will determine how many sub-communities the network has and which club members belong to which subgroups. You will also plot the networks by community membership.

library(igraphdata)

data(karate, package = "igraphdata")
g <- karate

# Perform fast-greedy community detection on network graph
kc <- fastgreedy.community(g)

# Determine sizes of each community
sizes(kc)

## Community sizes
##  1  2  3 
## 18 11  5

# Determine which individuals belong to which community
membership(kc)

##    Mr Hi  Actor 2  Actor 3  Actor 4  Actor 5  Actor 6  Actor 7  Actor 8 
##        2        2        2        2        3        3        3        2 
##  Actor 9 Actor 10 Actor 11 Actor 12 Actor 13 Actor 14 Actor 15 Actor 16 
##        1        1        3        2        2        2        1        1 
## Actor 17 Actor 18 Actor 19 Actor 20 Actor 21 Actor 22 Actor 23 Actor 24 
##        3        2        1        2        1        2        1        1 
## Actor 25 Actor 26 Actor 27 Actor 28 Actor 29 Actor 30 Actor 31 Actor 32 
##        1        1        1        1        1        1        1        1 
## Actor 33   John A 
##        1        1

# Plot the community structure of the network
plot(kc, g)

Edge-betweenness community detection

An alternative community detection method is edge-betweenness. In this exercise you will repeat the community detection of the karate club using this method and compare the results visually to the fast-greedy method.

# Perform edge-betweenness community detection on network graph
gc = edge.betweenness.community(g)

## Warning in edge.betweenness.community(g): At community.c:460 :Membership vector
## will be selected based on the lowest modularity score.

## Warning in edge.betweenness.community(g): At community.c:467 :Modularity
## calculation with weighted edge betweenness community detection might not make
## sense -- modularity treats edge weights as similarities while edge betwenness
## treats them as distances

# Determine sizes of each community
sizes(gc)

## Community sizes
##  1  2  3  4  5  6 
##  9  4  5 10  4  2

# Plot community networks determined by fast-greedy and edge-betweenness methods side-by-side
par(mfrow = c(1, 2)) 
plot(kc, g)
plot(gc, g)

leading.eigenvector.community() is another community algorithm you can explore

Interactive network visualizations

Interactive networks with threejs

In this course you have exclusively used igraph to make basic static network plots. There are many packages available to make network plots. One very useful one is threejs which allows you to make interactive network visualizations. This package also integrates seamlessly with igraph. In this exercise you will make a basic interactive network plot of the karate club network using the threejs package. Once you have produced the visualization be sure to move the network around with your mouse. You should be able to scroll in and out of the network as well as rotate the network.

library(igraph)
library(threejs)

# Set a vertex attribute called 'color' to 'dodgerblue' 
g <- set_vertex_attr(g, "color", value = "dodgerblue")

# Redraw the graph and make the vertex size 1
graphjs(g, vertex.size = 1)

The Javascript did not render for RStudio in Windows.

Sizing vertices in threejs

As with all network visualizations it is often worth adjusting the size of vertices to illustrate their relative importance. This is also straightforward in threejs. In this exercise you will create an interactive threejs plot of the karate club network and size vertices based on their relative eigenvector centrality.

# Create numerical vector of vertex eigenvector centralities 
ec <- as.numeric(eigen_centrality(g)$vector)

# Create new vector 'v' that is equal to the square-root of 'ec' multiplied by 5
v <- 5*sqrt(ec)

# Plot threejs plot of graph setting vertex size to v
graphjs(g, vertex.size = v)

The Javascript did not render for RStudio in Windows.

3D community network graph

Finally in this exercise you will create an interactive threejs plot with the vertices based on their community membership as produced by the fast-greedy community detection method.

# Create an object 'i' containing the memberships of the fast-greedy community detection
i <-  membership(kc)

# Check the number of different communities
sizes(kc)

## Community sizes
##  1  2  3 
## 18 11  5

# Add a color attribute to each vertex, setting the color based on community membership
g <- set_vertex_attr(g, "color", value = c("yellow", "blue", "red")[i])

# Plot the graph using threejs
graphjs(g)

The Javascript did not render for RStudio in Windows.

network analysis in R

DataCamp - James Curley

11/28/2020