Exploring your data set

Dyads

Triads

3-digit code

  1. count of vertices connected by a bidirectional symmetric edge
  2. count of pairs of vertices connected by an asymmetric edge
  3. count of pairs of unconnected vertices

Letter code

C. Cyclic D. Single edges go Down U. Single edges come Up T. Transitive - if any two vertices in a triad are connected to each other, then there must exist a connection between the third

Finding Dyads and Triads

Let’s think a little bit more about what we can learn from dyad and triad censuses of the overall graph. Because we are interested in understanding how items are purchased together and whether or not they are reciprocally purchased, dyad and triad censuses can provide a useful initial look. A dyad census will tell us how many items are purchased reciprocally vs. asymmetrically. The triad census will tell us which items might be important, specifically patterns 4, 9, and 12. All of these have one vertex that has 2 out degree edges and no in degree edges. Edge density should also give some insight we expect for graph clustering.

The Amazon co-purchase graph, amzn_g, is available.

library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
amzn_g <- read.graph("_data/amzn_g.gml",format=c("gml"))

# The graph, amzn_g, is available
amzn_g
## IGRAPH 40001a4 DN-- 10245 10754 -- 
## + attr: id (v/n), name (v/c)
## + edges from 40001a4 (vertex names):
##  [1] 44  ->42   179 ->71   410 ->730  415 ->741  656 ->1267 669 ->672 
##  [7] 672 ->669  689 ->690  689 ->1284 690 ->689  690 ->1284 730 ->410 
## [13] 741 ->909  786 ->1767 802 ->806  806 ->802  856 ->205  857 ->211 
## [19] 867 ->866  868 ->866  909 ->741  911 ->748  921 ->190  1015->151 
## [25] 1016->1015 1047->1049 1049->1047 1204->1491 1267->656  1272->669 
## [31] 1278->152  1282->943  1284->689  1285->1286 1286->1285 1290->1293
## [37] 1293->1290 1293->1606 1294->1295 1295->1294 1312->730  1350->2783
## [43] 1362->156  1366->190  1438->1580 1438->1581 1467->3996 1479->158 
## + ... omitted several edges
# Perform dyad census
dyad_census(amzn_g)
## $mut
## [1] 3199
## 
## $asym
## [1] 4356
## 
## $null
## [1] 52467335
# Perform triad census
triad_census(amzn_g)
##  [1] 179089386743     44610360     32763436          215         1906
##  [6]          507         1198          457          118            0
## [11]          301          170          119           33          239
## [16]          288
# Find the edge density
edge_density(amzn_g)
## [1] 0.0001024681

The dyad census shows us there were 3199 mutual connections, meaning items were bought together. There were also 215 out star triad patterns, or items that drive other purchases.

Clustering and Reciprocity

Our previous work looking at the dyad census should give some intuition about how we expect other graph level metrics like reciprocity and clustering in our co-purchase graph to look. Recall that there are 10,754 edges in our graph of 10,245 vertices, and of those, more than 3,000 are mutual, meaning that almost 60 percent of the vertices have a mutual connection. What do you expect the clustering and reciprocity measures to look like given this information? We can test our intuition against a null model by simulating random graphs. In light of the results of our previous simulation, what do you expect to see here? Will reciprocity also be much higher than expected by chance?

reciprocity is a measure of the likelihood of vertices in a directed network to be mutually linked

The graph, amzn_g is available.

# Calculate reciprocity
actual_recip <- reciprocity(amzn_g)

# Calculate the order
n_nodes <- gorder(amzn_g)

# Calculate the edge density
(edge_dens <- edge_density(amzn_g))
## [1] 0.0001024681
# Run the simulation
simulated_recip <- rep(NA, 1000)
for(i in 1:1000) {
  # Generate an Erdos-Renyi simulated graph
  simulated_graph <- erdos.renyi.game(n_nodes, edge_dens, directed = TRUE)
  # Calculate the reciprocity of the simulated graph
  simulated_recip[i] <- reciprocity(simulated_graph)
}

# Reciprocity of the original graph
actual_recip
## [1] 0.5949414
# Calculate quantile of simulated reciprocity
quantile(simulated_recip , c(0.025, 0.5, 0.975))
##         2.5%          50%        97.5% 
## 0.0000000000 0.0000000000 0.0003769896

Resplendent reciprocity calculating! Weirdly though, notice how the reciprocity of the simulations is much lower than the reciprociy of the original graph.

Important Products

We’ve now seen that there’s a clear pattern in our graph. Let’s take the next step and move beyond just understanding the structure. Given the context of graph structure, what can we learn from it? For example, what drives purchases? A place to start might be to look for “important products”, e.g. those products that someone purchases and then purchases something else. We can make inferences about this using in degree and out degree. First, we’ll look at our graph and see the distribution of in degree and out degree, and then use that to set up a working definition for what an “important product” is (something that has > X out degrees and < Z in degrees). We’ll then make a subgraph to visualize what these subgraphs look like.

For a directed graph and a vertex, the Out-Degree refers to the number of arcs directed away from the vertex. The In-Degree refers to the number of arcs directed towards the vertex.

# Calculate the "out" degrees
out_degree <- degree(amzn_g, mode = "out")

## ... and "in" degrees
in_degree <- degree(amzn_g, mode = "in")

# See the distribution of out_degree
table(out_degree)
## out_degree
##    0    1    2    3    4    5 
## 1899 6350 1635  313   45    3
## ... and of in_degree
table(in_degree)
## in_degree
##    0    1    2    3    4    5    6    7    8    9   11   12   17 
## 2798 5240 1549  424  139   50   20    7    9    5    1    2    1
# Create condition of out degree greater than 3
# and in degree less than 3
is_important <- out_degree > 3 & in_degree < 3

# Subset vertices by is_important
imp_prod <- V(amzn_g)[is_important]

# Output the vertices
print(imp_prod)
## + 8/10245 vertices, named, from 40001a4:
## [1] 1629   4545   6334   20181  62482  64344  155513 221085

Awesome, now you can use in-degree and out-degree to infer what products are important!

What Makes an Important Product?

Now that we’ve come up with a working definition of an important product, let’s see if they have any properties that might be correlated. One candidate pairing is salesrank.from and salesrank.to. We can ask if important products tend to have higher sales ranks than the products people purchase downstream. We’ll look at this by first subsetting out the important vertices, joining those back to the initial dataframe, and then creating a new dataframe using the package dplyr. We’ll create a new graph, and color the edges blue for high ranking (1, 2, 3) to low ranking (20, 21, 22) and red for the opposite. If rank is correlated with downstream purchasing, then we’ll see mostly blue links, and if there’s no relationship, it will be about equally red and blue.

The dataset ip_df contains the information about important products.

# Select the from and to columns from ip_df
ip_df_from_to <- ip_df[c("from", "to")]

# Create a directed graph from the data frame
ip_g <- graph_from_data_frame(ip_df_from_to, directed = TRUE)

# Set the edge color. If salesrank.from is less than or 
# equal to salesrank.to then blue else red.
edge_color <- ifelse(
  ip_df$salesrank.from <= ip_df$salesrank.to, 
  yes = "blue", 
  no = "red"
)

plot(
  # Plot the graph
  ip_g, 
  # Set the edge color
  edge.color = edge_color,
  edge.arrow.width = 1, edge.arrow.size = 0, edge.width = 4, 
  vertex.label = NA, vertex.size = 4, vertex.color = "black"
)
legend(
  "bottomleft", 
  # Set the edge color
  fill = unique(edge_color), 
  legend = c("Lower to Higher Rank", "Higher to Lower Rank"), cex = 0.7
)

Congratulations! You can see that sales rank is correlated with product importance

Exploring temporal structure

Metrics through time

So far, we have been looking at products that drive other purchases by examining their out degree. However, up until the last lesson we’ve just been looking at a single snapshot in time. One question is, do these products show similar out degrees at each time step? After all, a product driving other purchases could just be idiosyncratic, or it if were more stable through time it might indicate that product could be responsible for driving co-purchases. To get at this question, we’re going to build off the code we’ve already walked through that generates a list with a graph at each time step.

time_graph <- readRDS("_data/time_graph.rds")

d <- as.Date(c("2003-03-02", "2003-03-12", "2003-05-05", "2003-06-01"))

# Loop over time graphs calculating out degree
degree_count_list <- lapply(time_graph, degree, mode = "out")

# Flatten it
degree_count_flat <- unlist(degree_count_list)

degree_data <- data.frame(
  # Use the flattened counts
  degree_count = degree_count_flat,
  # Use the names of the flattened counts
  vertex_name = names(degree_count_flat),
  # Use the lengths of the count list
  date = rep(d, lengths(degree_count_list))
)

important_vertices <- c(1629, 132757, 117841)

important_degree_data <- degree_data %>% 
  # Filter for rows where vertex_name is
  # in the set of important vertices
  dplyr::filter(vertex_name %in% important_vertices)

# Using important_degree_data, plot degree_count vs. date, colored by vertex_name 
library(ggplot2)
ggplot(important_degree_data, aes(x = date, y = degree_count, color = vertex_name)) + 
  # Add a path layer
  geom_path()

Transcendent temporal analysis! Only one product, 132757, has a consistently high out degree, indicating it may be consistently driving secondary transactions. The other two might just be having high out degree by chance.

Plotting Metrics Over Time We can also examine how metrics for the overall graph change (or don’t) through time. Earlier we looked at two important ones, clustering and reciprocity. Each were quite high, as we expected after visually inspecting the graph structure. However, over time, each of these might change. Are global purchasing patterns on Amazon stable? If we think so, then we expect plots of these metrics to essentially be horizontal lines, indicating that reciprocity is about the same every day and there’s a high degree of clustering structure. Let’s see what we can find here.

Code to calculate the transitivity by graph is shown.

Transitivity of a relation means that when there is a tie from i to j, and also from j to h, then there is also a tie from i to h: friends of my friends are my friends. Transitivity depends on triads, subgraphs formed by 3 nodes.

# Examine this code
transitivity_by_graph <- data.frame(
  date = d,
  metric = "transitivity",
  score = sapply(all_graphs, transitivity)
)

# Calculate reciprocity by graph
reciprocity_by_graph <- data.frame(
  date = d,
  metric = "reciprocity",
  score = sapply(all_graphs, reciprocity)
)

metrics_by_graph <- bind_rows(transitivity_by_graph, reciprocity_by_graph)

# Using metrics_by_graph, plot score  vs. date, colored by metric
ggplot(metrics_by_graph, aes(x = date, y = score, color = metric)) +
  # Add a path layer
  geom_path()

Marvelous metric plotting! Reciprocity was fairly stable over time. Transitivity decreased after the first time point. This supports the idea that co-purchase networks are relatively stable over this time window.

Creating your retweet graph

Exploring the data

Visualize the graph

Now that we’ve thought a bit about how we constructed our network, let’s size our graph and make an initial visualization. Before you make the plot, you’ll calculate the number of vertices and edges. This allows you to know if you can actually plot the entire network. Also, the ratio of nodes to edges will give you an intuition of just how dense or sparse your plot might be. As you create your plot, take a moment to hypothesize what you think the plot will look like based on these metrics.

retweet_graph <- read.graph("_data/rt_g.gml", format=c("gml"))

# Count the number of nodes in retweet_graph
gorder(retweet_graph)
## [1] 4118
# Count the number of edges in retweet_graph
gsize(retweet_graph)
## [1] 6052
# Calculate the graph density of retweet_graph
graph.density(retweet_graph)
## [1] 0.00035697
# Plot retweet_graph
plot(retweet_graph)

Cool! This is the basic graph now let’s check it out with some more visual information.

Visualize nodes by degree

Now that we’ve taken a look at our graph and explored some of the basic properties of it, let’s think a bit more about our network. We observed that there are some highly connected nodes and many outlier points. We visualize this by conditionally plotting the graph and coloring some of the nodes by in and out degree. Let’s think about the nodes as three different types:

This will help us get more information about what’s going on in the ring around the cluster of highly connected nodes.

# Calculate the "in" degree distribution
in_deg <- degree(retweet_graph, mode = "in")

# Calculate the "out" degree distribution
out_deg <- degree(retweet_graph, mode = "out")

# Find the case with one "in" degree and zero "out" degrees
has_tweeted_once_never_retweeted <- in_deg == 1 & out_deg == 0

# Find the case with zero "in" degrees and one "out" degree
has_never_tweeted_retweeted_once <- in_deg == 0 & out_deg == 1

# The default color is set to black
vertex_colors <- rep("black", gorder(retweet_graph))

# Set the color of nodes that were retweeted just once to blue
vertex_colors[has_tweeted_once_never_retweeted] <- "blue"

# Set the color of nodes that were retweeters just once to green 
vertex_colors[has_never_tweeted_retweeted_once] <- "green"

# See the result
head(vertex_colors)
## [1] "black" "green" "black" "green" "green" "green"
plot(
  # Plot the network
  retweet_graph, 
  # Set the vertex colors
  vertex.color = vertex_colors
)

Looks awesome! Now we can quantify this visual information a bit more.

What’s the distribution of centrality?

Recall that there are many ways that you can assess centrality of a graph. We will use two different methods you learned earlier: betweenness and eigen-centrality. Remember that betweenness is a measure of how often a given vertex is on the shortest path between other vertices, whereas eigen-centrality is a measure of how many other important vertices a given vertex is connected to. Before we overlay centrality on our graph plots, let’s get a sense of how centrality is distributed.

Note that due to algorithmic rounding errors, we can’t check for eigen-centrality equaling a specific value; instead, we check a range.

# Calculate directed betweenness of vertices
retweet_btw <- betweenness(retweet_graph, directed = TRUE)

# Get a summary of retweet_btw
summary(retweet_btw)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0   109.4     0.0 69549.4
# Calculate proportion of vertices with zero betweenness
mean(retweet_btw == 0)
## [1] 0.9499757
# Calculate eigen-centrality using eigen_centrality()
retweet_ec <- eigen_centrality(retweet_graph, directed = TRUE)$vector

# Get a summary of retweet_ec
summary(retweet_ec)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.000000 0.006067 0.000000 1.000000
# Calc proportion of vertices with eigen-centrality close to zero
almost_zero <- 1e-10
mean(retweet_ec < almost_zero)
## [1] 0.9730452

Wow! The distributions are highly skewed. 95% of nodes have zero betweenness, but a few have a large value (with the highest close to 70k). Likewise, 97% of vertices have eigen-centrality close to zero.

Who is important in the conversation?

Different measures of centrality all try to get at the similar concept of “which vertices are most important.” As we discussed earlier, these two metrics approach it slightly differently. Keep in mind that while each may give a similar distribution of centrality measures, how an individual vertex ranks according to both may be different. Now we’re going to compare the top ranking vertices of Twitter users.

The vectors that store eigen and betweenness centrality are stored respectively as retweet_ec and retweet_btw.

# Get 0.99 quantile of betweenness 
betweenness_q99 <- quantile(retweet_btw, 0.99)

# Get top 1% of vertices by betweenness
top_btw <- retweet_btw[retweet_btw > betweenness_q99]

# Get 0.99 quantile of eigen-centrality
eigen_centrality_q99 <- quantile(retweet_ec, 0.99)

# Get top 1% of vertices by eigen-centrality
top_ec <- retweet_ec[retweet_ec > eigen_centrality_q99]

# See the results as a data frame
data.frame(
  Rank = seq_along(top_btw), 
  Betweenness = names(sort(top_btw, decreasing = TRUE)), 
  EigenCentrality = names(sort(top_ec, decreasing = TRUE))
)
##    Rank    Betweenness EigenCentrality
## 1     1  hadleywickham       ma_salmon
## 2     2        kierisi     rstudiotips
## 3     3           drob         opencpu
## 4     4        opencpu    AchimZeileis
## 5     5      ma_salmon       dataandme
## 6     6       rmflight            drob
## 7     7      dataandme       _ColinFay
## 8     8      _ColinFay        rOpenSci
## 9     9     juliasilge       kearneymw
## 10   10      revodavid   RobertMylesMc
## 11   11       rOpenSci       ptrckprry
## 12   12     nj_tierney        rmflight
## 13   13     jonmcalder     thosjleeper
## 14   14      Md_Harris       revodavid
## 15   15   mauro_lepore      juliasilge
## 16   16       sckottie   RLadiesGlobal
## 17   17  RLadiesGlobal   hadleywickham
## 18   18      kearneymw    mauro_lepore
## 19   19      lenkiefer      JennyBryan
## 20   20       NumFOCUS        tudosgar
## 21   21         tjmahr        cboettig
## 22   22  TheRealEveret        jasdumas
## 23   23     RLadiesMAD        antuki13
## 24   24       jasdumas       Rbloggers
## 25   25     JennyBryan        rensa_co
## 26   26       hrbrmstr        timtrice
## 27   27       antuki13        daattali
## 28   28        Voovarb        johnlray
## 29   29       timtrice      joranelias
## 30   30      thinkR_fr    StatsbyLopez
## 31   31     benmarwick         kierisi
## 32   32  RosanaFerrero   joshua_ulrich
## 33   33     clquezadar          pssGuy
## 34   34       drsimonj    bastistician
## 35   35        zentree       thinkR_fr
## 36   36      thomasp85          ledell
## 37   37       OilGains         zentree
## 38   38    yodacomplex  brookLYNevery1
## 39   39  annakrystalli       Md_Harris
## 40   40 davidhughjones        sckottie
## 41   41       noamross      jonmcalder
## 42   42       AlexaLFH      nj_tierney

Awesome job, now you can see who the most central screen names are.

Plotting important vertices

Lastly, we’ll visualize the graph with the size of the vertex corresponding to centrality. However, we already know the graph is highly skewed, so there will likely be a lot of visual noise. So next, we’ll want to look at how connected the most central vertices are. We can do this by creating a subgraph based on centrality criteria. This is an important technique when dealing with large graphs. Later, we’ll look at alternative visualization methods, but another powerful technique is visualizing subgraphs.

The graph, retweet_graph, its vertex betweenness, retweet_btw, and the 0.99 betweenness quantile, betweenness_q99 are available.

# Transform betweenness: add 2 then take natural log
transformed_btw <- log(retweet_btw + 2)

# Make transformed_btw the size attribute of the vertices
V(retweet_graph)$size <- transformed_btw

# Plot the graph
plot(
  retweet_graph, vertex.label = NA, edge.arrow.width = 0.2,
  edge.arrow.size = 0.0, vertex.color = "red"
)

# Subset nodes for betweenness greater than 0.99 quantile
vertices_high_btw <- V(retweet_graph)[retweet_btw > betweenness_q99]

# Induce a subgraph of the vertices with high betweenness
retweet_subgraph <- induced_subgraph(retweet_graph, vertices_high_btw)

# Plot the subgraph
plot(retweet_subgraph)

You can see that the most central vertices are all highly connected to each other, and by creating a sub-graph, we can more easily visualize the network. Now that we’ve explored the retweet graph let’s move on to looking at another type of Twitter graph.

Building a mentions graph

Comparing mention and retweet graph

By looking at the ratio of in degree to out degree, we can learn something slightly different about each network. In the case of a retweet network, it will show us users who are often retweeted but don’t retweet (high values), or those who often retweet but aren’t retweeted (low values). Similarly, if you have a in/out ratio of close to 1 in a mention graph, then the conversation is relatively equitable. However, a low ratio would imply that a given user often starts conversations but they aren’t responded to. When you compare the density plots of the different networks, consider what you’d expect. Which network do you expect to be more skewed and which do you expect to have a ratio closer to 1?

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::as_data_frame() masks tibble::as_data_frame(), igraph::as_data_frame()
## x purrr::compose()       masks igraph::compose()
## x tidyr::crossing()      masks igraph::crossing()
## x dplyr::filter()        masks stats::filter()
## x dplyr::groups()        masks igraph::groups()
## x dplyr::lag()           masks stats::lag()
## x purrr::simplify()      masks igraph::simplify()
mention_graph <- read.graph("_data/ment_g.gml", format=c("gml"))

# Read this code
(mention_data <- tibble(
  graph_type = "mention",
  degree_in = degree(mention_graph, mode = "in"),
  degree_out = degree(mention_graph, mode = "out"),
  io_ratio = degree_in / degree_out
))
## # A tibble: 955 x 4
##    graph_type degree_in degree_out io_ratio
##    <chr>          <dbl>      <dbl>    <dbl>
##  1 mention            1          8    0.125
##  2 mention            1          0  Inf    
##  3 mention            0          1    0    
##  4 mention            0          1    0    
##  5 mention           10         13    0.769
##  6 mention            2          0  Inf    
##  7 mention            1          5    0.2  
##  8 mention            0          2    0    
##  9 mention            1          0  Inf    
## 10 mention            0          2    0    
## # ... with 945 more rows
# Create a dataset of retweet ratios
(retweet_data <- tibble(
  graph_type = "retweet",
  degree_in = degree(retweet_graph, mode = "in"),
  degree_out = degree(retweet_graph, mode = "out"),
  io_ratio = degree_in / degree_out
))
## # A tibble: 4,118 x 4
##    graph_type degree_in degree_out io_ratio
##    <chr>          <dbl>      <dbl>    <dbl>
##  1 retweet           55          2     27.5
##  2 retweet            0          1      0  
##  3 retweet            2          1      2  
##  4 retweet            0          1      0  
##  5 retweet            0          1      0  
##  6 retweet            0          1      0  
##  7 retweet            0          3      0  
##  8 retweet            0          3      0  
##  9 retweet            0          1      0  
## 10 retweet            0          3      0  
## # ... with 4,108 more rows
# Bind the datasets by row
io_data <- bind_rows(mention_data, retweet_data) %>% 
  # Filter for finite, positive io_ratio
  filter(is.finite(io_ratio), io_ratio > 0)

# Using io_data, plot io_ratio colored by graph_type
ggplot(io_data, aes(x = io_ratio, color = graph_type)) + 
  # Add a geom_freqpoly layer
  geom_freqpoly() + 
  scale_x_continuous(breaks = 0:10, limits = c(0, 10))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 24 rows containing non-finite values (stat_bin).
## Warning: Removed 4 row(s) containing missing values (geom_path).

io_data %>% 
  # Group by graph_type
  group_by(graph_type) %>% 
  summarize(
    # Calculate the mean of io_ratio
    mean_io_ratio = mean(io_ratio),
    # Calculate the median of io_ratio
    median_io_ratio = median(io_ratio)
  )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##   graph_type mean_io_ratio median_io_ratio
##   <chr>              <dbl>           <dbl>
## 1 mention             1.39            1   
## 2 retweet             4.68            1.44

Congratulations! Now you can see the difference in the ratio of in and out degrees of mention and retweet graphs.

Assortativity and reciprocity

Another two key components of understanding our graphs are reciprocity and degree assortativity. Recall that reciprocity is the number of vertices that have connections going in each direction. Therefore, in the retweet graph, reciprocity is measuring the overall amount of nodes that retweet each other. In the mention graph, it will tell us how many nodes are conversing back and forth.

Assortitivity is a bit less obvious. Values greater than 0 indicate that vertices with high degrees tend to be connected to each other. However, values less than zero indicate a more degree disassortative graph. If you visualize the graph and see a hub and spoke type pattern, this is likely to be disassortative.

Assortativity, or assortative mixing is a preference for a network’s nodes to attach to others that are similar in some way. For instance, in social networks, nodes tend to be connected with other nodes with similar degree values. This tendency is referred to as assortative mixing, or assortativity.

# Find the reciprocity of the retweet graph
reciprocity(retweet_graph) 
## [1] 0.005948447
# Find the reciprocity of the mention graph
reciprocity(mention_graph)
## [1] 0.01846154
# Find the directed assortivity of the retweet graph
assortativity.degree(retweet_graph, directed = TRUE)
## [1] -0.1502212
# Find the directed assortivity of the mention graph
assortativity.degree(mention_graph, directed = TRUE)
## [1] -0.07742748

Great job! Reciprocity is higher in the mention graph compared to the retweet graph, and both graphs have negative assortativity.

Finding who is talking to whom

Recall from the first lesson that cliques are complete subgraphs within a larger undirected graph. When we look at the clique structure of a conversational graph in Twitter, it tells us who is talking to whom. One way we can use this information is to see who might be interested in talking to other people. It’s easy to see how this basic information could be used to construct models of suggested users to follow or interact with.

# Get size 3 cliques
list_of_clique_vertices <- cliques(mention_graph, min = 3, max = 3)
## Warning in cliques(mention_graph, min = 3, max = 3): At igraph_cliquer.c:
## 56 :Edge directions are ignored for clique calculations
# Loop over cliques, getting IDs
clique_ids <- lapply(list_of_clique_vertices, as_ids)

# See the result
head(clique_ids)
## [[1]]
## [1] "lenkiefer"   "dvaughan32"  "daniel_egan"
## 
## [[2]]
## [1] "lenkiefer" "dataandme" "revodavid"
## 
## [[3]]
## [1] "lenkiefer"      "bass_analytics" "rstudio"       
## 
## [[4]]
## [1] "lenkiefer" "dataandme" "rstudio"  
## 
## [[5]]
## [1] "unsorsodicorda" "revodavid"      "VizMonkey"     
## 
## [[6]]
## [1] "bhaskar_vk" "cboettig"   "Docker"
# Loop over cliques, finding cases where revodavid is in the clique
has_revodavid <- sapply(
  clique_ids, 
  function(clique) {
    "revodavid" %in% clique
  }
)

# Subset cliques that have revodavid
cliques_with_revodavid <- clique_ids[has_revodavid]

# Unlist cliques_with_revodavid and get unique values
(people_in_cliques_with_revodavid <- unique(unlist(cliques_with_revodavid)))
## [1] "lenkiefer"      "dataandme"      "revodavid"      "unsorsodicorda"
## [5] "VizMonkey"      "ibartomeus"     "frod_san"
# Induce subgraph of mention_graph with people_in_cliques_with_revodavid 
revodavid_cliques_graph <- induced_subgraph(mention_graph, people_in_cliques_with_revodavid)

# Plot the subgraph
plot(revodavid_cliques_graph)

Classy clique analysis! Revodavid is in six cliques of size three.

Finding communities

variance in information (how much variation is there in community membership for each vertex?)

Comparing community algorithms

There are many ways that you can find a community in a graph (you can read more about them here). Unfortunately, different community detection algorithms will give different results, and the best algorithm to choose depends on some of the properties of your graph Yang et. al..

You can compare the resulting communities using compare(). This returns a score (“the variance in information”), which counts whether or not any two vertices are members of the same community. A lower score means that the two community structures are more similar.

You can see if two vertices are in the same community using membership(). If the vertices have the same membership number, then they are in the same community.

# Make retweet_graph undirected
retweet_graph_undir <- as.undirected(retweet_graph)

# Find communities with fast greedy clustering
communities_fast_greedy <- cluster_fast_greedy(retweet_graph_undir)

# Find communities with infomap clustering
communities_infomap <- cluster_infomap(retweet_graph_undir)

# Find communities with louvain clustering
communities_louvain <- cluster_louvain(retweet_graph_undir)

# Compare fast greedy communities with infomap communities
compare(communities_fast_greedy, communities_infomap)
## [1] 2.308336
# Compare fast greedy with louvain
compare(communities_fast_greedy, communities_louvain)
## [1] 2.144703
# Compare infomap with louvain
compare(communities_infomap, communities_louvain)
## [1] 1.615744
two_users <- c("bass_analytics", "big_data_flow")

# Subset membership of communities_fast_greedy by two_users
membership(communities_fast_greedy)[two_users]
## bass_analytics  big_data_flow 
##              3              3
# Subset membership of communities_infomap by two_users
membership(communities_infomap)[two_users]
## bass_analytics  big_data_flow 
##            100             76
# Subset membership of communities_louvain by two_users
membership(communities_louvain)[two_users]
## bass_analytics  big_data_flow 
##             81             70

Delightful community detection! compare() returned a smaller distance between Infomap and Louvain communities, meaning that those algorithms gave more similar results than “Fast and Greedy”. In “Fast and Greedy”, the users bass_analytics and big_data_flow were placed in the same community but the other algorithms placed them in different communities.

Visualizing the communities

Now that we’ve found communities, we’ll visualize our results. Before we plot, we’ll assign each community membership to each vertex and a crossing value to each edge. The crossing() function in igraph will return true if a particular edge crosses communities. This is useful when we want to see certain vertices that are bridges between communities. You may just want to look at certain communities because the whole graph is a bit of a hairball. In this case, we’ll create a subgraph of communities only of a certain size (number of members).

# Color vertices by community membership, as a factor
V(retweet_graph)$color <- factor(membership(communities_louvain))

# Find edges that cross between commmunities
is_crossing <- igraph::crossing(communities_louvain, retweet_graph)

# Set edge linetype: solid for crossings, dotted otherwise 
E(retweet_graph)$lty <- ifelse(is_crossing, "solid", "dotted")

# Get the sizes of each community
community_size <- sizes(communities_louvain)

# Find some mid-size communities
in_mid_community <- unlist(communities_louvain[community_size > 50 & community_size < 90])

# Induce a subgraph of retweet_graph using in_mid_community
retweet_subgraph <- induced_subgraph(retweet_graph, in_mid_community)

# Plot those mid-size communities
plot(retweet_subgraph, vertex.label = NA, edge.arrow.width = 0.8, edge.arrow.size = 0.2, 
  coords = layout_with_fr(retweet_subgraph), margin = 0, vertex.size = 3)

Wow! Most of the edges connect members of a community to each other. There are only a few lines connecting each community to other communities.

Creating our graph from raw data

Making Graphs of Different User Types

Let’s compare graphs of people who subscribe to the divvy bike vs. more casual non-subscribing customers.

It’s convenient to use dplyr to manipulate the data before using graph_from_data_frame() to create the graph. One useful dplyr function you’ll need is n(), which gives the number of rows in that group of the data frame.

bike_dat <- read.csv("_data/divvy_bike_sample.csv", stringsAsFactors = TRUE)

subscribers <- bike_dat %>% 
  # Filter for rows where usertype is Subscriber
  filter(usertype == "Subscriber")

# Count the number of subscriber trips
n_subscriber_trips <- nrow(subscribers)

subscriber_trip_graph <- subscribers %>% 
  # Group by from_station_id and to_station_id
  group_by(from_station_id, to_station_id) %>% 
  # Calculate summary statistics
  summarize(
    # Set weights as proportion of total trips 
    weights = n() / n_subscriber_trips
  ) %>%
  # Make a graph from the data frame
  graph_from_data_frame()
## `summarise()` regrouping output by 'from_station_id' (override with `.groups` argument)
customers <- bike_dat %>% 
  # Filter for rows where usertype is Customer
  filter(usertype == "Customer")

# Count the number of customer trips
n_customer_trips <- nrow(customers)

customer_trip_graph <- customers %>% 
  # Group by from_station_id and to_station_id
  group_by(from_station_id, to_station_id) %>% 
  # Calculate summary statistics
  summarize(
    # Set weights as proportion of total trips 
    weights = n() / n_customer_trips
  ) %>%
  # Make a graph from the data frame
  graph_from_data_frame()
## `summarise()` regrouping output by 'from_station_id' (override with `.groups` argument)
# Calculate number of different trips by subscribers
gsize(subscriber_trip_graph)
## [1] 14679
# Calculate number of different trips by customers
gsize(customer_trip_graph)
## [1] 9528

Keep on pedalling! Subscribers made over 14 thousand different trips, compared to over 9 thousand for (non-subscribing) customers.

Compare Graphs of Different User Types

One of the easiest ways we can see differences between the graphs is by plotting a small subgraph of each one. If different patterns are strong, they should be easy to visually detect. As you look at these two different plots, what stands out? It’s worth taking a moment to think about the differences between these two populations. While there’s doubtless overlap between these two groups, what are some different traits between each population? Subscribers are probably regular users, local, and might be more likely to travel to more remote parts of the graph. Customers are likely tourists or locals who don’t bike regularly. Their use is probably more centered in major stations, and they might ride more for leisure in some of the touristy parts of the city.

# Induce a subgraph from subscriber_trip_graph of the first 12 vertices
twelve_subscriber_trip_graph <- induced_subgraph(subscriber_trip_graph, 1:12)

# Plot the subgraph
plot(
  twelve_subscriber_trip_graph, 
  main = "Subscribers"
)

# Induce a subgraph from customer_trip_graph of the first 12 vertices
twelve_customer_trip_graph <- induced_subgraph(customer_trip_graph, 1:12)

# Plot the subgraph
plot(
  twelve_customer_trip_graph, 
  main = "Customers"
)

That was loop of fun, but let’s look at the full distance.

Compare graph distance vs. geographic distance

Compare Subscriber vs. Non-Subscriber Distances

Let’s compare subscriber to non-subscriber graphs by distance. Remember we can think of subscribers as local Chicago residents who regularly use the bikes, whereas non-subscribers are likely to be more casual users or tourists. Also it’s important to keep in mind that this graph is a representation of geography. Which graph do you think has a further geographic distance? Why?

get_diameter() and farthest_vertices() both provide the vertices in the graph that have the longest “shortest route” between them – get_diameter() provides all the intermediate vertices, whereas farthest_vertices() provides the end vertices and the number of nodes between them.

calc_physical_distance_m(), a function that takes in two station IDs as inputs and calculates the physical distance between the stations (in meters), is also provided. You can view the function by running calc_physical_distance_m in the console.

library(geosphere)

calc_physical_distance_m <- function(station_1, station_2) {
  station_1_long_lat <- bike_dat %>% 
    filter(from_station_id == !!station_1) %>% 
    top_n(1) %>% 
    select(from_longitude, from_latitude) %>%
    as.numeric()
  station_2_long_lat <- bike_dat %>% 
    filter(from_station_id == !!station_2) %>% 
    top_n(1) %>% 
    select(from_longitude, from_latitude) %>%
    as.numeric()
  as.numeric(geosphere::distm(station_1_long_lat, station_2_long_lat, distHaversine))
}

# Get the diameter of the subscriber graph
get_diameter(subscriber_trip_graph)
## + 7/300 vertices, named, from 695e350:
## [1] 200 336 267 150 45  31  298
# Get the diameter of the customer graph
get_diameter(customer_trip_graph)
## + 6/299 vertices, named, from 698064b:
## [1] 116 31  25  137 135 281
# Find the farthest vertices of the subscriber graph
farthest_vertices(subscriber_trip_graph)
## $vertices
## + 2/300 vertices, named, from 695e350:
## [1] 200 298
## 
## $distance
## [1] 6
# Find the farthest vertices of the customer graph
farthest_vertices(customer_trip_graph)
## $vertices
## + 2/299 vertices, named, from 698064b:
## [1] 116 281
## 
## $distance
## [1] 5
# From previous step
farthest_vertices(subscriber_trip_graph)
## $vertices
## + 2/300 vertices, named, from 695e350:
## [1] 200 298
## 
## $distance
## [1] 6
farthest_vertices(customer_trip_graph)
## $vertices
## + 2/299 vertices, named, from 698064b:
## [1] 116 281
## 
## $distance
## [1] 5
# Calc physical distance between end stations for subscriber graph
calc_physical_distance_m(200, 298)
## Selecting by geo_distance
## Selecting by geo_distance
## [1] 17078.31
# Calc physical distance between end stations for customer graph
calc_physical_distance_m(116, 281)
## Selecting by geo_distance
## Selecting by geo_distance
## [1] 7465.656

You’re going the distance! In the subscriber graph, the furthest stations were 17km apart, compared to 7km for non-subscribing customers. Now on to most traveled to and from stations.

Most Traveled To and From Stations

Here we’ll look at which stations are most commonly traveled to and from, as well as the ratio of in to out degree. This will tell us which stations are skewed as either having many stations pulling bikes from them or leaving bikes at them. In order for a bike sharing graph like this to work effectively, you can’t have too many source or sink stations, otherwise the owner of the network would need to be constantly moving around bikes! Ideally, the network is designed to self correct, and if it’s doing that, we expect to see almost all the stations with an in to out degree ratio of around one. First, we’re going to look at this in the unweighted case.

bike_dat_filtered <- bike_dat %>% filter(from_station_id != to_station_id)
total_trips <- nrow(bike_dat_filtered)

trip_g_simp <- bike_dat_filtered %>%
  group_by(from_station_id, to_station_id) %>% 
  summarize(
    weights = n() / total_trips
  ) %>%
  # Make a graph from the data frame
  graph_from_data_frame()
## `summarise()` regrouping output by 'from_station_id' (override with `.groups` argument)
trip_deg <- data_frame(
  # Find the "out" degree distribution
  trip_out = degree(trip_g_simp, mode = "out"), 
  # ... and the "in" degree distribution
  trip_in = degree(trip_g_simp, mode = "in"),
  # Calculate the ratio of out / in
  ratio = trip_out / trip_in
)
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
trip_deg_filtered <- trip_deg %>%
  # Filter for rows where trips in and out are both over 10
  filter(trip_out > 10, trip_in > 10) 

# Plot histogram of filtered ratios
hist(trip_deg_filtered$ratio)

Most Traveled To and From Stations with Weights

So far, we’ve only looked at our network with unweighted edges. But our edge weights are actually the number of trips, so it seems logical that we would want to extend our analysis of degrees by adding a weighted degree distribution. This is important because while a balanced degree ratio is important, the item that would need to be rebalanced is bikes. If the weights are the same across all stations, then an unweighted degree ratio would work. But if we want to know how many bikes are actually flowing, we need to consider weights.

The weighted analog to degree distribution is strength. We can calculate this with the strength() function, which presents a weighted degree distribution based on the weight attribute of a graph’s edges.

trip_strng <- data_frame(
  # Find the "out" strength distribution
  trip_out = strength(trip_g_simp, mode = "out"), 
  # ... and the "in" strength distribution
  trip_in = strength(trip_g_simp, mode = "in"),
  # Calculate the ratio of out / in
  ratio = trip_out / trip_in
)

trip_strng_filtered <- trip_strng %>%
  # Filter for rows where trips in and out are both over 10
  filter(trip_out > 10, trip_in > 10) 

# Plot histogram of filtered ratios
hist(trip_strng_filtered$ratio)

Hopefully those weights didn’t make this too heavy! Now on to more visualization.

Visualize central vertices

As we saw in the last lesson, station 275 had the lowest out/in degree ratio. We can visualize this using make_ego_graph() to see all the outbound paths from this station. It’s also useful to plot this on a geographic coordinate layout, not the default igraph layout. By default, igraph uses the layout_nicely() function to display your graph, making an algorithmic guess about what the best layout should be. However, in this case we want to specify the coordinates of each station, because when a vertex is above another it means it’s actually north of it.

from_longitude <- c(-87.65650, -87.66100, -87.65549, -87.64275, -87.67328, -87.66154, -87.62373, -87.66875, -87.65103, -87.66651, -87.66661)

from_latitude <- c(41.85817, 41.86942, 41.86948, 41.88042, 41.87501, 41.85756, 41.86406, 41.85790, 41.87174, 41.86523, 41.89107)

latlong <- as.matrix(data.frame(from_longitude, from_latitude))

# Make an ego graph of the least traveled graph
g275 <- make_ego_graph(trip_g_simp, 1, nodes = "275", mode= "out")[[1]]

# Plot ego graph
plot(
  g275, 
  # Weight the edges by weight attribute 
  edge.width = E(g275)$weight
)

# Plot ego graph
plot(
  g275, 
  edge.width = E(g275)$weight,
  # Use geographic coordinates
  layout = latlong, 
)

Awesome work! Now let’s keep on riding.

Weighted Measures of Centrality

Another common measure of important vertices is centrality. There are a number of ways that we can measure centrality, but for this lesson, we’ll consider two metrics: eigen centrality and closeness. While eigen centrality has already been covered, closeness is another way of assessing centrality. It considers how close any vertex is to all the other vertices. In earlier lessons, we haven’t explicitly considered weighted versus unweighted versions of centrality. In this lesson, we’ll calculate both weighted and unweighted versions and see if the change is returned.

In the below example, do you expect to see the same vertex each time? What do you think will be the biggest difference between metrics or between weighted and unweighted versions?

# This calculates weighted eigen-centrality 
ec_weight <- eigen_centrality(trip_g_simp, directed = TRUE)$vector

# Calculate unweighted eigen-centrality 
ec_unweight <- eigen_centrality(trip_g_simp, directed = TRUE, weights = NA)$vector

# This calculates weighted closeness
close_weight <- closeness(trip_g_simp)

# Calculate unweighted closeness
close_unweight <- closeness(trip_g_simp, weights = NA)

# Get vertex names
vertex_names <- names(V(trip_g_simp))

# Complete the data frame to see the results
data_frame(
  "Weighted Eigen Centrality" = vertex_names[which.min(ec_weight)],
  "Unweighted Eigen Centrality" = vertex_names[which.min(ec_unweight)],
  "Weighted Closeness" = vertex_names[which.min(close_weight)],
  "Unweighted Closeness" = vertex_names[which.min(close_unweight)]
)
## # A tibble: 1 x 4
##   `Weighted Eigen Cen~ `Unweighted Eigen Ce~ `Weighted Closen~ `Unweighted Clos~
##   <chr>                <chr>                 <chr>             <chr>            
## 1 204                  204                   336               336

Connectivity

Find the minimum cut 1

Connectivity tells us the minimum number of cuts needed to split the graph into two different subgraphs. igraph has two functions that we can use to tell us which vertices are actually cut into those two different subgraphs and how many cuts are required. The first is min_cut(), which returns all the cuts made, the number of cuts, and the two different subgraphs created. The number of cuts differs between directed and undirected graphs. In directed graphs, the minimum number of cuts only counts inbound edges, whereas in an undirected graph, it is how many cuts for all edges.

# Calculate the minimum number of cuts
ud_cut <- min_cut(trip_g_ud, value.only = FALSE)

# See the result
ud_cut
$value
[1] 5

$cut
+ 5/12972 edges from e947f39 (vertex names):
[1] 71 --281 135--281 167--281 203--281 281--305

$partition1
+ 1/300 vertex, named, from e947f39:
[1] 281

$partition2
+ 299/300 vertices, named, from e947f39:
  [1] 5   13  14  15  16  17  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  66  67  68  69  71  72  73  74  75  76  77  80  81  84  85  86  87  88  90  91  92  93  94  97  98  99  100 106 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 126 127 128 129 130 131 132 134 135 136 137 138 140 141 143 144 146 147 148 149 150 152 153 154 156 157 158 159 160 162 163 164 165 166 167 168 169 170 171 173 174 175 176 177 178 179 181 183 184 185 186 188 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 222 223 224 225 226 227 228 229 230 231 232 233 234 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 267 268 271 272 273 274 275 276 277 278 279 280 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
[249] 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351

The number of cuts are shown in value, and the vertices in each new graph are listed in partition1 and partition2.

Find the minimum cut 2

Another function for cutting graphs into several smaller graphs is stMincuts(). This requires the graph and the IDs of two vertices, and tells you minimum number of cuts needed in the graph to disconnect them (specified by the value element of the function output). The syntax for this function is:

stMincuts(graph, "node1", "node2")
# Make an ego graph from the first partition
ego_partition1 <- make_ego_graph(trip_g_ud, nodes = ud_cut$partition1)[[1]]

# Plot the ego graph
plot(ego_partition1)

# Find the number of cuts needed to disconnect nodes "231" and "321"
stMincuts(trip_g_simp, "231", "321")
## $value
## [1] 54
## 
## $cuts
## $cuts[[1]]
## + 54/18773 edges from 6a598a9 (vertex names):
##  [1] 231->31  231->76  231->85  231->93  231->94  231->113 231->114 231->117
##  [9] 231->144 231->152 231->154 231->165 231->173 231->176 231->177 231->230
## [17] 231->232 231->234 231->238 231->239 231->240 231->242 231->243 231->244
## [25] 231->245 231->249 231->251 231->254 231->256 231->257 231->268 231->277
## [33] 231->293 231->294 231->295 231->296 231->297 231->298 231->299 231->302
## [41] 231->303 231->304 231->306 231->311 231->312 231->314 231->316 231->323
## [49] 231->325 231->326 231->329 231->334 231->340 231->349
## 
## 
## $partition1s
## $partition1s[[1]]
## + 1/300 vertex, named, from 6a598a9:
## [1] 231
# Find the number of cuts needed to disconnect nodes "231" and "213"
stMincuts(trip_g_simp, "231", "213")
## $value
## [1] 49
## 
## $cuts
## $cuts[[1]]
## + 49/18773 edges from 6a598a9 (vertex names):
##  [1] 16 ->213 17 ->213 20 ->213 27 ->213 31 ->213 37 ->213 54 ->213 58 ->213
##  [9] 60 ->213 61 ->213 69 ->213 73 ->213 74 ->213 81 ->213 85 ->213 87 ->213
## [17] 90 ->213 92 ->213 93 ->213 113->213 116->213 122->213 123->213 130->213
## [25] 141->213 160->213 173->213 181->213 183->213 199->213 210->213 220->213
## [33] 222->213 239->213 240->213 244->213 258->213 260->213 276->213 277->213
## [41] 290->213 302->213 305->213 307->213 315->213 316->213 333->213 340->213
## [49] 350->213
## 
## 
## $partition1s
## $partition1s[[1]]
## + 299/300 vertices, named, from 6a598a9:
##   [1] 351 350 349 348 347 346 345 344 343 342 341 340 339 338 337 336 335 334
##  [19] 333 332 331 330 329 328 327 326 325 324 323 322 321 320 319 318 317 316
##  [37] 315 314 313 312 311 310 309 308 307 306 305 304 303 302 301 300 299 298
##  [55] 297 296 295 294 293 292 291 290 289 288 287 286 285 284 283 282 281 280
##  [73] 279 278 277 276 275 274 273 272 271 268 267 265 264 263 262 261 260 259
##  [91] 258 257 256 255 254 253 252 251 250 249 248 247 246 245 244 243 242 241
## [109] 240 239 238 237 236 234 233 232 231 230 229 228 227 226 225 224 223 222
## [127] 220 219 218 217 216 215 214 212 211 210 209 208 207 206 205 204 203 202
## [145] 201 200 199 198 197 196 195 194 193 192 191 190 188 186 185 184 183 181
## [163] 179 178 177 176 175 174 173 171 170 169 168 167 166 165 164 163 162 160
## + ... omitted several vertices

Further apart stations require more cuts to disconnect them from the network

Unweighted Clustering Randomizations

We’ve seen that the bike graph has a very low connectivity relative to a random graph. This is unsurprising because we expect that a graph that represents geographic space should have some parts that are connected by small corridors, and therefore, it wouldn’t take much to disconnect the graph. It follows that it’s likely that there are geographic clusters that are highly connected to each other and less connected to other clusters. We can test this hypothesis by looking at the transitivity of the network, or the clustering coefficient, a concept introduced in our introductory lesson. Several types of clustering coefficients exist, but we’ll be looking at the global definition (essentially the portion of fully closed triangles), which is the same one covered earlier. First, we will look at an unweighted version of the graph and compare it to a random graph.

To calculate the global transitivity of a network, you’ll need to set type to "global" in your call to transitivity().

The bike trip network, trip_g_simp is available.

# Calculate global transitivity
actual_global_trans <- transitivity(trip_g_simp, type = "global")

# See the result
actual_global_trans
## [1] 0.5515575
# Calculate the order
n_nodes <- gorder(trip_g_simp)

# Calculate the edge density
edge_dens <- edge_density(trip_g_simp)

# Run the simulation
simulated_global_trans <- rep(NA, 300)
for(i in 1:300) {
  # Generate an Erdos-Renyi simulated graph
  simulated_graph <- erdos.renyi.game(n_nodes, edge_dens, directed = TRUE)
  # Calculate the global transitivity of the simulated graph
  simulated_global_trans[i] <- transitivity(simulated_graph, type = "global")
}

# Plot a histogram of simulated global transitivity
hist(
  simulated_global_trans, 
  xlim = c(0.35, 0.6), 
  main = "Unweighted clustering randomization"
)

# Add a vertical line at the actual global transitivity
abline(v = actual_global_trans, col = "red")

What a simulating exercise! The global transitivity of each simulation is much lower than the global transitivity of the original graph. Let’s try it again but with weighted clustering.

Weighted Clustering Randomizations

We can see support for the hypothesis that a graph with low connectivity would also have very high clustering, much higher than by chance. But our graph is more than just an undirected graph, it also has weights that represent the number of trips taken. So now we have several things to consider in our randomization. First, the weighted version of the metric is local only, so a transitivity value is calculated for each vertex. Second, the random graph doesn’t include weights. To solve both of these problems, we’ll look at the mean vertex transitivity, and implement a slightly more complicated randomization scheme.

To calculate the weighted vertex transitivity of a network, you’ll need to set type to "weighted" in your call to transitivity().

The bike trip network, trip_g_simp is available.

# Find the mean local weighted clustering coeffecient using transitivity()
actual_mean_weighted_trans <- mean(transitivity(trip_g_simp, type = "weighted"))

# Calculate the order
n_nodes <- gorder(trip_g_simp)

# Calculate the edge density
edge_dens <- edge_density(trip_g_simp)

# Get edge weights
edge_weights <- E(trip_g_simp)$weights

# Run the simulation
simulated_mean_weighted_trans <- rep(NA, 300)
for(i in 1:300) {
  # Generate an Erdos-Renyi simulated graph
  simulated_graph <- erdos.renyi.game(n_nodes, edge_dens, directed = TRUE)
  # Get number of edges in simulated graph
  n_simulated_edges <- gsize(simulated_graph)
  # Sample existing weights and add them to the random graph
  E(simulated_graph)$weights <- sample(edge_weights, n_simulated_edges, replace = TRUE)
  # Get the mean transitivity of the simulated graph
  simulated_mean_weighted_trans[i] <- mean(transitivity(simulated_graph, type = "weighted"))
}

# Plot a histogram of simulated mean weighted transitivity
hist(
  simulated_mean_weighted_trans, 
  xlim = c(0.35, 0.7), 
  main = "Mean weighted clustering randomization"
)

# Add a vertical line at the actual mean weighted transitivity
abline(v = actual_mean_weighted_trans, col = "red")

And we’ve reached the end of the ride for this bike chapter, but now let’s wrap it up with some alternative plotting types.

Other packages for plotting graphs!

ggnet Basics

There are many ways to visualize a graph beyond the basic igraph plotting. A common framework is to use ggplot2, which provides a way to make high quality plots with minimal graphical parameter setting. In this lesson, we’ll cover the basics of creating a plot with ggnet2. This package builds graph plots using the ggplot2 framework. While it produces nice graphics, it relies on graph objects from a different R library network, but it does make plots with a ggplot2 aesthetic without quite the formal grammar of graphics used by ggnetwork, as we’ll see in the next lesson. Therefore, you’ll notice a lot of overlap in the syntax with igraph (node.size is the same as vertex.size, and edge.size is the same name).

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(intergraph)
library(network)
## network: Classes for Relational Data
## Version 1.16.1 created on 2020-10-06.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
##                     Mark S. Handcock, University of California -- Los Angeles
##                     David R. Hunter, Penn State University
##                     Martina Morris, University of Washington
##                     Skye Bender-deMoll, University of Washington
##  For citation information, type citation("network").
##  Type help("network-package") to get started.
## 
## Attaching package: 'network'
## The following objects are masked from 'package:igraph':
## 
##     %c%, %s%, add.edges, add.vertices, delete.edges, delete.vertices,
##     get.edge.attribute, get.edges, get.vertex.attribute, is.bipartite,
##     is.directed, list.edge.attributes, list.vertex.attributes,
##     set.edge.attribute, set.vertex.attribute
library(sna)
## Loading required package: statnet.common
## 
## Attaching package: 'statnet.common'
## The following object is masked from 'package:base':
## 
##     order
## sna: Tools for Social Network Analysis
## Version 2.6 created on 2020-10-5.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
##  For citation information, type citation("sna").
##  Type help(package="sna") to get started.
## 
## Attaching package: 'sna'
## The following objects are masked from 'package:igraph':
## 
##     betweenness, bonpow, closeness, components, degree, dyad.census,
##     evcent, hierarchy, is.connected, neighborhood, triad.census
library(ggplot2)

# Create subgraph of retweet_graph
retweet_samp <- induced_subgraph(retweet_graph, vids = V(retweet_graph))

# Plot using igraph
plot(retweet_samp, vertex.label = NA, edge.arrow.size = 0.2, edge.size = 0.5, vertex.color = "black", vertex.size = 1)

# Convert to a network object
retweet_net <- asNetwork(retweet_samp)

# Plot using GGally 
ggnet2(retweet_net, edge.size = 0.5, node.color = "black", node.size = 1)

GGood plotting! Let’s move on to the next package.

ggnetwork Basics

In the last lesson, you saw that the ggnet2 package produces ggplot2-like plots with a reasonably familiar syntax to igraph. However, the ggnetwork package works a bit differently. It converts igraph objects into data frames that are easily plotted by ggplot2. It also adds several new geoms that can be used to build plots. The ggnetwork() function converts the igraph object to a data frame, and some parameters are added into the data frame (in this case, the arrow gap parameter) and then can be plotted using ggplot. From there, you build your graph up with geom_edges() for edges and geom_nodes() for vertices. In this lesson we’ll do two basic plots of the retweet graph, one with ggplot defaults, and one with some basic theming to look a bit nicer.

library(ggnetwork)

# Call ggplot
ggplot(
  # Convert retweet_samp to a ggnetwork
  ggnetwork(retweet_samp), 
  # Specify x, y, xend, yend
  aes(x = x, y = y, xend = xend, yend = yend)) +
  # Add a node layer
  geom_nodes() +
  # Add an edge layer
  geom_edges()

# Call ggplot
ggplot(
  # Convert retweet_samp to a ggnetwork
  ggnetwork(retweet_samp), 
  # Specify x, y, xend, yend
  aes(x = x, y = y, xend = xend, yend = yend)) +
  # Add a node layer
  geom_nodes() +
  # Change the edges to arrows of length 6 pts
  geom_edges(arrow = arrow(length = unit(6, "pt"))) +
  # Use the blank theme
  theme_blank()

Priceless plotting! The x and y scales don’t have any meaning when plotting a graph, so theme_blank() is a common choice to remove them.

More ggnet Plotting Options

The real power of ggnet and other alternatives to igraph is that they offer a way to generate advanced plots with just a little parameterization. In the earlier example comparing the two plotting methods, they were somewhat similar. However, in this lesson we’ll show how you can make more advanced plots. We’ll take the Twitter data set and using igraph add several vertex attributes, centrality and community, and plot them quickly using ggnet.

The centrality and community membership attributes you created in the last exercise are still present.

# Convert to a network object
retweet_net <- asNetwork(retweet_graph)

# Plot with ggnet2
ggnet2(
  retweet_net, 
  # Set the node size to cent
  node.size = "cent", 
  # Set the node color to comm
  node.color = "comm", 
  # Set the color palette to spectral
  color.palette = "Spectral"
)

# Update the plot
ggnet2(
  retweet_net, 
  node.size = "cent", 
  node.color = "comm", 
  color.palette = "Spectral",
  # Set the edge color
  edge.color = c("color", "grey90")
) +
  # Turn off the size guide
  guides(size = FALSE)

Our retweet graph is like colorful bouquet! Great work, now let’s try it with the next plotting package.

More ggnetwork Plotting Options

Just like in the last lesson, ggnetwork also offers methods to quickly generate nice plots. It’s important to keep in mind that each package has a different style, which may or may not appeal to you. Recall that ggnetwork works by converting a graph to a dataframe to be plotted byggplot2. Therefore. all the parameterizations you’re used to will be available. This gives you a great degree of flexibility, but could also mean more effort to achieve the aesthetic you desire. We’ll repeat the exercise in the previous lesson, but this time using ggnetwork. This will give you a good point of comparison to decide which package you like the best.

The centrality and community membership attributes you created in the last exercise are still present.

ggplot(
  ggnetwork(retweet_graph, arrow.gap = 0.01), 
  aes(x = x, y = y, xend = xend, yend = yend)
) + 
  geom_edges(
    arrow = arrow(length = unit(6, "pt"), type = "closed"), 
    curvature = 0.2, color = "black"
  ) + 
  # Add a node layer, mapping color to comm and setting the size to 4
  geom_nodes(aes(color = comm), size = 4) + 
  theme_blank()

ggplot(
  # Draw a ggnetwork of retweet_graph_smaller
  ggnetwork(retweet_graph, arrow.gap = 0.01), 
  aes(x = x, y = y, xend = xend, yend = yend)
) + 
  geom_edges(
    # Add a color aesthetic, mapped to comm
    aes(color = comm),
    arrow = arrow(length = unit(6, "pt"), type = "closed"), 
    curvature = 0.2, color = "black"
  ) + 
  # Make the size an aesthetic, mapped to cent 
  geom_nodes(aes(color = comm, size = cent)) + 
  theme_blank() +
  guides(
    color = guide_legend(title = "Community"), 
    # Add a guide to size titled "Centrality"
    size = guide_legend(title = "Centrality")
  )

Pretty plotting! ggnetwork provides powerful plot customization options.

Interactive visualizations

Interactive plots with ggiraph

Up until now, we’ve been making static plots of our graphs. However, there are many features of our graph that we may want to visualize, and if we displayed them all at once, the image would be overwhelming. That’s where interactive graphs can truly stand out. You can plot the basic graph structure and allow the user to see different vertex and edge properties based on how they interact with the plot. In this lesson we will build on work we’ve done with ggnetwork. First, we’ll take a 1% subsample of the bike sharing network (cut down for ease of visualization) and create a ggnetwork plot. Then, we’ll add betweenness centrality as a vertex property and create an interactive plot where centrality is displayed when the pointer hovers over a vertex.

library(ggiraph)

# From previous step
static_network <- ggplot(
  ggnetwork(trip_g_simp, arrow.gap = 0.01), 
  aes(x = x, y = y, xend = xend, yend = yend)
) + 
  geom_edges() + 
  geom_nodes(aes(size = cent)) + 
  theme_blank() 

interactive_network <- static_network + 
  # Add an interactive point layer
  geom_point_interactive(
    # Map tooltip and data_id to centrality
    aes(tooltip = cent, data_id = cent)
  ) 

# Print the interactive network
girafe(code = print(interactive_network)) %>%
  # Set girafe options
  girafe_options(
    # Set hover options
    opts_hover(css = "cursor: pointer; fill: red; stroke: red; r: 5pt"),
    # Set tooltip options; give x-offset of 10 px 
    opts_tooltip(offx = 10)
  )

Interactive javascript plots

Another widely used framework for creating interactive plots is D3.js. It has a specific standard for creating network plots that we can automatically generate in R. On the one hand, this is highly convenient because, with just a few lines of code, you’ll be able to create fully interactive D3.js plots. The drawback is that real customization only comes when you directly edit the output javascript source code from R (which is beyond the scope of this course). Nonetheless, it is quick and easy to create a nice D3.js network plot in R using the d3Network library. In this lesson we’ll load up the #rstats Twitter dataset and add community membership. Then we’ll create a subgraph of just a few communities and render a D3.js network graph.

The tweet graph object, retweet_samp, is available.

library(d3Network)

# Run this to see the static version of the plot
ggplot(ggnetwork(retweet_samp, arrow.gap = 0.01), 
       aes(x = x, y = y, xend = xend, yend = yend)) + 
       geom_edges(color = "black") + 
       geom_nodes(aes(color = as.factor(comm))) + 
       theme_blank()  

# Convert retweet_samp to a networkD3 object
nd3 <- igraph_to_networkD3(retweet_samp, V(retweet_samp)$comm)

# View the data structure
str(nd3)

# Render your D3.js network
forceNetwork(
  # Define links from nd3 object
  Links = nd3$links, 
  # Define nodes from nd3 object
  Nodes = nd3$nodes, 
  # Specify the source column
  Source = "source", 
  # Specify the target column
  Target = "target", 
  NodeID = "name", 
  Group = "group", 
  legend = TRUE, 
  fontSize = 20
)

Fantastic work, now you can put these network plots into web pages, or just impress your friends!

Alternative visualizations

Alternative ways to visualize a graph: Hive plots

Another method you can use to visualize graphs are hive plots. These visualize points along a set of axes that are defined by categories of data. The position on the axis is determined by a feature of the graph like centrality and edge width, and color can be set by graph properties. Unlike hairball plots, hive plots layout is determined by properties of the graph. It makes comparing and interpreting the visualization easier than other approaches. In this exercise, we’ll visualize some of the bike data by assigning each vertex to an axis by its geography (north, central, southern). Vertices are aligned by centrality, and then colored in a heat map by geographic distance. This is why we see more red lines (far distance) on vertices at the end of the axis (greater centrality).

library(HiveR)

# Convert trip_df to hive object using edge2HPD()
bike_hive <- edge2HPD(trip_df, axis.cols = rep("black", 3))

# Set edge color using our custom function
bike_hive$edges$color <- dist_gradient(trip_df$geodist)

# Set node radius based on centrality
bike_hive$nodes$radius <- ifelse(bike_cent > 0, bike_cent, runif(1000, 0, 3))

# Set node axis to station axis
bike_hive$nodes$axis <- dist_stations$axis

# Plot the hive
plotHive(bike_hive, method = "norm", bkgnd = "white")

Whoa, awesome work! Hive plots can be trickier than a hive of bees, but for the right data they are a powerful visualization.

BioFabric as an HTML widget

Biofabric plots are another way that you can visualize a graph without using a typical hairball approach. Instead, it lays out vertices as horizontal lines, one per row, and edges as vertical lines. When there is a connection between two vertices, a vertical line is drawn between the two vertex row lines. By default, vertices are sorted by degree, which leads to the triangular type of groupings. When using this method on large graphs, it’s best to save the output as html or a pdf. The layout prevents the overlapping tangle often seen when visualizing large graphs. In this lesson we’ll take the #rstats Twitter dataset and subset it into a few communities. Then we’ll visualize it with ggnetwork and RBioFabric so we can get a sense of both approaches.

library(RBioFabric)
library(htmlwidgets)

# Make a Biofabric plot of retweet_samp
retweet_bf <- bioFabric(retweet_samp)

# Create HTMLwidget of retweet_bf
#bioFabric_htmlwidget(retweet_bf) #this widget is a memory wurm, do not uncomment

I’m not telling any fabrications, these are tricky plots, but R makes them easy and they are an interesting alternative to the hairball plot.

Plotting graphs on a map

In this lesson, we’ll build a map that overlays the bike graph on a map. We’ll then color the edges by subscriber/non-subscriber and draw the thickness by the number of trips. First, recall that subscribers to the bike sharing network are regular users and probably more likely to be local, whereas non-subscribers are more likely to be tourists. Trips from non-subscribers tend to be limited to the lakefront portions of the city that tend to have more tourist attractions. Another pattern we’ll see is that the further west a station is, the thinner the line, meaning fewer trips were taken. Overlaying the graph on the map helps make certain network patterns much clearer than if we were to simply make a regular hairball plot.

The bike data, bike_dat, and a map of Chicago, chicago, are available.

library(ggmap)
chicago <- get_map("Chicago")

weighted_trips_by_usertype <- bike_dat %>% 
  group_by(
    from_station_id, to_station_id, 
    from_latitude, from_longitude, 
    to_latitude, to_longitude, 
    usertype
  ) %>% 
  # Weight each journey by number of trips
  summarize(weight = n())

# Create a base map
ggmap(chicago) + 
  # Add a line segment layer
  geom_segment( 
    aes(
      x = from_longitude, y = from_latitude, 
      xend = to_longitude, yend = to_latitude, 
      # Color by user type
      color = usertype, 
      # Set the line width to the weight
      size = weight
    ), 
    data = weighted_trips_by_usertype,
    alpha = 0.5
  )

Congratulations! You did it, that’s the end, but really it’s just the beginning of all the amazing things you can do with networks.