In this lab, we will be testing a number of hypotheses about a
network’s structure using exponential random graph modeling (ERGM)
techniques using the statnet
package in R.1 Mark S. Handcock, David R. Hunter, Carter T. Butts, Steven M. Goodreau, and Martina Morris (2003). statnet
: Software tools for the Statistical Modeling of Network Data. statnetproject.org; see also ?? statnet
. For more information about ERGMs, see generally D. Lusher, J. Koskinen, & G. Robins (2012) Exponential Random Graph Models for Social Networks. statnet
provides a comprehensive framework for ERGM-based network modeling,
including tools for model estimation, model evaluation, model-based
network simulation, and network visualization. This functionality is
powered by a central Markov chain Monte Carlo Maximum Likelihood
Estimation (MCMCMLE) algorithm.2 For a great introduction to MCMC grounded in graph theory, see Jeremy Kun, Markov Chain Monte Carlo Without All the Bullshit.
statnet
resources:
● Tutorial
We will analyze the communication behaviors within a team of seventeen members who were involved in designing military installations.
CRIeq.txt
: each team member’s communication to
retrieve information from other team members on the topic of
environmental quality (eq). This is a directed, binary relation.
CAIeq.txt
: each team member’s communication to
allocate information to other team members on the topic of environmental
quality (eq). This is a directed, binary relation.
EXeq_cons.txt
: each team member’s expertise on the
topic of environmental quality (eq) as perceived on average by all team
members. This is a continuous attribute.
We will test various hypotheses based on the Theory of Transactive Memory.3 See Monge & Contractor (2003) Theories of Communication Networks, 198—203.
Hypothesis 1: Individuals are less likely to retrieve information from those who retrieve information from them.
Hypothesis 2a: Information retrieval tends to be transitive. That is, if individual i retrieves information from individual k, and individual k retrieves information from individual j, then individual i is more likely to retrieve information from individual j.
Hypothesis 2b: Transitivity increases at a sub-linear rate as a function of the number of ties.
Hypothesis 3a: Individuals tend to retrieve information from other members with high expertise.
Hypothesis 3b: Individuals with low expertise tend to retrieve information from many others.
Hypothesis 4: Individuals tend to retrieve information from members to whom they allocate information.
The analysis will use three files: the CRIeq.txt
as the network file, EXeq_cons.txt
as the attribute file, and CAIeq.txt
as the co-variate network file. To begin, we must convert the data
files into matrices, transform those matrices into networks, and attach
the attribute file to our base network.
Let’s begin by looking at the summary of our base network.
In your own words, explain what this network respresents and
its relationship to our attribute information and the other network.
ANSWER:
This network represents 17 nodes, or workers, who go to each other in
search of information. In total, there are 41 links between the nodes.
These relationships are directed, meaning they are not necessarily
reciprocal. Some nodes are more sought after than others. The other
network shows the relationships between nodes when allocating
information to others.
Before we conduct further analysis, let’s visualize our base network. Similar to our approach in Lab 2, we will begin by establishing set coordinates for our nodes in order to simplify visual comparisons.
Base Graph Structure
Next we can visualize our base network.
Base Network: Retrieval of Environmental Quality
Next, we will size the nodes by the their expertise value.
Base Network: Retrieval of Environmental Quality, Nodes Sized by Expertise Score
Let’s compare this visualization to sizing by in-degree centrality.
Base Network: Nodes by Indegree Centrality
Consider hypothesis 3a from Part I. Do these visualizations
prove or disprove the hypothesis? In your own words, interpret the
graphs and explain how they support or reject the hypothesis.
ANSWER:
It supports the hypothesis. Individual with larger expertise value tend
to have higher in-degree centrality (9, 6, 16, 5, 4)
Let’s explore the summary statistics of our co-variate network.
## Network attributes:
## vertices = 17
## directed = TRUE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges = 23
## missing edges = 0
## non-missing edges = 23
## density = 0.08455882
##
## Vertex attributes:
## vertex.names:
## character valued attribute
## 17 valid vertex names
##
## No edge attributes
##
## Network edgelist matrix:
## [,1] [,2]
## [1,] 7 2
## [2,] 6 4
## [3,] 7 4
## [4,] 9 4
## [5,] 14 4
## [6,] 6 5
## [7,] 17 5
## [8,] 2 6
## [9,] 14 6
## [10,] 7 8
## [11,] 2 9
## [12,] 4 9
## [13,] 5 9
## [14,] 6 9
## [15,] 14 9
## [16,] 17 9
## [17,] 4 10
## [18,] 7 11
## [19,] 4 13
## [20,] 5 16
## [21,] 6 16
## [22,] 4 17
## [23,] 5 17
In your own words, explain what this network represents and its relationship to the other network and the attribute information. ANSWER: This network represents the same 17 nodes, or workers, who allocate information to other nodes. There are fewer links between nodes – 23. These relationships are also directed, meaning they are not necessarily reciprocal. Again, some nodes are more sought after than others.
We will repeat the visualization process for our co-variate network. Observe the location and distribution of edges in the following visualization.
Covariate Network: Allocation of Environmental Quality
Next, we will size the nodes by their expertise scores.
Base Network: Allocation of Environmental Quality, Nodes Sized by Expertise Score
Allocation
Consider hypothesis 4 from Part I. Do the visualizations of
the retrieval and allocation networks sized by expertise support or
disprove the hypothesis based on visual inspection? In your own words,
interpret the graphs and explain how they support or reject the
hypothesis.
ANSWER: Partially, but not as strongly as
hypothesis 3a Nodes with higher expertise tend to have higher in-degree
centrality, but it’s not the case for every high-expertise node. There
are also examples of low expertise nodes that have high in-degree
centrality.
Next, we’re going to construct an Exponential Random Graph Model. Note that model construction is integral to this process; ERGM is not a single method of analysis, but a type of modelling that requires a theoretical grounding specific to the network and the hypotheses posed by the researcher. While the base ERGM simulation is set to take up to twenty iterations of simulations to fit the model by estimating parameters, it will stop at fewer if the generated networks converge on estimates of our coefficient values or paramters. ERGMs are used to predict ties as a function of individual covariates (i.e. attribute data, like “EX” in our example) or network structure.
We will start with a very basic model, looking only at the
probability of edge formation, otherwise known as density, to
demonstrate how co-efficients can be translated into odds ratios.4 The term for tie density (edges
) is often used similarly to an intercept term in a linear regression or other linear model such as R’s glm
.
Keep in mind that the base network is about information retrieval. Our
model will primarily allow us to ask what the probability is that an
information retrieval relationship will form between two nodes.
model0 <- ergm(CRIeq ~ edges)
## Evaluating log-likelihood at the estimate.
summary(model0)
##
## ==========================
## Summary of model fit
## ==========================
##
## Formula: CRIeq ~ edges
##
## Iterations: 5 out of 20
##
## Monte Carlo MLE Results:
## Estimate Std. Error MCMC % p-value
## edges -1.7288 0.1695 0 <1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 377.1 on 272 degrees of freedom
## Residual Deviance: 230.6 on 271 degrees of freedom
##
## AIC: 232.6 BIC: 236.3 (Smaller is better.)
What do coefficients mean? Coefficients are the change in the (log-odds) likelihood of a tie for a unit change in a predictor. In our basic model above, our only predictor is information on the number of edges in the network. We can see that the coefficient estimate is negative, suggesting that a tie is more likely not to form than form (i.e. density is less than .5.) To get a better sense of how less likely it is for a tie to form, we can translate our log-odds into a probability.
To translate our estimated coefficient for the edges
parameter into a probability, we can take the inverse-logit. We are
thereby finding the probability that a tie will form in the network,
looking only at the number of ties in the base network to build our
model. We can see below that this probability is equal to our network
density.
plogis(coef(model0)[[1]])
## [1] 0.1507353
network.density(CRIeq)
## [1] 0.1507353
Conceptually, this should be fairly easy to follow: if all we know
about a network is the number of ties we have and we’re attempting to
predict the probability that an edge will exist solely based on that
information, then the probability of an edge existing at any point in
the network equals the density of the network as a whole.5 As we move on from this toy model, keep in mind that the estimate for our edges
parameter will change as we add additional predictors or network
statistics as additional terms will partially explain tie formation.
As we build a network, we can evaluate whether individual network statistics or node attributes prove our hypotheses and whether they do so in a way that is significantly different from random chance. Later, we will evaluate whether the model does a good job of explaining our observed network.
model1 <- ergm(CRIeq ~ edges # Set the base term based on density/edge formation.
+ mutual # H1
+ transitive # H2a: Transitive triads ( type 120D, 030T, 120U, or 300) # What if the tie is part of a transitive triad?
+ nodeicov("EX") # H3a
+ nodeocov("EX") # H3b
+ edgecov(CAIeq) # H4
)
## Starting maximum likelihood estimation via MCMLE:
## Iteration 1 of at most 20:
## Optimizing with step length 1.
## The log-likelihood improved by 0.2307.
## Step length converged once. Increasing MCMC sample size.
## Iteration 2 of at most 20:
## Optimizing with step length 1.
## The log-likelihood improved by 0.00728.
## Step length converged twice. Stopping.
## Evaluating log-likelihood at the estimate. Using 20 bridges: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 .
## This model was fit using MCMC. To examine model diagnostics and check for degeneracy, use the mcmc.diagnostics() function.
summary(model1)
##
## ==========================
## Summary of model fit
## ==========================
##
## Formula: CRIeq ~ edges + mutual + transitive + nodeicov("EX") + nodeocov("EX") +
## edgecov(CAIeq)
##
## Iterations: 2 out of 20
##
## Monte Carlo MLE Results:
## Estimate Std. Error MCMC % p-value
## edges -6.9744 1.1490 0 < 1e-04 ***
## mutual -1.4515 0.9915 0 0.14436
## transitive 0.2817 0.1200 0 0.01969 *
## nodeicov.EX 9.5449 2.0638 0 < 1e-04 ***
## nodeocov.EX 1.2598 1.5984 0 0.43131
## edgecov.CAIeq 2.0994 0.6816 0 0.00229 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 377.1 on 272 degrees of freedom
## Residual Deviance: 123.8 on 266 degrees of freedom
##
## AIC: 135.8 BIC: 157.4 (Smaller is better.)
kable(plogis(coef(model1)))
edges | 0.0009346 |
mutual | 0.1897633 |
transitive | 0.5699572 |
nodeicov.EX | 0.9999284 |
nodeocov.EX | 0.7789891 |
edgecov.CAIeq | 0.8908474 |
Take a look at the ERGM equation discussed in the week 5 slides, reproduced below:
What term in the equation do the ERGM terms or network statistics correspond to?
ANSWER: g(y) is a vector of network statistics.
Explain each ERGM term and its relationship to your hypotheses. Test your hypotheses (think about whether the sign of your coefficient suggests a tie is more or less likely) and report whether your results are significant.6 A parameter is significant if its absolute value is more than twice its Standard Error.
Hypothesis 1: Individuals are less likely to retrieve information from those who retrieve information from them.
ANSWER:
According to the documentaion, “mutual” is “equal to the number of
pairs of actors i and j for which (i,j) and (j,i) both exist.” That
number is negative (-1.4515), so the estimate alone would confirm the
hypothesis. However, the std. error is relatively high (0.9915), which
means we can’t say that conclusively.
Hypothesis 2a: Information retrieval tends to be transitive. That is, if individual i retrieves information from individual k, and individual k retrieves information from individual j, then individual i is more likely to retrieve information from individual j.
ANSWER:
“Transitive”, which according to the documentation is “the number of
triads in the network that are transitive” is positive (0.2817) and the
standard error is just small enough so that value is significant.
Hypothesis 3a: Individuals tend to retrieve information from other members with high expertise.
ANSWER:
According to the documentation, nodeicov calculates the “main effect of
a covariate for in-edges”, in this case, for EX, which is the team
member’s expertise. This estimate is very positive and much high er than
the standard error, meaning the data supports the hypothesis with quite
a large certainty.
Hypothesis 3b: Individuals with low expertise tend to retrieve information from many others.
ANSWER:
According to the documentation, nodeocov calculates the “main effect of
a covariate for out-edges”, also in this case, for EX, which is the
team member’s expertise. The estimate value for this is positive, but
the standard error is even larger than the estimate. Meaning the data
does not support this hypothesis.
Hypothesis 4: Individuals tend to retrieve information from members to whom they allocate information.
ANSWER:
The funcion “edgecov”, which in this case uses data from CAIeq, has a
positive value much larger than its standard error. So the hypothesis is
strongly supported.
Now we will change our model slightly.
model2 <- ergm(CRIeq ~ edges
+ mutual # H1
+ dgwesp(0.5, fixed=T, type="OTP") # H2b: OTP "transitive shared partner" ordered pair (i,j) iff i->k->j.
+ nodeicov("EX") # H3a
+ nodeocov("EX") # H3b
+ edgecov(CAIeq) # H4
)
## Starting maximum likelihood estimation via MCMLE:
## Iteration 1 of at most 20:
## Optimizing with step length 1.
## The log-likelihood improved by 0.1652.
## Step length converged once. Increasing MCMC sample size.
## Iteration 2 of at most 20:
## Optimizing with step length 1.
## The log-likelihood improved by 0.002943.
## Step length converged twice. Stopping.
## Evaluating log-likelihood at the estimate. Using 20 bridges: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 .
## This model was fit using MCMC. To examine model diagnostics and check for degeneracy, use the mcmc.diagnostics() function.
summary(model2)
##
## ==========================
## Summary of model fit
## ==========================
##
## Formula: CRIeq ~ edges + mutual + dgwesp(0.5, fixed = T, type = "OTP") +
## nodeicov("EX") + nodeocov("EX") + edgecov(CAIeq)
##
## Iterations: 2 out of 20
##
## Monte Carlo MLE Results:
## Estimate Std. Error MCMC % p-value
## edges -6.8740 1.2136 0 < 1e-04 ***
## mutual -1.5142 1.0369 0 0.145398
## gwesp.OTP.fixed.0.5 0.6104 0.3613 0 0.092309 .
## nodeicov.EX 8.9890 2.3300 0 0.000144 ***
## nodeocov.EX 1.3097 1.7263 0 0.448719
## edgecov.CAIeq 2.1134 0.6710 0 0.001821 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 377.1 on 272 degrees of freedom
## Residual Deviance: 124.3 on 266 degrees of freedom
##
## AIC: 136.3 BIC: 157.9 (Smaller is better.)
kable(plogis(coef(model2)))
edges | 0.0010333 |
mutual | 0.1803157 |
gwesp.OTP.fixed.0.5 | 0.6480383 |
nodeicov.EX | 0.9998752 |
nodeocov.EX | 0.7874616 |
edgecov.CAIeq | 0.8922022 |
Evaluate the remaining hypothesis with your model.
Hypothesis 2b: Transitivity increases at a sub-linear rate as a function of the number of ties.
ANSWER:
Again, another case in which the estimate points to a direction of the
hypothesis, but the standard error does not allow for support of the
hypothesis. gwesp has a estimate of 0.6104, but a std. error of 0.3613.
Next, judge convergence of the MCMC processes of the first model, using the mcmc.diagnostics()
function. The function will plot the change of model statistics during the last iteration of the MCMC estimation procedure.7
Note that although the edge graphs appear to be periodic, the dips
between whole numbers are due to the fact that edges are always whole
numbers. For each model statistic, the left hand side plot gives
the change of the statistic with iterations, and the right hand side
plot is a histogram of the statistic values. Both are normalized, so the
observed data locate at 0.
mcmc.diagnostics(model1) # Performs the markov chain monte carlo diagnostics
## Sample statistics summary:
##
## Iterations = 16384:4209664
## Thinning interval = 1024
## Number of chains = 1
## Sample size per chain = 4096
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## edges 0.63086 7.375 0.11523 0.17067
## mutual 0.22803 2.473 0.03863 0.05567
## transitive 3.00269 31.115 0.48617 0.71710
## nodeicov.EX 0.32909 3.927 0.06136 0.09246
## nodeocov.EX 0.19260 2.483 0.03880 0.05832
## edgecov.CAIeq 0.04272 1.784 0.02787 0.03580
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## edges -12.000 -4.000 0.000e+00 5.000 17.000
## mutual -3.000 -2.000 0.000e+00 2.000 6.000
## transitive -40.000 -19.000 -4.000e+00 19.000 82.000
## nodeicov.EX -6.706 -2.412 5.882e-02 2.824 8.919
## nodeocov.EX -4.294 -1.529 -2.000e-09 1.824 5.471
## edgecov.CAIeq -4.000 -1.000 0.000e+00 1.000 3.000
##
##
## Sample statistics cross-correlations:
## edges mutual transitive nodeicov.EX nodeocov.EX
## edges 1.0000000 0.7440785 0.9183834 0.9909533 0.9222568
## mutual 0.7440785 1.0000000 0.8654203 0.7483071 0.8720016
## transitive 0.9183834 0.8654203 1.0000000 0.9245335 0.9371334
## nodeicov.EX 0.9909533 0.7483071 0.9245335 1.0000000 0.9110763
## nodeocov.EX 0.9222568 0.8720016 0.9371334 0.9110763 1.0000000
## edgecov.CAIeq 0.5273759 0.4952026 0.5245660 0.5121310 0.5631972
## edgecov.CAIeq
## edges 0.5273759
## mutual 0.4952026
## transitive 0.5245660
## nodeicov.EX 0.5121310
## nodeocov.EX 0.5631972
## edgecov.CAIeq 1.0000000
##
## Sample statistics auto-correlation:
## Chain 1
## edges mutual transitive nodeicov.EX nodeocov.EX
## Lag 0 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000
## Lag 1024 0.29602714 0.28999608 0.37011059 0.30805949 0.30786156
## Lag 2048 0.13494980 0.11044553 0.15565527 0.14686320 0.13138072
## Lag 3072 0.07822541 0.07382388 0.08220402 0.08408586 0.07803388
## Lag 4096 0.01954890 0.03750356 0.03831785 0.01996449 0.02589681
## Lag 5120 0.03131580 0.02918752 0.03555072 0.03433520 0.03801852
## edgecov.CAIeq
## Lag 0 1.000000000
## Lag 1024 0.199076878
## Lag 2048 0.086168510
## Lag 3072 0.026845574
## Lag 4096 -0.005624284
## Lag 5120 -0.004873216
##
## Sample statistics burn-in diagnostic (Geweke):
## Chain 1
##
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5
##
## edges mutual transitive nodeicov.EX nodeocov.EX
## 0.7790 -0.1910 0.5870 0.6246 0.7619
## edgecov.CAIeq
## -0.3076
##
## Individual P-values (lower = worse):
## edges mutual transitive nodeicov.EX nodeocov.EX
## 0.4360008 0.8485307 0.5572247 0.5322518 0.4461489
## edgecov.CAIeq
## 0.7584113
## Joint P-value (lower = worse): 0.04277244 .
## Warning in formals(fun): argument is not a function
MCMC Diagnostics, Model 1.
MCMC Diagnostics, Model 1.
##
## MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).
Repeat the process for the second model.
mcmc.diagnostics(model2) # Performs the markov chain monte carlo diagnostics
## Sample statistics summary:
##
## Iterations = 16384:4209664
## Thinning interval = 1024
## Number of chains = 1
## Sample size per chain = 4096
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## edges -0.310791 5.357 0.08371 0.08845
## mutual -0.005127 1.711 0.02674 0.02896
## gwesp.OTP.fixed.0.5 -0.478587 10.028 0.15669 0.17049
## nodeicov.EX -0.164206 2.864 0.04474 0.04765
## nodeocov.EX -0.078800 1.689 0.02639 0.02858
## edgecov.CAIeq -0.079346 1.684 0.02631 0.02908
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## edges -10.000 -4.000 0.00000 3.000 10.000
## mutual -3.000 -1.000 0.00000 1.000 3.000
## gwesp.OTP.fixed.0.5 -19.321 -7.343 -0.82945 6.107 19.445
## nodeicov.EX -5.684 -2.118 -0.11765 1.765 5.529
## nodeocov.EX -3.353 -1.191 -0.05882 1.059 3.176
## edgecov.CAIeq -4.000 -1.000 0.00000 1.000 3.000
##
##
## Sample statistics cross-correlations:
## edges mutual gwesp.OTP.fixed.0.5 nodeicov.EX
## edges 1.0000000 0.5313516 0.8982144 0.9821594
## mutual 0.5313516 1.0000000 0.7194566 0.5381319
## gwesp.OTP.fixed.0.5 0.8982144 0.7194566 1.0000000 0.9165649
## nodeicov.EX 0.9821594 0.5381319 0.9165649 1.0000000
## nodeocov.EX 0.8614649 0.7380090 0.8902859 0.8316911
## edgecov.CAIeq 0.4102323 0.3498729 0.4224946 0.3871053
## nodeocov.EX edgecov.CAIeq
## edges 0.8614649 0.4102323
## mutual 0.7380090 0.3498729
## gwesp.OTP.fixed.0.5 0.8902859 0.4224946
## nodeicov.EX 0.8316911 0.3871053
## nodeocov.EX 1.0000000 0.4554200
## edgecov.CAIeq 0.4554200 1.0000000
##
## Sample statistics auto-correlation:
## Chain 1
## edges mutual gwesp.OTP.fixed.0.5 nodeicov.EX
## Lag 0 1.000000000 1.00000000 1.000000000 1.000000000
## Lag 1024 0.054919942 0.07963608 0.084058581 0.062741563
## Lag 2048 0.002266933 0.01325322 0.007484444 0.012055499
## Lag 3072 -0.006670161 -0.01343285 -0.004961040 -0.014162831
## Lag 4096 -0.005378409 0.01433994 -0.009229691 -0.006125102
## Lag 5120 -0.007475345 -0.01042801 -0.016320508 -0.015197072
## nodeocov.EX edgecov.CAIeq
## Lag 0 1.000000000 1.0000000000
## Lag 1024 0.079578621 0.0995903768
## Lag 2048 -0.013145788 0.0244695122
## Lag 3072 -0.001161323 0.0005197621
## Lag 4096 0.005133018 -0.0272918852
## Lag 5120 0.009118882 -0.0020513647
##
## Sample statistics burn-in diagnostic (Geweke):
## Chain 1
##
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5
##
## edges mutual gwesp.OTP.fixed.0.5
## -0.348152 -0.915231 -0.353861
## nodeicov.EX nodeocov.EX edgecov.CAIeq
## -0.109243 -0.511664 -0.003743
##
## Individual P-values (lower = worse):
## edges mutual gwesp.OTP.fixed.0.5
## 0.7277261 0.3600702 0.7234432
## nodeicov.EX nodeocov.EX edgecov.CAIeq
## 0.9130100 0.6088865 0.9970136
## Joint P-value (lower = worse): 0.8045628 .
## Warning in formals(fun): argument is not a function
MCMC Diagnostics, Model 2.
MCMC Diagnostics, Model 2.
##
## MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).
Has the MCMC process converged to a desired state for each ERGM term? Explain how you interpret the plots.8 D. Lusher et al, ERGM,
12.3 (“[T]he inferential goal is to center the distribution of
statistics over those of the observed network, thus fitting a model that
we say gives maximal support to the data.”)
ANSWER: Yes, it converged to the desired state. The charts show a distribution around the terms’ values.
To evaluate the goodness-of-fit for our model, we need to simulate many variations of the model.9 See ?? simulate
for more information.
Let’s visually inspect two of our random networks based on our first model.
ggnet2(sim[[1]], mode = baseLayout, label = TRUE, arrow.size = 8, edge.color="blue", arrow.gap = 0.03) + theme(panel.background = element_rect(fill = "#fffff8")) + guides(color = FALSE, size = FALSE)
Random Graph Variant, Example 1
ggnet2(sim[[10]], mode = baseLayout, label = TRUE, arrow.size = 8, edge.color="orange", arrow.gap = 0.03) + theme(panel.background = element_rect(fill = "#fffff8")) + guides(color = FALSE, size = FALSE)
Random Graph Variant, Example 1
Next we’re going to extract the number of triangles from each of the 100 samples, create a histogram of that model, and place a red arrow at the value of the observed network.
model.tridist <- sapply(1:100, function(x) summary(sim[[x]] ~triangle)) # Extracts the tiangle data from the simulated networks
hist(model.tridist, xlim=c(0,140), breaks = 20) # Plots that triangle distribution as a histogram
CRIeq.tri <- summary(CRIeq ~ triangle) # Saves the CRIeq triangle data from the summary to the CRI.eq variable
arrows(CRIeq.tri,20, CRIeq.tri, 5, col="red", lwd=3) # Adds an arrow to the plotted histogram
Triangle Distribution
Is the distribution of triangles in your simulation a good match with the distribution of triangles in your observed network?
ANSWER:
The arrow points to a place in the histogram which is close, but not
exactly, the most frequent distribution. So it seems close, but not
exactly accurate.
Next, we will calculate the Goodness of Fit for several of the parameters in our model. A p-value closer to one is better; this represents the difference between the observed networks and simulations.
gof
##
## Goodness-of-fit for in-degree
##
## obs min mean max MC p-value
## 0 9 6 8.93 11 1.00
## 1 3 0 2.02 6 0.66
## 2 0 0 0.70 4 0.92
## 3 0 0 0.50 2 1.00
## 4 0 0 0.52 3 1.00
## 5 2 0 0.60 4 0.16
## 6 0 0 0.75 3 0.88
## 7 0 0 0.74 3 0.86
## 8 1 0 0.74 4 1.00
## 9 1 0 0.64 3 0.88
## 10 0 0 0.44 3 1.00
## 11 1 0 0.21 2 0.40
## 12 0 0 0.14 2 1.00
## 13 0 0 0.07 1 1.00
##
## Goodness-of-fit for out-degree
##
## obs min mean max MC p-value
## 0 4 0 1.77 6 0.18
## 1 1 1 3.54 9 0.24
## 2 3 0 3.78 8 0.94
## 3 4 0 3.86 8 1.00
## 4 3 0 2.59 7 0.94
## 5 2 0 1.15 4 0.62
## 6 0 0 0.30 2 1.00
## 7 0 0 0.01 1 1.00
##
## Goodness-of-fit for edgewise shared partner
##
## obs min mean max MC p-value
## esp0 10 3 11.75 21 0.82
## esp1 13 2 12.02 20 0.90
## esp2 11 2 9.97 20 0.96
## esp3 5 0 4.93 18 0.86
## esp4 2 0 1.69 8 0.86
## esp5 0 0 0.28 4 1.00
## esp6 0 0 0.02 2 1.00
##
## Goodness-of-fit for minimum geodesic distance
##
## obs min mean max MC p-value
## 1 41 26 40.66 56 0.98
## 2 19 20 37.96 73 0.00
## 3 4 0 13.37 37 0.30
## 4 0 0 3.52 22 0.62
## 5 0 0 0.92 7 1.00
## 6 0 0 0.09 3 1.00
## 7 0 0 0.01 1 1.00
## Inf 208 106 175.47 220 0.14
# -------------------------------------------------------------------------------------------------
# Test the goodness of fit of the model
# Compiles statistics for these simulations as well as the observed network, and calculates p-values
# -------------------------------------------------------------------------------------------------
par(mfrow=c(2,2)) # Separate the plot window into a 2 by 2 orientation
plot(gof) # Plot the goodness of fit
Goodness of Fit
Evaluate the plots and summary statitistics of the Goodness
of Fit measures for Model 1. Are the four terms evaluated show a good
fit between the simulated networks and the observed network?10 In general, for configurations in the model, the fit is considered good if │t│≤ 0.1. For configurations not included in the model, the fit is considered good if 0.1<│t│≤ 1, and not extreme if 1< │t│≤
2. For your plot, the dark black line represents the data for the
observed network. The boxplots represent the distribution of
corresponding degrees across the simulated networks, and the soft lines
are the 95% confidence intervals.
ANSWER: Out of the
four models, two of them show a good fit: in degree (with the exception
of 5 and 11) and edgewise shared partner. For the other two, the results
are not that good. For out-degree and minimum geodesic distance, almost
half do not fit.
After knitting your file to RPubs, copy the URL and paste it into the comment field of the Lab 2 Assignment on Canvas. Save this .Rmd file and submit it in the file portion of your Canvas assignment. Make sure to review your file and its formatting. Run spell check (built into RStudio) and proofread your answers before submitting. If you can’t publish to RPubs, save your HTML file as a PDF and submit that instead.11 There are many different ways to do this with different browsers. Google it.