Social Network Analysis

Worksheet 10: ERGMs I

Author

Termeh Shafie

Introduction

We’re going to follow the ERGM modelling outline:

  • specify and estimate model parameters that should govern evolution of network
  • simulate other random networks based on specified models
  • compare the goodness of fit of observed to model networks.

The following resource is useful for looking up different model terms: ERGM terms.

Note that we now are performing stochastic simulation – in some of the cases, your output will differ slightly from mine and between different runs (you can however use set.seed() to get exactly the same results).

Packages needed

library(statnet) # also loads the ERGM package
library(igraph)
library(ggraph)
library(intergraph)
library(patchwork)
library(networkdata)

Object types

We will be primarily be working with matrix, network and graph objects. Note that ergm primarily requires network and adjacency matrices, but since we will be using ggraph to visualize networks we also need graph objects. We try to keep it clear here by using suffix g, net and mat to clarify object assignment.

Florentine marriage network

We start by loading the Florentine marriage network (available in the statnet package) and create the adjacency matrix from the loaded network object. This is done with the below code. Note that we have some available node attributes: priorates, totalties, vertex.names and wealth. We’ll be using these attributes later for modeling ERGMs.

data(florentine) # loads flomarriage and flobusiness data
flom_net <- flomarriage # look at the flomarriage network data
flom_mat <- as.matrix(flomarriage)

To visualize the network we create a graph object (note that using geom_node_text includes the vertex/family names but you can exclude this if you prefer):

flom_g <- asIgraph(flom_net)
flom_p <- ggraph(flom_g, layout = "stress") + 
  geom_edge_link0(edge_colour = "#666060", 
                  edge_width = 0.8, edge_alpha = 1) +
  geom_node_point(fill = "#808080", colour = "#808080",  
                  size = 7, shape = 21, stroke = 0.9) +
  theme_graph() + 
  theme(legend.position = "none") +
  geom_node_text(aes(label = vertex.names), colour = "#000000",
    size = 5, family = "sans") +
  ggtitle("Florentine marriage network")
flom_p

Model 1: Dyadic independence/Bernoulli graph

Estimation

We begin by specifying a Bernoulli model using the ergm function. This is done by only including number of edges as a term in the model (recall from lecture that this implies dyadic independence). Run the model and print out summary of model fit using below code:

flom_mod1 <- ergm(flom_net ~ edges) # fit the model
summary(flom_mod1) # get a summary of model
Call:
ergm(formula = flom_net ~ edges)

Maximum Likelihood Results:

      Estimate Std. Error MCMC % z value Pr(>|z|)    
edges  -1.6094     0.2449      0  -6.571   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 166.4  on 120  degrees of freedom
 Residual Deviance: 108.1  on 119  degrees of freedom
 
AIC: 110.1  BIC: 112.9  (Smaller is better. MC Std. Err. = 0)

You can also just print the estimated coefficient using only flom_mod1.

Q1. How can you interpret the parameter estimate?

The log-odds of any tie occurring is: \[ -1.609 \times \textrm{change in the number of ties} = -1.609 \times 1 \] for all ties, since the addition of any tie to the network changes the number of ties by 1. Corresponding probability is: \[\frac{\exp{(-1.609)}}{1+\exp{(-1.609)}}=0.1667\] which is what you would expect, since there are 20/120 ties.

Model 2: Transitivity effect added

Estimation

Next, we add a term the number of completed triangles/triads (which would indicate transitivity).

set.seed(1) #include if you want the same results shown here
flom_mod2 <- ergm(flom_net ~ edges + triangle)
summary(flom_mod2) 
Call:
ergm(formula = flom_net ~ edges + triangle)

Monte Carlo Maximum Likelihood Results:

         Estimate Std. Error MCMC % z value Pr(>|z|)    
edges     -1.6913     0.3219      0  -5.254   <1e-04 ***
triangle   0.1808     0.5567      0   0.325    0.745    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 166.4  on 120  degrees of freedom
 Residual Deviance: 108.1  on 118  degrees of freedom
 
AIC: 112.1  BIC: 117.6  (Smaller is better. MC Std. Err. = 0.01061)

Q2 How can you interpret the parameter estimates?

Q3 What do the parameter estimates tell us about the configurations specified in the model?

Conditional log-odds of two actors forming a tie is:

  • \(-1.644\times\) change in the number of ties + \(0.134 \times\) change in number of triangles
  • if the tie will not add any triangles to the network, its log-odds is: -1.644
  • if it will add one triangle to the network, its log-odds is: -1.644 + 0.134
  • if it will add two triangles to the network, its log-odds is: -1.644 + 0.134 \(\times\) 2

MCMC diagnostics

You can use mcmc.diagnostics(flom_mod2) to observe the behavior of the MCMC estimation algorithm and check for degeneracy. What you want to see in the MCMC diagnostics: the MCMC sample statistics varying randomly around the observed values at each step in the trace plots (which means the chain is mixing well) and the difference between the observed and simulated values of the sample statistics should have a roughly bell-shaped distribution, centered at 0 (which means no difference):

mcmc.diagnostics(flom_mod2, center = TRUE)

Sample statistics summary:

Iterations = 14336:262144
Thinning interval = 1024 
Number of chains = 1 
Sample size per chain = 243 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

           Mean    SD Naive SE Time-series SE
edges    0.2058 4.957   0.3180         0.3180
triangle 0.2222 2.866   0.1839         0.1839

2. Quantiles for each variable:

         2.5% 25% 50% 75% 97.5%
edges      -9  -3   0   4  9.95
triangle   -3  -2   0   1  7.95


Are sample statistics significantly different from observed?
               edges  triangle    (Omni)
diff.      0.2057613 0.2222222        NA
test stat. 0.6471018 1.2086209 1.6694087
P-val.     0.5175661 0.2268085 0.4367585

Sample statistics cross-correlations:
             edges  triangle
edges    1.0000000 0.7771561
triangle 0.7771561 1.0000000

Sample statistics auto-correlation:
Chain 1 
                edges    triangle
Lag 0     1.000000000  1.00000000
Lag 1024  0.056425983 -0.04043396
Lag 2048  0.006273791  0.01940035
Lag 3072 -0.051649675 -0.02455474
Lag 4096 -0.034487041  0.02913158
Lag 5120 -0.023329356  0.01178056

Sample statistics burn-in diagnostic (Geweke):
Chain 1 

Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5 

   edges triangle 
1.458524 1.483872 

Individual P-values (lower = worse):
    edges  triangle 
0.1446962 0.1378428 
Joint P-value (lower = worse):  0.1011719 

Note: MCMC diagnostics shown here are from the last round of
  simulation, prior to computation of final parameter estimates.
  Because the final estimates are refinements of those used for this
  simulation run, these diagnostics may understate model performance.
  To directly assess the performance of the final model on in-model
  statistics, please use the GOF command: gof(ergmFitObject,
  GOF=~model).

Q4 How would you interpret these results?

Simulation

When we have estimated the coefficients of an ERGM, we have defined a probability distribution across all networks of the same size. If the model is a good fit to the observed data, networks drawn from this distribution resemble the observed data. To draw networks from this distribution we use the simulate() function. We draw ten networks from the specified model and use the below command to get a summary of what the network statistics edges and triangles are for each of the ten sampled networks.

flom_mod2.sim <- simulate(flom_mod2, nsim = 10)
summary(flom_mod2.sim)
Number of Networks: 10 
Model: flom_net ~ edges + triangle 
Reference: ~Bernoulli 
Constraints: ~. ~. - observed 
Stored network statistics:
      edges triangle
 [1,]    16        3
 [2,]    26        7
 [3,]    18        1
 [4,]    17        1
 [5,]    22        1
 [6,]    18        1
 [7,]    11        1
 [8,]    22        4
 [9,]    20        3
[10,]    26        6
attr(,"monitored")
[1] FALSE FALSE
Number of Networks: 10 
Model: flom_net ~ edges + triangle 
Reference: ~Bernoulli 
Constraints: ~. ~. - observed 

This should give you a list over the ten networks and columns representing how many edges and triangles are apparent in each simulated case. Since you have listed all the simulated networks, you can simply call each one of them individually. For example, in the below, we call simulated networks 1 and 2:

flom_mod2.sim[[1]]
flom_mod2.sim[[2]]

You can also choose one of the networks to visualize, below is an example for the tenth, i.e. last on the list of, simulated network:

flom.sim_g <-asIgraph(flom_mod2.sim[[10]])
flom.sim_p <- ggraph(flom.sim_g, layout = "stress") + 
  geom_edge_link0(edge_colour = "#666060", 
                  edge_width = 0.8, edge_alpha = 1) +
  geom_node_point(fill = "#808080", colour = "#808080",  
                  size = 7, shape = 21, stroke = 0.9) +
  theme_graph() + 
  theme(legend.position = "none") +
  ggtitle("Simulated network")
flom.sim_p

These simulations are crucial for examining the goodness of fit which we will do next.

3. Goodness of Fit

The MCMC algorithm draws a dyad at random at each step, and evaluates the probability of a tie from the perspective of these two nodes. That probability is governed by the ergm-terms specified in the model, and the current estimates of the coefficients on these terms. Once the estimates converge, simulations from the model will produce networks that are centered on the observed model statistics i.e. those we control for (otherwise it is a sign that something has gone wrong in the estimation process). The networks will also have other emergent global properties that are not represented by explicit terms in the model. Thus, goodness of fit can be done in two ways, where the first is to be preferred:

  • evaluate the fit to the specified terms in the model (done by default)

  • evaluate the fit of terms not specified in the model to emergent global network properties

If the first does not indicate something off in the estimation process, you can use the second where three terms that can be used to evaluate the fit to emergent global network properties:

  1. the node level (degree)

  2. the edge level (esp: edgewise share partners)

  3. the dyad level (geodesic distances)

We check now whether the specified model above fits the observed data and how well it reproduces it. We do this by choosing a network statistic (that is not specified in the model), and comparing the value of this statistic to the distribution of values we get in simulated networks from our model. We use the gof() function.

flom_mod2.gof <- gof(flom_mod2) # this will produce 4 plots
par(mfrow=c(2,2)) # figure orientation with 2 rows and 2 columns
plot(flom_mod2.gof) # gof plots

To get an output containing the summary of the gof:

flom_mod2.gof # summary output of gof

Q5 How would you interpret the goodness of fit here?

Lazega’s lawyers (from lecture)

We will use the same subset of this network as in the previous lab: we want to check whether or not the partners of the firm more frequently work together with other partners having the same practice. We import the data as a graph object from the networkdata package:

data("law_cowork")

We then create an adjacency matrix from the directed graph for the first 36 lawyers in the network corresponding to the partners of the firm (see attribute ‘status’). To test homophily now, we only consider the reciprocal ties so we need to symmetrize the matrix to create and undirected graph:

law_mat_cwdir <- as_adjacency_matrix(law_cowork, sparse = FALSE)
law_mat_cwdir <- law_mat_cwdir[1:36,1:36]
law_mat_cw <- (law_mat_cwdir == t(law_mat_cwdir) & law_mat_cwdir ==1) + 0

Next we save the binary attribute ‘practice’ (1 = litigation, 2 = corporate) from the graph object as a vector, which is then in turn converted into a data frame (again only for the first 36 lawyers who are partners):

law_attr.pract <- vertex_attr(law_cowork)$pract[1:36] - 1

Since observed attribute values are 1/2, we subtract 1 to get a 0/1 variable which is easier to interpret in an ERGM context.

We create a network object and add the node attribute ‘practice’:

law_net <- as.network(law_mat_cw, directed = FALSE) 
law_net %v% "practice" <- law_attr.pract
law_net
 Network attributes:
  vertices = 36 
  directed = FALSE 
  hyper = FALSE 
  loops = FALSE 
  multiple = FALSE 
  bipartite = FALSE 
  total edges= 115 
    missing edges= 0 
    non-missing edges= 115 

 Vertex attribute names: 
    practice vertex.names 

No edge attributes

Model 1: homophily and clustering

Estimation

We are interested in running an ERGM with the following statistics (as done during lecture)

  • nodecov(“practice”)

  • match(“practice”)

  • gwesp(decay = 0.693)

Q6 Can you recall what these statistics represent? To run the ERGM:

law_mod1 <- ergm(law_net ~ edges
  + nodecov("practice") + match("practice")
  + gwesp(0.693, fixed = TRUE)
)
summary(law_mod1)
Call:
ergm(formula = law_net ~ edges + nodecov("practice") + match("practice") + 
    gwesp(0.693, fixed = TRUE))

Monte Carlo Maximum Likelihood Results:

                   Estimate Std. Error MCMC % z value Pr(>|z|)    
edges              -4.38591    0.32189      0 -13.626  < 1e-04 ***
nodecov.practice    0.18221    0.07121      0   2.559 0.010498 *  
nodematch.practice  0.60364    0.16760      0   3.602 0.000316 ***
gwesp.fixed.0.693   1.14125    0.16112      0   7.083  < 1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 873.4  on 630  degrees of freedom
 Residual Deviance: 503.4  on 626  degrees of freedom
 
AIC: 511.4  BIC: 529.2  (Smaller is better. MC Std. Err. = 0.3047)

See lecture slides for the interpretation of these coefficients.

MCMC diagnostics

Check the model by running MCMC diagnostics to observe what is happening with the simulation algorithm:

mcmc.diagnostics(law_mod1, center = TRUE)

Sample statistics summary:

Iterations = 122880:2424832
Thinning interval = 2048 
Number of chains = 1 
Sample size per chain = 1125 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

                    Mean    SD Naive SE Time-series SE
edges              3.588 30.18   0.8997          2.050
nodecov.practice   1.178 29.10   0.8677          1.771
nodematch.practice 2.573 18.51   0.5520          1.187
gwesp.fixed.0.693  6.930 59.05   1.7605          4.047

2. Quantiles for each variable:

                     2.5%    25%  50%   75% 97.5%
edges               -70.0 -14.00  8.0 25.00  49.0
nodecov.practice    -75.0 -12.00  6.0 21.00  43.9
nodematch.practice  -41.8  -8.00  4.0 15.00  33.0
gwesp.fixed.0.693  -132.5 -27.95 12.5 49.26 101.6


Are sample statistics significantly different from observed?
                edges nodecov.practice nodematch.practice gwesp.fixed.0.693
diff.      3.58755556        1.1777778         2.57333333        6.93039339
test stat. 1.74979703        0.6652089         2.16772023        1.71264367
P-val.     0.08015334        0.5059169         0.03017998        0.08677811
                 (Omni)
diff.                NA
test stat. 1.890437e+01
P-val.     9.736672e-04

Sample statistics cross-correlations:
                       edges nodecov.practice nodematch.practice
edges              1.0000000        0.8720346          0.9463973
nodecov.practice   0.8720346        1.0000000          0.8292965
nodematch.practice 0.9463973        0.8292965          1.0000000
gwesp.fixed.0.693  0.9942961        0.8753042          0.9426539
                   gwesp.fixed.0.693
edges                      0.9942961
nodecov.practice           0.8753042
nodematch.practice         0.9426539
gwesp.fixed.0.693          1.0000000

Sample statistics auto-correlation:
Chain 1 
              edges nodecov.practice nodematch.practice gwesp.fixed.0.693
Lag 0     1.0000000       1.00000000          1.0000000         1.0000000
Lag 2048  0.7091548       0.61237659          0.6442048         0.6814578
Lag 4096  0.4717707       0.37544290          0.4192158         0.4478894
Lag 6144  0.3156917       0.22673439          0.2935639         0.2997408
Lag 8192  0.2111325       0.13342547          0.1785633         0.2016442
Lag 10240 0.1487237       0.07132966          0.1299507         0.1454400

Sample statistics burn-in diagnostic (Geweke):
Chain 1 

Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5 

             edges   nodecov.practice nodematch.practice  gwesp.fixed.0.693 
          1.493081           0.917114           1.500024           1.640398 

Individual P-values (lower = worse):
             edges   nodecov.practice nodematch.practice  gwesp.fixed.0.693 
         0.1354159          0.3590829          0.1336083          0.1009225 
Joint P-value (lower = worse):  0.005343654 

Note: MCMC diagnostics shown here are from the last round of
  simulation, prior to computation of final parameter estimates.
  Because the final estimates are refinements of those used for this
  simulation run, these diagnostics may understate model performance.
  To directly assess the performance of the final model on in-model
  statistics, please use the GOF command: gof(ergmFitObject,
  GOF=~model).

Q6 Do you see any problems with model degeneracy here? Is the estimation process working as it should?

Goodness of Fit

Goodness of fit can be checked as done earlier:

law_mod1.gof <- gof(law_mod1) # this will produce 4 plots
par(mfrow = c(2, 2)) # figure orientation with 2 rows and 2 columns
plot(law_mod1.gof)

Note that you should not use esp to assess goodness of fit since it was explicitly modeled via the gwesp term in the specified model.

Knecht’s Friendship Network

For the last part, we will fir an ERGM to a directed network to check for reciprocity. We use a friendship network (Knecht,2008) which can be loaded using the networkdata package. You can read about the network by typing ?knecht. Note that the network is longitudinal and observed over four time periods. We will here focus on the last time period. To load the wave 4 data and to visualize it:

data("knecht")
knecht4_g <- knecht[[4]]
knecht4_p <- ggraph(knecht4_g, layout = "stress") + 
  geom_edge_link(edge_colour = "#666060", end_cap = circle(9,"pt"), 
                         n = 2, edge_width = 0.4, edge_alpha = 1, 
                         arrow = arrow(angle = 15, 
                         length = unit(0.1, "inches"), 
                         ends = "last", type = "closed"))  +
  geom_node_point(fill = "#000000", colour = "#000000", 
                  size = 7, shape = 21, stroke = 0.9) +
  theme_graph() + 
  theme(legend.position = "none") +
  ggtitle("Observed network (wave 4)")
knecht4_p

Next we create the network object to fit ERGMs:

  # to create network objects
knecht4_net <- asNetwork(knecht[[4]])

Model 1: Reciprocity effect

Estimation

knecht4_mod1 <- ergm(knecht4_net ~ edges + mutual)
summary(knecht4_mod1) 
Call:
ergm(formula = knecht4_net ~ edges + mutual)

Monte Carlo Maximum Likelihood Results:

       Estimate Std. Error MCMC % z value Pr(>|z|)    
edges   -2.1889     0.1450      0 -15.100   <1e-04 ***
mutual   2.4115     0.3212      0   7.507   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 901.1  on 650  degrees of freedom
 Residual Deviance: 562.9  on 648  degrees of freedom
 
AIC: 566.9  BIC: 575.8  (Smaller is better. MC Std. Err. = 0.7652)

Q7 How do you interpret these results?

MCMC diagnostics

mcmc.diagnostics(knecht4_mod1)

Sample statistics summary:

Iterations = 14336:262144
Thinning interval = 1024 
Number of chains = 1 
Sample size per chain = 243 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

          Mean     SD Naive SE Time-series SE
edges  -1.0988 11.869   0.7614         0.9010
mutual -0.3333  5.357   0.3436         0.4182

2. Quantiles for each variable:

         2.5%  25% 50% 75% 97.5%
edges  -26.00 -9.5   0   7    19
mutual -10.95 -4.0   0   3    11


Are sample statistics significantly different from observed?
                edges     mutual    (Omni)
diff.      -1.0987654 -0.3333333        NA
test stat. -1.2194889 -0.7970729 1.8580854
P-val.      0.2226587  0.4254087 0.3982094

Sample statistics cross-correlations:
           edges    mutual
edges  1.0000000 0.8121613
mutual 0.8121613 1.0000000

Sample statistics auto-correlation:
Chain 1 
               edges      mutual
Lag 0     1.00000000  1.00000000
Lag 1024  0.16476519  0.19190028
Lag 2048 -0.03275616 -0.03862647
Lag 3072 -0.04234861 -0.10047043
Lag 4096  0.03109986  0.06411610
Lag 5120  0.05887512  0.04907514

Sample statistics burn-in diagnostic (Geweke):
Chain 1 

Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5 

      edges      mutual 
-0.09833024 -0.92194343 

Individual P-values (lower = worse):
    edges    mutual 
0.9216701 0.3565581 
Joint P-value (lower = worse):  0.3677926 

Note: MCMC diagnostics shown here are from the last round of
  simulation, prior to computation of final parameter estimates.
  Because the final estimates are refinements of those used for this
  simulation run, these diagnostics may understate model performance.
  To directly assess the performance of the final model on in-model
  statistics, please use the GOF command: gof(ergmFitObject,
  GOF=~model).

Q8 How do you interpret these results?

Goodness of fit

Note that since we now are considering a directed network, we need to separate in- and out-degree when assessing the goodness of fit:

knecht4_mod1.gof <- gof(knecht4_mod1) # this will produce 4 plots
par(mfrow = c(3,2)) # figure orientation with 2 rows and 2 columns
plot(knecht4_mod1.gof)

Q9 How do you interpret these results?

Model 2: Reciprocity and homophily effect

Now we also include a homophily effect, i.e. do students tend to befriend others of the same gender?

Q10 Run the usual steps of fitting and ERGM, checking the estimation algorithm and assessing the goodness of fit. The ERGM syntax is shown below. What can you conclude?

knecht4_mod2 <- ergm(knecht4_net ~ edges +  nodecov("gender") + 
                       nodematch("gender") + mutual)
summary(knecht4_mod2) 
Call:
ergm(formula = knecht4_net ~ edges + nodecov("gender") + nodematch("gender") + 
    mutual)

Monte Carlo Maximum Likelihood Results:

                 Estimate Std. Error MCMC % z value Pr(>|z|)    
edges             -4.1981     0.3843      0 -10.925   <1e-04 ***
nodecov.gender     0.4966     0.1192      0   4.167   <1e-04 ***
nodematch.gender   1.2061     0.2257      0   5.344   <1e-04 ***
mutual             2.0843     0.3735      0   5.580   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 901.1  on 650  degrees of freedom
 Residual Deviance: 520.4  on 646  degrees of freedom
 
AIC: 528.4  BIC: 546.3  (Smaller is better. MC Std. Err. = 0.8236)

Exercises

Exercise 1 Use the undirected cowork network of the lawyer data.

(a) Focus on the attribute ‘gender’ now. Fit an ERGM that potentially could answer whether or not the partners of the firm more frequently work together with other partners of the same gender.

(b) Focus now instead on all 71 lawyers of the data and fit an ERGM that potentially could answer whether or not the partners of the firm more frequently work together with other partners of the firm.

(c) Can you understand why maximum pseudolikelihood estimation (MPLE) is used when only including nodecov() and match() terms?

Exercise 2: Import the data on Kapferer’s Tailors from the package networkdata. You can read about the data by typing ?mine. Note that the data is imported as graph objects so to convert it to a network object, type the following:

mine_net <- asNetwork(mine)

Perform the following tasks:

(a) Fit en ERGM with edges and triangles included. How do you interpret the coefficients?

(b) Run MCMC diagnostics on the model in (a). Do you not any problems in the estimation process?

(c) Perform a goodness of fit assessment on the model in (a). What can you conclude?

Exercise 3: Use the directed Knecht friendship networks for this exercise.

(a) We wish to compare the reciprocity effect from wave 1 to wave 4. We fitted the ERGM for wave 4 above. Run the same ERGM for wave 1 and run the usual checks of the fitted model. Can you notice a difference in reciprocity over time?

(b) Run the same comparison as in (a) but also include gender homophily in your model specification. Can you notice a difference in the effects from wave 1 to wave 4?