Social Network Analysis

Worksheet 8: Conditional Uniform Graph Distributions

Author

Termeh Shafie

Introduction

In this session, we will be using conditional uniform graph distributions to simulate random networks. These random networks correspond to the null model and generate the null distribution to which we can compare our observed features to. Thus, we can conclude whether or not an observed feature of interested is significantly different than those from the null model. Most of the examples here are those presented in the lecture.

Packages needed

library(statnet)
library(igraph)
library(ggraph)
library(intergraph)
library(patchwork)
library(networkdata)

Object types

We will be primarily be working with matrix, network and graph objects. It is important that you can understand and pay attention to these since some functions only work with graph objects, and others with network/matrix objects. We try to keep it clear here by using suffix g, net and mat to clarify object assignment.

The Coleman Data

Load a dataset and extract adjacency matrix

We are going to use a data set, coleman, which is automatically loaded with the package statnet. To get information about it type ?coleman and select Colemans High School Friendship Data. This should open a help file with information about the data set. Read the description of the data in the help file in order to know what you are working with. To load the data in your session:

data(coleman, package = "sna")

As described in the help file, the data set is an array with 2 observations on the friendship nominations of 73 students (one for fall and one for spring). We will start by focusing on the fall network here, and create the adjacency matrix for the network:

fall_mat <- coleman[1,,] 

Q1: How can you check whether the network is directed or undirected?

Q2: How can you calculate the number of ties you have in the fall network?

Visualize the network

Create a graph object from the adjacency matrix and visualize the network:

fall_g <- graph_from_adjacency_matrix(fall_mat, "directed")
fall_p <- ggraph(fall_g , layout = "nicely") + 
          geom_edge_link(edge_colour = "#666060", end_cap = circle(9,"pt"), 
                         n = 2, edge_width = 0.4, edge_alpha = 1, 
                         arrow = arrow(angle = 15, 
                         length = unit(0.1, "inches"), 
                         ends = "last", type = "closed"))  +
            geom_node_point(fill = "#525252",colour = "#FFFFFF", 
                           size = 5, stroke = 1.1, shape = 21) + 
            theme_graph() + 
          ggtitle("fall friendship network") +
            theme(legend.position = "none")
fall_p

Uniform graph distribution given expected density: \({\cal{U}}|E(L)\)

Calculate the density of the Coleman fall network. Density is given as the number of present ties divided by the total number of possible ties in the network. We can use the adjacency matrix to calculate this

sum(fall_mat)/(dim(fall_mat)[1]*(dim(fall_mat)[1]-1))
[1] 0.04623288

but we can also use the graph object and call a function from igraph

edge_density(fall_g)
[1] 0.04623288

To generate one random graph with the same density on average as the observed fall network, we write:

sim1_mat <- rgraph(n = dim(fall_mat)[1], m = 1, 
                 tprob = edge_density(fall_g), mode = "digraph")
sim1_g <- graph_from_adjacency_matrix(sim1_mat, "directed")

Make sure you understand all arguments included. The random network and the observed network may not have the exact same number of edges but stochastically, it has the same density:

sum(sim1_mat)
[1] 267
sum(fall_mat)
[1] 243

Now we can plot the random network we generated next to the observed one to compare them:

sim1_p <- ggraph(sim1_g, layout = "nicely") + 
  geom_edge_link(edge_colour = "#666060", 
                 end_cap = circle(9,"pt"), n = 2, 
                 edge_width = 0.4, edge_alpha = 1, 
                 arrow = arrow(angle = 15,
                               length = unit(0.1, "inches"), 
                               ends = "last", type = "closed"))  +
  geom_node_point(fill = "#525252", colour = "#FFFFFF", 
                  size = 5, stroke = 1.1, shape = 21) + 
  theme_graph() + 
  ggtitle("random network") +
  theme(legend.position = "none")
fall_p + sim1_p # 'patchwork' required for this

Q3: Can you note any obvious differences in structure?

Uniform graph distribution given exact density (number of edges): \({\cal{U}}|L\)

Now we generate a random network with exactly the same density. This is done with a slightly different function called rgnm():

sim2_mat <- rgnm(n = 1, nv = dim(fall_mat)[1], m = sum(fall_mat), mode = "digraph")

Make sure you understand all arguments included. Now we calculate the out-degrees for this random network. This can be done using the adjacency matrix sim2_mat or by converting the matrix into a graph and using an igraph function:

sim2_outdeg <- rowSums(sim2_mat)
# using graph object:
sim2_g <- graph_from_adjacency_matrix(sim2_mat, "directed")
degree(sim2_g, mode = "out")

We can do the same to calculate the out-degree distribution for the observed network:

fall_outdeg <- rowSums(fall_mat)
# using graph object:
igraph::degree(fall_g, mode = "out")

Let’s plot these two out-degree distributions next to each other (note the difference in x axes when interpreting results):

fall_outdeg <- as.data.frame(fall_outdeg)
p_fall_outdeg <- ggplot(fall_outdeg, aes(x=fall_outdeg)) +
  geom_bar(color="darkgrey", fill="lightgrey") +
  labs(title = "out-degree distribution observed network", x = "degree") 

sim2_outdeg <- as.data.frame(sim2_outdeg)
p_sim2_outdeg <- ggplot(sim2_outdeg, aes(x=sim2_outdeg)) +
  geom_bar(color="darkgrey", fill="lightgrey") +
  labs(title = "out-degree distribution random network", x = "degree") 

p_fall_outdeg + p_sim2_outdeg 

Q4: If you interpret being nominated many times as ‘being active’, are there actors in the observed data more active than by pure chance?

Dyad census and triad census

Tabulate the number of dyads that do not have any ties, have exactly one tie, and that have two ties (i.e. are reciprocated or mutual dyads). We calculate these numbers for both the observed and the random network using igraph:

dyad_census(fall_g) 
dyad_census(sim2_g)
# using sna package with matrix objects instead
sna::dyad.census(fall_mat) 
sna::dyad.census(sim2_mat)

Q4: Where do you note the strongest difference between the observed and simulated network?

Now we do the same to compare the number of transitive triads. Using the sna function for calculating triad census might be easier since it also includes the triad labels:

triad_census(fall_g) 
triad_census(sim2_g)
# using sna package with matrix objects instead
sna::triad.census(fall_mat) 
sna::triad.census(sim2_mat)

Q4: Where do you note the strongest difference between the observed and simulated network in terms of transitivity?

Note that the number of complete triads (MAN: 300) is 22 in the observed data and 0 for the random network. However, this might be an unfair comparison as the complete 300 triangle contains three reciprocated ties and we already saw that the Coleman data had a much higher number mutual dyads than the random network. But is this just a coincidence? To answer this we need to generate a world of hypothetical networks by generating many many random networks.

To see just how unusual mutual ties are in the alternative world, we can generate 1000 random networks while conditioning on the observed number of ties (the exact density):

sim2.1000_mat <- rgnm(n = 1000, nv = dim(fall_mat)[1], m = sum(fall_mat), mode = "digraph")
sim2.1000_dc <- as.data.frame(sna::dyad.census(sim2.1000_mat))

Now we can draw the histogram for the distribution of mutual dyads through:

p_sim2.1000_dc <- ggplot(sim2.1000_dc, aes(x= Mut))  +
  geom_histogram(binwidth = 1, color="darkgrey", fill="lightgrey") +
  coord_cartesian(ylim=c(0,200)) +
  labs(title = "", x = "number of mutual ties") 
p_sim2.1000_dc

Q5: Do any of the 1000 random networks have as large a number of mutual dyads as in the observed fall network?

Uniform graph distribution given dyad census: \({\cal{U}}|\textrm{MAN}\)

Generate a new type of random graph, namely one that is random but has the exact same number of null, mutual and asymmetric dyads as the observed network:

sim3_mat <- rguman(n = 1, nv = 73, mut = 62, asym = 119, null = 2447, method = "exact")
sim3_g <- graph_from_adjacency_matrix(sim3_mat, "directed")

Now repeat the check of triad census:

triad.census(fall_g) 
triad.census(sim3_g)
# using sna package with matrix objects instead
sna::triad.census(fall_mat) 
sna::triad.census(sim3_mat)

As seen from the results, this still did not manage to produce any complete 300 triangles. How do we interpret this? We can say that

“had allocation of ties in the network been completely random given the ‘dyadic processes’, it would be unlikely that we would observe any complete triangles”

Can we then say that there are more complete triangles than we expect by chance? Just how likely or unlikely is it to observe this many triangles? In order to answer this we again need to produce a world of hypothetical networks.

We generate 1000 random graphs from the null model (that is given the fixed dyad census we observed). We also calculate the triad census of the randomly generated graphs:

sim3.1000_mat <- rguman(n = 1000, nv = dim(fall_mat)[1], 
            mut = 62, asym = 119, null = 2447, method = "exact")
sim3.1000_tc <- as.data.frame(sna::triad.census(sim3.1000_mat, mode = 'digraph'))

Finally, we plot the distribution of number of transitive triads (030T) under the null model, and include a red line showing where the observed number (23) falls in this distribution (you can tidy up the histogram if needed):

p_sim3.1000_tc <- ggplot(sim3.1000_tc, aes(x= `030T`))  +
  geom_histogram(binwidth = 1, color="darkgrey", fill="lightgrey") +
  geom_vline(xintercept = 23, lwd=0.5, colour="red") +
  coord_cartesian(ylim=c(0,200)) +
  labs(title = "", x = "number of 030T triads")
p_sim3.1000_tc

Q7: What can be concluded?

A homophily test: Lazega’s lawyers

Data

This data set comes from a network study of corporate law partnership that was carried out in a Northeastern US corporate law firm. You can read about this data set here. We will go through two examples for testing homophily using this data set. Thus, the null and alternative hypotheses for both tests are: \(H_0\): observed homophily effect is from \(\mathcal{U}|L\) model

\(H_1\): observed homophily effect is not from \(\mathcal{U}|L\) model

but we will use different relations and check for social selection based on different attributes.

Analysing homophily using non-parametric null distribution

Test 1: Friendship based on gender

We start with a simple homophily test: do lawyers befriend those with the same gender? To load the friendship network:

data("law_friends")

We then create an adjacency matrix from the directed graph, calculate number of ties and nodes:

law_mat.frn <- as_adjacency_matrix(law_friends, sparse = FALSE)
law_nodes <- dim(law_mat.frn)[1]
law_ties.frn <- sum(law_mat.frn)

Next we save the binary attribute ‘gender’ from the loaded graph object as a vector, which we then convert into a data frame:

law_attr.gend <- as.data.frame(vertex_attr(law_cowork)$gender)

To calculate the number of observed homophilous ties:

homoph_obs.frn <- sum(law_mat.frn[law_attr.gend == 1, law_attr.gend == 1]) + 
              sum(law_mat.frn[law_attr.gend == 2, law_attr.gend == 2])

Next, generate 1000 random graphs with the same number of ties as the observed one, i.e. the null model \({\cal{U}}|L\):

law_sim1000.frn <- rgnm(1000, law_nodes, law_ties.frn, mode='digraph')

For each random network generated, calculate the number of homophilous ties in the same way:

homoph_sim.frn <- apply(law_sim1000.frn, 1, function(x) {
                sum(x[law_attr.gend == 1,law_attr.gend == 1]) + 
                sum(x[law_attr.gend == 2, law_attr.gend == 2])})

To plot the distribution of homophilous ties under the null hypothesis \(H_0\):

homoph_sim.frn <- as.data.frame(homoph_sim.frn)
p_lawsim1000.frn <- ggplot(homoph_sim.frn, aes(x= homoph_sim.frn))  +
  geom_histogram(binwidth = 5, color="darkgrey", fill="lightgrey") +
  coord_cartesian(ylim=c(0,200)) +
  labs(title = "", x = "number of homophilious ties")
p_lawsim1000.frn 

Q8: Can you reject the null? Why or why not?

Test 1: Cowork based on law practice

We want to check whether or not the partners of the firm more frequently work together with other partners having the same practice. We import the data as a graph object from the networkdata package:

data("law_cowork")

We then create an adjacency matrix from the directed graph for the first 36 lawyers in the network corresponding to the partners of the firm (see attribute ‘status’). To test homophily now, we only consider the reciprocal ties so we need to symmetrize the matrix to create and undirected graph:

law_mat_cwdir <- as_adjacency_matrix(law_cowork, sparse = FALSE)
law_mat_cwdir <- law_mat_cwdir[1:36,1:36]
law_mat_cw <- (law_mat_cwdir == t(law_mat_cwdir) & law_mat_cwdir ==1) + 0
law_nodes_cw <- dim(law_mat_cw)[1]
law_ties_cw <- sum(law_mat_cw)/2

Next we save the binary attribute ‘practice’ (1 = litigation, 2 = corporate) from the graph object as a vector, which is then in turn converted into a data frame (again only for the first 36 lawyers who are partners):

law_attr.pract <- as.data.frame(vertex_attr(law_cowork)$pract[1:36])

To calculate the number of observed homophilous ties:

homoph_obs.cw <- sum(
  law_mat_cw[law_attr.pract == 1, law_attr.pract == 1])/2 + 
  sum(law_mat_cw[law_attr.pract == 2, law_attr.pract == 2])/2
homoph_obs.cw
[1] 72

Next, generate 1000 random graphs with the same number of ties as the observed one, i.e. the null model \({\cal{U}}|L\):

set.seed(7722) # so that we all get the same results
law_sim1000.cw <- rgnm(1000, law_nodes_cw, law_ties_cw, mode='graph')

For each random network generated, calculate the number of homophilous ties in the same way:

homoph_sim.cw <- apply(law_sim1000.cw, 1, function(x) {
                sum(x[law_attr.pract == 1,law_attr.pract == 1])/2 + 
                sum(x[law_attr.pract == 2, law_attr.pract == 2])/2})

The distribution of homophilous ties under the null hypothesis \(H_0\) is plotted below with a red line indicating where the observed value falls in this distribution:

homoph_sim.cw <- as.data.frame(homoph_sim.cw)
p_lawsim1000.cw <- ggplot(homoph_sim.cw , aes(x= homoph_sim.cw))  +
  geom_histogram(binwidth = 1, color="darkgrey", fill="lightgrey") +
  coord_cartesian(ylim=c(0,100)) +
  geom_vline(xintercept = homoph_obs.cw, lwd=0.5, colour="red") +
  labs(title = "", x = "number of homophilious ties")
p_lawsim1000.cw

Q9: Can you reject the null? Why or why not? We can also calculate the \(p\)-value for this test. Thus, you can use this distribution to calculate probability \(P(\text{test statistic} > \textrm{ observed value } |H_0 \textrm{ true })\):

sum(homoph_sim.cw  > homoph_obs.cw)/1000
[1] 0.002

Q10: How do you interpret this value?

Exercises

Exercise 1: Compare the in-degree distribution (interpreted as ‘popularity’) of the students in the observed Coleman network to one generated random network given expected density. Can you note major differences between the two?

Exercise 2: Generate 1000 random networks given number of edges to test reciprocity and transitivity effects in the Coleman spring network. Given these effects, can you detect a change from the fall to spring network?

Exercise 3: Repeat the homophily test of friendship relation given gender in the lawyer data set, but assume the null model is \(\mathcal{U}|\textrm{MAN}\) instead. What can you conclude?

Exercise 4: Use the third type of relation in the lawyer data set ‘advice’ and test for homophily using your attribute of choice. Assume the null model is \(\mathcal{U}|L\) model. What can you conclude?