Social Network Analysis

Worksheet 9: Random Graph Models

Author

Termeh Shafie

Introduction

In this session, we introduces a range of models used to represent and understand the structure of social and relational networks. We begin with the classic \(G(n,p)\) random graph model, in which each pair of nodes is connected independently with a fixed probability. While analytically tractable and conceptually simple, \(G(n,p)\) falls short in capturing the structural complexity of real-world networks; it fails to account for common features such as clustering, skewed degree distributions, or short average path lengths.

To address these shortcomings, we explore several alternative models that incorporate more realistic structural constraints. The small-world model captures the coexistence of high local clustering and global reachability observed in many social systems. The configuration model enables the generation of random graphs with a fixed degree sequence, offering more control over node-level connectivity patterns. For each model, we discuss its definition, how to simulate or estimate it, and its relevance for modeling real network data.

Packages Needed

library(igraph)
library(ggraph)
library(graphlayouts)
library(networkdata)
library(tidyverse)

The Erdős–Rényi Model: \(G(n, p)\)

To see why the \(G(n, p)\) model is often an inadequate representation of real-world networks, we can compare its properties to those of an actual empirical network. A typical social or informational network displays three features that are not captured well by \(G(n, p)\): a right-skewed degree distribution (with hubs), high clustering or triadic closure, and short average path lengths. While \(G(n, p)\) can match the density of a network, it assumes a binomial (or normal) degree distribution, minimal clustering, and does not account for structural heterogeneity.

The example below uses the igraph package in R and a real network dataset of moderate-to-large size to illustrate these differences. We load a real-world network; a network of co-appearances of characters in Victor Hugo’s novel “Les Miserables” which can be loaded from the networkdata package.

We compute this its key structural properties, then generate a random graph with the same number of nodes and expected density using sample_gnp(). We then compare the two in terms of degree distribution, transitivity (clustering), and average geodesic distance.

The code below summarizes key structural properties of the observed Les Misérables network and the corresponding \(G(n, p)\) random graph. These include the global clustering coefficient (measuring the tendency of nodes to form closed triads), the average geodesic distance (a measure of path efficiency), and the maximum degree (the highest number of connections any single node has).

The observed network shows substantially higher clustering, a slightly shorter average path length, and a much larger maximum degree. These results highlight that the empirical network is both more locally cohesive and more hierarchically structured than its random counterpart. The presence of hubs and local clusters—common in real-world networks—is not reproduced by the \(G(n, p)\) model, which assumes uniform and independent edge probabilities.

Together, these differences support the conclusion that random tie formation alone cannot explain the structure of this network.

library(knitr)
# Load the Les Misérables network from networkdata
data("miserables")
g_obs <- miserables

# Basic stats of the observed network
n <- vcount(g_obs)
m <- ecount(g_obs)
density_obs <- edge_density(g_obs)
deg_obs <- degree(g_obs)
clustering_obs <- transitivity(g_obs, type = "global")
dist_obs <- mean_distance(g_obs, directed = FALSE, unconnected = TRUE)

# Generate a G(n, p) graph with the same density
set.seed(123)
g_gnp <- sample_gnp(n = n, p = density_obs, directed = FALSE)

deg_gnp <- degree(g_gnp)
clustering_gnp <- transitivity(g_gnp, type = "global")
dist_gnp <- mean_distance(g_gnp, directed = FALSE, unconnected = TRUE)

# Combine comparison into a data frame
comparison <- data.frame(
  Model = c("Observed", "G(n, p)"),
  Clustering = c(clustering_obs, clustering_gnp),
  AvgPathLength = c(dist_obs, dist_gnp),
  MaxDegree = c(max(deg_obs), max(deg_gnp))
)

# Format with kable
kable(comparison, caption = "Comparison of structural features: Observed vs G(n, p)",
      digits = 3, align = "c")

Comparison of structural features: Observed vs G(n, p)
Model	Clustering	AvgPathLength	MaxDegree
Observed	0.499	4.861	36
G(n, p)	0.086	2.636	12

To further illustrate the limitations of the \(G(n, p)\) model, we also examine the degree distributions of the observed network and the simulated random graph. Real-world networks often exhibit right-skewed degree distributions, with many nodes having few connections and a small number of hubs with very high degree. In contrast, the \(G(n, p)\) model produces a binomial (and approximately normal) degree distribution, where most nodes have degrees clustered around the mean. By comparing these two distributions side by side, we can observe how poorly the random model captures the heterogeneity present in the empirical network.

# Plot the degree distributions
df_deg <- data.frame(
  Degree = c(deg_obs, deg_gnp),
  Type = rep(c("Observed", "G(n, p)"), times = c(length(deg_obs), length(deg_gnp)))
)

ggplot(df_deg, aes(x = Degree, fill = Type)) +
  geom_histogram(position = "identity", bins = 20, alpha = 0.6, color = "white") +
  facet_wrap(~Type, scales = "free_y") +
  labs(x = "Node Degree", y = "Frequency") +
  theme_minimal() +
  scale_fill_manual(values = c("skyblue", "tomato")) +
  theme(legend.title = element_blank())

This example underscores the need for more realistic network models that can capture multiple structural properties simultaneously. While the \(G(n, p)\) model offers a useful theoretical baseline, its assumptions of uniform edge probability and independent tie formation lead to networks with unrealistic degree distributions. In particular, it fails to capture the heterogeneity observed in many real-world systems, where some nodes act as hubs while others have very few connections. To address this limitation, we turn to the configuration model, which allows us to fix the degree sequence of the network and thereby preserve node-level connectivity patterns. This model represents a natural next step toward our second random graph model.

Note: \(G(n, p)\) and CUG Given Density

The \(G(n, p)\) model is mathematically equivalent to a Conditional Uniform Graph (CUG) test given density. In both cases, edges are formed between node pairs independently with fixed probability \(p\), and the overall network density is preserved on average across simulations.

However, there are key differences in interpretation and usage:

The \(G(n, p)\) model is a generative model used to define a probability distribution over the space of graphs with \(n\) nodes and tie probability \(p\). It is often used in theoretical network science as a baseline or null model.
A CUG test given density is a hypothesis testing framework. It conditions on the observed number of nodes and the expected density, and tests whether an observed network statistic (e.g., mutual ties, clustering) deviates significantly from what would be expected by chance.

In practice, simulating random graphs under the \(G(n, p)\) model is functionally identical to conducting a CUG test with fixed density. The distinction lies in whether the model is used for generative modeling or for evaluating the statistical significance of observed network features.

The Configuration Model

we continue with the Les Misérables co-appearance network and compare it to a random network generated from the configuration model. The goal is to assess how well the configuration model replicates key structural features of the observed network when it exactly preserves the degree sequence but randomizes the specific tie configuration.

We use the igraph package to compute network properties and simulate the configuration model using sample_degseq(). The configuration model guarantees that each node retains its observed degree. We create a comparison table and visualize the degree distribution as before.

# Get observed degree sequence
deg_seq <- degree(g_obs)

# Compute observed properties
clustering_obs <- transitivity(g_obs, type = "global")
dist_obs <- mean_distance(g_obs, directed = FALSE, unconnected = TRUE)

# Simulate configuration model with the same degree sequence
set.seed(123)
g_conf <- sample_degseq(deg_seq, method = "fast.heur.simple")

# Compute simulated properties
clustering_conf <- transitivity(g_conf, type = "global")
dist_conf <- mean_distance(g_conf, directed = FALSE, unconnected = TRUE)

# Degree distribution comparison
deg_conf <- degree(g_conf)
deg_df <- data.frame(
  Degree = c(deg_seq, deg_conf),
  Type = rep(c("Observed", "Configuration Model"), times = c(length(deg_seq), length(deg_conf)))
)

# Summary table
conf_comparison <- data.frame(
  Model = c("Observed", "Configuration Model"),
  Clustering = c(clustering_obs, clustering_conf),
  AvgPathLength = c(dist_obs, dist_conf),
  MaxDegree = c(max(deg_seq), max(deg_conf))
)

# Display comparison table with kable
kable(conf_comparison, caption = "Comparison of structural features: Observed vs Configuration Model",
      digits = 3, align = "c")

Comparison of structural features: Observed vs Configuration Model
Model	Clustering	AvgPathLength	MaxDegree
Observed	0.499	4.861	36
Configuration Model	0.239	2.503	36

# degree distribution plot
ggplot(deg_df, aes(x = Degree, fill = Type)) +
  geom_histogram(position = "identity", bins = 20, alpha = 0.6, color = "white") +
  facet_wrap(~Type, scales = "free_y") +
  labs(x = "Node Degree", y = "Frequency") +
  scale_fill_manual(values = c("skyblue", "tomato")) +
  theme_minimal() +
  theme(legend.title = element_blank())

As expected, the degree distribution of the simulated network matches that of the original exactly. However, when we examine higher-order properties, such as the global clustering coefficient and average path length, we find notable differences. The observed network has significantly more clustering, suggesting the presence of structured triadic closure that is not reproduced by the configuration model’s randomized pairing process. The average path length may also differ, although it often remains in the same general range.

These results highlight an important distinction: while the configuration model controls for degree-based features, it does not account for clustering, community structure, or other forms of structural dependency. As such, it is useful as a baseline or null model for testing whether observed patterns can be explained by degree alone.

Note: Configuration Model vs. CUG Given Degree

The configuration model and a Conditional Uniform Graph (CUG) test given degree both generate random networks that preserve the observed degree sequence. In this sense, they are conceptually aligned: both assume that node-level connectivity (i.e., degrees) is fixed and use this constraint to explore how other structural features might arise by chance.

The key distinction lies in how each is used. The configuration model is a generative model; it produces random graphs that exactly match a specified degree sequence, often for theoretical or simulation purposes. A CUG test given degree, on the other hand, is a hypothesis testing framework. It evaluates whether a particular network statistic (such as clustering or transitivity) in the observed network is unusually high or low compared to what would be expected under random tie arrangement, given the same degree sequence.

The Small-World Model

To evaluate how well the small-world model captures structural features of a real network, we simulate a small-world graph using the same number of nodes and approximate average degree as the Les Misérables co-appearance network. We then compare the simulated graph to the observed one in terms of degree distribution, clustering, and average path length.

The simulation uses igraph::sample_smallworld(), which generates a Watts–Strogatz small-world graph by starting from a regular ring lattice and randomly rewiring edges with a given probability \(p\). We set \(p = 0.05\) to introduce moderate randomness while maintaining local structure (we discuss the choice of \(p\) in more detail below).

avg_deg_obs <- mean(deg_obs)
k <- round(avg_deg_obs / 2)  # average degree per side for ring lattice

# Simulate small-world graph
set.seed(123)
g_sw <- sample_smallworld(dim = 1, size = n, nei = k, p = 0.05)

# Compute properties
clustering_obs <- transitivity(g_obs, type = "global")
clustering_sw <- transitivity(g_sw, type = "global")

dist_obs <- mean_distance(g_obs, directed = FALSE, unconnected = TRUE)
dist_sw <- mean_distance(g_sw, directed = FALSE, unconnected = TRUE)

deg_sw <- degree(g_sw)

# Comparison table
sw_comparison <- data.frame(
  Model = c("Observed", "Small-World"),
  Clustering = c(clustering_obs, clustering_sw),
  AvgPathLength = c(dist_obs, dist_sw),
  MaxDegree = c(max(deg_obs), max(deg_sw))
)

# Print formatted table
kable(sw_comparison, caption = "Comparison of structural features: Observed vs Small-World Model",
      digits = 3, align = "c")

Comparison of structural features: Observed vs Small-World Model
Model	Clustering	AvgPathLength	MaxDegree
Observed	0.499	4.861	36
Small-World	0.498	3.778	7

# Degree distribution
deg_df <- data.frame(
  Degree = c(deg_sw, deg_obs),
  Type = rep(c("Small-World","Observed"), times = c(length(deg_obs), length(deg_sw)))
)
 # Reverse the factor levels
deg_df$Type <- factor(deg_df$Type, levels = c("Small-World", "Observed"))

ggplot(deg_df, aes(x = Degree, fill = Type)) +
  geom_histogram(position = "identity", bins = 20, alpha = 0.6, color = "white") +
  facet_wrap(~Type, scales = "free_y") +
  labs(x = "Node Degree", y = "Frequency") +
  scale_fill_manual(values = c("skyblue","tomato")) +
  theme_minimal() +
  theme(legend.title = element_blank())

The simulation demonstrates how the small-world model approximates certain properties of the observed network. As shown in the table, the simulated network achieves a relatively short average path length, similar to that of the Les Misérables network, due to the introduction of random long-range ties. The clustering coefficient remains substantial, reflecting the model’s ability to preserve local neighborhood structure.

However, the degree distribution in the small-world model shown in remains relatively narrow, with most nodes having degrees close to the average. This limitation highlights that while the small-world model captures some global and local properties, it does not account for degree heterogeneity. The results show that the small-world model offers a useful structural middle ground between regular and fully random graphs but still lacks the full complexity observed in empirical networks.

Note that the choice of the rewiring probability \(p\) in the small-world model is crucial, as it balances regularity and randomness. Small values of \(p\) (e.g., between 0.01 and 0.2) are typically chosen to introduce enough randomness to significantly reduce path lengths, while still preserving high clustering. If \(p\) is too low, the network remains overly regular; if \(p\) is too high, the network behaves like a random graph and loses its local structure. In practice, \(p\) is often selected empirically to achieve small-world characteristics (high clustering and short average path length) relative to the number of nodes and degree.

To illustrate how the small-world model transitions between regular and random structure, we simulate multiple networks with the same number of nodes as Les Misérables network with varying values of the rewiring probability \(p\) and track how two key properties (clustering and average path length) change. This helps identify a “sweet spot” for \(p\) where the network retains high clustering but achieves short global paths, capturing the essence of small-world structure. The results are shown in ?@fig-rewire.

The plot shows a sharp transition in network structure as \(p\) increases. At \(p = 0\), the network is a regular lattice: clustering is high, but average path length is long. As \(p\) increases slightly (e.g., \(p \approx 0.1\)), the average path length drops rapidly due to the introduction of long-range shortcuts, while clustering remains relatively high. This intermediate range is where small-world characteristics emerge.

As \(p\) approaches 1, the network becomes increasingly random (think \(G(n,p)\)): clustering drops off, and path length stabilizes at a low level. This demonstrates the trade-off between local cohesion and global efficiency controlled by the rewiring parameter \(p\).

Exercise

Choose a network of yourself and analyze and compare results using the three introduced random graph models.