Exponential random graph models in python
I have had some recent experience estimating exponential random graph models (ERGMs) in python using PyMC, and I thought it could be useful to put some notes about it up here.
For readers unfamiliar with ERGM, it is a modeling framework for network or graph data. An unfortunate fact of statistical inference on networks is that the independence assumptions of ordinary least squares are violated in deep, complex and interconnected ways (one of the core insights of social network analysis is that whether or not I am friends with you is tightly related to whether or not I am friends with your friends). ERGMs attempt to bring a linear modeling approach to network estimation by assuming that these inherent interdependencies depend mostly on network characteristics that a savvy modeler can explicitly specify. If you have reason to believe that the major network dependencies in your data can be controlled for by reciprocity and transitivity, for instance, you can simply include those terms in your model specification and hope (read: assume) that the rest of your errors are independent. While they are not perfect, ERGMs have become a widely accepted method among social network analysts.
Most often, ERGMs are estimated using the ergm package for the R statistical environment. This is a wonderful package put together by many of the top researchers in the field of network inference. Recently I needed to estimate a model that allowed more low-level control of the modelling than this package allowed however, so I turned to PyMC to see if I could implement ERGM estimation myself. PyMC is an invaluable python package that makes Markov-chain Monte-Carlo (MCMC) estimation straightforward and, importantly, very fast to implement. Getting estimates for pretty much any model in PyMC takes barely any more work than just specifying that model, and it’s well designed enough that writing your own step methods (and even sampling algorithms) in python is a snap.
Gagnon & MacRae prison friendship network
As a sanity check on my work, I decided to estimate the same model on the same data with ergm in R and then with PyMC in python. The data I used is the prison friendship network from Gagnon & MacRae1, which I got from the UCINET IV datasets page. I was only interested in directed friendship ties between inmates, so I discarded any information about the prisoners themselves; you can download my cleaned up adjacency matrix.
I specify a pretty simple model for the network. I use \(Y\) to refer to the adjacency matrix of the network and \(y_{i,j}\) to refer to the \((i,j)\mathrm{th}\) element of \(Y\). The value of \(y_{i,j}\) is one if prisoner \(i\) listed prisoner \(j\) as a friend, and zero otherwise. The ERGM is specified as
\[\mathrm{Pr}(Y|\theta)\propto\exp\left[\theta'\Psi(Y)\right]\]
where \(\theta\) is our vector of coefficients, and \(\Psi(Y)\) is a vector of network statistics on \(Y\). With ERGMs, \(\Psi\) is where the modeler includes all of the statistics she deems relevant to the network ties. I include a few straightforward statistics:
The number of pairs in the network for which both of the directed edges \((i,j)\) and \((j,i)\) exist. (How many pairs of prisoners mutually nominated each other as friends)
The total number of edges present in the graph. (How many total friendship nominations were made)
The total number of nodes with exactly two incoming edges. (How many prisoners were listed as friends by exactly two other prisoners)
The total number of nodes with exactly three incoming edges. (How many prisoners were listed as friends by exactly three other prisoners)
The total number of nodes with exactly two outgoing edges. (How many prisoners listed exactly two other prisoners as friends)
Estimating this model in R using ergm is dead simple:
library(ergm) # load the data prisonAdj <- matrix(scan('prison.dat'),nrow=67,byrow=T) # create a 'network' object prisonNet <- network(prisonAdj) # estimate the model prisonEst <- ergm(prisonNet ~ mutual+istar(1:3)+ostar(2))
In the final line there, mutual is the mutual edge count, ostar(2) indicates the 2-outstar count, and istar(1:3) tells ergm to include a term for 1-, 2-, and 3-instars. The number of 1-instars is the same thing as the density.
Estimating the model in python will require a bit more work. A standard result from ERGMs is that the model can be reduced to a simple logistic regression with a case for each edge in the network. In this case, each covariate represents the change in a particular network statistic if the edge in question were present versus not present. Formally, let \(Y_{i,j}^+\) be the network obtained by setting \(y_{i,j}=1\) in \(Y\), and let \(Y_{i,j}^-\) be the network obtained by setting \(y_{i,j}=0\). For a particular network statistic \(\Phi_k(Y)\) the corresponding change statistic for edge \((i,j)\) is defined as \(\phi_k(Y,i,j)=\Phi_k(Y_{i,j}^+)-\Phi_k(Y_{i,j}^-)\).
So for edge \((i,j)\) the change in the mutual edge count, \(\phi_M(Y,i,j)\) would depend on the value of \(y_{j,i}\). If \(y_{j,i}=0\), then holding the rest of the network fixed edge \((i,j)\) would be unable to change the total mutual edge count, so \(\phi(Y,i,j)=0\). If \(y_{j,i}=1\), then \(y_{i,j}\) would determine whether the edge was mutual and \(\phi(Y,i,j)=1\).
The first step in the python implementation, then, is to construct these \(\phi\) covariates for each of our network statistics.
import scipy as sp from scipy.misc import comb from itertools import product from pymc import Normal, Bernoulli, InvLogit, MCMC,deterministic # functions to get delta matrices def mutualDelta(am): return(am.copy().transpose()) def istarDelta(am,k): if k == 1: # if k == 1 then this is just density res = sp.ones(am.shape) return(res) res = sp.zeros(am.shape,dtype=int) n = am.shape[0] for i,j in product(xrange(n),xrange(n)): if i!=j: nin = am[:,j].sum()-am[i,j] res[i,j] = comb(nin,k-1,exact=True) return(res) def ostarDelta(am,k): if k == 1: # if k == 1 then this is just density res = sp.ones(am.shape) return(res) res = sp.zeros(am.shape,dtype=int) n = am.shape[0] for i,j in product(xrange(n),xrange(n)): if i!=j: nin = am[i,:].sum()-am[i,j] res[i,j] = comb(nin,k-1,exact=True) return(res)
The mutualDelta function takes an input adjacency matrix and returns an indicator of whether the reversed tie is present in the network. This is simply the transpose of the original adjacency matrix. The istarDelta and ostarDelta functions are a little more complicated, but in essence return a matrix that how many \(k\)-instars or -outstars depend on each edge in the network.
From here it is a simple matter to define the model in PyMC.
def makeModel(adjMat): # define and name the deltas termDeltas = { 'deltamutual':mutualDelta(adjMat), 'deltaistar1':istarDelta(adjMat,1), 'deltaistar2':istarDelta(adjMat,2), 'deltaistar3':istarDelta(adjMat,3), 'deltaostar2':ostarDelta(adjMat,2) } # create term list with coefficients termList = [] coefs = {} for dName,d in termDeltas.items(): tName = 'theta'+dName[5:] coefs[tName] = Normal(tName,0,0.001,value=sp.rand()-0.5) termList.append(d*coefs[tName]) # get individual edge probabilities @deterministic(trace=False,plot=False) def probs(termList=termList): probs = 1./(1+sp.exp(-1*sum(termList))) probs[sp.diag_indices_from(probs)]= 0 return(probs) # define the outcome as outcome = Bernoulli('outcome',probs,value=adjMat,observed=True) return(locals())
(This is a little bit kludgey with the renaming of the coefficient keys in the coefs dictionary). The definition of probs contains one important step that defines the probability along the diagonal to be zero, because inmates were not allowed to list themselves.
Finally, we load the data and estimate the model:
if __name__ == '__main__': # load the prison data with open('prison.dat','r') as f: rowList = list() for l in f: rowList.append([int(x) for x in l.strip().split(' ')]) adjMat = sp.array(rowList) # make the model as an MCMC object m = makeModel(adjMat) mc = MCMC(m) # estimate mc.sample(30000,1000,50)
From there PyMC handles sampling and collecting traces—remarkably easy to get use.
All the python code is combined in this python script.
We can examine the estimates made by ergm and my PyMC approach to verify that the python version is working. Although the two methods are not formally comparable—ergm approximates the maximum-likelihood estimates while PyMC generates samples from a posterior distribution—they should yield relatively similar point estimates and ranges.
Estimates and 95-percent intervals from ergm (top, orange) and PyMC (bottom, red).
The above figure shows the estimates and the comparable 95-percent intervals for the two methods. For each model term the ergm estimate is on top (with an orange dot) and the PyMC result is on the bottom (with a red dot). In the case of ergm the central dot is the maximum-likelihood estimate of the coefficient and the interval is derived from 1.96 times the estimated standard error. For PyMC the bars indicate the 2.5, 50 and 97.5 percentiles of the MCMC sample. As expected, the results are not identical. The point estimates and the ranges differ noticeably for each term. However both methods produce estimates of the same sign, approximate magnitude, and ‘confidence’. The table below lists the point estimates and standard error/deviation for the methods.
statistic ergm (standard error) PyMC (standard deviation) mutual ties 3.51 (0.34) 3.57 (0.19) density -4.09 (0.33) -4.50 (0.28) 2-instars 0.35 (0.16) 0.47 (0.13) 3-instars -0.06 (0.05) -0.10 (0.04) 2-outstars -0.13 (0.09) -0.06 (0.06)
From these tests we can say that PyMC should provide a straightforward alternative to ergm for estimating exponential graph models. The model I used here was deliberately simplistic and arbitrary—I only needed to verify my work. In this case it probably would have been easier to just stick with the ergm estimates and move on.
But there are situations where ergm won’t work quite so easily. If you need to define a specialized network statistic not included in the package, for instance (I believe there is a way to incorporate user-defined network statistics in ergm, though I’ve never tried). Or you might need to extend the model (as I recently did) to include multiple networks or time-dependence. In each of these cases the PyMC approach is a simple matter of defining the ‘delta’ matrices that indicate how individual edges affect the statistics of interest.
Has anybody else had experience using PyMC for network statistics like this? I’d love to hear how you went about it.
MacRae J. (1960). Direct factor analysis of sociometric data. Sociometry, 23, 360-371↩