Understanding the uncertainty in presence or absence of links given our observa- tions will be important when attempting to analyse the dynamics of the follower network. In a first analysis of full follower network snapshots over the period of observation, it became apparent that many links had remained unobserved for a long period. It is reasonable to assume that some of those links may have been subsequently broken. A strategy to account for links for which we have no fresh observations in a principled way was needed. To enable such a strategy, an analysis of observed link life times and link dissolution patterns was undertaken. Such an analysis is also of interest in its own right and to my knowledge has not previously been attempted.4
One possible strategy to account for stale links (those which have not been observed for some time) is to introduce link weights related to the estimated probability that they have been since removed. However, some network analysis techniques, including those performed in Section 6, do not admit link weights. In this case, a binary approximation (presence or absence) of link survival must be used. A reasonable cut-off would be the median of the inferred link lifetime distribution: links unobserved for longer than the median would be discarded.
As a first step, links for which we do not have upper and lower bounds for both a creation event and a destruction event were discarded. This precludes users who have 3 or less tweets recorded in the data, which is desirable since such users are not significant contributors to any social groups present. Of greater
4[Myers and Leskovec 2014] looked at bursts of link creation and destruction related to
retweet events but did not present a survival analysis and [Xu et al. 2013] examined factors related to unfriending, however had only 4 whole network snapshots to work with.
concern are stable links extant for the whole period of observation or links for which one of the bounds on its creation or destruction events would have been just outside the observation period. Removal of these links introduces a bias that we can control for, however: It is less likely that both creation and destruction events are observed for longer lasting links than those that are broken quickly — the effective time window in which we can observe both ends of a links lifespan is the total observation timeminus the links lifespan. To be more precise, a link can be included only if the initial and final records of a links absence (with presence recorded in between) lie within the observation period. To correct for this bias, the number of links of a given age should be scaled by w
w−l werew is the length of
the observation window (the total amount of time over which data was collected) and l is the upper bound on the links lifetime (the time between the lower bound on the links creation and the upper bound on its destruction).
Figure 4.6: Corrected link lifetimes.
A histogram of the resulting link lifetimes are presented as a histogram in Figure 4.6. Observing the near linearity of the corrected histogram heights with log-scale y-axis, we can expect a reasonable fit from an exponential. This is typical of lifetime data with constant hazard rate (the probability a link will be broken at any given point in time). A least squares estimation in log10 of the number of links versus midpoints between upper and lower bounds on link duration can be seen in Figure 4.7. The exponent (slope of Figure 4.7), intercept and Pearsons-r correlation coefficient are provided in Table 4.1.
The high level of correlation indicates a very good fit, suggesting that link lifetimes in this data exhibit a constant dissolution rate — the probability of a given link dissolving in a given time period is very close to constant. This is an interesting observation, as one might expect time-relative effects such as younger links being more prevalent in a user’s mind and so more dynamic, appearing and
Figure 4.7: linear fit in log space to link lifetimes exponent -0.00313
y-intercept 5.696 correlation -0.980
Table 4.1: Linear regression and correlation coefficients with units in days andlog10 of the number of links
disappearing with greater frequency, where a link that is older may be forgotten and left untouched.
The estimated exponent in Table 4.1 equates to a constant attrition proba- bility of 0.0072 per day. As a binary approximation (presence or absence) of link survival, I employed the median of the inferred exponential5, shown in Table4.2, as a cut-off — links that had not been observed for more than this value were taken to no longer exist.
Median link age 96.11 days
Table 4.2: Cutoff for binary approximation of link survival.
The above analysis is essentially a survival function estimation, as is often done in medical trials, and there are more sophisticated statistical methods [Miller et al. 1981; Radke 2003] that are able to utilise extra information from censored data6 that could have been applied. Though such an analysis would be of interest in its own right, for the purposes of approximate correction of long-unobserved links however, it was considered unnecessary since a good estimate could be achieved
5Note that the median is the inverse of the attribution probability multiplied byln(2). 6In this situation, we have interval censored data, where we only have bounds on link
with a more prosaic approach presented here.
In estimating expected link lifetimes, I considered removing links for which we do not have a reasonable level of certainty in addition to requiring upper and lower bounds on its lifetime. That is, links for which the difference between the upper and lower bounds on the link’s lifetime (the uncertainty in the links duration) is large. Figure 4.8 shows the number of links with a given duration uncertainty in days (upper plot) and the ratio between As can be seen in Figure4.8, the lifespan of a significant number of the sampled links is highly uncertain. As a simple heuristic, I investigated links with uncertainty less than 10% of the observation window. This was chosen in preference to a relative uncertainty cut-off, as we wish to investigate dynamics at the scale of the observation window, not relative to link lifetimes. As can be seen in Figure 4.9, however, this approach introduced a complex bias, retaining more links with lifetimes close to zero or the observation window than with central values, hence it was abandoned.
Figure 4.8: Histogram of the uncertainty in link duration.
Figure 4.9: Corrected link lifetimes — all links vs. subset without discarded links. Data points are the midpoint of upper and lower link lifetime bounds.