CAPÍTULO III Indagando Respuestas
4.5. Cronograma de Gantt
Sampling airports
The structure of air transport network is well-known to be scale-free [ZL13a] (i.e. with a long-tailed degree probability distribution), characterised by few highly connected nodes (hubs) and a large number of secondary nodes (regional airports). This structural particularity of air transport network is confirmed in Fig. 4.2, in which the normal and cumulative distribution of degrees have been displayed for each of the 10 air transport network considered.
The sampling properties of scale-free networks are well understood [SWM05]. However, the sampling is usually performed randomly, which is not the case in air transportation networks, where large airports are over-represented before smaller ones. Such is the case whether the sampling is purposive (e.g. select the most connected airports to lower the computational cost) or reflecting the limitation of the data set (e.g. RITA dataset, where only airlines generating more than 1% of passenger revenues are bound to report their data).
In order to mimic the bias commonly present in air transportation network representations, we here introduce a sequential sampling process where nodes (airports) are successively added to the considered network according to their number of connections - highly connected airports first, then the sparsely connected ones. This approach allows us to display the evolution of the networks topology as and when nodes are being added - that is, from the core backbone structure (i.e. composed of the most connected airports) to the whole structure. From this, the
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Figure 4.2: Normal (top panel) and cumulative (bottom panel) degree distribution of the 10 air transport network studied. Reprinted with permission from [BCPZ16].
error made when considering incomplete data sets can be assessed, for example when focusing on the most connected airports.
Fig. 4.3 represents the evolution of seven relevant topological metrics (link density Ld, maxi-mum degree Md, clustering coefficient CC, degree correlation Dc, entropy of the degree distri-bution Edd, efficiency E and the information content IC - whose definitions can be found in Section 2.4.2) as a function of the fraction of nodes sampled from the original data set according
0 , 0
Figure 4.3: Evolution of network topological metrics, as a function of the fraction of airports included in the network. Airports are included sequentially in function of its number of oper-ation (in a decreasing order). Thus, values close to zero in the x-axis describe the core bone of the network; values close to one signify that all airports are considered. Reprinted with permission from [BCPZ16].
to the aforementioned sequential sampling process.
The metrics typically vary quite strongly as a function of the number of airports included -the most notable example being -the entropy of -the degree distribution Edd, which starts at
0.0, because the first considered airports are inter-connected, and stabilises when the totality of them is included. Additionally, whilst most of the metrics shown demonstrate a monotonic behaviour, some (Edd and IC) change the direction of their evolution. Notably, many of them do not saturate - that is, that they do not reach a stable value even when all airport at disposal are included. All these observations - and more particularly the non convergence of the metrics - reinforce the idea that the network topology is sensitive to the addition of airports, even small ones, thereupon there is no obvious threshold by which nodes may be safely discarded when a sequential weighted (i.e. from highly to sparsely connected airports) sampling strategy is adopted. This might be surprising, knowing that past studies have demonstrated that such a threshold exists and is set to 0.6, i.e. sampling 60% of the nodes is enough to reach a good approximation of a full scale-free network (e.g. the Internet and the online pre-print repository arXiv, see [LKJ06]). This result does not translate to air transport network for two possible reasons: (a) the sampling strategy adopted in air transport not being random might cause a tardiness in the saturation of the metrics and (b) the actual number of airports is much higher than the ones considered by our data sets, therefore the complete dataset considered is not even representing the 60% of the existing airports. The 1854 reported airports in Europe by ALL_FT+ represents approximatively half of the real existing number of airports in Europe (3347) as mentioned in the introduction of the Chapter.
Besides, it is quite remarkable that China and India present a qualitatively different behaviour.
While we had no particular information about India allowing to explain such pattern, we can
0.00 0.25 0.50 0.75 1.00
Figure 4.4: Maximum degree by airport sampling fraction for Europe, China and the US (OpenFlights data sets).
bring more explanations on the case of China. It may be observed in Fig. 4.4 that China ends up with a higher maximum degree in comparison with OpenFlights European and US data sets.
Such a behaviour may be ascribed to the market structure of China (see Section 4.1.2), where the regionalisation of the country and the national policies have led more airports to have high connectivity.
Finally, let us go back to the information displayed at the bottom of Fig. 4.3, where the topologies obtained from two different data sources for the European and US networks are compared - that is, RITA and OpenFlights for the US and ALL_FT+ and OpenFlights for Europe. Roughly speaking, both pair of data sets present different characteristics - please refer to Tab. 3.1 for more details - as OpenFlights is more complete for the US but includes fewer airports than ALL_FT+ for Europe. In Fig. 4.3, two vertical lines respectively in Europe (ALL_FT+) and US (OpenFlights) represent when the number of airports included has reached the number of airport considered in the smaller data set (respectively OpenFlights for Europe and RITA for the US). If both data set had been collected the same way, then one would expect the value of the metrics at these vertical lines to be equal to the value for the smaller data sets when all available airports are being incorporated. Specifically, the topological metrics for the full OpenFlights European data set (resp. the full RITA US data set) should be equal to the value obtained in the ALL_FT+ European data set when 497 airports are included (resp. the OpenFlights US data set when 286 airports are included) identified by the vertical line. On the contrary, this does not happen, notwithstanding the final convergence to similar values. Given that the RITA data set only includes 16 airlines, one might ask whether an airline sampling might be the cause behind such behaviour.
Sampling airlines
As commented before, the sampling might also be done across other dimensions, and in par-ticular along the airline dimension, whether it is done intentionally to focus the study on one airline or alliance (see for example [LC04, LSS14, RSNC09]); or is the result of data collection procedures (e.g. only 16 airports are required to report their movement to RITA - see Section
4.1.3). It is therefore of interest to look at the bias introduced by a carrier sampling.
Similarly to Fig. 4.3, Fig. 4.5 plots the evolution of the seven considered metrics as a function of the fraction of airlines included (largest first), with somewhat contrasting results. Specifically, it seems that a low(er) proportion of airlines is enough to approximate the complete topology.
Such behaviour is probably caused by the fact that an airline encompasses both big and small airports, therefore better representing the heterogeneity of the system (in comparison with the usual larger-airports only sampling strategy).
Here again, the Chinese network stands out as an outlier, as few airlines do not suffice for the metrics to converge anymore. Notably, they do stabilise only when one third of the airlines are considered (except for the IC, which seems to converge much later), further reflecting the state-control route and region allocation of the Chinese market - that is, even major airlines might leave some regions of the network under-explored. Also, it can be extracted from Fig. 4.2 that the average degree for the Chinese network is 15, right in between of European (19) and US (10) values, revealing the good connectivity of Chinese airports, given a smaller number of airports, and further suggesting a point-to-point structure rather than a hub-and-spoke system [ZR09]. However, it must be noted that China possesses a comparable number of airports11 (64) with more than 1 million passengers than US12 (87) and the same proportion of them as Europe13 (221 airports out of 609 for Europe and 64 out of 202 for China).
A set of airlines with some common characteristics (e.g. region of operation, alliance member-ship, etc.) might be chosen for the specific needs of a study, therefore rendering inefficient the number of operated flights sampling criterion. The same might be said if the airline sampling corresponds to the limitation of the data set, where only some predetermined airlines provides data (we can imagine that happening in the context of a private investigation project). There is thus a need to contemplate an alternative strategy for sampling airlines that includes more randomness - in other words, not only based on the number of flights operated by the airline.
We propose the following two options to emulate such process:
11Civil Aviation Administration of China, 2014 [oC16]
12http://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/ (Ac-cessed May 2016).
13ACI EUROPE, personal communication.
0 . 0
Figure 4.5: Evolution of network topological metrics, as a function of the fraction of airlines considered to reconstruct the network. Airlines are included in decreasing order of number of operations. Reprinted with permission from [BCPZ16].
1. A totally random process where airlines are drawn from the complete pool of available possibilities.
2. A substitution process based on the sequential original sampling. While, at first, airlines are selected sequentially in decreasing order of operations, a second step consists in dis-carding and randomly replacing some of them by other airlines. To illustrate this process
let us imagine a set with the busiest airlines in Europe (DLH, RYR, EZY, AFR and BER - which are the codes for Deutsche Lufthansa, Ryanair, EasyJet, Air France and Air Berlin). Then one of them is discarded (let us say, randomly, RYR) to be replaced by a, here again, randomly chosen smaller airline, for example SAS - Scandinavian Airlines.
The substitution process is based on the rationale that research studies, while not following the sequential sampling as for airports, seldom have completely random airlines and more generally work with a little set of big airlines.
Fig. 4.6 presents the results for both processes for the European Network. Four metrics (C, Edd, E and IC) are displayed as a function of the number of airlines sampled. The blue line reports the evolution of the metrics through a random drawing process (option 1), while the red and grey circles and whiskers respectively represent the substitution process mean value and standard deviation. Grey circles indicates a 5:1 substitution ratio - that is that for every 5 airlines, one is substituted - while the red circles a 10:1 one. It can be seen that the random
0.0
Figure 4.6: Topological metrics evolution as a function of a mixed sequential-random airline sampling process for the European (ALL-FT+) network – see main text for details. Blue lines indicate the evolution of metrics for a completely random airline sampling process. Reprinted with permission from [BCPZ16].
process converges lately, as there is a high probability of picking small airlines when the sampling is done randomly. On the contrary, the 5:1 substitution process approximates the final metric values with only one step, as it ensures that the core of network (created by the biggest airlines) is represented.
Sampling aircraft types and time windows
Air transport networks are seldom filtered or sampled according to aircraft types or specific time windows (at least, purposively), however, for the sake of completeness, this analysis will here be performed.
Excluding military aircrafts - which may bias the following results, but for which data was not available - we are here to study the effect that the different types of aircrafts have on the network. It can be appreciated in Fig. 4.7 that none of the four main topological metrics converges before all aircraft types are included (each aircraft type is introduced in decreasing
0.10
Figure 4.7: Evolution of topological metrics of the European (ALL-FT+) network, as a function of the fraction of aircraft types included. Aircraft are included in decreasing order of number of operations of their type. Reprinted with permission from [BCPZ16].
order of frequency). Specifically, note how the mesoscale structure of the system only appears above 0.6, that is when turboprops aircraft - smaller aircraft commonly connecting hubs with regional airports - are included.
Also, the time window considered can strongly affect the topology of the system, as traffic changes during the peak hours and night - the same can be said about summer and winter (more weather problems in winter) or across the different days of the week (less traffic during weekends). The focus on a given time window yield useful knowledge about the dynamics of the system under specific conditions, see [FRE13]. However, in order to synthesise the topological characteristic of the whole network representation to the system, Fig. 4.8 shows that large time intervals are necessary - to include weekly and seasonal effects. Indeed, for a number of days considered, a window have been dragged all over the dataset selecting every consecutive days combination possible in order to avoid eventual biases. It resulted in 200 calculations for each number of days considered (resumed in Fig. 4.8 by its mean value and standard deviation).