In order to generate chains of activities, we define generators that are able to produce tours of activity types. We then fill in the location for each of these activity types on the generated tour. By design activity types play a central role in how the data that will be generated. An activity type declaration contains information related to the temporal properties of related activities, as well as possible locations at which the activity can be performed. For example, we may define one activity type for being at home, with possible locations in a residential area and within the city center, and
116 Exploratory Analysis of Time-Space patterns in Smart Card Data • Time instants are modeled by choosing a temporal unit (e.g. minutes or seconds)
and a time offset (for Unix timestamps this is the start of the first of January 1970). As a result, we can use common arithmetic on the natural numbers to work with time instants.
• Time durations are also modeled using the natural numbers, with the same
temporal unit as we are using for time instants.
• A set of locations L that occur within our dataset.
Using our chosen representation of time and space, we can now define how to model activities:
Definition 2. An activity Aiis a 3-tuple (ai, di, li)where ai∈ N is the time at which the
activity starts, di∈ N is the time at which the activity ends and a location li∈ L where the
activity takes place. The duration of an activity is computed by di− ai.
Suppose that some individual performs n activities at different locations. We can then consider following activity sequence for this individual:
A1= (a1, d1, l1), A2= (a2, d2, l2), . . . , An = (an, ∞, ln)
In this sequence, it must hold that ai di ai+1for all i, since time always goes
forward. Associated with a sequence of activities, we have a corresponding sequence of journeys. A journey can be defined as follows:
Definition 3. A journey Jiis a 4-tuple (di−1, ai, li−1, li)where di−1is the departure time
at which the journey starts, aiis the arrival time at which the journey ends, li−1∈ L is the
origin location of the journey and li∈ L is the destination location of the journey.
Suppose that we know the activity sequence of a certain individual. The sequence of journeys of the same individual will then have the following structure:
J1= (?, a1, ?, l1), J2= (d1, a2, l1, l2), . . . , Jn= (dn−1, an, ln−1, ln)
The chronological order defined on the ai’s and di’s also applies when we consider
a sequence of journeys. Given a sequence of activities, we do not know the precise departure time and origin of the first journey, which is indicated with the question mark symbol in the above sequence.
The relationship between journeys and activities can be exploited in this chapter. Although smart card datasets typically contain information on journeys, we can transform a sequence of journeys to a sequence of activities and vice versa. This can be done in a streaming fashion, as we only need to keep two journeys (or activities) in memory to produce an activity (or journey) assuming they are ordered in time.
6.3 Synthesis of Artificial Smart Card Data 117
6.3 Synthesis of Artificial Smart Card Data
One of the complications that rises in research related to mobility data in general and smart card data in particular, is the availability of the data. Real life data sets are often considered of significant importance by companies that collect and own the data, because of a competitive advantage or the sensitive nature of the data in relation to privacy. As a result many researchers have difficulties obtaining such data. In order to develop and test ways to deal with the data, it is important to have such data in the first place, but the best way to convince companies that they should provide access to the data is to show them that the developed methodologies will be indeed beneficial to them and their customers.
A second issue is that too much reliance on these company-owned data sets is detrimental for the scientific process, as it makes it extremely difficult to reproduce experiments. Furthermore, as real life data is not based on any model, it makes validation of proposed methodologies difficult due to a lack of a ground truth. In this section we provide an alternative to company-owned data sets by the introduction of a framework that can generate synthetic data sets that can be shared and discussed by researchers.
The framework uses activity type definitions and generator definitions to generate sequences of journeys and activities for individuals over any desired number of days. It assumes that an individual has a home activity type that serves as a starting point and ending point for a tour of activities connected by journeys. The performed activity types are selected using a Markov chain, while the choice of time, duration and location of each individual activity is dependent on the activity types.
A generated tour always starts at the home activity. For the current day, it is then decided if there are potential starting activities to be performed or whether the individual decides to stay at home for the day and try again the next day. If a starting activity is selected, the location and time for the first activity is then determined. For consecutive activities during the same tour, the starting time of the next activity is based on the ending time of the previous activity plus travel time. If at some point the home activity is selected as the next activity, the tour ends. This procedure also implements a number of rules that we will discuss in further detail in the appropriate sections.
6.3.1 Activity Types
In order to generate chains of activities, we define generators that are able to produce tours of activity types. We then fill in the location for each of these activity types on the generated tour. By design activity types play a central role in how the data that will be generated. An activity type declaration contains information related to the temporal properties of related activities, as well as possible locations at which the activity can be performed. For example, we may define one activity type for being at home, with possible locations in a residential area and within the city center, and
118 Exploratory Analysis of Time-Space patterns in Smart Card Data
define a second activity type work that can occur in the city center or in an industrial area. For every activity type, our definition needs to specify the following properties:
• A unique identifier. Every activity can be identified by a unique name. • A starting time distribution. If the activity is selected as the first activity of a
tour, the starting time is sampled using the specified distribution. If an activity never occurs as the first non-home activity in a tour, this distribution is never used in the sampling process.
• A duration distribution. In order to determine how long a non-home activity is
performed, durations are sampled from this distribution.
• The days of the week during which this activity can begin. Some activities
may only be available at certain days of the week. Work and school activities are typically only available from Monday until Friday, while activities such as events and music festivals are typically only available in the weekends.
• Whether an individual chooses a fixed location to perform this activity type. If
the location has the fixed attribute, the same individual always performs this activity at the same location (although the location may be different for different individuals). Typical activity types with a fixed attribute are home, work and school, while shopping or entertainment are examples of activity types without fixed locations.
• The skip probability of this activity type. This is used to determine whether
an individual leaves home in order to perform activities. Suppose the only activity type that is considered by an individual for a certain day has a skip probability of 0.3. Then with probability 0.3 that individual stays at home that day. If multiple activity types are considered at the beginning of the day, we define the lowest skip probability among those activity types to determine the probability to stay at home or not, but other implementations are possible.
• A set of possible locations L at which this activity type can be performed.
When an activity is generated for a certain individual, it is also necessary to deter- mine the location for that individual. For this purpose, a set of possible locations is associated with each activity type. If an activity lacks the fixed attribute, a location is selected randomly each time any individual wants to perform this location. If it has the attribute, a new location is selected randomly only when a new individual performs the activity for the first time.
For each location l in the set L of an activity type definition, we define a number of attributes. Note that each activity type has its own set of locations, and that the same location can be defined for multiple activity types.
• A name useful for identifying the location.
6.3 Synthesis of Artificial Smart Card Data 119 • A latitude/longitude pair plto indicate the exact position of this location on
the globe for the purpose of computing travel times.
• A weight wl, useful to indicate that some locations are more popular for this
activity type than others.
In order to determine the location for an activity type, we have two procedures. The first procedure is the most simple, as it only depends on the weights of the locations. The probability p(l) to select location l is defined to be proportional to the weight of the location related to the other locations. Formally this is defined as follows:
p(l) = wl k∈L wk −1 (6.1) In certain cases the individual for whom we are selecting a location for is already at a different location o, which may influence the probability to select another location. For this, we use a distance function d(o, l) which estimates how much time is required to travel from o to location l (our basic estimate takes 50 km/h as the crow flies, but more elaborate functions are possible). This distance function is used together with a logit model with a parameter λ. The probability to select a location l is defined as
p(l) = e−λwl1 d(o,l) k∈L e−λwk1 d(o,k) −1 (6.2) For practical purposes, we always use Equation 6.1 to select a location in case
λ 0. In case λ > 0, we use Equation 6.2 in case a previous location is available.
If no previous location is available, such as when determining a home location, Equation 6.1 is used.
6.3.2 Generators
When all the activity types are declared, the second part of the model defines what types of activity chains will and will not occur. For the purpose of determining the sequence of activities that are performed by an individual, we use Markov Chains in a similar way as they are used to generate semi-realistic random natural language sentences. Instead of using only a single Markov chain to generate chains of activities, our procedure derives multiple Markov chains from the input data. As the input data contains blue prints instead of exact Markov chains, we use the more intuitive term
generators. We define the following properties for a generator:
• A weight, indicating how likely it is that a new individual uses this generator
118 Exploratory Analysis of Time-Space patterns in Smart Card Data
define a second activity type work that can occur in the city center or in an industrial area. For every activity type, our definition needs to specify the following properties:
• A unique identifier. Every activity can be identified by a unique name. • A starting time distribution. If the activity is selected as the first activity of a
tour, the starting time is sampled using the specified distribution. If an activity never occurs as the first non-home activity in a tour, this distribution is never used in the sampling process.
• A duration distribution. In order to determine how long a non-home activity is
performed, durations are sampled from this distribution.
• The days of the week during which this activity can begin. Some activities
may only be available at certain days of the week. Work and school activities are typically only available from Monday until Friday, while activities such as events and music festivals are typically only available in the weekends.
• Whether an individual chooses a fixed location to perform this activity type. If
the location has the fixed attribute, the same individual always performs this activity at the same location (although the location may be different for different individuals). Typical activity types with a fixed attribute are home, work and school, while shopping or entertainment are examples of activity types without fixed locations.
• The skip probability of this activity type. This is used to determine whether
an individual leaves home in order to perform activities. Suppose the only activity type that is considered by an individual for a certain day has a skip probability of 0.3. Then with probability 0.3 that individual stays at home that day. If multiple activity types are considered at the beginning of the day, we define the lowest skip probability among those activity types to determine the probability to stay at home or not, but other implementations are possible.
• A set of possible locations L at which this activity type can be performed.
When an activity is generated for a certain individual, it is also necessary to deter- mine the location for that individual. For this purpose, a set of possible locations is associated with each activity type. If an activity lacks the fixed attribute, a location is selected randomly each time any individual wants to perform this location. If it has the attribute, a new location is selected randomly only when a new individual performs the activity for the first time.
For each location l in the set L of an activity type definition, we define a number of attributes. Note that each activity type has its own set of locations, and that the same location can be defined for multiple activity types.
• A name useful for identifying the location.
6.3 Synthesis of Artificial Smart Card Data 119 • A latitude/longitude pair plto indicate the exact position of this location on
the globe for the purpose of computing travel times.
• A weight wl, useful to indicate that some locations are more popular for this
activity type than others.
In order to determine the location for an activity type, we have two procedures. The first procedure is the most simple, as it only depends on the weights of the locations. The probability p(l) to select location l is defined to be proportional to the weight of the location related to the other locations. Formally this is defined as follows:
p(l) = wl k∈L wk −1 (6.1) In certain cases the individual for whom we are selecting a location for is already at a different location o, which may influence the probability to select another location. For this, we use a distance function d(o, l) which estimates how much time is required to travel from o to location l (our basic estimate takes 50 km/h as the crow flies, but more elaborate functions are possible). This distance function is used together with a logit model with a parameter λ. The probability to select a location l is defined as
p(l) = e−λwl1 d(o,l) k∈L e−λwk1 d(o,k) −1 (6.2) For practical purposes, we always use Equation 6.1 to select a location in case
λ 0. In case λ > 0, we use Equation 6.2 in case a previous location is available.
If no previous location is available, such as when determining a home location, Equation 6.1 is used.
6.3.2 Generators
When all the activity types are declared, the second part of the model defines what types of activity chains will and will not occur. For the purpose of determining the sequence of activities that are performed by an individual, we use Markov Chains in a similar way as they are used to generate semi-realistic random natural language sentences. Instead of using only a single Markov chain to generate chains of activities, our procedure derives multiple Markov chains from the input data. As the input data contains blue prints instead of exact Markov chains, we use the more intuitive term
generators. We define the following properties for a generator:
• A weight, indicating how likely it is that a new individual uses this generator
120 Exploratory Analysis of Time-Space patterns in Smart Card Data • A home activity type, that is used as a starting and ending activity for every
tour generated by this generator. In many applications this should be an activity type with a fixed location, although this is not strictly necessary.
• A list of transitions. It is highly recommended that these transitions are defined
such that the home activity can be reached from any other activity and that all cycles in the Markov chain contain the home activity. This is to ensure that any started tour returns to the home activity after a finite number of steps.
Every time we start creating a chain of activities for a new individual, we first decide at random which of the generators to use with probabilities proportional to their defined weights. The Markov chain associated with that generator is defined by the transitions and the properties of the activity types. All transitions that are not defined explicitly are assumed to have a probability of zero. The transitions that are defined have the following properties:
• The from activity type, indicating from which current activity this transition is
to be considered during the generation of journeys.
• The to activity type, indicating which activity is performed next when this
transition is selected.
• The weight of the transition. The probability that a certain transition is selected
is proportional to the weight of a transition among the relevant transitions that share the same from activity type.
Note that although a single Markov chain is defined, not all activities may be available every day. As a result, the actual transition probabilities to go from one activity type to another depend on the availability of that activity during the day for which we are currently generating activities.
Suppose that we have a set of activity types A and transition weights tij for
i∈ A, j ∈ A. The probability that activity j is selected to succeed activity i during the current day d is defined as follows:
pd(i, j) = tij k∈Ad tik −1
if j available during day d
0 otherwise
(6.3)
Here, Ad is the set of activity types that are available during day d. Note that
during some days, it could be the case that there are no valid transitions. In those cases, the current activity is extended until a day is reached where a valid transition exists.
6.4 Labelling of Activities 121
6.3.3 Sampling Process
Using all the properties defined in the previous sections, we can generate random data. For a certain individual, we generate T tours, starting from a given point in time, advancing the days until the individual returns to the home activity T times. The following procedure can then be repeated as often as desired to obtained the desired amount of data:
1. Pick a generator to use with probability proportional to the weights defined for the generators.
2. Repeat the following until the specified number of tours T has been generated a) Determine a day during which the new tour starts. First try the day after