Much research on data centre transport protocols has been motivated by the results of a limited number of small traffic studies as described in Section 2.4. These studies suggest that data centre traffic broadly splits into two groups, short foreground message traffic and longer-running background traffic. It is important to note that these studies specifically focus on single tenant data centres. Such data centres are used by online companies to host search engines, social media sites and online retail sites. They may also be used by large corporations to host their internal systems.
Data centre traffic generally comes from three types of application:
Partition-aggregate applications: These applications send small queries out to a large number of nodes (the partition stage). Each query then generates a small response, often only a single packet long. These arrive at the aggregation node where they are combined to give the result. The efficiency is directly related to the time taken to retrieve all the responses from the partition stage.
Online data-intensive (OLDI) applications: These include web applications such as search engines and e-commerce where the results of a transaction have to be sent to an end user. These applications generate short flows of messages that may be a few tens of packets long. A study conducted in 2009 suggested that there is a direct correlation between increase in latency and a reduction in revenue. Consequently OLDI applications have strict deadlines by which their results must be passed on to the user.
Long-running applications: These include applcations transferring data sets to and from storage as well as maintenance related tasks such as creating backups or replicas
and installing new software images. As a general rule it has been assumed that such traffic is not latency sensitive. However, sometimes this storage traffic may be seen by the end-user, leading to an increased use of SSD storage in many data centres for so-called “hot” storage. Examples of this might include the first page of emails in someone’s online email account and the content for adverts that appear on search engines and social media pages.
The authors of data center TCP (DCTCP) studied the traffic generated by a 6,000 node production cluster over a one month period [5]. This cluster is dedicated to partition- aggregate style traffic. In common with most data centre traffic studies, most of the logs were at the granularity of sockets or flows, although they did collect some packet and application level logs to extract latency information. They identified two of the above classes of traffic which they term query traffic and background traffic. They further sub- divide the background traffic into update flows which are used to refresh the data at each worker node and short message flows which are used for controlling the cluster.
The traffic pattern generated by this combination of traffic sources is far from simple. This is borne out by other traffic studies which have tried (and largely failed) to characterise data centre traffic mathematically.
3.3.1
The importance of low latency
Figure 3.2 attempts to show how different applications are sensitive to bandwidth, to delay or to both. As can be seen, data centre applications tend to be sensitive to both, and OLDI or partition-aggregate applications are especially sensitive to latency.
Figure 3.2: Comparing the latency requirements of different traffic types
In the case of partition-aggregate applications, delaying messages exacerbates the problem of “stragglers”, or partition nodes that return their results late, leading to an overall delay
in the aggregation stage. While some partition-aggregate applications can choose to ignore results that arrive late, others may be highly sensitive. Imagine for instance searching a distributed key-value store for a key that only occurs once. In this case you may have to wait until every query result returns before finding the value you want. As mentioned above, a response message is generally only a single small packet and so this traffic is highly sensitive to per-packet latency. Partition-aggregate is also especially sensitive to TCP incast [27] where the arrival rate of packets causes the buffers at a single node to overflow, leading to packet losses across multiple flows and triggering time outs.
OLDI traffic is characterised by short message flows that may consist of tens of packets. The entire message is critical and so this traffic is sensitive to what might be termed “message latency” or “flow completion time”—the total time taken to receive the entire message. Often this latency requirement is imposed externally. Take for instance a web search engine. When the user submits a search request they have come to expect to get the response within less than a second. In that time the entire index has to have been searched (using some form of partition-aggregate scheme), the search results have to have been ranked and if the search engine uses an advertising revenue model a relevant set of adverts must have been chosen (which may even have involved an instant auction). In order to hit the final deadline, each stage of the process has a strict deadline. Search results that return late are simply ignored. However, the success of a search engine is measured by the accuracy and relevance of the results it returns to the user and so it pays to not ignore too many partition results. Such stragglers also represent a waste of compute resource and reduce the overall efficiency of the data centre.