CAPÍTULO I: CÓDIGO PROCESAL PENAL DE 2004.
2. CONSTITUCIONALIZACIÓN DEL CÓDIGO PROCESAL PENAL.
2.2. La Constitucionalización del Proceso Penal.
Most studies that use TCP/IP headers to study the Web are focused on web monitoring, modeling, and performance evaluation applications. We provide an overview of the most related prior work that primarily use TCP/IP data next.
TCP/IP for Monitoring and Characterizing Web Usage Crovella and Bestavros [1997] performed one of the earliest studies on the workload characteristics of web traffic using primarily TCP/IP headers. Crovella and Bestavros [1997] found that web workloads exhibit many of the properties of self-similar processes — similar observations were already made about Internet traffic as a whole [Willinger et al., 1997, Leland et al., 1994]. Self-similar processes follow heavy-tailed distributions and are autocorrelated. Crovella and Bestavros [1997] noted that the fact that users share the same links and access similar web pages are some of the primary factors that explain the self-similar characteristics that web traffic exhibits. Other factors include the underlying distributions of file sizes and the notion that users pause, or think, before requesting new web pages. Many of these factors still influence web traffic properties today [Ihm and Pai, 2011].
Smith et al. [2001] also conducted a measurement study on web traffic using TCP/IP headers. Smith et al. [2001] show that TCP/IP headers can be used to make inferences about how the Web is used. Some examples that were highlighted in their study include being able to comment on the increased adoption of banner ads by content providers and the increased adoption of Web-based email services using only TCP/IP headers. There are also studies that tracked the evolution of the Web using only TCP/IP headers [Hern´andez- Campos et al., 2003a, Newton et al., 2013]. The goals and key results of these studies are similar to the HTTP header based studies previously discussed [Ihm and Pai, 2011, Callahan et al., 2010]. Paxson [1999] proposed a real-time monitor that can be used for security and network performance analysis. The monitor proposed by [Paxson, 1999], called Bro, processes TCP/IP headers in real-time and generates logs that correspond to events that the network administrator may want to investigate further. Moore et al. [2001] and Roesch et al. [1999] proposed tools, called CoralReef and Snort respectively, that perform similar tasks. These tools can perform deep packet inspection to obtain more information about how the network is being used that can be used detect when a network has been compromised. Gu et al. [2008] also proposed a similar tool that analyzes TCP/IP headers to determine whether a network is being used in a malicious manner. Gu et al. [2008] used TCP/IP headers for this analysis because most applications on the Internet use TCP and the system is intended to be application layer protocol independent. Gu et al. [2008] showed that host communication patterns derived from TCP/IP headers could be used to detect malicious activity without application layer protocol headers. These studies show that application layer headers are not always needed to analyze web traffic.
One of the most popular applications of web traffic monitoring is web traffic modeling and traffic gen- eration. Barford and Crovella [1998] proposed one of the first tools, called SURGE, that generates realistic
web traffic. SURGE generates traffic using statistical models that approximate the distributions of network features that influence the network properties that are observed on a real link. Examples of network features that are modeled include file size and the duration of active/inactive periods in traffic. Barford and Crovella [1998] showed that theoretical distributions, such as weibull, pareto, and lognormal, can be used to model these network features in a manner that approximates real web traffic. However, theoretical distributions are not enough to model web traffic as it becomes more complex. Hernandez-Campos [2006] acknowledged this issue and proposed models for replaying web traffic at the level of HTTP requests and responses.18 Weigle et al. [2006] used the ideas behind the models by Hernandez-Campos [2006] to develop a web traffic generator, called TMIX, that can replay web traffic measured “in the wild” in a manner that approximates the properties of real web traffic.
This past research, which is included in Table 2.3, is relevant to this dissertation because it shows that TCP/IP headers can be used to study the properties of web traffic. In this dissertation, we build upon this past research by expanding the scope in which TCP/IP headers can be used to study the modern Web to include web page traffic classification. We also use many of the analysis methods and tools, including TMIX and theoretical distributions, to study and model web page traffic.
Evaluating Web Performance using TCP/IP It is inevitable to use TCP/IP headers when evaluating the performance impact that protocol enhancements at the network and transport layers have on the Internet. Nielsen et al. [1997] designed experiments to determine whether the adoption of new technology will im- pact web performance. In particular, this study determined that HTTP/1.1 outperformed HTTP/1.0. This study also showed that the widespread use of CSS style sheets and the use of a more compact PNG image representation improved web performance. These results were obtained by analyzing only TCP/IP header data via a carefully designed experimental methodology. More or less, the authors ran multiple experiments, where one enabled an existing feature, say HTTP/1.0, and another enabled a different feature, say HTTP/1.1. Conclusions were drawn from such experiments by simply comparing the TCP/IP data generated using the experiments. This simple approach is heavily used for TCP/IP based analysis of web traffic [Le et al., 2007, Wang et al., 2014, Christiansen et al., 2000].
Le et al. [2007] and Christiansen et al. [2000] conducted studies similar to Nielsen et al. [1997] ex- cept their focus was to analyze the impact that different active queue management methods have on web 18This method is more generally referred to as replaying traffic at thesourcelevel
performance. Active queue management (AQM) refers to a class of approaches that attempt to improve the performance of the Web at the network layer by managing the amount of datagram queuing at routers. These methods typically achieve this goal by having routers either (i) drop datagrams19 or (ii) sending ex- plicit notifications to end-hosts to reduce their sending rates. Christiansen et al. [2000] found that actively managing router queues by simply dropping datagrams does not significantly improve web performance alone. Le et al. [2007] found that active queue management approaches that send explicit notifications to end hosts in addition to dropping datagrams can noticeably improve web performance in many scenarios. Le et al. [2007] noted that performance improvements were less significant for TCP connections that have a high variance in RTTs. Recent performance evaluations that use TCP/IP data investigate whether SPDY improves web performance over HTTP/1.1 [Wang et al., 2014, Erman et al., 2013]. We discussed these the prior section since they also leverage HTTP headers.
This body of work, which is included in Table 2.3, is related to this dissertation because it highlights that differences at other layers of the Internet Protocol Stack, say at the network and application layers, can be studied using TCP/IP headers. In this dissertation, we focus on developing techniques that can classify such differences using anonymized TCP/IP headers.
Comments on Anonymization of Web Traffic TCP/IP headers include IP address information which can be used to determine sensitive hostname information that can have implications on user privacy. The studies by Hern´andez-Campos et al. [2003a], Newton et al. [2013], and Smith et al. [2001] make extra efforts to preserve user privacy by anonymizing the IP addresses so hostnames cannot be easily obtained. However, the details of the anonymization procedures used in Web studies are rarely described [Sicker et al., 2007]. This lack of detail in the anonymization process is a serious issue because data that is weakly anonymized can still be a privacy concern. There have been a number of instances where a content provider releases “anonymized” user data only to have the data be analyzed by others to obtain the private information that the content providers were trying to protect via anonymization [He and Naughton, 2009]. Such cases have made the owners of web traffic data (e.g., researchers, Content providers, ISPs, etc) more reluctant to share/release it. Content providers are also increasingly encrypting network content to address privacy issues [White et al., 2013]. There have been a number of efforts to improve anonymization approaches [Fan 19 TCP will reduce the sending rates of end-hosts when datagrams are dropped/lost. Dropping a datagram is an indirect and
et al., 2004b, Koukis et al., 2006, Le Blond et al., 2013, Schneier, 2013, Chen et al., 2013]. Despite these efforts to improve user privacy, it is still possible to deanonymize portions of anonymized web traffic and to infer the content of encrypted communications. We discuss these approaches when we discuss the related work on traffic classification.