CAPITULO II MARCOTEÓRICO
OBJETIVOS ESTRATÉGICOS Y ESPECÍFICOS Objetivo Estratégico:
6. Falta de sistema computarizado de registro de hechos vitales
5.1.5.8 Presentación y Análisis de los Resultados del Cuestionario de Control Interno por Componente.
Over the past decade, the web has become a dominant communication channel, and its popularity has fueled the rise of malicious websites (Xu et al., 2013a) and malvertising as a vector for infecting vulnerable hosts. Provos et al. (2007) examined the ways in which web page components could be used to exploit web browsers and infect clients through drive-by downloads. That study was later extended (Provos et al., 2008) to include an understanding of large-scale infrastructures of malware delivery networks, and provided overall statistics on the impact of these networks at a macro level. Their analysis found that ad syndication significantly contributed to drive-by downloads. Similarly, Zarras et al. (2014) performed a large scale study of the prevalence of malvertising in ad networks. They showed that certain ad networks are more prone to serving malware than others. Grier et al. (2012) studied the emergence of the exploit-as-a-service model for drive-by browser compromise and found that many of the most prominent families of malware are propagated through drive-by downloads from a handful of exploit kit flavors.
Detecting malicious landing pages has been a hot topic of research. The most popular approach involves crawling the web for malicious content using known malicious websites as a seed (Invernizzi et al., 2012; Li et al., 2012, 2013; Eshete and Venkatakrishnan, 2014). The crawled websites are verified using statistical analysis techniques (Li et al., 2012) or by deploying honeyclients in virtual machines to monitor OS and browser changes (Provos et al., 2008). Other approaches include the use of a PageRank algorithm to rank the
maliciousness of crawled sites (Li et al., 2013) and the use of mutual information to detect similarities among content-based features derived from malicious websites (Wang et al., 2013). Eshete and Venkatakrishnan (2014) identified content and structural features using samples of 38 exploit kits to build a set of classifiers that can analyze URLs by visiting them through a honey client. These approaches are complimentary to ours, but require significant resources to comb the Internet at scale.
Other approaches involve analyzing the source code of exploit kits to understand their behavior. For example, De Maio et al. (2014) studied 50 kits to understand the conditions which triggered redirections to certain exploits. Such information can be leveraged for drive-by download detection. Stock et al. (2015) clustered exploit kit samples to build host-based signatures for anti-virus engines and web browsers. Closer to our work are approaches that try to detect malicious websites using HTTP traffic. Cova et al. (2010), for example, designed a system to instrument JavaScript run-time environments to detect malicious code execution while Rieck et al. (2010) described an online approach that extracts all code snippets from web pages and loads them into a JavaScript sandbox for inspection. Unfortunately, these techniques do not scale well, and require precise client environment conditions to be most effective.
Other approaches focus on using statistical machine learning techniques to detect malicious pages by training a classifier with malicious samples and analyzing traffic in a network environment (Rieck et al., 2010; Canali et al., 2011; Blum et al., 2010; Ma et al., 2009, 2011; Mekky et al., 2014; Nelms et al., 2015). More comprehensive techniques focus on extracting javascript elements that are heavily obfuscated oriframes
that link to known malicious sites (Provos et al., 2007; Cova et al., 2010). Cova et al. (2010) and Mekky et al. (2014) note that malicious websites often require a number of redirections, and build a set of features around that fact. Canali et al. (2011) describes a static prefilter based on HTML, javascript, URL and host features while Ma et al. (2009, 2011) use mainly URL characteristics to identify malicious sites. Some of these approaches are used as pre-filter steps to eliminate likely benign websites from further dynamic analysis (Provos et al., 2008, 2007; Canali et al., 2011). Unfortunately, these techniques take broad strokes in terms of specifying suspicious activity, and as such, tend to have high false positive rates. They also require large training sets that are often not available. By contrast, this chapter proposes a framework for detecting flavors of exploit kits, and utilizing the interactions between HTTP flows to reduce false positives from a small seed of examples.
Yegneswaran et al. (2005) describe a framework for building semantic signatures for client-side vulnera- bilities packet traces collected from a honeypot. The work shares the similar observation that correlating
flows can help to reduce false positives; however, the work described in this chapter focuses on the specific problem of detecting server-side exploit kits using the structure of HTTP traffic by modeling kits as trees, and take advantage of structural similarity properties to reduce FPs. More recently, Stringhini et al. (2013) proposed a learning approach to detect malicious redirection chains using a proprietary dataset. The technique requires traffic from a large crowd of diverse users from different countries, using different browsers and OSes to visit the same malicious websites in order to train a classifier. Unfortunately, as shown in the work, the approach leads to high false positives and negatives with modest data labels and can only detect chains whereby the last node is deemed malicious. By contrast, the work in this chapter does not model client usage patterns and is not limited to the presence of redirection chains to identify exploit kits. The technique is based on structural similarity; therefore, the last node in the structure does not need to be malicious. Finally, the approach is designed to specifically reduce false positives and negatives in light of a small amount of malicious training data.
Subtree Similarity Search Problem: Note that the subtree similarity-search problem on large datasets remains an open research problem. Most proposals require scanning each tree in the dataset and then applying tree edit distance techniques to prune the search space. Cohen (2013) combined the general structural commonalities of trees as well as the uncommon elements to reduce the number of trees checked in a similarity search. The drawback of that work is that the indices are 10x the size of the input data and only works with single-labeled nodes. The work in this chapter is based on similar ideas to Cohen (2013) but works on trees where the nodes themselves have a large number of features. To make the approach practical, the work leverages the sparsity of the feature space in network traffic.