In this Chapter, we are interested in automatically labeling which mobile network requests are either tracking the user or are requesting ads for the user. Prior art, such as NoMoAds (Chapter 5), was limited in its ability to label A&T requests: it required a human to first detect an ad and then to manually come up with a rule in AdblockPlus format [84] that would block future occurrences of such ads. We illustrate this manual and iterative process in Fig. 6.1(a). Not only is this approach prone to human error, it is also not scalable and is limited to ads only because tracking activities are invisible.
AutoLabel aims to solve the problem of manual labeling of mobile A&T requests. We start by noting that tracking and advertising on mobile devices is usually done by third-party libraries that app developers include in their apps to generate revenue. Throughout this Chapter, we will refer to a network request as an A&T request, if it was generated by a library whose primary purpose is advertising or analytics (A&T libraries). Another key observation is that it’s possible to determine if a network request came from the application itself or from a library by examining the Java stack trace leading to the network API call. For example, consider the trimmed stack traces shown in Listings 6.2(a) and 6.2(b). Both were captured within the ZEDGE™Ringtones & Wallpapers app,
which has the package name net.zedge.android. Listing 6.2(a) shows the app starting an SSL handshake, as indicated by the presence of the package name net.zedge.android in the stack trace. On the other hand, Listing 6.2(b) shows the MoPub ad library starting an SSL handshake, as indicated by the com.mopub package name. Prior art, such as PmP [14], has also used stack trace analysis to infer which third-party libraries were accessing sensitive APIs. However, hooking into networking APIs is challenging (see Sec. 6.3.1.1), and thus PmP was unable to fetch such detailed traces for network access [14]. How we obtain the stack traces leading to each network request is described in Sec. 6.3.1.1.
In order to use stack traces for labeling A&T requests, we also need a list of package names belonging to libraries, and we need to know which ones are A&T libraries. To minimize manual list curation efforts, we use advances in static analysis of apps to help us identify A&T libraries’ package names. Android apps are structured in a way where classes belonging to different entities (e.g. app vs. library) are separated into different folders (packages). One such structure is shown in the “APK” box in Fig. 6.1(b). As with the stack traces shown in Fig. 6.2, the APK pictured in Fig. 6.1(b) also belongs to the ZEDGE™ app. Note how the APK is split between packages belonging to Google, MoPub, and ZEDGE™ itself. We can use this splitting to extract package names belonging to third-party libraries. In fact, that is exactly what LibRadar [20] does: they build signatures for each packaged folder and then use clustering to identify third-party libraries. Using this technique they have built an initial database of 29k libraries. Based on the extracted library signatures, LibRadar can identify libraries in new apps and can provide the corresponding packages names even when package name obfuscation is used. Recently, an updated version of LibRadar was released – LibRadar++ [82]. This version of the tool is built over a larger set of apps (six million) and libraries (5,102).
Thus, AutoLabel uses LibRadar++ [82] to analyze apps and automatically produce a list of library package names contained within (Fig. 6.1(b)). Note that LibRadar++ provides two package names as output: the first is the package name used in the APK and the second is the original package
name of the library. Most of the time the two names are the same. If an app uses package-name obfuscation, then the names are different, but LibRadar++ is still able to identify the library via its database of library signatures. Although there are some cases in which this identification fails (see Sec. 6.5.2), the LibRadar++ approach is still more resilient than matching against a static list of library package names. Furthermore, if an updated version of LibRadar++ becomes available, it can easily be plugged into AutoLabel. Based on the original package name that LibRadar++ provides it is trivial to identify popular A&T libraries: one can simply search the Internet or use an existing list of library names and their purposes, such as AppBrain [1]. To identify A&T libraries, we use the list prepared by LibRadar [85], which maps library package names to their primary purpose.
Fig. 6.1(b) summarizes the AutoLabel method: we match the collected stack traces against a list of package names belonging to A&T libraries produced by LibRadar++. We note that unlike the approach depicted in Fig. 6.1(a), our method has minimal human involvement. The only point where manual effort might be needed is in mapping package names to their primary purpose. However, we note that unlike a list of rules to match against URLs, a mapping of library names to their purpose will always be available and will remain up-to-date so that app developers can select which libraries to use. In order to perform this labeling, we need a system that can collect network requests and map them to the stack traces that led to each networking API call. The next section describes how AutoLabel achieves this mapping.