Textual Readability Indexes: These represent the comprehension difficulty when reading a pas- sage in English and are different approximations and representations of the U.S. grade level4 needed to comprehend the text. We use the Stanford NLP Parser5 to extract sentences and words, and TeX hyphenation [Lia83] to obtain syllables. The indexes include: Automated Reading Index [SSS67], Coleman-Liau Index [CL75], Gunning Fox index [Gun52], SMOG Grade [McL69], Flesch Reading Ease Score and Flesch Kincaid Grade Level [Fle48]; Code Readability Index: an index devised by Buse and Weimer [BW10] to evaluate the read-
ability of java-like code samples that considers different metrics like identifier length, for loops, if blocks etc.
The aforementioned meta-information types are the set computed by default. Not all of the meta-information are suitable to every type of information unit. Text Readability and its code counterpart Code Readability are two examples. The former is designed to work with narrative, while the latter is designed to work with source code. If they were used on different input types they do not properly work. For example, source code would result unreadable according to text readability metrics.
Vice versa, meta-information concerning code elements (e.g., method invocations, declara- tors) might help discovering structural and semantic links between textual units and code units, due to their general applicability. For example, in the discussion depicted in Figure 7.2, the Types meta-information would contain StorageManagerBeanboth for the first code unit and for the last text unit. With StORMeD this information would be uncovered with a simple traversal of the meta-information model, without reprocessing the data.
The model can be easily generalized, allowing custom analyses to decorate the information unit with their result stored as ad-hoc meta-information type. For example, traditional source code metrics [LM10] could be calculated when applicable and modeled as meta-information of code units. The organization of the meta-information model, together with the ready-made nature of StORMeD, favors the customization and reuse of Stack Overflow data to perform analysis tailored to specific needs. For example, Stack Overflow can be fully analyzed to discover information about undocumented libraries (e.g., bugs, usages and patterns) that would require a full-blown analysis of the Stack Overflow dataset otherwise.
7.2
Usages ofsun.misc.Unsafein Stack Overflow
Unbeknownst to many application developers, the Java runtime includes a “backdoor” that allows expert library and framework developers to circumvent Javas safety guarantees. This backdoor is there by design, and is well known to experts, as it enables them to write high-performance “systems-level” code in Java. This backdoor is provided through an unofficial and undocumented API that allows the developer to access low-level, unsafe features of the Java Virtual Machine (JVM) and underlying hardware, features that are unavailable in safe Java bytecode. This API is provided through an undocumented class,sun.misc.Unsafe, in the Java reference implemen- tation produced by Oracle.
Identifying Stack Overflow discussions concerning the usage ofsun.misc.Unsafe cannot be performed by solely relying on the tagging system provided by Stack Overflow. The topic is rarely discussed and the only tag called<unsafe>is rather used to identify unsafe usages in code not only focused on the java programming language. If we consider the tag pair <java> plus
4http://en.wikipedia.org/wiki/Grade_levels 5http://nlp.stanford.edu/software/index.shtml
<unsafe>, the contents are not only focusing on sun.misc.Unsafe. An analysis of the contents is thus required to understand if a discussion tacklessun.misc.Unsafe.
Without relying on any tagging of the discussion, we need to discover specific constructs in the contents that suggest the usage of sun.misc.Unsafe. This information can be obtained by analyzing both the text and the code contained in a discussion. For example, a discussion could report a code sample using some features of thesun.misc.Unsafeclass, or a user could mention the class in an answer to a question concerning some specific problem that the usage of the class can tackle. Identifying pieces of the discussion that matches the information concerning
sun.misc.Unsafe requires an in depth analysis of the text. StORMeD reveals to ideal tool to perform such type of analysis. As a proof of reusability, in this section we employ StORMeD to analyze discussions on Stack Overflow, discover the ones concerningsun.misc.Unsafe, avoid false positives
7.2.1 Identifying discussions by type and method names
To identify Stack Overflow discussions concerning the sun.misc.Unsafe class, we start by an- alyzing all the discussions whose tags contains one among java, scala, and android, jvm. One possible solution to understand if a discussion concernssun.misc.Unsafe, is to (i) discover usage of one of the methods exposed by the class or (ii) identify any mention of of the type Unsafe. We focus on the following AST nodes to check if a discussion matches one of the two criteria: Method Invocations: each node matching a method invocation node is analyzed to understand
if the invoked method name belongs to sun.misc.Unsafe. In case of match, the post is marked as containing a method name of Unsafe. We also perform check on the callee to understand if the method invocation is effectively performed on the Unsafe class. We check this information on the callee by applying the same rules used for qualified identifiers. Strict Method Name Identifiers: we consider identifiers respecting the java naming convention
for methods. Every identifier beginning with a lowercase letter and containing a case change (i.e., fieldObject) is taken in consideration as method name. The method name must then match one of the methods declared in the class Unsafe.
Qualified Identifiers: qualified identifiers are nodes that are generally used in other constructs. For example, they are used in import declarations, method invocations (before the method name) and stack trace lines (between “at” and the line number). For this reason we check if the qualified identifier matches value like Unsafe, unsafe, UNSAFE or the fully qualified typesun.misc.Unsafe. In case of match, the post is marked as declaring the type Unsafe. Strict Qualified Identifiers: as well as for the strict method name identifiers, we also check the strict qualified identifier appearing in the natural language. We look for all the occurrences of qualified identifiers composed by 3 identifier at least (i.e.,sun.misc.Unsafe). Whatever matches this construct is treated as a normal qualified identifier.
String Literals: we verify that the fully qualified type sun.misc.Unsafeis present in the literal. We also verify that literal matches the string “theUnsafe”. This string is a special field name in the Unsafe class for the HotSpot VM to get the instance via reflection. Both these rules suggest an usage of the class via reflection and the presence of sun.misc.Unsafe. Stack Traces: we keep track of full stack traces and lone stack trace lines to either identify
7.2 Usages ofsun.misc.Unsafein Stack Overflow 117
7.2.2 Refining sun.misc.Unsafe.park usages
Whenever a thread is put in the idle state, a call to the park. If an exception occurs in the thread, it is likely to find sun.misc.Unsafe.park in the method invocations of the trace. In this case, the presence of the method park does not represent a relevant usage of Unsafe and makes the park method the most used in Stack Overflow. For this reason, we ignore occurrences of park inside stack traces.
7.2.3 Refining Parsing Results
The analysis performed on the AST allows us to identify if a post contains the type or a method name of the class Unsafe. We collected 20915 discussions matching at least one of the two criteria, out of which 560 discussion reports the type Unsafe and 20426 reports a method name of Unsafe. However, if the presence of the type Unsafe guarantees that the discussion is effectively about
sun.misc.Unsafe, the lone presence of the method name could misclassify the discussion. For example, methods like getInt, getFloat, and getShort can be found in other classes like ByteBuffer6, while a method name like defineClass can be found in the java ClassLoader7. The absence of type in our parsing results does not guarantee that the discussion is not including
sun.misc.Unsafe. Indeed, we do not check at parsing time if the lone term “unsafe” is mentioned among the natural language parts to avoid false positives. To overcome the safety limitation we imposed in the parser, we take all the discussion with a method name of the class Unsafe, and we perform a pure text search of the term “unsafe”.
Out of 20426 discussions with an Unsafe method mentioned, only 49 discussions contain the term “unsafe” in the text. We proceed by manually inspecting and verifying each discussion, resorting to 18 discussion effectively reporting an usage of sun.misc.Unsafe. Thus, our final dataset contains a 560 discussions explicitly using the type Unsafe, and 18 discussions reporting the method name only and the term “unsafe”, for a total of 578 discussions that effectively concern
sun.misc.Unsafe.
7.2.4 Stack Overflow Discussions
To understand which topics are related to posts mentioningsun.misc.Unsafeand its methods, we started by analyzing the tags of the corresponding questions. Table 7.1 shows the overall occurrences.
Popularity of Repliers
To understand how difficult are the topics related to the specific features that might require the use of sun.misc.Unsafe, we collected all the repliers of the questions in our final dataset. The answers may or may not contain references to sun.misc.Unsafe, but they are representative of the possible topics involved in posts mentioning this undocumented class. The resulting set of users has, at the moment of the Stack Overflow dump, an average reputation of 18,000. This is just below the level of trusted user8, which is the highest reputation rank for getting special privileges on Stack Overflow.
We refined our selection to include only the answers mentioningsun.misc.Unsafe or one of its methods. In this case, the average reputation of the corresponding repliers is around 21,770,
6http://goo.gl/uH4oJZ 7http://goo.gl/iN3dhm
Table 7.1. Most frequent tags
Tag Occurrences Tag Occurrences
java 366 arrays 16 multithreading 32 memory-management 15 android 27 jni 14 jvm 26 bytebuffer 13 concurrency 24 c++ 12 memory 20 reflection 11 unsafe 18 serialization 10 performance 18 atomic 10
which is even above the maximum privilege threshold. Assuming a correlation of Stack Overflow popularity with user expertise, one might conclude that this is evidence that the topics related tosun.misc.Unsafe, and even more the techniques to exploit and usesun.misc.Unsafe, require a significant development experience.
To get further evidence that the topics related to sun.misc.Unsafe attract popular and expert users, we computed the distribution of replier’s reputation among the ranks defined in the Stack Overflow reputation league9, as shown in Table 7.2.
Table 7.2. Distribution of Repliers Reputation. Repliers Reputation Range All Users
all withsun.misc.Unsafe
1–199 3,276,655 120 (0.0%) 31 (0.0%) 200–499 80,105 51 (0.1%) 10 (0.0%) 500–999 49,825 62 (0.1%) 18 (0.0%) 1,000–1,999 30,833 103 (0.3%) 31 (0.1%) 2,000–2,999 11,847 65 (0.5%) 17 (0.1%) 3,000–4,999 10,151 93 (0.9%) 35 (0.3%) 5,000–9,999 7,462 133 (1.8%) 34 (0.5%) 10,000–24,999 4,278 140 (3.3%) 42 (1.0%) 25,000–49,999 1,271 72 (5.7%) 19 (1.5%) 50,000–99,999 444 41 (9.2%) 17 (3.8%) 100,000+ 234 39 (16.7%) 14 (6.0%)
From the distribution reported in Table 7.2, the topics discussed in posts wheresun.misc.Unsafe
is mentioned attracted 39 top-ranked users, corresponding to 16.7% of all top-ranked users, and 14 of them discussed and mentioned sun.misc.Unsafeor one of its methods (corresponding to 6% of top-ranked users).