Stack Overflow is a source of ready-made natural language contents and tagged source code that is a potential candidate as a resource to better evaluate our approach. The data provided by Stack Overflow would allow to create more realistic cases by harnessing the combination of human-generated narrative and code. However, as explained in Section 6.1, it is not possible to completely trust the fragments tagged with the <code>tag (see Figure 6.2). In the following we tackle this issue, and we present a methodology to mitigate this problem, test our approach in a practical setting, and furtherly analyze RQ2.
<p>But <code>getWidth()</code> and <code>getHeight()</code> returns 0. Is it a problem with the inheritance or the constructor?</p>
Listing 6.9. Fragment with Tagged Code
Fragment Extraction
The contents of Stack Overflow are tagged with a subset of HTML5. Tagging of code elements can be performed by using the <code>, either at the top level of the post (out-paragraph) and inside any other HTML tag (in-paragraph). By analyzing the DOM of a post, we separate code- tagged elements (i.e.,<code>) from the rest of the contents. In the end, each Stack Overflow post
is fragmented in tagged textual and code parts. Each top-level node in the body of the post is treated separately.
Consider the fragment in Listing 6.9. The contents of the paragraph are collected until a
<code> is encountered, and marked as textual. Then, we extract the contents of <code>, and we mark it as code, and we keep repeating this process until the end of the input. In this case, the fragmentation would generate the sequence “But ”, “getWidth()”, “and ”, “getHeight()”, and “ returns 0. Is it a problem with the inheritance or the constructor?”. This process is applied to every post of a discussion tagged as <java>in Stack Overflow. In total, we extracted 18,993,221 sub-fragments.
Tagging Agreement
After isolating fragments from Stack Overflow discussions, we need to check that fragments likely respect the tagging assigned by the users. Even though there is a degree of uncertainty about the behavior of the island grammar in ambiguous cases, in Section 6.2.1 we extensively tested the grammar of each language implemented, guaranteeing a reasonable level of confidence for complete constructs. Furthermore, the huge amount of fragments to be analyzed mitigates and distributes the possible ambiguity errors, still providing a realistic approximation.
The island parser can be used to estimate the code coverage of the fragments, that is, the percentage of character parsed as valid code elements out the total number of characters of the fragments. The calculation for the tagging agreement is relatively simple: If a fragment is marked as text, and it has 0% code coverage, the tagging agreement would be 100%, or 0% if the code coverage is 100%. The dual holds for fragments marked as code: agreement is 100% with 100% code coverage, and 0% with 0% code coverage.
Disentangling Stack Overflow Posts
The island grammar can be tested by analyzing to what extent it is capable of disentangling natural language from code elements in a subset of Stack Overflow posts. Having extracted and analyzed the fragments of every Stack Overflow discussion concerning Java, it is possible to select the posts whose tagging reaches full (i.e., 100%) agreement either for text and code elements. Assuming enough confidence on the correctness of our approach, this analysis elimi- nates unavoidable ambiguity cases where the contents are wrongly tagged (i.e., non tagged code elements within narrative), by selecting the ones that fully agree with their tagging when parsed in isolation.
The main idea is to select these posts, their fragments (that can be parsed in isolation), and verify that the island parser can reconstruct the human tagging when all fragments are merged together. The process to follow is conceptually similar to the one used in Section 6.2.1:
1. for each fragment, we parse it and create the H-AST; 2. we combine all the H-ASTs in one sequence;
3. we merge all the raw text of fragments to create one unique document;
4. we parse the document with the island parser to obtain another sequence of H-ASTs; 5. we verify that the two sequences are identical.
6.2 Evaluating the Island Grammar and Model Construction 107
In the Stack Overflow dump of March 20165there are 863,500 posts whose tagging has perfect agreement. We run the island parser by following the aforementioned process. Table 6.3 shows the results of the disentangling process.
Disentanglment Posts Percentage
Success 823,866 95.41%
Failures 39,634 4.59%
Total Posts 863,500
Table 6.3. Disentangling Stack Overflow Results
The island parser is capable of correctly disentangling about 95.41% of the posts. When we inspected the results, we found that the failures were due to a complex rule matching and grouping sequences of isolated statements. We found that some constraints on the first statement of these sequences, that are used to avoid capturing some natural language constructs that resemble variable declarations, were indeed too strict. While we are able to capture single statements in isolation, the reference structure exhibits a mismatch, causing the failure. After fixing this issue, we were able to disentangle all the considered Stack Overflow posts, increasing our confidence on the ability of our approach to correctly disentangle unambiguous structured fragments from narrative.
Partial Agreement Analysis
Another interesting analysis can be done if we consider all the possible top-level paragraphs of posts and their contents. We can extend the agreement analysis to reveal some insights about both the correctness of how people tag code, and the limitations of our approach itself.
Consider the three types of tagging performed by humans:
(1) top-level paragraphs completely tagged as code (i.e., enclosed by the <code> tag);
(2) paragraphs tagged as pure natural language, with no in-paragraph code tagging (i.e., en- closed in any tag but <code>, like <p>);
(3) in-paragraph tagging, where paragraphs tagged as narrative exhibit some sub-fragments tagged as code (i.e., as in Listing 6.9).
Table 6.4 shows the agreement analysis for paragraphs completely tagged by people as code (2,928,766 paragraphs).
Agreement
Type None Partial Full Code 8.15% 34.70% 57.15%
Table 6.4. Agreement for Paragraphs Tagged as Code
Our approach obtains full coverage on around 57% of fragments, meaning that for 57% of paragraphs tagged as code, our island parser reconstruct a full-fledged H-AST (please note that this includes the case of islands with lakes). Partial agreement (i.e., the parser finds some narrative mixed with code) is found in 35% of the paragraphs, and no agreement (i.e., the parser
finds only narrative) in 8% of the cases. By manually inspecting these cases, we mostly found examples of other programming languages (e.g., SQL, CSS). Partial agreement, instead, is mostly due to cases where people tag as code console logs or error output other than stack traces that contain incomplete code elements mixed with narrative.
Table 6.5 shows the coverage results for the paragraphs completely tagged as natural language, with no in-code paragraphs (a total of 7,704,072 paragraphs).
Agreement
Type None Partial Full
Textual 1.38% 18.74% 79.88%
Table 6.5. Agreement for Paragraphs Tagged as Natural Language
As expected, a large amount (79.88%, i.e., 6M paragraphs) of the content tagged as natural language is coherent, that is, it contains only narrative without code. However, a significant part of the remaining paragraphs (18.74%, i.e., 1.4M paragraphs) is reported to contain some valid constructs, that are very likely to be code elements for the languages we support. Assuming the correctness of the island parser, this is evidence that users tend to forget to tag, or avoid to tag on purpose code elements by using alternative markup tags to emphasize the code within a discussion (e.g., by using <strong> or <blockquote> HTML tags). Only a minimal part of paragraphs (i.e., around 100K) are reported to be completely code by our parser. By manual inspection, we found that the top two untagged constructs found by our approach are reference types and qualified identifiers.
Finally, Table 6.6 reports the agreement values for paragraphs tagged as narrative that con- tain elements tagged as code (i.e., 1,665,340 paragraphs). We report the agreement aggregated by sub-fragment type, i.e., text or code.
Agreement
Type None Partial Full
Code 54.53% 6.76% 38.71% Textual 0.04% 3.87% 96.09%
Table 6.6. Agreement for Fragments with in-paragraph Code Fragments.
We adopt the same fragmentation process that we applied for whole posts to evaluate the ability of our approach to disentangle posts. According to the island parser output, only 38.71% of the in-paragraph tagged code in totally agreement with the human tagging, and more than a half of the sub-fragments (54.53%) in total disagreement. Again, we found examples of other programming languages (e.g., SQL, CSS), and actual limitations of our approach, like with types with no real camel case likeString, or isolated primitive types. These constructs cannot be easily identified with just a syntactic/lexical approach like ours, but require a technique that integrates domain knowledge and natural language processing.
Finally, only a minority of in-paragraph sentences that remain untagged are completely rec- ognized as code (i.e., 0.04% of fragments) or have some code elements (i.e., 3.87% of fragments). As in paragraphs completely tagged as narrative, we found that the top two untagged constructs found by our approach are reference types and qualified identifiers.