• No se han encontrado resultados

3. Resultados

3.1 Análisis

Our grammar supports all Java 8 constructs, including lambdas and type annotations. A proper definition of islands is the main challenge in our approach. Code fragments can be ambiguous with natural language, and thus rules must be methodically devised. Clearly, the higher the complexity of a grammar rule, the lesser the ambiguity with natural language that it can generate. Thus, constructs like type declarations (e.g., classes) or non-empty blocks can be parsed and modeled as islands without restrictions. However, these constructs are too coarse grained.

For example, in Stack Overflow users tag code fragments in various ways. Generally, frag- ments are leveraged to focus the attention of the reader on the part of the code where the problem

6.1 Multilingual Island Grammar 95

is supposed to be located. Therefore, instead of reporting whole classes, users tend to report simpler constructs, e.g., methods, variable declarations, statements, or type names. Moreover, these constructs are often incomplete (e.g., missing body for methods), making the disambigua- tion with natural language even harder. In the following we pick one example for each category of constructs that our approach handles, focusing on the precise methodology to parse it with PEGs to extract it from narrative, and the way we precisely model with an H-AST.

Fragments and Sub-constructs

Our first example focuses on method declarations. Consider the two following (simplified) rules for method and constructor declarations:

MethodDecl← Modifier∗Type Identifier Args Body ConstructorDecl← Modifier∗Identifier Args Body

public Foo() { } // a constructor

private void aMethod() { } // a method

Listing 6.1. Example Declarations

The two cases are not problematic since they are complete and do not require grammar modi- fications, even when immersed in the narrative. If modifiers are present, there is no ambiguity on deciding that the first declaration is a constructor, and that the second declaration is a method (since it has a return type). If visibility modifiers are absent, things are more complicated.

Foo() { } // a constructor

void aMethod() { } // a method

Listing 6.2. Declarations without Modifiers

Listing 6.2 shows a case where the constructs are missing modifiers. From a grammatical point of view the constructs are valid, since modifiers are not mandatory. Due to the PEG pars- ing prioritization mechanism, the method declaration rule must take precedence over constructor declaration in the island definition; otherwise, the parser would prioritizeaMethodas a construc- tor, ignoring the return type. Furthermore, when visibility modifiers (i.e., public and private) are absent, ambiguity arises. Consider the same two declarations interleaved with narrative as in Listing 6.3.

Consider the constructor Foo() { } and the method void aMethod() { }.

Listing 6.3. Declarations Immersed in Narrative

In this case, by following the standard Java grammar, there is no way to distinguish between a constructor and a method, since the word constructor is a valid Java identifier and thus it is

a syntactically valid return type.

MethodDecl← Modifier+Type Identifier Args Body| Type MethodIdentifier Args Body

ConstructorDecl← Modifier+Identifier Args Body| ClassIdentifier Args Body

To solve this issue, we must take into account the lexical structure of identifiers, since we cannot rely on pure syntactical aspects of the Java grammar. In fact, lexical constraints can be enforced to help disambiguation of such cases; in other words, ambiguity can be mitigated by enforcing naming conventions.

In Java, naming conventions discriminate constructors from methods. The former have a capital letter at the beginning of the name (since they share the same conventions as class names), while the latter start with a lowercase letter and implement the camel case convention whenever their name is composed of two or more words.

Incomplete Productions

The second example concerns incomplete productions, like incomplete declarations. Users tend to focus on the important aspects of code by hiding the parts they deem irrelevant for the discussion. For example, method bodies can be removed, and an ellipsis (“...”) can be used instead of the actual body implementation, or to strip the parameters declaration (e.g., int aMethod(...)). Figure 6.1 shows an example.

Figure 6.1. Example of Stack Overflow discussions with code tagged by users

The author is asking about an inheritance problem with the JPanel class, and reports just the partial signature of the classes Entity, andTextEntity, without the body.

6.1 Multilingual Island Grammar 97

The incomplete class declaration is a good example of the challenge to face when designing the island grammar. The incompleteness of the class should be managed to avoid ambiguity.

ClassDecl← ... | Modifier∗classIdentifierextendsType| Modifier+classIdentifier|classClassIdentifier

The listing above shows some of the rules to parse incomplete class declarations. Incomplete class declarations can appear in even simpler forms likeclass SomeTypeName, without modifiers. In this case, the ambiguity with English is relatively high. If no restrictions are enforced on the identifier lexical structure, whatever word comes after the keyword class in a text would be considered a class name. Once again, naming conventions can help, and we can consider only identifiers respecting Java conventions for types.

Incompleteness must be taken into account on the modeling side. Listing 6.4 shows a simpli- fied version of a class declaration in the H-AST.

case class ClassDeclarationNode(

val modifiers: Seq[ModifierNode],

val identifier: IdentifierNode,

val typeParams: Option[TypeParamsNode],

val superTypes: Option[TypeNode],

val interfaces: Option[TypeListNode],

val body: Option[ClassBodyNode])

Listing 6.4. Modeling a Class Declaration

Contrary to a normal Java AST, in our H-AST the class body is modeled as an optional construct.

In-Paragraph Fragments

The hardest fragments to disentangle from natural language concern in-paragraph code, where they take the role of parts of the discourse in the narrative. An example can be found in Figure 6.1, where the method invocations getWidth() and getHeight() are the subjects of a sentence. Moreover, Figure 6.1 also contains non-tagged class names like TextEntity and

JTextArea, and it is also possible to find a fully qualified identifier (e.g., Java.util.Date), and entire statements. Our approach considers these cases as well, and is also able to parse and model them.

As in the case of incomplete class declarations, the lexical structure of words, and in particular the compliance to naming conventions, can be used to retrieve and identify in-paragraph frag- ments. For example, naming conventions can be leveraged to identify type names likeTextEntity

andJTextAreain Figure 6.1. We devised additional rules to deal with specific in-paragraph frag- ments.

Qualified Identifiers

According to the Java grammar, a qualified identifier is a sequence of identifier separated by a dot. Unfortunately, this structure is highly ambiguous with the english grammar, in particular when a dot separates two periods. Consider the sentence “Hi Mike. How are you?”, the rule would match the qualified identifier Mike.How, since spaces are normally ignored by Java parsers. To

mitigate these false positives, one simple solution is not to allow spaces between identifiers and the dots, which are very rarely left when mentioning qualified identifiers.

Method and Class Names

Class names are characterized by starting with an uppercase letter and being named using camel case notation. Extracting class names with just the first letter as uppercase would introduce noise, identifying as a potential class name every word at the beginning of a sentence. For this reason, we require class names to have at least a second word in camel case notation, e.g.,

TextEntity [BWYS11, BLR10, BCLM11, RR13]. Similarly, a in-paragraph method name is an identifier that respects the Java naming convention, or exhibits features of C-like naming convention. In other words, we extract likely method names that start with a lowercase letter, and contain at least a case change (i.e., aMethodName), or an underscore (i.e., a_method_name) [BLR10, BWYS11, RR13].

Method Invocations

Method invocations have a peculiar structure, but they still maintain a not negligible level of am- biguity with English. The presence of elements like type arguments (e.g., write<String>("a")) clearly distinguishes them from natural language. In some other cases, a strict qualified iden- tifier followed by a method name respecting naming conventions for methods is enough (e.g.,

object.aMethod()). However, our approach also considers the case where the name is composed of a single word, with no camel case (e.g., size). In this other case, we resort to the arguments of the invocation to understand if the fragment is a likely method invocation.

We adopt the following heuristic: If more than one argument is provided, then the fragment is considered an invocation. If there is only one argument, we discriminate by its type. Every argument type but qualified identifiers are considered safe. If the type is a qualified identifier, then it needs to provide at least two identifiers separated by a dot. Allowing a lone identifier would introduce noise. For example, in the sentence “the color of the apples (red) is not yellow” the grammar would match apples (red)as a method invocation.

Listing 6.5. Example of island with lakes

@Entity @Table(name = "shops")

public class Shop { ...

@ManyToOne(fetch=FetchType.EAGER) @JoinColumn(name = "shop_type_id")

private TypeShop typeShop; ...

}

Island with Lakes

According to Moonen [Moo01], island with lakes are other types of constructs that can be enabled by parsing island grammars. An island with lakes is a valid construct that may contain water (lakes), in our case natural language narrative or other punctuation marks. Listing 6.5 shows an example where “...” would be a lake.

6.1 Multilingual Island Grammar 99

The sample is taken from a Stack Overflow question2 and it represents a typical way for users to strip away unneeded code. Our grammar supports these types of constructs to avoid missing relevant information, like the class reported in Listing 6.5. For the sake of simplicity in engineering the grammar, we limited the support to blocks (e.g., method body, class body) containing water. In doing so, we can support either code stripping and minor errors in state- ments (e.g., missing semicolon). A complete class or method construct like Listing 6.5 is parsed as complete construct even if it contains minor errors inside its body. Allowing lakes complicates the modeling side of the H-AST. Essentially, we allow textual fragments to be interchangeable with statements in blocks (e.g., in method bodies) and with member declarations in classes.

Figure 6.2. A Stack Overflow discussion with multilingual contents

Documento similar