RECOMENDACIONES - FACULTAD DE INGENIERÍA Y ARQUITECTURA

Figure 7.4 shows the architecture of the Stack Overflow retagger. On the right side is depicted the plugin for installed on the Chrome web browser, while on the left side is depicted the StORMeD parsing service. The Chrome plugin is composed of to components. The first one Contents Ex- tractor is activated when the visited page matches the Stack Overflow domain. The extractor navigates the DOM of the web page, and select the HTML element enclosing both question, an- swers, and their related comments, yet excluding top and side bars containing irrelevant contents. The extracted contents is sent to the StORMeD parsing service, where the Multi-lingual Island Parser parses the HTML, and creates an H-AST model of the contents. With the H-AST model,

Multi-lingual Island Parser Contents Extractor H-AST Transformer StORMeD Service Chrome Broswer Contents Injector StackOverflow Retagger </> Raw HTML Retagged HTML

Figure 7.4. The Stack Overflow Retagger Architecture.

it is possible to visit each part of the HTML (i.e., the DOM) and perform a second analysis to understand if the textual fragments of in the HTML actually contain valid code elements. The visit is performed by the H-AST Transformer, which excludes all the branches underneath a <code>tag, and analyzes all the free text encountered during the visit.

Listing 7.1. Example of HTML Tagging

<p>

Some tagged

<code>aMethod()</code>

and untagged code anotherMethod() </p>

Listing 7.1 shows an example of paragraph with untagged and tagged code. In this case, the H-AST Transformer exclude the <code> tag from the visit of the DOM, thus ignoring the text fragment “aMethod()”, and visits the the two remaining text fragments “Some tagged” and “and untagged code anotherMethod()”. To understand if the free text contains untagged code elements, the H-AST Transformer renders the HTML escapes (e.g., <) to obtain plain text, and it runs the multi-lingual island parser to identify code elements. If the parser returns some element, the corresponding text enclosed in <code>tags.

Listing 7.2. Example of HTML Tagging

<p>

Some tagged

<code>aMethod()</code> and untagged code

<code>anotherMethod()</code> </p>

Listing 7.2 shows the final transformed HTML. The text fragments “and untagged code an- otherMethod()” is transformed so that the part anotherMethod() is enclosed in<code>tags, while the rest of the H-AST remains untouched. Once the transformation of the H-AST is completed, the retagged HTML is sent to the Contents Injector, which substitutes the original HTML in the web browser, and add the StORMeD logo on the discussion’s title.

7.4 Conclusions 121

7.4 Conclusions

We presented StORMeD, a dataset and service that models Stack Overflow posts by building a H-AST for each discussion in a publicly available data dump. Our dataset enables the navigation of the contents of a discussion by diﬀerentiating among Java code, XML, JSON, stack traces, and natural language fragments. We described a ready made meta-information that describes and leverages the heterogeneity of Stack Overflow. The meta-information model describes several aspects of the information concerning code and text, like term frequency vectors, readability indexes, and code constructs either mentioned or standalone.

We conducted an exploratory study to discover usages ofsun.misc.Unsafe, showing how the StORMeD dataset can be reused without having to analyze a dataset of considerable size like Stack Overflow from scratch. We discussed how the aggregation of meta data concerning commu- nity (e.g., reputation) in StORMeD can be harnessed as well to discover thatsun.misc.Unsafe

catches the attention of the most well reputed and experienced users on Stack Overflow.

Last but not least, we also provided another proof of concept by building a Stack Overflow retagger that automatically sanitizes untagged code elements in the narrative. The approach followed in this second application could potentially be integrated as a tool in the Stack Overflow pipeline to automatically tag posts without requiring human intervention, or a helper tool to sanitize post at creation time.

Reflections

This chapter presented a set of applications and analysis that can be built on top of the multilin- gual island parser and the H-AST model described in Chapter 6. The usefulness of the H-AST model, and the resulting StORMeD dataset and service is described by the application and analysis themselves. All the chapter can be summarized with one single word: modeling.

In Chapter 6 we started a first phase of low level modeling, by devising the concept of H-AST, which allowed to preserve the structure of the contents. The StORMeD dataset described in Section 7.1 would not be possible without such low level modeling. The meta-information model of StORMeD is implicitly built on top of the H-AST, which in turn, provides an additional layer of modeling abstraction, focusing one the information. The analysis performed in Section 7.2, as well as the retagging tool described in Section 7.3, take both advantage of this two-sided modeling phase allowing to (1) analyze information without having to recompute data, and (2) reshape the contents of a development artifact like a Stack Overflow discussion on the fly.

In the next chapter we take advantage of the multi-lingual island parser and its H-AST model to model the information of artifacts whose primary nature is not textual. We start moving the first steps towards the definition of H-RSSE by cross-recommending items of heterogeneous nature like Stack Overflow discussions and YouTube videos in the same application.

8

Extracting Relevant Fragments from Software

Development Video Tutorials

The approaches and applications described in the previous chapters focus on textual artifacts like Stack Overflow discussions. Even though most of the artifacts perused by developers are of textual nature (i.e., bug reports, development emails), the knowledge needed by developers to understand a concept can be also acquired from other type of resources.

A prominent example are video tutorials, a new and emerging source of information that can be effective in providing a general and thorough introduction to a new technology, yet providing a learning perspective different and complementary to that offered by traditional, text-based sources of information [MSB15].

Despite these benefits, there is still limited support for helping developers to find the relevant information they require within a video. In many cases, video tutorials are lengthy, and lack an index to allow finding specific fragments of interest.

In this chapter we present CodeTube, an approach to leverage the information found in video tutorials and other online resources. Given a textual query (e.g., “implementing an Android listener”) and the type(s) of video tutorial a developer is interested in (e.g., “theoretical concepts”, “code implementation”, “working environment setup”), CodeTube recommends video tutorial fragments relevant to the query and to the specific developer’s needs, and complements the recommended video tutorial fragments with related Stack Overflow discussions.

Structure of the Chapter

Section 8.1 reports the design and results of a study we run with the aim of identifying categories of development video tutorial fragments and investigating how video tutorials are composed. Section 8.2 details CodeTube, while Section 8.3, Section 8.4, and Section 8.5 describe and report the results of the three evaluations. Threats to validity are discussed in Section 8.6, while Section 8.7 concludes the chapter.

8.1 Investigating the Structure of Video Tutorials

Previous research on development video tutorials [MSB15] investigated the motivation and pur- pose of the whole tutorial, rather than looking deeper at its structure and content. Even if not explicitly stated, a video tutorial has an intrinsic structure embedded in the flow of actions performed by the tutor.

Table 8.1. Participants’ Occupation. Occupation Total % Faculty 1 2% PhD Student 3 7% Master Student 4 10% Undergraduate Student 31 76%

Professional Software Developer 2 5%

Total 41 100%

When it comes to devise an automated approach to analyze, fragment, classify, and index video tutorials, understanding the aforementioned structure of the original video is essential to provide, for example, advanced searching features.

The goal of this study is to understand which are the typical parts/sections composing a software development video tutorial (e.g., setting of the IDE, code writing, etc.). The context consists of objects, i.e., 150 video tutorials collected from YouTube, and participants, i.e., 41 computer science students/professors and professional developers manually tagging the diﬀerent parts of the tutorials (e.g., “from 1:00 to 3:30 the tutorial shows how to set the IDE”).

In document FACULTAD DE INGENIERÍA Y ARQUITECTURA (página 39-53)