for the full review copying.
3.2.2 Mobile app page metadata sourcing
A typical mobile app page contains the metadata of the app. Figure 6 shows the metadata that this web sourcing code mines.
Once the sourcing target data are decided, a developer tool is needed to locate the correct tags of the data. At the time of writing this code, during 2016 to 2018, Firebug helped mining the XPath of page elements. But at the time of submission of this thesis, Firebug has been deprecated. Instead, Firefox and Chrome both have developer tools containing similar functions to Firebug. For FireFox, it is “Inspector” under “Web Developer”. For Chrome, it is “Developer Tools” under “More Tools”. The developer tools in these two major popular browsers are useful not only for finding the XPath, but also for being able to copy the XPath.
They both use the same icon as highlighted in the Figure 7.
57
Figure 6 Metadata being mined on a typical mobile app page
Figure 7 Icon for the tool locating webpage elements
However, the XPath values that these developer tools return in the copy functions are absolute, therefore not always useful. This is because web page structure changes almost page by page. Most of the time, the developer tools are helpful for locating the right elements, then the users of the system need to analyse the XPath and decide the relative XPath values that allow the code to find the elements across all app pages.
Having decided the relative XPath values, elements can be found by groups or individually through the following two methods: find_elements_by_xpath() and find_element_by_xpath(). The former returns all elements that match the XPath. The latter method will return the first element that matches the condition.
Code Snippet 4 Getting elements by group or individually
apps = brw.find_elements_by_xpath('//div[@class="Vpfmgd"]')
price = brw.find_element_by_xpath('//span[@class="oocvOe"]//button').text
58
Table 12 lists the relative XPath values of typical metadata on a mobile app page that work on 27th December 2019.
Table 12 XPath values of typical metadata on 27th December 2019
Mobile app name //h1[@class="AHFaub"]
Developer Company
//div[@class="jdjqLd"]//span[1]
Number of
Ratings
//span[@class="AYi5wd TBRnV"]/span[1]
Pricing //span[@class="oocvOe"]//button Average Stars //div[@class="BHMmbe"]
Updated Date //div[@class="JHTxhe IQ1z0d"]//*[text()="Updated"]/following-sibling::span[@class="htlgb"]
Size //div[@class="JHTxhe IQ1z0d"]//*[text()="Size"]/following-sibling::span[@class="htlgb"]
Number of Installs //div[@class="JHTxhe IQ1z0d"]//*[text()="Installs"]/following-sibling::span[@class="htlgb"]
Current Version //div[@class="JHTxhe IQ1z0d"]//*[text()="Current Version"]/following-sibling::span[@class="htlgb"]
Required Android Version
//div[@class="JHTxhe IQ1z0d"]//*[text()="Requires Android"]/following-sibling::span[@class="htlgb"]
59
Content Rating //div[@class="JHTxhe IQ1z0d"]//*[text()="Content
Rating"]/following-sibling::span[@class="htlgb"]//span[@class="htlgb"]/div[1]
In-app Products Pricing
//div[@class="JHTxhe IQ1z0d"]//*[text()="In-app Products"]/following-sibling::span
Developer Website
//div[@class="JHTxhe IQ1z0d"]//*[text()="Developer"]/following-sibling::span[@class="htlgb"]//a[starts-with(@href, "http")]
Developer Email //div[@class="JHTxhe IQ1z0d"]//*[text()="Developer"]/following-sibling::span[@class="htlgb"]//a[starts-with(@href, "mailto:")]
This component uses the webdriver module from Selenium. Selenium provides a number of methods to locate elements in a webpage. These methods can locate elements by id, name, xpath, link_text, partial_link_text, tag_name, class_name, and css_selector. However, only xpath and css_selector from this list are currently provided with the copy function across both developer tools from Firefox and Chrome. The css_selector is more inconsistent and difficult to locate across different web pages than XPath. The reason for this is that css_selector is mainly composed of a css element style followed by a series of position numbers of the target element in the sequences of the lists. As the css_selector example below indicates, this is very difficult to keep stable across webpages:
li.card-outer:nth-child(1) > a:nth-child(1) > child(1) > child(1) > div:nth-child(1)
If the users of this system have not created the database as defined in the first step in Appendix B Steps to run this framework, this is the latest time to do it. This is because the sourcing code for the mobile app page will output the metadata to a number of text files and into the database at the same time.
The input of this piece of code C.3 is a list of apps that have passed the inclusion criteria from the above step. The code goes through each of the apps in turn and extracts the same
60
metadata fields. The outputs of this code are a number of text files that are named with the app names, and the populated “apps” table in the database.
The input file of the app list has three pieces of information, serial number “ID”, “app name”, and the “link” to the app page. Each line represents one app. The simplest way to produce this input file is to copy the list of mobile apps that have passed the inclusion criteria from the spreadsheet in the previous step, and only keep the information of the above three columns. The figure below shows an example of the resulting input text file that the author used in the third dataset.
Figure 8 An example of the input file containing a list of apps
Although Python has implicit type conversion, which automatically recognizes and converts some data types, such as numbers and strings, this does not apply to “date” type. The code in Code Snippet 5 shows an example of getting a metadatum “updated date” string by XPath and converting it into the “date” type for MySQL.
The XPath that the browser arrives at in the example below through the find_element_by_xpath method is a relevant XPath value. Firstly, it looks for a div element with a class name “X”. Secondly it looks for any child element of the previously found div element regardless of the type if the child element has a text value “Updated”. Thirdly, in this child element’s following siblings, it looks for a span with a class name “Y”. Eventually, the text value of this span is the target value, which is the “updated date” from the app page.
Because the string of the updated date contains a comma, it has to be removed, which is achieved by the “.replace(",", "")”.
The next line of code converts the string value into a date value, because the
“updatedOnDate” in the “apps” table is defined as the MySQL type “date”. The second
61
parameter of the “strptime” method is the format of the source date, namely, to interpret the first parameter as the format of “month date year”.
It is worth mentioning that '%B' matches a full month name, such as December. '%b' matches an abbreviation of a month, for example Dec. '%Y' matches a four-digit year, whereas '%y' matches a two-digit year. The second parameter defines exactly how the first parameter should be interpreted in format, including spaces and punctuation. As the first parameter that the upper line of code produces is in the format of “August 15 2019”, the second parameter has to be '%B %d %Y'. For example, if the upper line of code does not remove the comma from “August 15, 2019”, the second parameter for this must be '%B %d, %Y'.
Code Snippet 5 Getting “update date” by XPath and converting to “date” type for MySQL
from datetime import datetime
## Omitted code not a part of this snippet, see the list of code C.3 updatedDate = \
brw.find_element_by_xpath('//div[@class="X"]//*[text()="Updated"]/following-sibling::span[@class="Y"]').text.replace(",", "")
updatedOnDate = datetime.strptime(updatedDate, '%B %d %Y')
Code Snippet 6 Outputting metadata to a text file
## Omitted code not a part of this snippet, see the list of code C.3
outputFilename = 'GooglePlayDrivingTheoryReviews/00'+appId+'_'+data[1]+'.txt';
url = data[2];
brw.get(url)
outputFile = open(outputFilename,'w') outputFile.write('\n'+url+'\n'+'\n')
appName = brw.find_element_by_xpath('//h1[@class="AHFaub"]').text outputFile.write("App Name: " + '\n')
outputFile.write(appName + '\n' + '\n')
62
## Omitted code not a part of this snippet, see the list of code C.3 outputFile.close()
The snippet of code in
Code Snippet 6 describes the process of writing the metadata, such as appName, into a series of text files with the serial number of the app and the app name as the output file names.
The code in Code Snippet 7 inserts data from Python 3 into a MySQL database. The
“add_app” assignment defines the SQL statement. The “data_app” assignment defines the mappings between the metadata and the fields in the “apps” table in a dictionary format. The keys of the dictionary are field names, and the values are extracted metadata.
Code Snippet 7 Outputting metadata to the database
myConnection = \
pymysql.connect(host=hostname,user=username,password=password,db=database) cursor = myConnection.cursor()
## Omitted code not a part of this snippet, see the list of code C.3 add_app = ("INSERT INTO apps (appName, ……, link)"\
"VALUES (%(appName)s, ……, %(link)s)" ) data_app = {'appName':appName, ……, 'link':url}
cursor.execute(add_app, data_app) myConnection.commit()