Algorithm 1 describes how MapTube calculates a “topicality index” for each of its maps. This is used to promote data that is currently in the news, but it could also be used in reverse to identify where information is lacking. The idea that you can detect what you do not know and spot gaps in a body of information is an interesting one. Taking all the London maps as an example, using the spatial context of the data, then any news stories about London not matching any maps would hint at missing information.
The first stage of the topicality index algorithm begins by building a “bag of words” list for all the words contained in the text of the title, description and xml content fields for all of today’s posts for every one of the RSS feeds on the feed list. As the content is html, all of the tags must be stripped from the data to turn it into plain text, before the text is split into a list of words. During this process, any words of less than 3 characters are dropped, along with non-alpha characters. The word split is performed on the space, comma, full-stop and newline characters.
Having created a list of words from current news articles, the next stage turns this into a dictionary of unique words and frequency. Here, the Porter Stemming algorithm is used so that only the stem of the word is used as the unique key [JW97] [Por80]
4.1. “A Place to Put Maps” 123 [Por91]. For example, if the text contained the words, “computer”, “computers” and “computing”, then all three are reduced to the stem of “compu”. Compared to vector embedding techniques like Mikolov’s “Word2Vec” [Mik+13], stemming is basic, but effective. Also, the implementation developed here pre-dates Mikolov’s original paper on word vector embedding by two years. Finally, in the algorithm pseudo-code there is reference to “spoof” keywords. After implementation on the live server, it was dis- covered that certain combinations of keywords appearing in the media caused unusual maps to appear at the top of the list. This led to the addition of manual weightings applied to words, for example the keyword “test” appearing in the media is weighted down in the following section, otherwise it causes lots of test maps to appear on the homepage, where the user probably hasn’t finished uploading the data for them yet.
To complete the topicality calculation, the stemmed words from the media are trans- formed into probabilities by normalising by total word count. Then they are matched up to a set of stemmed words generated from the title, short description, long descrip- tion and keywords stored for every MapTube map. This now means that there is a word probability for the media words and a count of word frequency for the words describing MapTube maps. The topicality index can then be calculated for every MapTube map by taking each map’s word list in turn and scoring it against the media word list.
Tm = X i<|Wm|) WmiWnj P j<|Wn|Wnj
where Wmiand Wnj are the same word (4.1)
Tm topicality index for map m
Wm list of words describing map m e.g. [“this”, “is”, “a”, “map”]
Wmi word i in the list e.g. i = 3 =⇒ map in the example above
Wn list of words in all news articles
Wnj word j from the list of words in news articles
Although equation 4.1 gives the basic equation, this is calculated for each map in turn using the words from the “title”, “keyword” and “description” fields separately, with the results added together to give the final topicality index as follows:
Tm = Tmtitle+ Tmkeywords+
Tmdescription
The reason for the 0.5 weight on the description is that it contains plain text de- scribing the map, with many more words than the title and keywords. The additional weighting was an empirical fix to make this work effectively in practice. The title and keywords have greater weight in describing what the map is about, while the plain text description is less focussed.
Finally, the topicality index, Tm, for every map is obtained, but in addition to this,
algorithm 1 also shows the words and counts for each map that went into this calcula- tion being returned in T opicalW ords. Not only does this provide an explanation for the topicality index values, but it can also be used as a cross-check against the words from the news media, Wn, which were not matched against any maps. As stated at the
start of this section, any media words not matched against any maps suggest a need to locate and add new data. For example, if the word “Spanish” appears and there are no Spanish maps on MapTube. Given the automatic nature of the data mapping system developed for MapTube one idea was for the system to be extended to find its own data on the Internet.
On a related point, the data from the topicality keyword matching can also be re- used in another way. Rather than matching the words describing maps to news items, the maps can be related to each other by number of shared words and word probability. This is future work, but a form of graph search could be constructed to find similar maps based on concept, rather like Internet recommendation systems based on association rules.
4.1. “A Place to Put Maps” 125
Algorithm 1 Topicality Index
Require: A list of RSS sites: e.g. BBC, Guardian, CASA Blogs
1: RSS Keywords< string, int >← empty hash
2: for all RSS Site in RSS Site List do
3: for post in RSS Site do
4: for xml = (title, keywords, description) in post do
5: text← html stripped from xml
6: words← split words(text)
7: for word in words do
8: if not(is stopword(word)) and len(word) > 2 then
9: RSS Keywords[word] ← RSS Keywords[word] + 1 10: end if 11: end for 12: end for 13: end for 14: end for
15: Post-Cond: RSS Keywords contains a list of keywords from the RSS feed, along with a count of the number of times each word appears
16: StemWordCounts< stem word, int >← empty hash
17: for all word, count in RSS Keywords do
18: Create StemWord from word using Porter Algorithm
19: Note: two words can result in the same stem word, so StemWordCounts< RSSKeywordCounts
20: if StemWordCounts contains StemWord then
21: StemWordCounts[StemWord]← StemWordCounts[StemWord]+count
22: else
23: StemWordCounts[StemWord]← count
24: end if
25: end for
26: TotalCount← sum of all word counts in StemWordCounts
27: (*Normalisation step*)
28: StemRSSKeywords ← StemW ordCounts/T otalCount
29: Add spoof keywords and counts to StemRSSKeywords
30: M atchedKeywords ← hashtable matching mapid, title, keywords, desc to StemRSSKeywords
31: (*Topicality calculation*)
32: T opicality < int, f loat >← every map id from MapTube (int) with a zero value (float)
33: T opicalW ords < int, List of StemWords >← every map id from MapTube (int) with an empty word list
34: for all mapid in Maps do
35: T title ← StemRSSKeywords ∗ M atchedKeywords(mapid, title).count ∀ word matches
36: T keywords ← StemRSSKeywords∗M atchedKeywords(mapid, keywords).count ∀ word matches
37: T desc ← StemRSSKeywords ∗ M atchedKeywords(mapid, desc).count ∀ word matches
38: T opicality(mapid) ← T title + T keywords + 0.5T desc
39: T opicalW ords(mapid) ← all matching words 40: end for