Surface Structure from Elastic Scattering
4.3 Surface Phonons of a Graphene-Passivated Metal: Gr/Ni(111)
We apply this function to our sample messages with
sampleSplit = lapply(sampleEmail, splitMessage)
We have found the body of the message, and we next tackle the removal of any attachments.
3.5.2 Removing Attachments from the Message Body
We saw in Section 3.2 that when an email message has attachments, the MIME type is multipart and the Content-Type field provides a boundary string that can be used to locate the attachments. In the example provided there, the Content-Type field is
Content-Type: MULTIPART/Mixed;
BOUNDARY="_===669732====calmail-me.berkeley.edu===_"
It seems our first step is to find the Content-Type key and use its value to determine whether or not an attachment is present. If so, then we find the boundary string and use this string to locate the attachments.
We work with the first message in our sample and use the grep() function to locate Content-Typein the header with
header = sampleSplit[[1]]$header grep("Content-Type", header) [1] 46
We have successfully found the Content-Type key in the 46th element ofheader. We next use this Content-Type’s value to determine whether the message has any attachments, i.e., we check whether or not the Content-Type is "multipart". When we examine the messages insampleEmailwe see that the MIME type is not consistently capitalized so we convert header[46]to lower case before searching for the term multipart. We again usegrep()to do this, i.e.,
grep("multi", tolower(header[46])) integer(0)
It appears this message has no attachments. We double check with
header[46]
[1] "Content-Type: text/plain; charset=us-ascii"
Indeed, it has only a plain text body.
We can apply this call togrep()to all of the headers in the list of sample messages with headerList = lapply(sampleSplit, function(msg) msg$header)
CTloc = sapply(headerList, grep, pattern = "Content-Type") CTloc
[[1]]
[1] 46 ...
[[6]]
[1] 54 [[7]]
integer(0) ...
The sapply() did not return a vector as expected because the seventh element has no Content-Type key. To remedy this, we can check for a missing Content-Type field and return 0 or NA in this case so that we have a numeric vector to work with, i.e.,
sapply(headerList, function(header) {
CTloc = grep("Content-Type", header) if (length(CTloc) == 0) return(NA) CTloc
})
[1] 46 45 42 30 44 54 NA 21 17 52 31 52 52 27 31 Finally, we add the check for a multipart MIME type with
hasAttach = sapply(headerList, function(header) { CTloc = grep("Content-Type", header)
if (length(CTloc) == 0) return(FALSE) grepl("multi", tolower(header[CTloc])) })
hasAttach
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE [10] TRUE TRUE TRUE TRUE TRUE TRUE
Note that grepl()returns a logical indicating whether there was a match or not. Several of the messages in our sample have attachments.
We have usedgrep()andgrepl()to search for specific literals in our header strings. For example, grepl("multi", header) searches in each element of the character vector header for an m followed by a u then by an l and so on. This sequence of 5 literals can appear anywhere in the string, and if it does, grep() returns the indices of the elements where it found a match. This is a very simple example of pattern matching using regular expressions. The first argument togrep()is a regular expression and the second is the vector
of strings in which to search. Regular expression matching is far more powerful and flexible than this simple example demonstrates. We use more of the features of regular expressions next as we search for the boundary string.
We need to extract the boundary string from those messages that have attachments in order to locate and remove the attachments. There are several ways to extract the boundary string from the Content-Type value. We leave a string manipulation approach to the exercises and use regular expressions and the sub() function here. Essentially we want to discard all of the string except for the boundary so our goal is to create a regular expression that identifies that part of the string which is the boundary. We locate the boundary string in our sixth message as follows:
header = sampleSplit[[6]]$header
boundaryIdx = grep("boundary=", header) header[boundaryIdx]
[1] " boundary=\"==_Exmh_-1317289252P\";"
The boundary string begins after ‘boundary="’ and ends before the ‘;’ character.
In pseudocode, we want to create a pattern like the following:
any characters followed by
boundary="(string we are looking for)"; any characters
The actual pattern we use is’.*boundary="(.*)";.*’. This pattern uses many of the special characters available in the regular expression language. The pattern begins with a
‘.; character, which stands for any literal. It is followed by the ‘*; quantifier, meaning any number of times so the pattern ‘.*’ matches any number of arbitrary literals. However, these must be followed by the literals ‘boundary=’ and a quotation mark. The boundary value is the string that follows, up to a quotation mark, followed by a semicolon and then any characters. That part of the pattern within the parentheses is our boundary string. That is, the pattern (.*) does not match the literal parentheses, but uses them to group together the characters that match the pattern within them. Note that it matches any characters any number of times, but it must be followed by a quotation mark and a semicolon. The use of the parentheses to identify a sub-pattern gives us access to these matching characters later.
They can be referred to using a variable, specifically \\1. In the following call tosub(), we use the contents of this variable as a substitute for the entire string. That is,
sub(".*boundary=\"(.*)\";.*", "\\1", header[boundaryIdx]) [1] "==_Exmh_1547759024P"
The first argument to sub()is the pattern that we search for inheader[boundaryIdx], and the second argument contains the substitution for the matching substring. The sub() function allows us to modify part of the input string. Here we are processing all of it to remove the pieces we do not want. That is, if we have written our first pattern correctly, we match the entire string and replace it with the piece that contains the boundary string.
Although our first application of our pattern successfully extracted the boundary string from the header, pattern matching can be tricky and we want to try it on other strings. For example, we apply the call to sub()to the ninth message in our sample, i.e.,
header2 = headerList[[9]]
boundaryIdx2 = grep("boundary=", header2) header2[boundaryIdx2]
[1] "Content-Type: multipart/alternative;
boundary=Apple-Mail-2-874629474"
Notice that the boundary string does not appear in quotes and there is no semicolon at the end. Our pattern matching fails, i.e.,
sub(’.*boundary="(.*)";.*’, "\\1", header2[boundaryIdx2]) [1] "Content-Type: multipart/alternative;
boundary=Apple-Mail-2-874629474"
We have not successfully located the boundary string because of the missing quotation marks and semicolon. Searching for quotation marks is potentially problematic as not all boundaries appear in quotes. If we eliminate quotation marks from the string then we can drop them from our pattern as well. This is a simpler approach than searching for optional quotation marks. We eliminate them with
boundary2 = gsub(’"’, "", header2[boundaryIdx2])
The substitution string is empty so this is equivalent to eliminating the quotation marks from header2[boundaryIdx2]. Notice that we use thegsub()function, rather thansub(). The “g” stands for global, which means that all occurrences of a quotation mark in the string are found and substituted, rather than only the first occurrence.
We have not yet solved the problem of correctly identifying that portion of the string that contains the boundary information because we have not addressed the case of a Content-Type value that has no semicolon. Let’s change our pattern to make the semicolon optional by adding a ‘?’ after the semicolon in the pattern. Let’s also allow any number of blanks (0 or more) between boundary= and the boundary string, i.e.,
sub(".*boundary= *(.*);?.*", "\\1", boundary2) [1] "Apple-Mail-2-874629474"
That seems to have done it!
Let’s check that this revised pattern successfully finds the boundary string in our first example. When we do, we find that the pattern no longer finds the boundary string in that message’s Content-Type value. It worked before, but we have broken the pattern matching, i.e.,
boundary = gsub(’"’, "", header[boundaryIdx]) sub(".*boundary= *(.*);?.*", "\\1", boundary) [1] "==_Exmh_-1317289252P;"
Our pattern no longer correctly finds the end of the boundary string, but instead includes the semicolon. This is a case of greedy matching. We are allowing any character within our parentheses, including the semicolon, and the semicolon at the end of the string is now optional. We can exclude the semicolon from matching by using [^;] in the expression, which matches all characters except the semicolon. Our revised pattern is
sub(".*boundary= *([^;]*);?.*", "\\1", boundary) [1] "==_Exmh_-1317289252P"
Now we have again successfully located the boundary string from the sixth message, and when we try our revised regular expression on the ninth message, we find that it still works.
Although we did not initially identify the task of finding the boundary string as a separate function, we can wrap this code into its own function, which we callgetBoundary(). The only input required is the header and the function returns the boundary string. We do this with
getBoundary = function(header) {
boundaryIdx = grep("boundary=", header)
boundary = gsub(’"’, "", header[boundaryIdx]) gsub(".*boundary= *([^;]*);?.*", "\\1", boundary) }
We are now ready to search through the body of the message for attachments. To get a better sense of the format of these bodies and attachments, we examine a few more messages, e.g.,
sampleSplit[[6]]$body
[1] "--==_Exmh_-1317289252P"
[2] "Content-Type: text/plain; charset=us-ascii"
[3] ""
[4] "> From: Chris Garrigues <[email protected]>"
[5] "> Date: Wed, 21 Aug 2002 10:40:39 -0500"
[6] ">"
...
[43] " World War III: The Wrong-Doers Vs. the Evil-Doers."
[44] ""
[45] ""
[46] ""
[47] ""
[48] "--==_Exmh_-1317289252P"
[49] "Content-Type: application/pgp-signature"
[50] ""
[51] "---BEGIN PGP SIGNATURE---"
[52] "Version: GnuPG v1.0.6 (GNU/Linux)"
[53] "Comment: Exmh version 2.2_20000822 06/23/2000"
[54] ""
[55] "iD8DBQE9ZQJ/K9b4h5R0IUIRAiPuAJwL4mUus5whLNQZC8MsDlGpEdK..."
[56] "PcGgN9frLIM+C5Z3vagi2wE="
[57] "=qJoJ"
[58] "---END PGP SIGNATURE---"
[59] ""
[60] "--==_Exmh_-1317289252P--"
[61] ""
[62] ""
[63] ""
[64] "_______________________________________________"
[65] "Exmh-workers mailing list"
[66] "[email protected]"
[67] "https://listman.redhat.com/mailman/listinfo/exmh-workers"
[68] ""
We see that this body contains one attachment, which is a PGP signature, and each body part has its own short header. Also, there are 8 lines following the end of the attachment.
Another message body in our sample appears as
[1] "This is a multi-part message in MIME format."
[2] ""
[3] "---=_NextPart_000_0005_01C26412.7545C1D0"
[4] "Content-Type: text/plain;"
[5] "\tcharset=\"iso-8859-1\""
[6] "Content-Transfer-Encoding: 7bit"
[7] ""
[8] "liberalism"
...
[27] " http://www.english.upenn.edu/~afilreis/50s/schleslib.html"
[28] ""
[29] "---=_NextPart_000_0005_01C26412.7545C1D0"
[30] "Content-Type: application/octet-stream;"
[31] "\tname=\"Liberalism in America.url\""
[32] "Content-Transfer-Encoding: 7bit"
[33] "Content-Disposition: attachment;"
[34] "\tfilename=\"Liberalism in America.url\""
[35] ""
[36] "[DEFAULT]"
[37] "BASEURL=http://www.english.upenn.edu/~afilreis/50s/sch..."
[38] "[InternetShortcut]"
[39] "URL=http://www.english.upenn.edu/~afilreis/50s/schlesl---"
[40] "Modified=E0824ED43364C201DE"
[41] ""
[42] "---=_NextPart_000_0005_01C26412.7545C1D0--"
[43] ""
[44] ""
[45] ""
Here we find that there are a few lines in the body preceding the first boundary string and a few lines after the closing string. Lines 4, 5, and 6 contain header information for the first part of the body, i.e., the message. Lines 30 through 34 are header lines for the attachment.
Also note that there is an empty line between the header information for each portion of the body and the content itself, e.g., lines 7 and 35 are empty. That is, each body part has a structure that mimics the structure of the message with header information separated from the content with a blank line.
We examine one more message, the 11th in our sample:
[1] ""
[2] "---090602010909000705010009"
[3] "Content-Type: text/plain; charset=ISO-8859-1; format=flowed"
[4] "Content-Transfer-Encoding: 8bit"
[5] ""
[6] "Geege wrote:"
...
[63] "Check out the pictures."
[64] ""
[65] ""
[66] ""
[67] ""
[68] "---090602010909000705010009--"
[69] ""
[70] ""
Note that this body contains no attachment. There are two occurrences of the boundary string — one at the start of the body and one at the end. That is, there is no boundary string to separate the message text from the attachment. The header information within the body indicates that the format is flawed.
With these examples in hand, we can begin to design a way to extract the attachments from the body of the message. Our investigation has shown that some messages do not have an attachment even though their header indicates that they are supposed to. We also must decide what to do with the lines that appear before the first boundary string and after the closing boundary string. Additionally, we might want to address the situation when the last boundary string is not found. We have not come across such a case yet, but it seems like a reasonable precaution to take. Let’s write our function to do the following.
• Drop the blank lines before the first boundary string.
• Keep the lines following the closing string as part of the first portion of the body and not the attachments.
• Use the last line of the email as the end of the attachment if we find no closing boundary string.
We prepare the last message in our sample with boundary = getBoundary(headerList[[15]]) body = sampleSplit[[15]]$body
We search in the body for the boundary string preceded by 2 hyphens with bString = paste("--", boundary, sep = "")
bStringLocs = which(bString == body) bStringLocs
[1] 2 35
These lines in the body mark the start of each portion of the email. Next, we find the closing boundary with
eString = paste("--", boundary, "--", sep = "") eStringLoc = which(eString == body)
eStringLoc [1] 77
We can locate the first part of the message from the body, excluding the attachments, with msg = body[ (bStringLocs[1] + 1) : (bStringLocs[2] - 1)]
tail(msg)
[1] ">" ">Yuck" "> " ">" "" ""
To add the lines that appear after the last attachment to this part of the message, we do the following
msg = c(msg, body[ (eStringLoc + 1) : length(body) ]) tail(msg)
[1] "" "" "" "" "" ""
It appears we have added several empty lines.
We leave as an exercise the creation of the dropAttach()function. It follows the basic operations explored in this section. However, the special cases described earlier need to be addressed, e.g., when there is no attachment despite the header supplying a MIME type of multipart and a boundary string.
Next we explore how to extract the words from a message.