To any given variable that requires correction, the tool applies one of four possible correc- tions. The tool is capable of applying three corrections for XSS and one for SQLi. Regard-
ing SQLi, the tool always applies a call to a MySQL (or MySQLi) string escaping func- tion. The used function is mysqli real escape string when the sensitive sink has a database connection as it’s first parameter and mysql real escape string in any other case. Currently, the only sensitive sinks considered that have a connection as their first parameter are: mysqli query, mysqli real query and pg send query. When a connection is available, the correction consists of the following code: $t = mysqli real escape string($c, $t);, where $c is the name of the connec- tion and $t is the name of a tainted variable (it may also be an access to an array). When a connection is not available, the following correction is inserted instead: $t = mysql real escape string($t);, where $t is once again the name of a tainted variable.
For XSS, the tool has three corrections at it’s disposal: a HTML-encoding correc- tion, an URL-encoding correction and a special correction for format strings. The format string correction is more complex and will be described in detail in Subsection 4.4.1. The HTML-encoding correction consists of a call to the htmlentities function with a tainted variable as it’s first argument and the ENT QUOTES flag as it’s second argu- ment. This correction is applied in all situations when the other corrections can not be applied and it consists of adding the following code: $t = htmlentities($t, ENT QUOTES);, where $t is the name of a tainted variable (as with SQLi, this can be an access to an array). htmlentities was chosen as the HTML-encoding function to be used in this type of correction because it is widely recommended by the community. In the case of the URL-encoding correction, the tool instead adds the following code: $t = rawurlencode($t);, where $t has the same meaning as before. The URL- encoding correction is applied when the variable is being included inside of a <script> or <style> tag. rawurlencode was chosen over urlencode as the URL-encoding function to be used in this type of correction because it also encodes the + character, which is a valid operator in Javascript.
4.4.1
Format String Correction
The format string correction is applied to tainted variables that are part of a format string in a call to the printf sensitive sink or a call to sprintf that is an argument to another XSS sensitive sink. Both of these functions expect to receive a format string (similar to the format strings used in the C language) as their first argument. The format string can not be corrected using the two previously described XSS corrections because they would result in the encoding of some of the format specifiers, thus breaking the application’s output. For this reason, we developed a way of correcting format strings without breaking their original output. This is the only correction applied by our tool that involves the addition of more than one line of code. Listing 4.7 presents an example of the correction applied to format strings. The correction itself starts on line 6 of the listing and ends on
line 20, inclusive.
1 <?php 2
3 $format = $_GET["f"]; 4 $input = $_GET["i"]; 5
6 $matches = array();
7 preg_match_all("/(?<=%’)./", $format, $matches); 8
9 $format = preg_replace("/(?<=%’)./", "#", $format); 10 $format = htmlentities($format);
11 $format = preg_replace("/(?<!%)’/", "'", $format); 12
13 $matchIdx = 0;
14 for ($ptr = 2; $ptr < strlen($format); $ptr++) {
15 if ($format[$ptr] == "#" && $format[$ptr - 1] == "’" && 16 $format[$ptr - 2] == "%") { 17 $format[$ptr] = $matches[0][$matchIdx]; 18 $matchIdx++; 19 } 20 } 21
22 printf($format, $input);
Listing 4.7: Example of the correction applied to a format string.
In PHP, out of all characters that can form a format specifier, only the single quote is converted by a call to htmlentities (if the ENT QUOTES flag is in use). The single quote is used in a format string to specify a character to be used as padding for the argu- ment. The padding character is the one immediately to the right of the single quote. For example, the format specifier %’09s will pad a string with zeroes on the left until it is 9 characters in length. If the original string is 9 or more characters in length, no padding will be added. Because the padding is part of the application’s output, it is important that the structure of these format specifiers is not modified by a correction.
Taking this into consideration, our correction does the following: In lines 6 and 7, it locates all padding characters and saves them to the $matches array. In line 9, all padding characters are replaced by the # character, to prevent them from being modified next. In line 10, a call to htmlentities is made without any flag, to prevent it from modifying single quotes (all other applicable characters are still converted). In line 11, any single quote that is not part of a format specifier (not preceeded by %) is replaced by it’s equivalent HTML-encoded representation. Lastly, in lines 13 to 20, the padding characters that were saved to the $matches array are put back in their place to be part of the output. This process HTML-encodes any applicable character while maintaining the original paddings.
be noted: Firstly, this correction uses a HTML-encoding function and thus inherits all limitations of this type of functions. Secondly, the use of URL-encoding functions is not possible in this situation because they would encode many more characters (including the ones that form the format specifiers) than HTML-encoding ones. Lastly, in order to prevent the names of the variables created by this correction from interfering with others that already exist in the program, our tool appends a random number to the end of the name of each variable created by this correction. As an example, the $matches array will be named $matches NNNN in a real correction, where NNNN is a random number (greater than zero) generated by the tool. Despite the use of this technique, it is still possible for the added variables to have the same name as ones that already exist in the program. We believe this is not a problem because the correction for format strings is rarely applied and, when it is applied, the use of random numbers makes the chances of interfering with existing variables extremely low.