3. ESTUDIO TÉCNICO
3.4. NECESIDADES Y REQUERIMIENTOS
In many cases the average value of a set of data is not a consistent sampling of the data. Wide ranges of values can skew average results and provide an unrealistic assessment of data. Consider the following chart representing the continents and their populations:
121 Name Population Asia 3,705,025,700 Africa 784,475,000 Europe 730,074,600 North America 482,993,000 South America 345,780,000 Oceania 30,401,150 Antarctica 0
Based on the data in this chart, the average population for the continents is 868,392,779. A quick glance at the data reveals this is not accurate. Asia is so much larger in population than the other continents along with Antarctica having no population. This results in a skewed outcome and not representative of the data. A more accurate calculation would be the median value.
Calculating the median value
The median value is calculated by using the middle value of a data set. If there was a grouping of values that consisted of 1,3,5,20,99, the median value would be 5. In the case of the population chart, the single middle value is North America with a population of 482,993,000. If there was an even number of continents, this would require averaging the two populations located in the center of the data. Easy enough to determine when doing it by hand, but how is this done with SQL.
Multiple paths to the median
In MySQL there are multiple ways to accomplish the same task. Some are more efficient than others, while others are easier to create. Either way, in MySQL determining the median value can be accomplished in one of three ways:
• Creating an implicit temporary table
• Creating a stored procedure
Thi Nguyet Tran (moonฺtran@spts21ฺcomฺvn) has a non-transferable
license to use this Student Guideฺ
5.5.1 Median Value via Implicit Temporary Tables
Using an SQL user variable and nested SELECT can provide creative solutions to perceived limitations of SQL. In the following SQL statements, a user variable called @counter is created for use in a nested SELECT statement:
mysql> SET @counter := 0;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT FLOOR(AVG(table_a.Population)) AS median_pop -> FROM (SELECT @counter := @counter+1 AS Count, -> Population
-> FROM Continents ORDER BY Population) AS table_a, -> (SELECT COUNT(*) AS totalrows FROM Continents) AS table_b -> WHERE Count IN (FLOOR((totalrows+1)/2), FLOOR(totalrows/2)+1); +---+
| median_pop | +---+ | 482993000 | +---+
1 row in set (#.## sec)
For the most part the SQL is pretty standard; however, the following explains portions of the SQL that may seem a little awkward:
122
123
• AVG(value) - The first portion AVG(value) will take the values that are returned and
average them together. If there are an odd number of rows, the values returned would be the same values so averaging them would simply provide the one value in response. If there are an even number of rows, the values returned would be the middle two values which would then be averaged together.
• (SELECT @counter := @counter+1 AS Count, Population FROM
Continents ORDER BY Population) AS table_a - This portion of the script acts as the first table (table_a) that is being used in the SQL statement. This table acts as the counter table which will produce a value from 1 to 7 based on the values in the Continents table.
• (SELECT COUNT(*) AS totalrows FROM Continents) AS table_b - This
portion of the script acts as the second table (table_b) that is being used in the SQL. The only column in this table, called totalrows, will contain the number of rows from the incomes table. This totalrows column will be used to determine the middle rows which is necessary for determining the median value.
5-10
_________________________________________________________________________________________________
_________________________________________________________________________________________________ _________________________________________________________________________________________________
Thi Nguyet Tran (moonฺtran@spts21ฺcomฺvn) has a non-transferable
license to use this Student Guideฺ
• WHERE counter IN (FLOOR((totalrows+1)/2), FLOOR(totalrows/2)+1) - The
totalrows column that is obtained from the second table (table_b), is used here to determine the middle rows.
o Odd number of rows - When there are an odd number of rows, the formula FLOOR((totalrows+1)/2) will evaluate to (totalrows+1)/2. The second formula FLOOR(totalrows/2)+1 will also evaluate to (totalrows+1)/2. Thus providing two values that are identical to the AVG(value) aggregate function. In a list that contains 1, 2, 3, 4, 5 the following results would exist:
FLOOR((totalrows+1)/2) => FLOOR((5+1)/2) => 3
FLOOR(totalrows/2)+1 => FLOOR(5/2)+1 => 3
o Even number of rows - Where there are an even number of rows, the formula FLOOR((totalrows+1)/2) will evaluate to totalrows/2. The second formula FLOOR(totalrows/2)+1 will evaluate to (position/2+1). This results in the two values returned being the values from the two middle rows in the list. In a list that contains 1, 2, 3, 4, 5, 6 the following results would exist:
FLOOR((totalrows+1)/2) => FLOOR((6+1)/2) => 3 FLOOR(totalrows/2)+1 => FLOOR(6/2)+1 => 4
FLOOR() function
The SQL FLOOR() function returns the largest integer that is less than or equal of the numeric value that is supplied. This function will return the same data type of the numeric value that is supplied.
Thi Nguyet Tran (moonฺtran@spts21ฺcomฺvn) has a non-transferable
license to use this Student Guideฺ
5.5.1.1 Evaluating Performance
The SQL statement that was used to determine the median value for the Continents table produced the correct output but at what cost to performance. Utilizing the EXPLAIN function can provide clues to performance issues:
mysql> EXPLAIN
-> SELECT FLOOR(AVG(table_a.Population)) AS median_pop -> FROM (SELECT @counter := @counter+1 AS Count, Population -> FROM Continents ORDER BY Population)
-> AS table_a,
-> (SELECT COUNT(*) AS totalrows FROM Continents) -> AS table_b
-> WHERE Count IN (FLOOR((totalrows+1)/2), FLOOR(totalrows/2)+1);
+----+---+---+----+----+---+----+----+---+---+ | id | selec.. | table | .. | .. | key | .. | .. | rows | Extra | +----+---+---+----+----+---+----+----+---+---+ | 1 | PRIMARY | <derived3> | .. | .. | NULL | .. | .. | 1 | | | 1 | PRIMARY | <derived2> | .. | .. | NULL | .. | .. | 7 | Using where | | 3 | DERIVED | Continents | .. | .. | Pop.. | .. | .. | 7 | Select tables optimized away | | 2 | DERIVED | Continents | .. | .. | Pop.. | .. | .. | 7 | Using index | +----+---+---+----+----+---+----+----+---+---+ 4 rows in set (0.00 sec)
Note: Some of the content from this output has been removed for printing purposes to ensure that the more important columns for this discussion are displayed.
The first nested SELECT statement (id #2) is using the Continents.Population index, but the second nested SELECT statement (id #3) is able to completely remove the SELECT statement due to the MyISAM storage engine having an internal row count.
Other storage engines (such as InnoDB) would have to attempt to utilize an index. With the SELECT statement as it currently is, no index would be able to be used when requesting a count on all columns: COUNT(*). To overcome this limitation in non-MyISAM storage engines, an indexed column can be used instead of * to utilize optimization features:
mysql> EXPLAIN
-> SELECT FLOOR(AVG(table_a.Population)) AS median_pop -> FROM (SELECT @counter := @counter+1 AS Count, -> Population
-> FROM Continents ORDER BY Population) AS table_a,
-> (SELECT COUNT(Population) AS totalrows FROM Continents) AS table_b -> WHERE Count IN (FLOOR((totalrows+1)/2), FLOOR(totalrows/2)+1);
+----+---+---+----+----+---+----+----+---+---+ | id | selec.. | table | .. | .. | key | .. | .. | rows | Extra | +----+---+---+----+----+---+----+----+---+---+ | 1 | PRIMARY | <derived3> | .. | .. | NULL | .. | .. | 1 | | | 1 | PRIMARY | <derived2> | .. | .. | NULL | .. | .. | 7 | Using where | | 3 | DERIVED | Continents | .. | .. | Population | .. | .. | 7 | Using index | | 2 | DERIVED | Continents | .. | .. | Population | .. | .. | 7 | Using index | +----+---+---+----+----+---+----+----+---+---+ 4 rows in set (#.## sec)
Using the Continents.Population index in the COUNT function has allowed the second table (id #3) to now use the index and thus improves the performance of the SELECT statement.
124 125 5-12 _________________________________________________________________________________________________ _________________________________________________________________________________________________ _________________________________________________________________________________________________
Thi Nguyet Tran (moonฺtran@spts21ฺcomฺvn) has a non-transferable
license to use this Student Guideฺ
5.5.2 Creating a Median Stored Procedure
The MySQL stored programming language is a block-structured programming language based on the ANSI SQL:2003 SQL/PSM (Persistent Stored Module) specification. This language includes many of the common programming tools expected in a block-structured programming language such as manipulating variables, managing execution through conditions, repeated routine processing along with handling errors. In the course of using MySQL, a developer may find it easier to produce a stored procedure to eliminate external coding or repetition of tasks. Thus, if there is a need to obtain the median value on a regular basis against datasets, it may make sense to create a stored procedure to handle the task.
A median procedure
Using the process shown for obtaining a median value using implicit temporary tables can be moved easily into a stored procedure that can be reused over and over again. The following is the median stored procedure in its entirety which will be followed up with a detailed look at how it works:
126
mysql> DELIMITER //
mysql> CREATE PROCEDURE median (src_field VARCHAR (32), src_table VARCHAR (32)) -> SQL SECURITY INVOKER
-> BEGIN
-> SET @counter = 0; -> SET @stmt = CONCAT(
-> 'SELECT FLOOR(AVG(table_a.', src_field,')) AS MEDIAN '> FROM (SELECT @counter := @counter+1 Count, ', -> src_field, -> ' FROM ', -> src_table, -> ' ORDER BY ', -> src_field, -> ') AS table_a,
'> (SELECT COUNT(*) totalrows FROM ',src_table, ') AS table_b '> WHERE Count IN (FLOOR((totalrows+1)/2), FLOOR(totalrows/2)+1) '> ');
-> PREPARE stmt_exec FROM @stmt; -> EXECUTE stmt_exec;
-> DEALLOCATE PREPARE stmt_exec; -> END //
Query OK, 0 rows affected (#.## sec) mysql> DELIMITER ;
mysql> CALL median('Population','Continents'); +---+
| MEDIAN | +---+ | 482993000 | +---+
1 row in set (#.## sec)
127
Thi Nguyet Tran (moonฺtran@spts21ฺcomฺvn) has a non-transferable
license to use this Student Guideฺ
DELIMITER
Within the BEGIN … END syntax, statements must be terminated with a semicolon (;). This is due to
the fact that this is the same default terminating character for SQL statements, it is important to change
the SQL terminating character with the DELIMITER statement when using the MySQL command line
client or batch processing.