• No se han encontrado resultados

NECESIDADES Y REQUERIMIENTOS

In document Follow this and additional works at: (página 112-117)

3. ESTUDIO TÉCNICO

3.4. NECESIDADES Y REQUERIMIENTOS

In many cases the average value of a set of data is not a consistent sampling of the data. Wide ranges of values can skew average results and provide an unrealistic assessment of data. Consider the following chart representing the continents and their populations:

121 Name Population Asia 3,705,025,700 Africa 784,475,000 Europe 730,074,600 North America 482,993,000 South America 345,780,000 Oceania 30,401,150 Antarctica 0

Based on the data in this chart, the average population for the continents is 868,392,779. A quick glance at the data reveals this is not accurate. Asia is so much larger in population than the other continents along with Antarctica having no population. This results in a skewed outcome and not representative of the data. A more accurate calculation would be the median value.

Calculating the median value

The median value is calculated by using the middle value of a data set. If there was a grouping of values that consisted of 1,3,5,20,99, the median value would be 5. In the case of the population chart, the single middle value is North America with a population of 482,993,000. If there was an even number of continents, this would require averaging the two populations located in the center of the data. Easy enough to determine when doing it by hand, but how is this done with SQL.

Multiple paths to the median

In MySQL there are multiple ways to accomplish the same task. Some are more efficient than others, while others are easier to create. Either way, in MySQL determining the median value can be accomplished in one of three ways:

• Creating an implicit temporary table

• Creating a stored procedure

Thi Nguyet Tran (moonฺtran@spts21ฺcomฺvn) has a non-transferable

license to use this Student Guideฺ

5.5.1 Median Value via Implicit Temporary Tables

Using an SQL user variable and nested SELECT can provide creative solutions to perceived limitations of SQL. In the following SQL statements, a user variable called @counter is created for use in a nested SELECT statement:

mysql> SET @counter := 0;

Query OK, 0 rows affected (0.00 sec)

mysql> SELECT FLOOR(AVG(table_a.Population)) AS median_pop -> FROM (SELECT @counter := @counter+1 AS Count, -> Population

-> FROM Continents ORDER BY Population) AS table_a, -> (SELECT COUNT(*) AS totalrows FROM Continents) AS table_b -> WHERE Count IN (FLOOR((totalrows+1)/2), FLOOR(totalrows/2)+1); +---+

| median_pop | +---+ | 482993000 | +---+

1 row in set (#.## sec)

For the most part the SQL is pretty standard; however, the following explains portions of the SQL that may seem a little awkward:

122

123

AVG(value) - The first portion AVG(value) will take the values that are returned and

average them together. If there are an odd number of rows, the values returned would be the same values so averaging them would simply provide the one value in response. If there are an even number of rows, the values returned would be the middle two values which would then be averaged together.

(SELECT @counter := @counter+1 AS Count, Population FROM

Continents ORDER BY Population) AS table_a - This portion of the script acts as the first table (table_a) that is being used in the SQL statement. This table acts as the counter table which will produce a value from 1 to 7 based on the values in the Continents table.

(SELECT COUNT(*) AS totalrows FROM Continents) AS table_b - This

portion of the script acts as the second table (table_b) that is being used in the SQL. The only column in this table, called totalrows, will contain the number of rows from the incomes table. This totalrows column will be used to determine the middle rows which is necessary for determining the median value.

5-10

_________________________________________________________________________________________________

_________________________________________________________________________________________________ _________________________________________________________________________________________________

Thi Nguyet Tran (moonฺtran@spts21ฺcomฺvn) has a non-transferable

license to use this Student Guideฺ

WHERE counter IN (FLOOR((totalrows+1)/2), FLOOR(totalrows/2)+1) - The

totalrows column that is obtained from the second table (table_b), is used here to determine the middle rows.

o Odd number of rows - When there are an odd number of rows, the formula FLOOR((totalrows+1)/2) will evaluate to (totalrows+1)/2. The second formula FLOOR(totalrows/2)+1 will also evaluate to (totalrows+1)/2. Thus providing two values that are identical to the AVG(value) aggregate function. In a list that contains 1, 2, 3, 4, 5 the following results would exist:

ƒ FLOOR((totalrows+1)/2) => FLOOR((5+1)/2) => 3

ƒ FLOOR(totalrows/2)+1 => FLOOR(5/2)+1 => 3

o Even number of rows - Where there are an even number of rows, the formula FLOOR((totalrows+1)/2) will evaluate to totalrows/2. The second formula FLOOR(totalrows/2)+1 will evaluate to (position/2+1). This results in the two values returned being the values from the two middle rows in the list. In a list that contains 1, 2, 3, 4, 5, 6 the following results would exist:

ƒ FLOOR((totalrows+1)/2) => FLOOR((6+1)/2) => 3 ƒ FLOOR(totalrows/2)+1 => FLOOR(6/2)+1 => 4

FLOOR() function

The SQL FLOOR() function returns the largest integer that is less than or equal of the numeric value that is supplied. This function will return the same data type of the numeric value that is supplied.

Thi Nguyet Tran (moonฺtran@spts21ฺcomฺvn) has a non-transferable

license to use this Student Guideฺ

5.5.1.1 Evaluating Performance

The SQL statement that was used to determine the median value for the Continents table produced the correct output but at what cost to performance. Utilizing the EXPLAIN function can provide clues to performance issues:

mysql> EXPLAIN

-> SELECT FLOOR(AVG(table_a.Population)) AS median_pop -> FROM (SELECT @counter := @counter+1 AS Count, Population -> FROM Continents ORDER BY Population)

-> AS table_a,

-> (SELECT COUNT(*) AS totalrows FROM Continents) -> AS table_b

-> WHERE Count IN (FLOOR((totalrows+1)/2), FLOOR(totalrows/2)+1);

+----+---+---+----+----+---+----+----+---+---+ | id | selec.. | table | .. | .. | key | .. | .. | rows | Extra | +----+---+---+----+----+---+----+----+---+---+ | 1 | PRIMARY | <derived3> | .. | .. | NULL | .. | .. | 1 | | | 1 | PRIMARY | <derived2> | .. | .. | NULL | .. | .. | 7 | Using where | | 3 | DERIVED | Continents | .. | .. | Pop.. | .. | .. | 7 | Select tables optimized away | | 2 | DERIVED | Continents | .. | .. | Pop.. | .. | .. | 7 | Using index | +----+---+---+----+----+---+----+----+---+---+ 4 rows in set (0.00 sec)

Note: Some of the content from this output has been removed for printing purposes to ensure that the more important columns for this discussion are displayed.

The first nested SELECT statement (id #2) is using the Continents.Population index, but the second nested SELECT statement (id #3) is able to completely remove the SELECT statement due to the MyISAM storage engine having an internal row count.

Other storage engines (such as InnoDB) would have to attempt to utilize an index. With the SELECT statement as it currently is, no index would be able to be used when requesting a count on all columns: COUNT(*). To overcome this limitation in non-MyISAM storage engines, an indexed column can be used instead of * to utilize optimization features:

mysql> EXPLAIN

-> SELECT FLOOR(AVG(table_a.Population)) AS median_pop -> FROM (SELECT @counter := @counter+1 AS Count, -> Population

-> FROM Continents ORDER BY Population) AS table_a,

-> (SELECT COUNT(Population) AS totalrows FROM Continents) AS table_b -> WHERE Count IN (FLOOR((totalrows+1)/2), FLOOR(totalrows/2)+1);

+----+---+---+----+----+---+----+----+---+---+ | id | selec.. | table | .. | .. | key | .. | .. | rows | Extra | +----+---+---+----+----+---+----+----+---+---+ | 1 | PRIMARY | <derived3> | .. | .. | NULL | .. | .. | 1 | | | 1 | PRIMARY | <derived2> | .. | .. | NULL | .. | .. | 7 | Using where | | 3 | DERIVED | Continents | .. | .. | Population | .. | .. | 7 | Using index | | 2 | DERIVED | Continents | .. | .. | Population | .. | .. | 7 | Using index | +----+---+---+----+----+---+----+----+---+---+ 4 rows in set (#.## sec)

Using the Continents.Population index in the COUNT function has allowed the second table (id #3) to now use the index and thus improves the performance of the SELECT statement.

124 125 5-12 _________________________________________________________________________________________________ _________________________________________________________________________________________________ _________________________________________________________________________________________________

Thi Nguyet Tran (moonฺtran@spts21ฺcomฺvn) has a non-transferable

license to use this Student Guideฺ

5.5.2 Creating a Median Stored Procedure

The MySQL stored programming language is a block-structured programming language based on the ANSI SQL:2003 SQL/PSM (Persistent Stored Module) specification. This language includes many of the common programming tools expected in a block-structured programming language such as manipulating variables, managing execution through conditions, repeated routine processing along with handling errors. In the course of using MySQL, a developer may find it easier to produce a stored procedure to eliminate external coding or repetition of tasks. Thus, if there is a need to obtain the median value on a regular basis against datasets, it may make sense to create a stored procedure to handle the task.

A median procedure

Using the process shown for obtaining a median value using implicit temporary tables can be moved easily into a stored procedure that can be reused over and over again. The following is the median stored procedure in its entirety which will be followed up with a detailed look at how it works:

126

mysql> DELIMITER //

mysql> CREATE PROCEDURE median (src_field VARCHAR (32), src_table VARCHAR (32)) -> SQL SECURITY INVOKER

-> BEGIN

-> SET @counter = 0; -> SET @stmt = CONCAT(

-> 'SELECT FLOOR(AVG(table_a.', src_field,')) AS MEDIAN '> FROM (SELECT @counter := @counter+1 Count, ', -> src_field, -> ' FROM ', -> src_table, -> ' ORDER BY ', -> src_field, -> ') AS table_a,

'> (SELECT COUNT(*) totalrows FROM ',src_table, ') AS table_b '> WHERE Count IN (FLOOR((totalrows+1)/2), FLOOR(totalrows/2)+1) '> ');

-> PREPARE stmt_exec FROM @stmt; -> EXECUTE stmt_exec;

-> DEALLOCATE PREPARE stmt_exec; -> END //

Query OK, 0 rows affected (#.## sec) mysql> DELIMITER ;

mysql> CALL median('Population','Continents'); +---+

| MEDIAN | +---+ | 482993000 | +---+

1 row in set (#.## sec)

127

Thi Nguyet Tran (moonฺtran@spts21ฺcomฺvn) has a non-transferable

license to use this Student Guideฺ

DELIMITER

Within the BEGINEND syntax, statements must be terminated with a semicolon (;). This is due to

the fact that this is the same default terminating character for SQL statements, it is important to change

the SQL terminating character with the DELIMITER statement when using the MySQL command line

client or batch processing.

In document Follow this and additional works at: (página 112-117)