EL MISTERIO DE LOS SIETE RAYOS - Curso completo de Magia Negra

Statistics is the systematic study of data (Riffenburgh, 2012, Baldi and Moore, 2014, Moore et al., 2012). Both structured and unstructured data is available to researchers through computer technologies in overwhelming amounts. Below, I give some examples of data resources associated with global research:

Human Genome Project (http://www.genome.gov/10001772)

Human Plasma Proteome Project (http://www.peptideatlas.org/hupo/hppp/), Gallup-Healthways Well-Being Index (http://www.well-beingindex.com/ ),

studies regarding non-communicable diseases

(http://www.lstmed.ac.uk/news-events/media/the-lstm-leverhulme-lecture-2015 ),

- 26 -

Highly automated equipment, ubiquitous apps, and software make it relatively easy, fast, and cheap to collect and to record the data automatically. However, cleaning, retrieving the meaningful data and making sense of it (mining it) is still extremely time-consuming and expensive task. Therefore, re-thinking and the ability to implement statistical methods effectively in a highly variable scientific communities is of strategic importance for the high- quality study design, analysis, and reporting. Statistics can be viewed as learning from actual data expressed as numbers (Riffenburgh, 2012, Baldi and Moore, 2014, Moore et al., 2012). Statistical analysis often starts with making inclusion decisions on the number of cases for desired final analysis (Riffenburgh, 2012, Baldi and Moore, 2014, Moore et al., 2012). Set of cases gives a higher-level data entity called a variable. In this thesis cases are measurements done for the individual hyphae, actin filaments, and bacteria respectively. Examples of variables studied here are apical extension velocities, deflection angles, turning angles, branching angles, branching distances, filament persistence lengths, and pixel intensities. Importantly, the distributions of variables describe the ranges and frequencies of specific variable values. Also, it is of crucial importance to take into account all qualitative information associated with recorded variables. Some of the variables are very straightforward, e.g. apical extension velocities, while other, like the filament persistence length require some deeper understanding of an individual problem, e.g. semi-flexible polymers in that case. It is critical to ensure that the variable measure desired phenomena, effect, or outcome in a sample or a population as otherwise it might lead to inaccurate inferences. In life sciences, a significant amount of data is recorded on each variable at randomization, as well as outcome, periodical, or endpoint data at various time points. The data can be both quantitative, e.g. continuous variables, such as apical extension velocity, branching angle, and qualitative, e.g. image data, fungus morphology, metadata, parents, daughter, and further generation of hyphae (Riffenburgh, 2012, Baldi and Moore, 2014, Moore et al., 2012). Therefore, understanding data types and practical experience with data processing are crucial for choosing the accurate statistical methods to apply).

- 27 - Example

Neurospora crassa data inventory includes information on various aspects of growth patterns of this organism. The data contains not only quantitative variables, such as an apical extension velocity or branching angle but also digital photos and movies attached to the records representing unique database (experimental) cases. A snapshot of the fungi data inventory is shown in Figure 131 in Appendix.6.2. (Figure 131). I used File Maker and Excel spreadsheets for the calculations and data analysis. Extension velocities, branching angles, and branching distances numerical values describe Neurospora crassa quantitatively. They can be put on a continuous numerical scale and take any numerical value (continuous variables) or a finite number of values (discrete variables). One can classify categorical variables into nominal and ordinal. The nominal variables are purely qualitative and unordered while ordinal variables are ranked. Importantly, the intervals between the successive ranks might be not equal. Thus, ordinally ranked variables are not quantitative ones. An example of ranked (ordinal) variables might be ‘1-parents’, ‘2-daughters’, and ‘3-further generations’ hyphae. The raw data used for study of growth patterns of Neurospora crassa is available here: Fungi Numerical Data The data includes 792 frames taken from 6 different movies (132 frames per movie), 18 + 70 + 34 + 55 + 77 + 86 = 340 hyphae, 3 variables (apical extension velocity [m/min] , branching angles [ ° ], and branching distances [m]), 1 213 + 3 161 + 1 499 + 2 609 + 1 876 + 2 190 = 12 548 cases for apical extension velocity, 36 + 75 + 69 + 81 + 32 + 44 = 337 cases for branching angles, and 5 + 16 + 15 + 25 + 15 + 19 = 95 cases for branching distances. The purpose of the data collection is to use it for computer simulations and to make inferences on how statistical methods can influence experimental conclusions.

Data characterization starts with exploratory data analysis. It usually begins with plotting the numbers to assess the presence of any general patterns or obvious deviations from these patterns, e.g. outliers. Presence or absence of deviations is often a matter of judgment, and it may be crucial for providing unbiased analysis and drawing correct conclusions (Moore et al., 2012, Wang and Bakhai, 2006). Importantly, data distribution is characterized by its shape, central tendency, and spread (Riffenburgh, 2012, Baldi and Moore, 2014, Moore et al., 2012). Central tendency measure is usually of main interest, and it can be characterized by the midpoint, which is defined as a value that is situated in the middle of ranked (semi- qualitative) observations or the middle of measured (quantitative) observations put into numerical order. The minimum and the maximum values observed give an idea about the data spread. Various distribution shapes are often classified according to the number of distribution peaks into uni- or multimodal. Another important distribution feature is skewness. Distribution curve can be symmetrical, skewed to the right, or skewed to the left. Also, it is always worth to plot the observed data in time order to detect any changes patterns over time. Although, the graphs are the right starting point for exploratory data analysis, numerical summaries allow for clearly defined comparisons.

- 28 -

Thus, descriptive summary of the data should include both shape and numerical descriptors of the data center and spread. Histogram or relative frequency plots are used for getting the key information on the distribution shape and allow for selecting an accurate measure of the data center and spread. The most significant quantity is central tendency measure, which is usually mean or median, or, less often, mode (Riffenburgh, 2012, Baldi and Moore, 2014, Moore et al., 2012). Mean, which is “average value”, is calculated by summing all observation values and dividing them by the number of the observations. The formula for calculating mean is given below (Baldi and Moore, 2014, Moore et al., 2012):

Where, 𝑥̅ is a mean, n is the number of observations, and x1, x2,..,xn are the individual observations. Importantly, subscripts xi are used to indicate that the observations are separate, they do not indicate any particular ranking or order.

The mean is the most popular central tendency measure in research practice. However, it has one major drawback: it is strongly affected by the outliers and extreme values in the distribution tails, especially in the case of skewed distributions, in which mean will be shifted in the direction of its longer tail. Therefore mean is said to be a non-resistant measure of the data center. It is relatively easy to manipulate mean value by changing the value of as few as just one extreme observation (Baldi and Moore, 2014). Using means as a central tendency measure, without a deeper understanding of the data, can result in making biased inferences, especially in life sciences, as outliers and extreme values occur relatively often in biological systems. An alternative measure is median, also called the “middle value” (Moore et al., 2012) or the distribution midpoint (Moore et al., 2012). Calculating median involves several steps. Firstly, all observations are sorted from the smallest to the largest value. Secondly, the median is identified as the value situated in the center of the ordered list or as a value situated as (n+1)/2 from the bottom of the ordered list. Whereas, for the even number of the observation, the median will be the mean value of two observations situated in the middle of the ranked list. Medians are easy to find manually, however, to automate this task, a computer software is required. Importantly, median as a central tendency measure is its resistance to extreme values and outliers.

For perfectly symmetric distributions mean and median will be the same. For roughly symmetrical distributions mean and median will be situated close to each other on the

1 2 ... n 1 i x x x x x n n   

- 29 -

horizontal axis in the standard Cartesian coordinate system. For skewed distributions mean will be shifted towards the longer tail, compared to the median. The mode is less popular central tendency measure that is very sensitive to any changes in the data distribution. The mode is defined as the value that occurs most frequently (Wang and Bakhai, 2006, Baldi and Moore, 2014). A measure of center ideally should be complemented by the measure of spread in a sample or a population. Making inferences based on central tendency alone can be very misleading (Moore et al., 2012, Bland, 1998). For example, two different populations of Neurospora crassa with the same median value for branching angles will display very different morphologies if one population reaches extremely high or low branching angle values, and the other has little branching angle variation among hyphae. A straightforward and efficient way of characterizing data spread and variability is using percentiles that describe proportions of the data falling above or below certain value (Wang and Bakhai, 2006, Moore et al., 2012, Riffenburgh, 2012). The pth _{percentile is defined as a value that is} found amongst measured values and below which p% of these measured values are less than and at most (100-p)% are greater, e.g. the 50th_{percentile is the median value}_{(Wang and} Bakhai, 2006, Moore et al., 2012, Riffenburgh, 2012). Analogically, 25th_{, 50}th_{, and 75}th percentiles are called quartiles as they divide data into four equal parts (Moore et al., 2012). Using quartiles is a good way of bypassing normality assumption and for describing the variability of a distribution for which using parametric methods cannot be easily justified. However, one should keep in mind that some of the observed values might be the same and that there are a few different rules for calculating percentiles that can give slightly different outcomes (Moore et al., 2012). In the study presented here, for preliminary assessments of data spread and variability, I consequently used “the five-number summary” approach suggested by Moore et al. (Moore et al., 2012). The concept assumes giving the measure in the following order: the minimum observed value, the 25th_{percentile (1}st_{quartile), median} (2nd_{quartile), the 75}th_{percentile (3}rd_{quartile), and the maximum observed value. According} to the literature sources “the five-number summary” despite its advantages, seems not to be a frequently used method for characterizing data spread and variability (Moore et al., 2012).

- 30 -

The most popular way of describing phenotypical data in that respect is using standard deviations that give information on how far the observed values are from the distribution mean. Unfortunately, a practical implementation of this concept very often does not come along with checking normality assumption. Thus, it can result in a biased perception of the experimental results, lead to incorrect inferences, and making wrong decisions.

Assessing normality should be a prerequisite for using any statistical methods, especially when the data sample is small. It is possible to check whether the data is normally distributed two ways: (a) by plotting it, thus, visualizing the difference between the theoretical and experimental distribution, or (b) by using formal statistical tests. Histograms can be used for assessing normality. However, they can be misleading for the small data samples. Therefore, the most common method used is a Q-Q plot. It consists of the ordered list of values of the particular variable with equivalent quantiles of theoretical normal (Gaussian) distribution. If experimental and theoretical points match, then the visual outcome is a linear pattern. The visual assessment of normality can be very subjective at times. Therefore the numerical methods (such as Shapiro-Wilk, Kolmogorov-Smirnov, Anderson-Darling, or Cramer-von Misles) are often used instead (Bland, 1998, Kirkwood, 2003). The most commonly used and well-established families of density curves are the ones describing a data that is normally distributed. The normal (Gauss) density curve has a general shape specified by the mean value  and the standard deviation value . For perfectly symmetrical normal (Gaussian) distribution mean value equals the median value. Regarding the visual properties of the normal (Gauss) density curve, changing the mean value (and keeping standard deviation constant) will translocate the curve along horizontal axis without changing its spread. Whereas, changing the standard deviation value (and keeping the mean value constant) will amend the spread of the curve.

Figure 8 Illustrating median and mean as a central tendency measure: median vs mean

- 31 -

The main idea behind standard deviation is to assess the data spread by measuring how far the observed values are from the central value of the standard distribution. Geometrically, the standard deviation is an averaged distance between observation points (plotted in the Cartesian coordinate system) and the central (mean) value. Graphically, the standard deviation can be found by locating two points at which the curvature changes along the normal (Gauss) distribution curve. Standard deviation (is the distance between any of the points and the mean 

The procedure for calculating standard deviation involves several steps. It begins with estimating deviations between the observations and the mean value, followed by squaring them, calculating variance, and, subsequently, the square root of the variance. What are the reasons behind squaring the deviations?(Moore et al., 2012, Wang and Bakhai, 2006) Firstly, this simple operation is useful for the optimization purposes, as the squared sum of any given set of deviations from the mean will be the smallest sum probable. Secondly, squaring the deviation proved to be a useful procedure for the representation purposes as standard deviation is a measure of spread for normal (Gauss) distribution. The reason standard deviation is most commonly used for reporting (instead of variance) is because s is the natural measure for normal (Gauss) distributions. Also, s has the same measurement unit as the observation unit.

- 32 -

The mean and standard deviation fully define the shape of the normal (Gauss) distribution curve (Moore et al., 2012, Wang and Bakhai, 2006). Thus, the curve can be expressed as N(__{and given by the following equation:}

procedure name mathematical representation Comments

deviations from the

mean value 1 2 ( ) ( ) ... ( _n ) x x x x x x   

Equation 11 Calculating deviations from the central tendency measure

assessing how much the observations deviate from the

(global) central tendency measure squared deviations 2 1 2 2 2 ( ) ( ) ... ( _n ) x x x x x x   

Equation 12 Calculating squares of deviations

simplifying mathematical operations Variance 2 2 2 2 ( 1 ) ( 2 ) ... ( ) 1 n x x x x x x s n        

Equation 13 Calculating variance (1)

2 1 ₍ ₎2 1 i s x x n   



Equation 14 Calculating variance (2)

variance gives information about the

data spread standard deviation 2 1 ( ) 1 i s x x n   



Equation 15 Calculating standard deviation

the larger the spread, the larger the standard deviation





2 2 1 ( ) exp 2 2 0, , x f x x       _{ }     _ _           

Equation 16 Mathematical formula for normal (Gaussian) distribution

Table 1 Calculating Standard Deviation, Information source: (Wang and Bakhai, 2006, Moore et al., 2012, Riffenburgh, 2012)

- 33 -

The popularity of normal (also called Gaussian) distributions across various science disciplines is due to several reasons. Firstly, repeated measurements often follow the normal (Gaussian) distribution. Secondly, normal (Gaussian) distribution often approximates well observations involving chance outcomes. Finally, and most importantly, a significant number of statistical tests and procedures require the data to be normally distributed (Baldi and Moore, 2014). One of the most useful features of the normal (Gaussian) distributions is called ’68-95-99.7’ rule. It says that 68% observations fall within the interval between 1and 95% between 2and and 99.7% between 3 and  respectively. The 66-95- 99.7 rule is illustrated in Figure 10.

Figure 10 Illustration of the 58-95-99.7 rule. For the normal (Gaussian) distribution

approximately 68% of the observations fall within of the mean  95% within 2 of  and 99.7% within 3of  respectively.

- 34 -

In the practice of statistics, there is often a dilemma whether to choose non-parametric methods, e.g. five-number summary (described earlier on in this section) or parametric methods, such as describing the spread and central tendency by using mean and standard deviation. For the distributions that are strongly skewed and with different spreads on both sides, using standard deviation is not feasible. In such situation, using five-number summary will characterize the data much more accurate and precisely. Using mean and standard deviation will work well mainly for symmetrical distributions with few outliers. Also, it is always worth to plot the data as no numerical measure can describe the shape in full, e.g. numerical characteristics might not reveal several modes or gaps (Baldi and Moore, 2014). Distribution of data is of central importance for a complete statistical description of phenotype as it contains all the information needed for making inferences based on statistical methods (Wang and Bakhai, 2006, Moore et al., 2012, Riffenburgh, 2012). Sample

distribution tells what proportion of specific measurements outcomes falls into the interval of interest, whereas population distribution gives the probability of a random measurement outcome coming from the specific interval. Methods for estimating population distribution are described in the previous section of this chapter. Regarding practical aspects, to calculate any given area under the normal (Gauss) density curve one needs to use either a computer software or statistical tables. In both cases, the area under the curve will be the cumulative proportion of the observation values situated between specific boundary values.

In document Curso completo de Magia Negra (página 189-194)