A pharmacist needs to be aware of how the scientific data is generated and inter-preted in the modern “evidence-based medicine.” This is important not only for the adequate appreciation and interpretation of new research findings, but also for an understanding of conventionally well established practices in medicine. This section outlines the basic concepts utilized in the generation and interpretation of data. It assumes the background knowledge of experimental design and random sampling.
4.6.1 meaSureS of central tenDency
When a collection of data is available, it can be arranged in an array. An array is a collection of data arranged in a systematic manner, such as listing a set of values in an ascending or descending order of their magnitude. The data can be analyzed in terms of its frequency distribution. The frequency distribution is constructed by identifying the number of times a value repeats itself (frequency of occurrence of such value). This information can be plotted in a two dimensional x–y plot with the x-axis representing the increasing order of values and the y-axis representing their frequency of occurrence. The frequency distribution can also be organized to repre-sent a set of ranges of values, rather than individual values, with the frequency rep-resenting all data points that fall within the given ranges. An x–y plot of this range of values can produce a series of columns, called a histogram. These approaches both reduce and organize the data for easy interpretation.
Frequently, when the data is organized in a frequency distribution, a normal dis-tribution is obtained (Figure 4.4). A review of the normal disdis-tribution curve indicates
Data values
Frequency
FIGURE 4.4 A normal distribution. Normal distribution of data can be represented by a frequency distribution (histogram), a curve passing through the medians of the frequency distribution, or discrete data points.
74 Pharmaceutical Dosage Forms and Drug Delivery that the data tends to be more frequent for a given set of values, which are usually towards the center of the numerical distribution of data values. This is called central tendency. The numeric location of the central tendency can be stated in one of the three ways: mean, median, and mode.
• Mean. The arithmetic mean of a data is the sum of observations divided by the number of observations. The mean describes the central location of the data.
• Median. The median is the numeric value of a data point that falls in the middle when counting the set of values after arranging them in an ascend-ing or descendascend-ing order.
• Mode. Mode is the value that occurs most frequently in a set of data.
Either of these values tend to indicate the numeric point in the spread of the data that all observations tend to lean towards, which can be interpreted as the expected value of a data set. The expected value of a distribution is the average, or the first moment, over the entire distribution. The reason why each and every value in the data set is not the expected value is considered to be due to random variation or errors in experimentation or data collection.
4.6.2 meaSureS of DiSPerSion
In addition to knowing the central tendency of the data, one needs to appreciate the level of distribution or variation in the individual data values. This indicates how closely the data set represents a central tendency or value. For example, the four sets of data represented by the normal distribution curves in Figure 4.5 show increasing level of dispersion from the central tendency in the order a < b < c < d.
Distribution of a set of data can be quantified by one or more of the following numerical values:
• Range. It represents the difference between the highest and the lowest val-ues in a data set.
• Variance and standard deviation. Variance represents the mean of square of deviation of all individual values in the data set from the mean of the set of data set. It is calculated by subtracting each individual value from the mean, squaring it, and dividing the sum of this squared difference by n − 1, where n is the number of samples in the data set. Standard deviation is the square root of the variance.
Standard deviation is commonly used to interpret the spread of the data. As indi-cated in Figure 4.6, assuming a normal sample distribution, the standard deviation of a sample set (symbol: s) indicates the percentage of data set values that fall on either side of the mean value of this data set. As illustrated in the figure, 68.26% of values fall within ± 1 s of the mean, 95.44% fall within ± 2 s of the mean, and 99.72% fall within ± 3 s of the mean. It would be noted that the greater the value of s compared to the mean, more the spread of the data. This could indicate either lower precision of measurement and/or greater error in data collection.
4.6.3 SamPle Probability DiStributionS
A probability distribution represents the probability of occurrence of each value of a discrete random variable or the probability of each value of a continuous random variable falling within a given interval. Hence, a probability distribution can be either
• Discrete probability distribution. It reflects a finite and countable set of data whose probability is one.
• Continuous probability distribution. It reflects the probability of occur-rence of a value in terms of its probability density function, which can be defined within an interval.
0–5 –4 –3 –2 b
a
c
d
– 1 0 1 2 3 4 5
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FIGURE 4.5 Illustration of variability in four different data sets following normal distribu-tion. The level of dispersion from the central tendency is d > c > a, even though their means are the same. Data set b represents a difference of mean in addition to dispersion.
–4σ
0.13% 2.14% 13.59% 34.13% 34.13% 13.59% 2.14% 0.13%
–3σ –2σ –1σ 0 +1σ +2σ +3σ +4σ
FIGURE 4.6 Illustration of spread of data (from the hypothetical mean of 0) in a normal distribution as a function of the standard deviation of the population (σ). The probability of finding data values at illustrated multiples of standard deviation is indicated in the figure as a % number.
76 Pharmaceutical Dosage Forms and Drug Delivery
4.6.3.1 Normal Distribution
The preceding examples assumed a normal frequency or probability distribution of the data set. Normal distribution, also known as the Gaussian distribution, reflects the tendency of the data to cluster around the mean from both directions. It is a continuous probability distribution and forms a typical bell shaped curve. A data set following a normal distribution is indicative of the additive nature of underlying factors.
4.6.3.2 Log-Normal Distribution
A log-normal distribution refers to the probability distribution of a variable whose logarithm is normally distributed, such that for a variable y, log y is normally dis-tributed. The base of the logarithmic function does not make a difference to the distribution pattern of the variable. A log-normal distribution typically represents a multiplicative effect of underlying factors.
4.6.3.3 Binomial Distribution
Binomial distribution is a discrete probability distribution that reflects the number of a given outcome in a sequence of experiments with only two outcomes, each of which yields a given outcome with a defined probability. Such an experiment is frequently called a success/failure experiment or Bernoulli experiment with n repeti-tions and p as the probability of each successful outcome.
4.6.3.4 Poisson Distribution
Poisson distribution represents the probability of n occurrences of an event over a period of time or space given the average number of occurrences of the event. For example, if the lyophilization process fails on an average in five batches per year, Poisson distribution can be used to calculate the probability of 0, 1, 2, 3, 4, 5, … failed lyophilization processes for a given year. Although both Poisson and bino-mial distributions are based on discrete random variables, the binobino-mial distribution assumes a finite number of possible outcomes, while the Poisson distribution does not. Poisson distribution is usually applied in cases where the mean much smaller than the maximum data value possible, such as radioactive decay.
4.6.3.5 Student’s t-Distribution
The Student’s t-distribution is a continuous probability distribution that is used to estimate the mean of a normally distributed population when the sample size is small (population standard deviation is unknown). The t-distribution is based on the cen-tral limit theorem that the sampling distribution of a sample statistic, such as the sample mean (x), follows a normal distribution as n gets large. The t-distribution is a continuous probability distribution of the t-statistic or t-score, defined as
t x s n
= − μ
where
μ is the population mean
s is the sample standard deviation n is the sample size
The shape of the t-distribution varies with the sample size, or the number of degrees of freedom (DF) of the sample. The DF represents the number of values in the final calculation of a statistic that can freely vary, and is calculated as n − 1 for n number of samples. It is used as a measure of the amount of data that is used for the estimation of a given statistical parameter.
The t-distribution is characterized by having a mean of 0 and variance of always greater than 1. The variance approaches 1 and the t-distribution approaches the stan-dard normal distribution at high sample sizes.
Knowing the sample mean, standard deviation, and size, and the (assumed) popula-tion mean, a t-score or t-statistic can be calculated. Each t-score is associated with a unique cumulative probability of finding a sample mean less than or equal to the cho-sen sample mean for a random sample of the same size. The term tα denotes a t-score that has a cumulative probability of (1 − α). For example, for a cumulative probability of occurrence of 95%, α = (1 − 95/100) = 0.05. Hence, the t-score corresponding to this probability would be represented as t0.05. The t-score for a given probability varies with DF of the sample. Thus, t0.05 at DF of 2 is 2.92, while t0.05 at DF of 20 is 1.725.
Also, since t-distribution is symmetric with a mean of zero, t0.05 = −t0.95, or vice versa.
The t-statistic helps determine the probability of occurrence of a given sample mean when the (hypothetical or target) population mean is known. In other words, it can help determine the probability that the selected sample comes from the pop-ulation with the given (hypothetical or target) mean. For example, during tablet compression for a target average tablet weight of 100 mg, a sample of 10 tablets is weighed. The average weight of 10 tablets was 90 mg with a standard deviation of 35 mg. What is the probability that the tablet compression operation is proceeding at its target average tablet weight of 100 mg? To compute this probability, a t-score can be calculated as follows:
t x s n
= −
= −
μ 90 100 = −
35 10 0 9035.
This t-score corresponds to 19% probability of occurrence (using standard probabil-ity distribution tables). Thus, if the tableting operation is performing at target, then there is a 19% chance that the sample mean would fall below 90 based on a sample of 10 tablets. Therefore, there is not evidence that the machine is off target. However, due to the large variability and small sample size, we cannot say that it is at target. A confidence interval would show that the target mean could be any value over a large range which would include 100. Thus, it is likely that the tableting unit operation is performing at the target average tablet weight of 100 mg. On the other hand, if the sample of 10 tablets had a standard deviation of 15 mg, the t-score would be 2.1082, which corresponds to the probability of occurrence of 3%. This data would indicate
78 Pharmaceutical Dosage Forms and Drug Delivery that the tableting unit operation is probably not performing at its target average tablet weight of 100 mg.
This distribution forms the basis of the t-test of significance, which can help determine
• Statistical significance of the difference between two sample means
• Confidence intervals for the difference between two population means 4.6.3.6 Chi-Square Distribution
Chi-square distribution represents the squared ratio of sample to population stan-dard deviation as a function of the sample size used for computing the sample standard deviation. This distribution is used to estimate the probability ranges for the standard deviation values for a given sample size.
Mathematically, the chi-square (χ2) distribution represents the distribution of the chi-square statistic, which represents the squared ratio of the standard deviation of a sample (s) to that of the population (σ) multiplied by the DF of the sample:
χ2
2
1 2
=
(
n−)
×σsThe shape of the chi-square distribution curve varies as a function of the sample size, or the DF. As the number of DF increase, the chi-square curve approaches a normal distribution.
The χ2 distribution is constructed such that the total area under the curve is 1. This allows the estimation of cumulative probability of a given value of the χ2 param-eter. Given this value, the probability of occurrence of the χ2 parameter above the obtained value can be obtained.
For example, if the standard deviation obtained for a larger sample (e.g., N = 100) is assumed to be the population standard deviation (σ = 5), one can define the prob-ability of obtaining a sample of a given standard deviation (e.g., s > 6) for a given number of samples tested (e.g., n = 10). This is done by calculating the χ2 parameter:
χ σ
2 2
2
2
1 10 1 62
5 12 96
=
(
n−)
× s =(
−)
× = .Using the χ2 distribution for the given DF, the probability of occurrence of χ2 parame-ter below 12.96 is 0.84. Hence, the probability of occurrence of s > 6 is 1 − 0.84 = 0.16, or 16%.