Binding Database
The goal of this step is to associate a transcription factor with given gene sets based on the binding strength of that transcription factor with the promoter region of the genes in each gene set. In other words we want to find out whether a transcription factor may regulate a given gene set. To achieve this we ran a functional class scoring with the transcription factor binding signals associated to human genes and the gene sets obtained from MSigDB and KEGG.
As described in the previous chapter, a functional class scoring approach has three steps. First is to calculate the gene level statistic. Next step is to find a gene set level statistic using the gene level statistic and finally using null hypothesis testing we find out the statistical significance of the gene sets.
3.3. REPA DESCRIPTION
A step by step description of the process follows.
Inputs
1. Gene sets in the format described in the section 3.3.1.
2. Transcription factor binding data from the output obtained in step 1 de- scribed in section 3.3.2.
Step 2.1: Parsing and loading input files The two input files are parsed and loaded in memory. A hash table is used to store the data to quickly access the values based on transcription factor name and the gene’s Entrez Id.
Steps 2.2 through 2.6 are performed for each transcription factor - gene set pair. Step 2.2: Map TFBSs to strongest signal Get the signal values from the tran- scription factor binding sites database for each of the gene in the gene set. If multiple values exist from different cell lines, the maximum value is considered. If there are values for at least 5 genes in the gene set we proceed to the next step else we skip to the next gene set - transcription factor pair.
Steps 2.3 and 2.4 are executed 1000 times.
Step 2.3: Generate vector with random signal values To run the permutation test, we first generate a gene set of randomly selected genes of the same size as that of the real gene set. We collect the signal values for this randomly chosen gene set from the transcription factor binding database as done in step 2.2 for the real gene sets.
Step 2.4: Statistical hypothesis testing We use the following three statistical tests in our program with the real gene set values and the random gene set values as input.
3.3. REPA DESCRIPTION
Two-sample pooled test (ALGLIB project ; Broad 2014)
This test checks hypotheses about the fact that the means of two random variables X and Y which are represented by samples xS and yS are equal.
In our case xS represents the signals from real gene sets we found in step
2.2 and yS represents the randomly selected signal in the step 2.3. The test
works correctly under the following conditions:
• Both random variables have a normal distribution • Dispersions are equal (or slightly different)
• Samples are independent.
During its work, the test calculates t-statistic:
t =
xS−yS r P (xi−xS)2+P (yi−yS)2 Nx+Ny−2 1 Nx+Ny1NX and NY are the sizes of X and Y respectively.
Note 1 If X and Y have a normal distribution, the t-statistic will have Student’s distribution with NX+NY−2 degrees of freedom. This allows the
use of the Student’s distribution to define a significance level corresponding to the value of the t-statistic.
Note 2 If X or Y are not normal, t will have an unknown distribution and, strictly speaking, the t-test is inapplicable. However, according to the central limit theorem, as the sample sizes increase, the distribution of t tends to be normal. Therefore, if sample sizes are big enough, we can use the t-test even if X or Y is not normal. But there is no way to find what
3.3. REPA DESCRIPTION
values for NX and NY are big enough. These values depend on how X and
Y deviate from the normal distribution.
After running this test we store the right-tailed p-value. The null hypoth- esis for the right-tailed test is that the mean of yS is less than or equal to
the mean of xS.
Two-sample unpooled test (ALGLIB project ) Similar to the two-sample
pooled test, this test checks hypotheses about the fact that the means of two random variables X and Y which are represented by samples xS and
yS are equal. The test works correctly under the following conditions:
• Both random variables have a normal distribution • Samples are independent.
Unlike the previous test, for this test, dispersion equality is not required. During its work, the test calculates the t-statistic:
t =
xS−ySr
V ar(xS)
NX +V ar(yS)NY
If X and Y have a normal distribution, the t-statistic will have Student’s distribution with DF degrees of freedom:
DF =
(NX−1)(NY−1) (NY−1)c2+(N X−1)(1−c2)c =
V ar(XS) NX V ar(XS) NX +V ar(XS)NY3.3. REPA DESCRIPTION
This allows the use of the Student’s distribution to define the significance level corresponding to the value of the t-statistic. Note 2 from the two sampled pooled test section is also applicable to this test.
Mann-Whitney U test The Mann-Whitney U-test (Mann, Whitney, et al. 1947; McKnight and Najab 2010; Bochkanov and Bystritsky 1947) is a non-parametric method which is an alternative to the two-sample Student’s t-test. This test is used to compare medians of non-normal distributions X and Y . The test works correctly under the following conditions:
• X and Y are continuous distributions (or discrete distributions well- approximating continuous distributions)
• X and Y have the same shape. The only possible difference is their position (i.e. the value of the median)
• The number of elements in each sample is not less than 5 • The samples are independent
• Scale of measurement should be ordinal, interval or ratio (i.e. test could not be applied to nominal variables)
Here is a simple step by step description of how the Mann-Whitney U test works:
• Both samples (having sizes N and M) are combined into one array which is sorted in ascending order keeping track of which sample the element had come from.
• After sorting, each element is replaced by its rank (its index in array, from 1 to N + M ).
• Then the ranks of the first sample elements are summarized and the U-value is calculated using the following formula:
3.3. REPA DESCRIPTION
U = N × M +
N (N +1)2−P
xi
Rank(x
i)
The mean of U equals 0.5 × N × M . If U is close to this value, the medians of X and Y are close to each other. If we know distribution quantiles, we can get the significance level corresponding to the value of U .
• For a big enough N and M , U could be approximated by the normal distribution with a mean of 0.5 × N × M and a standard deviation of
σ =
q
N ×M (N +M +1) 12
The p-value is calculated from the mean and standard deviation.
All the three tests mentioned above namely two-sample pooled test, two-sample unpooled test and Mann-Whitney U-test returns three p-values:
p-value for two-tailed test The null hypothesis here is that the medians for the two samples are equal.
p-value for left-tailed test The null hypothesis here is that the median of yS
is greater than or equal to the median of xS.
p-value for right-tailed test The null hypothesis is that the median of yS is
less than or equal to the median of xS.
Step 2.5 For three pre-determined thresholds of significance namely 0.05, 0.01 and 0.001 we count number of times the mean / median of the values of the actual gene set is to the right of the mean / median of the values of the random gene set with a p-value of:
• less than 0.001
3.3. REPA DESCRIPTION
• between 0.05 and 0.01 • greater than 0.05
Transcription Factor E2F6
Gene set name DOANE BREAST CANCER CLASSES UP
Total genes in the gene set 72
Genes for which values were found 14
Pooled tTest [< 0.001] 11 UnPooled tTest [< 0.001] 7 MWU test [< 0.001] 5 Pooled tTest [< 0.01] 74 UnPooled tTest [< 0.01] 75 MWU test [< 0.01] 70 Pooled tTest [< 0.05] 238 UnPooled tTest [< 0.05] 237 MWU test [< 0.05] 249
Pooled tTest [Remaining] 677
UnPooled tTest [Remaining] 681
MWU test [Remaining] 676
Pooled tTest Average p-value 0.1937
UnPooled tTest Average p-value 0.194217
MWU test Average p-value 0.18896
Genes in gene set for
which values were found 347, 253190, 51181 . . . . Table 3.4: Sample of the gene set - transcription factor binding database
Results After running the program for all the transcription factors and gene sets available, the generated results are as shown in table 3.4. Due to space con- strains, the columns are presented as rows and rows as columns.