4.1. ANÁLISIS IMPUTS INTERNOS
4.1.4. Clima Organizacional
Finally, we describe how we construct portfolios from the fundamental signals. The portfolios sorted on fundamental signals are rebalanced annually at the end of June every year, based on signals from business year ending in the previous calendar year. They are either value- or equal-weighted and are constructed by buying stocks in the top decile of the signals and shorting stocks in the bottom decile of the signals. Portfolios based on published anomalies are always constructed to have positive returns in line with the findings in the original papers.15 The portfolios are zero cost and returns correspond to monetary payoff each month. They are thus different from what an investor would get if he tried to invest in the signals as he would have to hold some collateral. Reason for this choice is that the value of collateral would often drop below zero within the 12 months before annual rebalancing period. The only solution would be to introduce leverage constraints and more frequent rebalancing, which would unnecessarily complicate the analysis.
2.2
Multiple Hypothesis Tests - Bootstrap Methods
When testing the statistical significance of new anomalies it is important to take into account the full universe of potential anomalies and try to include those that have not been published. The justification is simple: the value of t-statistic required for significance will be higher if 20 strategies are tested compared to testing only one. The difference is due to the fact that there is, on average, one false positive discovery among the 20 tested strategies. The false discovery appears to be significant in its individual test, while in fact it is not. It is important to control for these false discoveries in order to maintain the same rate of type I errors in the statistical tests.
Datastream classification which sorts industries into 19 groups instead. This has one main reason. The industry classification in Datastream is available only from the static file which means that only the latest value is available. Variation over time for individual firms between closely related SIC codes would thus cause problems.
14Constructing the portfolios on large cap universe but with the same restrictions as in the original studies has no effect on the main results of this study.
Every test in a classical statistical framework is framed in terms of type I (size) and type II (power) errors.
Null hypothesis
Decision True False
Reject Type I error OK Not Reject OK Type II error
The goal is to select a test that will have the required size, typically 5%, and the largest possible power. There is always some trade-off between power and size unless the sample size in increasing. Tests that have smaller size tend to under-reject truly significant, and thus profitable, signals. In the present study this means that fewer fundamental signals are deemed significant. It is therefore important to apply appropriate methods with the largest possible power.
Harvey et al. (2016) studied the problem of identifying significant anomalies in a multiple hypothesis setting. They collected p-values reported in original studies and generated a hypothetical sample of p-values on all tried signals, thereby recreated the original sample of p-values before most of the tried strategies were discarded. However, the sample of p-values depends on strong underlying assumptions about structure of correlation among the anomalies. We take a more structured approach in this study by generating a universe of possible data-mined fundamental signals instead. This allows us to study the relation between individual anomalies in much greater detail. Specifically, it allows us to study the role of cross-sectional dependence between the signals. There are 93 published and 772, 1,497, or 48,387 data-mined signals in our sample, or about a 1:8, 1:16 ratio, or 1:520 ratio. This should provide very reasonable setting for the multiple hypothesis tests. Harvey et al. (2016) estimated that 71.1% of the tried signals were not published which translates to about 322 overall signals in our case with 93 published anomalies. This is fewer than 865 but we will show that the main results do not depend on the number of data-mined signals and the larger number is more reasonable due to the number of active researchers in the area over the years.
Harvey et al. (2016) reported that: ”We find that the difference in rejections rates produced by single and multiple hypothesis testing is such that most rejections of the null of no out-performance under single hypothesis testing are likely false.” They then propose that the proper cut-off for t-statistics should be three. We will show in the rest of this section that this conclusion greatly depends on the precise specification of the tests. 63% of anomalies is significant under most favourable setting and the cut-off t-statistic is close to two, whereas, none of the anomalies is significant in the most conservative setting.
There are many simple correction methods for individual p-values to make them valid in multiple testing framework but these usually lead to poor power.16 Harvey et al. (2016) had to rely on these methods since they did not have a ready access to the original data. We present three of the most frequently used methods. The simplest method is Bonferroni
Omitted Strategy Bias in Anomalies Research
where p-value on individual tests are multiplied by a number of tests (M). The individual p-values then have to beM times smaller than the required size in single hypothesis tests. Holm (1979) provided a refinement by introducing a stepwise method where all the p- values are ordered from smallest to largest and the penalty is decreasing with their size. Specifically, the method rejects any hypothesis where pi(M + 1−i)< α for 1 ≤ i ≤ M
and sizeα. This method is a refinement of Bonferroni. It tends to reject additional true positive hypothesis and is less strict for larger p-values. Benjamini and Yekutieli (2001) provide further refinement. The test proceeds again by first sorting p-values from the smallest to the largest so that p1 ≤ p2. . . ≤ pi. . . ≤ pM. False discovery rate (FDR)
adjusted p-values are determined with backward induction where pF DRM = pMP1≤j≤M
1 j and pF DRi = min ( pF DRi+1 , pi M i X 1≤j≤M 1 j ) (2.1)
The individual hypothesis are rejected with FDR of 5% if their adjusted p-values pF DRi are smaller than 5%.
The methods presented so far have focused on standard testing framework that con- trols for probability of at least one false positive discovery (type I error), but in practice this rapidly becomes too strict. The approach where we try to correct for probability of even one false positive discovery is denoted family-wise error rate (FWER). This as- sumption becomes too restrictive when there are many signals, as is the case here, and it is advantageous to allow for some false discoveries if it leads to acceptance of many positive discoveries. In our case of trading strategies, this means that several unprofitable strategies are accepted in order to select many more truly profitable strategies. The in- crease in number of profitable strategies should lead to a more profitable meta-strategy. This approach to the testing is defined by the maximum FDR, which is the proportion of false positive discoveries among all signals that were deemed significant. The rest of this section then discusses FDR methods that require bootstrap but should lead to greater power in the tests.