Interfaces between statistical learning and risk management

Texto completo

(1)PONTIFICIA UNIVERSIDAD CATÓLICA DE CHILE FACULTAD DE MATEMÁTICAS – DEPARTAMENTO DE ESTADÍSTICA. Interfaces Between Statistical Learning and Risk Management By Rodrigo Esteban Rubio Varas January 2020. A dissertation submitted to the Department of Statistics, Faculty of Mathematics, in partial fulfillment of the requirements for the degree of Doctor in Statistics. UC advisor: Manuel Galea; Pontificia Universidad Católica de Chile External advisor: Miguel de Carvalho; The University of Edinburgh External co-advisor: Raphaël Huser; King Abdullah University of Science and Technology Members of the Examination Committee: Manuel Galea; Pontificia Universidad Católica de Chile Rodrigo Herrera; Universidad de Talca Alejandro Jara; Pontificia Universidad Católica de Chile Wilfredo Palma; Pontificia Universidad Católica de Chile.

(2) i. Acknowledgments First of all I would like to thank God who has given me life and the opportunity to develop my skills and creativity during my stay in the Doctoral Program in Statistics. My most sincere and deepest thanks go to all those who have always been by my side over this journey, including the Faculty from the Statistics Department of Pontificia Universidad Católica de Chile. I am especially grateful to my supervisor and tutor, Professor Miguel de Carvalho, director of this dissertation, for the guidance, monitoring and supervision delivered, as well as, for the motivation, support, and encouragement throughout this journey. I am eternally grateful to Miguel and will forever preserve each of the moments we spent together. I would also like to thank my co-advisor, Professor Raphaël Huser, and colleague, Professor Manuele Leonelli, for their support and motivation as well as for their collaboration in the research developed over this thesis. Many thanks to José Quinlan and Bastián Galasso for discussions, suggestions, and comments as well as for the unconditional and fraternal friendship. I would also like to extend my gratitude to my colleagues in the Doctoral Program in Statistics and Doctorate in Mathematics, especially Erik Contreras, Álvaro Ferrada and Diana Torres for their friendship, support and collaboration. This thesis was developed under the umbrella of the FCT project PTDC/MATSTA/28649/2017. And I would also like to thank Conicyt and VRI programs for partial funding this thesis. Finally, I would like to thank, my wife Daniela Saavedra for her forbearance, patience, support and, above all, for her love. Thanks everyone! Rodrigo Rubio V. Santiago, Chile.

(3) ii. Abstract The recent hype on Artificial Intelligence, Data Science, and Machine Learning has been leading to a revolution in the industries of Banking and Finance. Motivated by this revolution, this thesis develops novel statistical methodologies tailored for learning about financial risk in the Big Data era. Specifically, the methodologies proposed in this thesis build over ideas, concepts, and methods that relate to cluster analysis, copulas, and extreme value theory. I start this thesis working on the framework of extreme value theory and propose novel statistical methodologies that identify time series which resemble the most in terms of magnitude and dynamics of their extreme losses. A cluster analysis algorithm is proposed for the setup of heteroscedastic extremes as a way to learn about similarity of extremal features of time series. The proposed method pioneers the development of cluster analysis in a product space between an Euclidean space and a space of functions. In the second contribution of this thesis, I introduce a novel class of distributions—to which we refer to as diagonal distributions. Similarly to the spectral density of a bivariate extreme value distribution, the latter class consists of a mean-constrained univariate distribution function on [0, 1], which summarizes key features on the dependence structure of a random vector. Yet, despite their similarities, spectral and diagonal densities are constructed from very different principles. In particular, diagonal densities extend the concept of marginal distribution—by suitably projecting pseudo-observations on a segment line; diagonal densities also have a direct link with copulas, and their variance has connections with Spearman’s rho. Finally, I close the thesis by proposing a density ratio model for modeling extreme values of non-indentically distributed observations. The proposed model can be regarded as a proportional tails model for multisample settings. A semiparametric specification is devised to link all elements in a family of scedasis densities through a tilt from a baseline scedasis. Inference is conducted by empirical likelihood inference methods..

(4) List of Figures. 1.1 Simulated independent observations from nonstationary GEV model in (1.3). The top panels correspond to GEV(t , 2t , 1) (left) and GEV(t 2 , 2t , 1) (right); the bottom panels correspond to GEV(sin2 t , t , 1) (left) and GEV(cos t 2 , t , 1) (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 1.2 Simulated trajectories of random functions belonging to three different families . . . . . . . . .. 12. 1.3 Simulated data from Gaussian, t , Clayton and Gumbel bivariate copulas. . . . . . . . . . . . . .. 18. 2.1 Outputs of a simulated single-run experiment. Scenarios A, B and C (left to right) are presented for sample sizes T = 500, 1000, 2000, 5000 (top to bottom). Scedasis function estimates (gray) and true scedasis functions (solid black) defined in Table 2.1 are displayed. The dashed blue lines depict the cluster center estimates produced by our algorithm.. . . . . . . . . . . . . . . . . . . . . . . .. 40. 2.2 Illustration of the elbow method. The graph displays the sum of squared errors WK in (2.16) for Scenarios A, B (left), and C (right) described in Section 2.7.1 with T = 5000 and α = 0.5, as a function of the number of clusters K = 1, . . . , 8, and averaged over 1000 Monte Carlo simulations. The blue and red curves correspond to Scenarios A and B respectively, while the magenta curve corresponds to Scenario C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 43. 2.3 Negative log-returns for 26 selected stocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. 2.4 Sum of squared errors WK in (2.16) for the data described in Section 2.9.2, plotted as a function of the number of clusters K = 1, . . . , 20 for various values of the weight parameter α = 0, 0.1, . . . , 0.9, 1 (bottom to top).. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. iii. 48.

(5) LIST OF FIGURES. iv. 2.5 (i) Scedasis function estimates (gray lines) for the stocks under analysis. The dashed blue lines correspond to cluster centers estimates obtained by our method (ii) Mode mass function. The gray rectangles correspond to contraction periods in the US economy, as dated by NBER. The black solid lines represents the mode mass function using ν = 75; recall (2.15). The dashed and dotted lines represent the mode mass function with ν = 100 and ν = 120, respectively. The dots along the. . . . . . . . . . . . . . . .. 50. 2.6 Extremal index estimate for the stocks under analysis for different values of the threshold u. . . .. 56. x-axis correspond to the modes of the estimated scedasis functions.. 2.7 Empirical extremogram for stocks 1–9 (left to right, top to bottom). The threshold is chosen as in the application in Section 2.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 57. 2.13 Value-at-risk functions (grey) partitioned per cluster, and value-at-risk cluster center function (blue), obtained with the K -geometric means algorithm for heteroscedastic extremes with K = 3 and p = 0.95. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 57. 2.14 Value-at-risk functions (grey) partitioned per cluster, and value-at-risk cluster center function (blue); clustering is obtained with the K -geometric means algorithm for heteroscedastic extremes with K = 5 and p = 0.95. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 58. 2.15 Extended analysis: Scedasis functions (grey) partitioned per cluster and cluster center scedasis functions (blue), obtained with the K -means algorithm for heteroscedastic extremes using K = 9 and with α = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 59. 2.16 Extended analysis: Value-at-risk functions (grey) partitioned per cluster and value-at-risk cluster (blue), obtained with the K -geometric means algorithm for heteroscedastic extremes with K = 9 and p = 0.95. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 60. 2.8 Empirical extremogram for stocks 10–26 (left to right, top to bottom). The threshold is chosen as in the application in Section 2.9.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 61. 2.9 Scedasis functions (grey) partitioned per cluster, and cluster center scedasis functions (blue); clustering is obtained with the K -means algorithm for heteroscedastic extremes using K = 3 and with α = 0, 0.1, 0.3 and 0.5, from top to bottom, respectively. . . . . . . . . . . . . . . . . . . . . . .. 62. 2.10 Scedasis functions (grey) partitioned per cluster, and cluster center scedasis functions (blue); clustering is obtained with the K -means algorithm for heteroscedastic extremes using K = 3 and with α = 0.7, 0.9 and 1, from top to bottom, respectively. . . . . . . . . . . . . . . . . . . . . . . . .. 63.

(6) LIST OF FIGURES. v. 2.11 Scedasis functions (grey) partitioned per cluster, and cluster center scedasis functions (blue); clustering is obtained with the K -means algorithm for heteroscedastic extremes using K = 5 and with α = 0, 0.1, 0.3 and 0.5, from top to bottom, respectively. . . . . . . . . . . . . . . . . . . . . . .. 64. 2.12 Scedasis functions (grey) partitioned per cluster, and cluster center scedasis functions (blue); clustering is obtained with the K -means algorithm for heteroscedastic extremes using K = 5 and with α = 0.7, 0.9 and 1, from top to bottom, respectively. . . . . . . . . . . . . . . . . . . . . . . . .. 65. 3.1 Diagonals and projection set: Diagonals dπ/6 , dπ/4 , and dπ/3 along with projection set G (gray) as defined in (3.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 69. 3.2 Diagonal distribution function bounds (gray solid line). The solid black line (Fθ+ ) corresponds to the perfect positive dependence diagonal distribution function. R1 to R6 correspond to the regions delimited by the black and gray lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 71. 3.3 Normal, Farlie–Gumbel–Morgenstern, and Clayton main diagonal densities from Examples 3.2.1, 3.2.3 and 3.2.4: For each configuration pseudo-observations are projected over the main diagonal (above), and the true main diagonal densities along with the mean-constrained histogram to be introduced in Section 3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 75. 3.4 Left: Bates diagonal density function for D = 2, . . . , 20 (less to more concentrated). Right: Monte Carlo approximation of the true diagonal density (black solid lines) from Example 3.2.5 obtained using (3.20) with N = 5000 for D = 3, along with true (grey lines) corresponding to α = (1, 1, 1, 0), α = (0, 0, 0, 1) and α = (0, 0, 0, −1), respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 79. 3.5 Left: Diagonal densities from from Example 3.2.6; in solid, dashed and dotted lines are the cases corresponding to Σ1 , Σ2 , and Σ3 as defined in (3.21). Right: Diagonal densities from Example 3.2.7, in solid, dashed and dotted lines respectively corresponding to θ = 1, θ = 2 and θ = 5 respectively. .. 80. 3.6 Farlie–Gumbel–Morgenstern, Normal, and Clayton main diagonal densities estimation from Examples 3.2.1, 3.2.3 and 3.2.4: For each configuration the true main diagonal densities using (3.20) (solid gray lines) along with the Smoothed mean-constrained density estimator in dashed line and. . . . . . . . . .. 86. 3.7 Negative log-returns for assets under analysis. . . . . . . . . . . . . . . . . . . . . . . . . . .. 90. the beta-kernel density estimator (3.47) in dotted line introduced in Section 3.3..

(7) LIST OF FIGURES. vi. 3.8 Mean-constrained smooth estimator (solid line), mean-constrained histogram estimates, and the Bates diagonal (gray) density with D = 5 for FAANG (left), MCTQM (middle), and for cyrpto-assets (right). The black solid line (each panel) correspond to the naive main constraint estimator. . . . .. 91. 4.1 One shot-experiment: Estimated integrated scedasis obtained using the scedasis density ratio model (grey line), empirical integrated scedasis (black line) and true integrated scedasis (blue line); the reference diagonal dashed line corresponds to the case of constant frequency of extreme losses over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109. 4.2 Negative log-returns of 8 leading cryptocurrencies. . . . . . . . . . . . . . . . . . . . . . . . . 111 4.3 Estimated integrated scedasis obtained using the scedasis density ratio model (solid) and empirical integrated scedasis (grey); the reference diagonal dashed line corresponds to the case of constant frequency of extreme losses over time and the rug of points represents the times of exceedances. . 111.

(8) List of Tables. 1.1 Summary of three one-parameter (α) Archimedean copulas for D > 2. . . . . . . . . . . . . . . .. 19. 2.1 Data-generating scenarios. For Scenarios A, B and C, each row reports the number of clusters in the product-space (K ), the number of clusters in the respective profile-spaces (K c and K γ ), the different scedasis functions (c i , i = 1, 2, 3) and extreme-value indices (γ j , j = 1, 2) involved, and the cluster sizes Ni j . By abuse of notation, Ni j here refers to the number of time series simulated independently in a specific cluster characterized by the scedasis function c i and extreme-value index γ j. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 38. 2.2 Validity measures for Scenarios A, B, and C as defined in Section 2.7.1 obtained from 1000 Monte Carlo simulations. The Rand and silhouette indices evaluate the clustering of the K -means algorithm for heteroscedastic extremes as a function of the sample size T = 500, 1000, 2000, 5000 and the weight parameter α = 0.1, 0.3, 0.5, 0.7, 0.9. Values closer to one correspond to better performance 42. 2.3 Estimated scale parameters (â), extreme-value indices (γ̂), and scedasis functions (ĉ) (solid black) plotted over time, for the stocks under analysis and corresponding economic sectors. The dashed blue curves correspond to the scedasis’ cluster centers estimated by our algorithm. The daggers (†) represent FTSE 100 companies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 47. 2.4 Cluster center estimates corresponding to the nine extreme-value indices, for the partition of stocks obtained with α = 0.5. Ticker symbols are the same as in Table 2.3 . . . . . . . . . . . . . . . . .. 49. 2.5 Nine clusters of heteroscedastic extremes of companies over a grid of values of α. Ticker symbols are the same as in Table 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49. 2.6 Nine clusters of value-at-risk functions of companies over a grid of values of p. Ticker symbols are the same as in Table 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. vii. 50.

(9) LIST OF TABLES. viii. 2.7 Extremal index estimate θb for the stocks under analysis. The threshold is chosen as in the application in Section 5 of the main chapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 56. 3.1 Mean integrated squared error estimates computed over 1000 samples for the data-generating configurations in the scenarios A,B and C whit different samples sizes; 500,1000, 2000 and 5000, and chooses of the parameters α, β and ρ where MCH is the estimator in (3.25), NMC is the estimator. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 87. 3.2 Effective number of dimensions. Db eff and the I (Db eff ) estimated over the sets under study. . . . .. 89. in (3.34) and BMC is a beta kernel estimator.. 3.3 Expressions for pseudo-observations Ti for some well-known kernels, under the probit transformation; see (3.48) for details on notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101. 4.1 Monte Carlo average total variation computed over 1000 samples for the data-generating configurations in Section 4.3.1 whit samples sizes 500,1000, 2000 and 5000. . . . . . . . . . . . . . . . . 108. 4.2 Estimated tilting parameters, extreme value index, and standard errors. Bitcoin is the baseline cryptocurrency, with estimated extreme value index of 0.35. Tilting parameters are estimated as in (4.11) and the extreme value index is estimated using the Hill estimator. . . . . . . . . . . . . . . 112.

(10) Contents Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. i. Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ii. List of Figures. iii. List of Tables. vii. 1 Introduction and background. 2. 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2. 1.2 Background on extreme value theory . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 1.3 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 10. 1.4 Learning about dependence via copulas . . . . . . . . . . . . . . . . . . . . . . . . .. 15. 1.5 Background on empirical likelihood and on NPMLE . . . . . . . . . . . . . . . . . .. 20. 1.6 Problems to be adressed and main contributions . . . . . . . . . . . . . . . . . . . .. 22. 1.7 Thesis outline, structure, and organization . . . . . . . . . . . . . . . . . . . . . . . .. 23. 2 Cluster analysis for heteroscedastic extremes. 25. 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25. 2.2 K-cluster proportional tails models . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 28. 2.3 Estimation and inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 30. 2.4 Similarity-based clustering for heteroscedastic extremes . . . . . . . . . . . . . . . .. 31. 2.5 Clustering risk loss patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 36. 2.6 Mode mass function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 37. 2.7 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 38. 2.7.1 Simulation setting and preliminary experiments . . . . . . . . . . . . . . . .. 38. ix.

(11) CONTENTS. x. 2.8 Monte Carlo simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 41. 2.9 A case study on the London stock exchange . . . . . . . . . . . . . . . . . . . . . . .. 43. 2.9.1 Data description and exploratory analysis . . . . . . . . . . . . . . . . . . . .. 43. 2.9.2 Preliminary analysis and selection of stocks . . . . . . . . . . . . . . . . . . .. 44. 2.9.3 Do clusters mirror economic sectors? . . . . . . . . . . . . . . . . . . . . . . .. 46. 2.9.4 Economic contraction periods and the mode mass function . . . . . . . . .. 50. 2.10 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 51. 2.11 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 53. 2.11.1 Derivation of cluster centres in product-space . . . . . . . . . . . . . . . . . .. 53. 2.11.2 Clustering performance measures . . . . . . . . . . . . . . . . . . . . . . . . .. 54. 2.11.3 Diagnostics of extremal dependence . . . . . . . . . . . . . . . . . . . . . . .. 55. 2.11.4 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 56. 2.11.5 Extended analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 58. 3 Diagonal distributions. 66. 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 66. 3.2 Marginals and diagonals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 68. 3.2.1 Diagonals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 68. 3.2.2 Main diagonal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 72. 3.2.3 D-dimensional extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 76. 3.2.4 Effective number of dimensions . . . . . . . . . . . . . . . . . . . . . . . . . .. 80. 3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 81. 3.3.1 Mean-constrained histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 81. 3.3.2 Smoothed mean-constrained density estimators . . . . . . . . . . . . . . . .. 82. 3.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 85. 3.4.1 Data generating processes and preliminar experiments . . . . . . . . . . . .. 85. 3.4.2 Monte Carlo study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 86. 3.5 Financial data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 87. 3.5.1 Background, context, and motivation for the analysis . . . . . . . . . . . . .. 87. 3.5.2 Diagonal densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 88.

(12) CONTENTS. 1. 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 89. 3.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 91. 4 Exponential tilts for proportional tail model. 102. 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.2 Scedasis density ratio model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2.1 Multisample heteroscedastic extremes . . . . . . . . . . . . . . . . . . . . . . 103 4.2.2 Semiparametric modeling of families of scedasis measures . . . . . . . . . . 104 4.2.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.3 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.3.1 Preliminary experiments and computing . . . . . . . . . . . . . . . . . . . . . 107 4.4 Monte Carlo simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.5 Application to cryptocurrency data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.5.1 Motivation for the analysis and data description . . . . . . . . . . . . . . . . 110 4.5.2 Modeling frequency and magnitude of extremes . . . . . . . . . . . . . . . . 110 5 Discussion. 113. 5.1 Final notes and comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.2 Self-criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.3 Plans and directions for future research . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Bibliography. 117.

(13) Chapter 1 Introduction and background This chapter outlines and introduces the Statistical Learning toolbox required for introducing and solving the open problems to be addressed over this thesis.. 1.1 Introduction This thesis lies at the interface between Statistical Learning (Hastie et al., 2009; Vapnik, 2013) and Quantitative Risk Management (McNeil et al., 2015). These two modern fields of Statistics have bold ambitions and goals, both in terms of theory and applications. Roughly speaking, Statistical Learning is the bridge between Statistics and Machine Learning, whereas Quantitative Risk Management is concerned about assessing and managing risk from a statistically-oriented perspective. And why are we interested on the interface between these two fields? At the moment, the industries of Banking and Finance are experiencing a revolution in terms of quantitative techniques that are employed for learning from data (Kolanovic and Krishnamachari, 2017). This paradigm shift has been driven by the exponential increase of the amount of data1 and have led investors to change their strategies, adopt new methods of analysis and address problems from a “Big Data" and “Machine Learning" perspective. Financial companies capitalize over the opportunities offered by these tools to boost growth and innovation, as well as to manage the new and challenging regulatory environment and to improve efficiency and productivity. 1. Around 2015, George Lee (Goldman Sachs) claimed that 90% of the world’s data had been created in the last two years (http://www.goldmansachs.com/our-thinking/trends-in-our-business).. 2.

(14) CHAPTER 1. INTRODUCTION AND BACKGROUND. 3. For many, this paradigm shift is part of a “Fourth Industrial Revolution", where automation and the study and analysis of large amounts of data are becoming the main protagonists of the new business models in the financial industry. Instigated by these revolutions, this thesis proposed novel statistical methods method to learn about financial risk. Specifically, the contributons of this thesis exploit and develop ideas, concepts, and methods in cluster analysis, copula modeling, and extreme value theory. Below, I will introduce some background on these fields and will describe the main contributions of this thesis.. 1.2 Background on extreme value theory Extremes of stationary sequences The main contribution of Chapter 2 will require ideas and methods from extreme value theory, and thus we start with some preparations on the subject.. Extreme value theory is a field of Statistics focused on modeling rare but catastrophic events associated with the tails of a distribution (Coles, 2001; Beirlant et al., 2004; de Haan and Ferreira, 2006; Resnick, 2007; Davison and Huser, 2015a). The statistical analysis of extreme events is applied in many fields, such as Hydrology (floods, maximum expected rainfall in the next years)(Katz et al., 2002; Thibaud et al., 2013; Huser and Davison, 2014; Castro-Camilo and Huser, 2019), Climatology (hurricanes, extreme changes in temperature) (Gong, 2012; Huser and Genton, 2016; Huser, 2019), Public Health (Vettori et al., 2019a,b), Engineering (resistance of materials) (Castillo, 2012), Finance (stock market crashes) (Danielsson and de Vries, 1997; Longin and Solnik, 2001; Poon et al., 2003; Herrera and Schipp, 2013; Hilal et al., 2014; Chavez-Demoulin et al., 2014), among others. And within this latter context (i.e. Finance), and specifically in the study of the returns of a financial asset, extreme value theory may be useful to accurately assess rare but catastrophic stock market crashes, which is key from the point of view of risk (Brodin and Klüppelberg, 2014; Longin, 2016). Below, I will follow closely Coles (2001). There are two classical methods for modeling ex-.

(15) CHAPTER 1. INTRODUCTION AND BACKGROUND. 4. treme events. The first method is based on fitting the distribution of maximum or minimum values, while the second method is based on the analysis of observations that exceed a large threshold. Let us start with the first method; in this case the interest is on studing the statistical behavior of the sample maximum. M n = max{X 1 , . . . , X n },. where X 1 , . . . , X n is random sample, i.e. a sequence of independent random variables with common distribution F . The distribution of M n can be obtained exactly from the distribution of the n variables, taking into account properties of independence:. F Mn (z) = P (M n ≤ z) = P (X 1 ≤ z, . . . , X n ≤ z) n Y P (X i ≤ z) = i =1. = {F (z)}n ; if the distribution admits a density, then f Mn (z) = n{F (z)}n−1 f (z). Note that F Mn converges to zero when n tend to infinity for z < z ∗ , and converges to one for z ≥ z ∗ , with z ∗ = sup{z : F (z) < 1}. To obtain a non-degenerate limit distribution, we can linearly standardize the maximum, in a similar fashion as we proceed with the sample mean in the central limit theorem. This is made rigorous in the following theorem anticipated by Fisher and Tippett (1928) and extended by Gnedenko (1948). Theorem 1.2.1. (Fisher–Tippett–Gnedenko) If there exist constants {a n > 0} and {b n } such that ¶ Mn − bn P ≤ z → G(z), an µ. as n → ∞,. for a non-degenerate distribution G, then G belongs to the family of generalized extreme value (GEV ) distributions, defined by " ½ # µ ¶¾ z − µ −1/ξ G(z) = exp − 1 + ξ , σ +. z ∈ R,. (1.1).

(16) CHAPTER 1. INTRODUCTION AND BACKGROUND. 5. with ξ ∈ R, µ ∈ R, σ > 0 and x + = max{0, x}. Proof: See de Haan and Ferreira (2006), pp. 7-8. The parameter ξ of the GEV distribution determines the tail behavior, in sense that depending on its value, we have heavy (ξ > 0), light (ξ = 0), or short (ξ < 0) tails; these cases are respectively known as Fréchet, Gumbel, and Weibull distributions, and the parameter ξ is known as tail index. The GEV family of distributions is useful for modeling the distribution of block maxima, such as weekly maxima or annual maxima. The block maxima approach groups the raw data into blocks of equal size, and then fits a GEV distribution to the set of the maxima corresponding to each block. The main challenge with this approach lies in the choice of the block size, which involves a bias and variance tradeoff. The choice of very small blocks leads to a poor approximation of the model, which increases the bias when estimating and extrapolating; on the other hand, choosing very large blocks increases the variance of the estimates. Theorem 1.2.1 can be easily adapted for block minima m n = min{X 1 , . . . , X n } = − max{−X 1 , . . . , −X n } and consequently it is possible to obtain a version of the generalized extreme value (GEV) distribution for m n ; see Coles (2001), Theorem 3.3.. Inference and model checking for block maxima can be conducted using likelihood-based approaches. Specifically, let z 1 , . . . , z k be a random sample from a GEV distribution. The loglikelihood for ξ 6= 0 is given by ³ z − µ í X ³ z − µ í−1/ξ k h i i l (µ, σ, ξ) = −k log σ − + 1 log 1 + ξ − 1+ξ , ξ σ σ i =1 i =1 ³1. ´X k. h. under the condition that 1 + ξ(z i − µ/σ) > 0, for i = 1, . . . , k. And for ξ = 0, l (µ, σ, ξ) = −k log σ −. k ³z −µ´ X i i =1. σ. −. n ³ z − µ ó i exp − . σ i =1 k X. The maximum likelihood estimator has no closed form solution, and thus estimates have to be obtained using numerical optimization algorithms.. Modeling only block maxima is an inefficient approximation in the analysis of extreme values if.

(17) CHAPTER 1. INTRODUCTION AND BACKGROUND. 6. some of the blocks contain more than just one extreme event. In this case, the use of threshold exceedance models will be adequate. Let X ∼ F be a variable of interest (e.g. loss of a certain portfolio). In a threshold excedance model, extreme events are defined as events that exceed a threshold u, and the exceedances above this threshold are denoted as Y = X − u. The stochastic behavior of the exceedances, above a large threshold u, is then characterized by. P (X > u + y | X > u) =. 1 − F (u + y) , 1 − F (u). y > 0.. If the distribution F was known, the distribution of threshold exceedances would also be known; however, in practice this is often not the case, and thus it is common to use the GPD distribution as an approximation. This yields the following theorem: Theorem 1.2.2. (Pickands–Balkema–de Haan) Let X 1 , X 2 , . . . be a sequence of independent and identically distributed random variables with distribution function F , obeying the assumptions of Theorem 1.2.1. Then, for a sufficiently large threshold u, the distribution function of Yi = X i − u, conditional on X i > u, may be approximated by a distribution of the form:. H (y) =.  ³ ´−1/ξ   1 − 1 + ξy. , ξ 6= 0,. ¡ ¢   1 − exp −y. , ξ = 0,. σ̃ + σ̃. (1.2). defined on {y : y > 0 and (1 + ξy/σ̃) > 0}, where σ̃ = σ + ξ(u − µ). Proof: The proof follows from (McNeil et al., 2015), pp. 159. According to the Theorem 1.2.2, if block maxima follow a GEV distribution, then the distribution of threshold exceedances follow a generalized Pareto distribution. The distribution in (1.2) is known as generalized Pareto (GP) distribution, with scale parameter σ and shape ξ ∈ R. The behavior of the tail is determined by the parameter ξ that is again the so-called extreme value index. If ξ < 0, the distribution of exceedances has an upper limit u − σ̃/ξ whereas, if ξ ≥ 0, no upper limit exists. For threshold selection it is necessary to take into account that a too low threshold increases bias, while a too high threshold increases variance. Thus, threshold selection entails a bias-variance tradeoff. Given a value of the threshold u and the number of points, k, of the original sample X 1 , . . . , X n that exceed the threshold, the estimation of the parameters σ̃.

(18) CHAPTER 1. INTRODUCTION AND BACKGROUND. 7. and ξ can be carried out by different methods such as maximum likelihood, Bayesian inference, weighted moments method or basic percentile method, among others. There is a parameter called “extremal index” θ ∈ [0, 1] that governs the temporal dependence on the highest observations, and such that its reciprocal measures the level of clustering at the extremes. To deepen in the definition and properties of θ see for instance (Coles, 2001, pp. 92) Extremes for nonstationary sequences The methods discussed above assume identically distributed observations. This section focuses on discussing versions of the latter methods for a nonstationary setting; such versions are natural for studying the time-changing nature of the extremes. The probabilistic setup is based on nonstationary processes, which have characteristics that change over time. For example, in context of Finance the mechanisms driving and governing stock markets are known to be rather turbulent and volatile, and thus their dynamics cannot be captured using a stationary process. Extreme value theory for stationary random sequences has been extensively studied (Chernick et al., 1991; Coles, 2001; Beirlant et al., 2004; de Haan and Ferreira, 2006). In a certain sense, stationarity implies a time-invariant behavior of the extremes, whereas for a nonstationary process the marginal distribution changes over time and thus so may change the behavior of the extremes. The theory for nonstationary extremes is rather intricate; yet given the importance of modeling the dynamics of extremes over time, several statistical models have been proposed (Davison and Smith, 1990; Coles, 2001; Chavez-Demoulin and Davison, 2005; Eastoe and Tawn, 2009; Jonathan et al., 2014; Randell et al., 2016; Einmahl et al., 2016; Castro et al., 2018; Opitz et al., 2018). A popular approach for extending the theory of extremes of stationary sequences, entails indexing the parameters of an extreme value distribution over time (Davison and Smith, 1990; Coles, 2001). For instance, a nonstationary GEV model can be used to describe the limiting distribution of the linearly standarized maximum X t as: X t ∼ GEV(µ(t ), σ(t ), ξ(t )),. t = 1, . . . , T ;. (1.3). here, µ(t ), σ(t ) and ξ(t ) are time-varying parameters controlling the location, scale, and shape of.

(19) CHAPTER 1. INTRODUCTION AND BACKGROUND. 8. the marginal distributions across of time, respectively. Figure 1.1 illustrates, through four examples, the model in (1.3). If, we would consider a parametric structure so that θ(t ) = (µ(t ), σ(t ), ξ(t )), and modeling each parameter using for instance a generalized lineal model (GLM), in such case θ(t ) = h(B tT β), where h is a specified inverse-link function, β is a vector of parameters, and B t is a vector of basis functions (e.g B t = (1, t )). Now, if we model as in (1.3), the likelihood is then: L (β) =. T Y. g (x t ; µ(t ), σ(t ), ξ(t )),. (1.4). t =1. where g is the density of a GEV distribution. Hence, if ξ(t ) 6= 0 for all t = 1, . . . , T , the loglikelihood is given by · ´ ³ x − µ(t ) ´¸ ³ 1 T X t + 1 log 1 + ξ(t ) l (β) = − log σ(t ) + − ξ(t ) σ(t ) t =1 t =1 · ¸−1/ξ(t ) ³ ´ T X x t − µ(t ) − 1 + ξ(t ) , σ(t ) t =1 T X. (1.5). under the condition that 1 + ξ(t ){x t − µ(t )/σ(t )} > 0, for t = 1, . . . , T , where µ(t ), σ(t ) and ξ(t ) are replaced using the GLM model in (1.5). Maximization of this log-likelihood using numerical methods can then be used to estimate the parameter β. Model selection for nested models under this setup is typically based on the deviance statistic D = 2{l 1 (M1 ) − l 0 (M0 )}. (1.6). where l 1 (M0 ) and l 0 (M1 ) are the maximized log-likelihood under models M0 and M1 respectivally, with M0 ⊂ M1 . Large values of D suggest that model M1 explains substantially more of the variation in the data than M0 , whereas small values of D suggest that the increase in model size does not bring significative improvements in the model’s capacity to explain the data, D being chi-squared distributed (asymptotically). Now, in terms of threshold exceedances, similar techniques can be adopted for the GP distribution. We could consider a set of time-varying thresholds u(t ), for t = 1, . . . T , leading to threshold excesses modeled using the generalized Pareto distribution GPD(σ̃(t ), ξ(t )) (Davison and Smith, 1990). Analogously to (1.3), we can use a GLM to model the vector of parameter.

(20) CHAPTER 1. INTRODUCTION AND BACKGROUND. 9. Figure 1.1: Simulated independent observations from nonstationary GEV model in (1.3). The top panels correspond to GEV(t , 2t , 1) (left) and GEV(t 2 , 2t , 1) (right); the bottom panels correspond to GEV(sin2 t , t , 1) (left) and GEV(cos t 2 , t , 1) (right). θ(t ) = (σ̃(t ), ξ(t )) in (1.7).. Another related nonstationary model is that of Chavez-Demoulin and Davison (2005), who model the number of exceedances above a high threshold u using a Poisson distribution with mean λt 0 , where the function λ is defined by: −1/ξ(t ). λ(t ) = {1 + ξ(t )(u − µ(t ))/σ(t )}+. (1.7). and t 0 is the upper extreme point of the interval where the data are observed. Then, similarly to Coles (2001), the threshold excesses Y t1 , . . . , Y tT over the time-varying thresholds u(t ), for t = 1, . . . T , have the distribution in (1.7).. Specifically, Chavez-Demoulin and Davison (2005) suggest modeling the intensity as well as the time-varying shape and scale by using a smooth nonstationary generalized additive specifica-.

(21) CHAPTER 1. INTRODUCTION AND BACKGROUND tion where.     λ(t ) = exp{x T α + f (t )},     ξ(t ) = x T β + g (t ),       σ(t ) = exp{x T γ + s(t )};. 10. (1.8). here x is a covariate, α, β and γ are parameter vectors and f , g and s are smooth functions. Inference for λ(t ), ξ(t ) and σ(t ) can be conducted using a penalized likelihood for the log-likelihood l (y t ; λ(t ), σ(t ), ξ(t )) = l N (y t ; λ(t ) + l (y t ; σ(t ), ξ(t )).. This completes our introduction to nonstationary extremes. Chapter 2 and 4 will offer details on yet another paradigm for nonstationary extremes–known as heteroscedastic extremes–which has been recently proposed by Einmahl et al. (2016).. Brief comments on multivariate extremes Theorem 1.2.1 can be extended to the bivariate case (Coles, 2001, Theorem 8.1, pp.144); this leads to the definition of bivariate extreme value distributions and to the set of distribution functions H called spectral distribution. Similar to the angular density of a bivariate extreme value distribution, one of the objects to be proposed in this thesis (diagonal density, Chapter 3) consists of a mean-constrained univariate distribution function on the interval, which summarizes key features on the dependence structure. Yet, despite their similarities, spectral and diagonal densities are contructed from very different principles. We now switch gears and offer preparations on cluster analysis.. 1.3 Cluster analysis Motivated by the fact that in Chapter 2 we have to cluster random functions, I will review next some cluster analysis methods for the latter setting. Whereas cluster analysis in Euclidean spaces is by now well undestood, cluster analysis in functions spaces is much more challenging and it is still a subject of active research. Clustering methods aim to learn about meaningful partitions of data. Their goal is to order objects (people, variables, functions, etc. . .) into homoge-.

(22) CHAPTER 1. INTRODUCTION AND BACKGROUND. 11. neous groups (clusters) such that the degree of association or similarity between the members of same clusters is stronger than the degree of association or similarity between the members of different clusters without using any prior knowledge on the group labels of data. Different variants lead to different concepts of a cluster ‘center’, the most well-known being K -means (MacQueen and others, 1967) and K -medoids (Kaufman and Rousseeuw, 1987). Beyond similarity based clustering (such as K -means and K -medoids), the other mainstream approaches include model-based clustering (i.e. based on mixture models) and hierarchical clustering (i.e. based on dendograms); see (Hastie et al., 2009, Section 14.4) for an overview on the latter approaches. The results of a cluster analysis can contribute to devise a taxonomy for a set of objects, to suggest statistical models to describe populations, to assign new individuals to classes for diagnosis and identification, etc. In spite are all the developments made on cluster analysis, there are areas with still many open problems due to the great variety of type of data that can be used including vector, time series, categorical data, images, functions, etc. Below we will discuss some recent advances on cluster analysis for time series and functional data. Cluster analysis for random processes When the data of interest evolve over time (i.e. are realizaions of a stochastic process), the clustering techniques to be employed must take into account the ordered nature of the sequences of the records. There are different ways to capture these characteristics, for instance considering time series models underlying data that take into account their dynamic nature, or define criteria of distance or dissimilarity between observations that also consider affinity between structures of temporal dependence. As in the case of data without underlying temporal dependence, the cluster analysis of time series has been widely discussed in the literature. An excellent review on this topic can be found in Liao (2005); see also Wang and Fu (2005) for a review of related contributions from artificial intelligence and data mining. More specifically, Liao (2005) summarizes previous time series data in various application domains, including generalpurpose clustering algorithms, the criteria for evaluating the performance of the clustering results, and the measures to determine the similarity/dissimilarity between two time series being compared. And how shall we proceed when data are a random functions, rather than time series? In this case we may resort to methods and techniques from a field of Statistics known as.

(23) CHAPTER 1. INTRODUCTION AND BACKGROUND. 12. Figure 1.2: Simulated trajectories of random functions belonging to three different families functional data analysis (Ramsay, 2004; Ferraty and Vieu, 2006; Horváth and Kokoszka, 2012). A main goal of this field is to model data that consist of random functions, i.e., data whose trajectories live on functional spaces. When the basic unit of information is a random function, the classical clustering approaches for vector-valued multivariate data can typically be extended to functional data, where various additional considerations arise, such as discrete approximations of distance measures, and dimension reduction of the infinite-dimensional functional data objects. Figure 1.2 shows trajectories of random functions belonging to three differents functional families. In particular, the K -means clustering algorithm is an approach that can be applied to functional data. It is natural to view cluster mean functions as the cluster centers in functional clustering, in this case, we consider the classification problem for a sample of functional data y 1 (t ), . . . , y T (t ), the K -means functional clustering allow us to find a set of cluster centers µ1 (t ), . . . , µK (t ), assuming there are K clusters, by minimizing the sum of the squared distance between {y i (t )}Ti=1 and the cluster centers, for a suitable functional distance D. That is, the T observations are partitioned into K groups such that D(y j (t ), µi (t )), is minimized over all possible sets of functions µ1 (t ), . . . , µK (t ), allocating each function using the encoder L : {1, . . . , T } → {1, . . . K }, given by L ( j ) = arg min D(y j (t ), µi (t )), i =1,...,K. where µi (t ) =. PT. j =1 y j (t )I (L ( j ). = i )/Ni , and Ni =. PT. j = 1, . . . , T,. j =1 I (L ( j ). (1.9). = i ). Typically the distance D.

(24) CHAPTER 1. INTRODUCTION AND BACKGROUND. 13. is often chosen as the L 2 norm. Since functional data are discretely recorded, frequently contaminated with measurement errors, and can be sparsely or irregularly sampled, a common approach to minimize the distances D(y j (t ), µi (t )) is to project the infinite-dimensional functional data onto a low dimensional space of a set of basis functions, similarly to functional correlation and regression (Wang et al., 2015). The traditional K -means clustering for vector-valued multivariate data has been extended to functional data using mean functions as cluster centers. There are two approaches based on the functional decomposition of the observations. A first approach based on the ‘Functional Basis Expansion’, where basically we consider a set of pre-specified basis functions {φ1 (t ), φ2 (t ), . . .} of a functional space, and we can use the first L projections M i ,k =< y i (t ), φk (t ) > of the observed trajectories onto the space spanned by the set of basis functions to represent the functional data, for k = 1, . . . , L. And then, we can apply a clustering algorithms for multivariate data, such as the K -means algorithm, to partition the estimated sets of coefficients {M i ,k }. When the K -means algorithm consider the sets of coefficients {M i ,k } , we obtain the cluster centers M 1 , . . . , M L on the projected space, and thus the set of cluster centers {µ1 (t ), . . . , µK (t )} in the functional space, P where µ j (t ) = Li=1 M i φi (t ). We can see this above approach, for instance in Abraham et al. (2003) using B-spline basis functions and Serban and Wasserman (2005) using Fourier basis function coupled with K -means algorithm. I now discuss the so-called functional PCA (principal component analysis) approach. Whereas the functional basis expansion method needs a pre-specified set of basis functions, the finite approximation by functional PCA uses the data-adaptive basis functions that are determined by the covariance function of the functional data (Jacques and Preda, 2014). This approach consists of reducing the infinite-dimensional problem to a finite one by approximating data with elements from some finite-dimensional space. Then, clustering algorithms for finite dimensional data such as K -means can be performed. Jacques and Preda (2014) consider the estimators for µ(t ) and for the covariance function C (w, t ) given by the sample mean and the sample.

(25) CHAPTER 1. INTRODUCTION AND BACKGROUND. 14. covariance function, respectively T 1 X y i (t ), T i =1. b(t ) = µ. T 1 X b(w)}{y i (t ) − µ b(t )} {y i (w) − µ T − 1 i =1. Cb(w, t ) =. And we assume that each functional data y i (t ), with t = 1, . . . , T belong to a finite dimensional space spanned by some basis of functions such that:. y i (t ) =. L X. φ j (t )θi j = Φ(t )T θi ,. (1.10). j =1. where θi = (θi 1 , . . . , θi L )T are the expansion coefficient of the observed curve y i (t ) in the basis Φ = {φ1 , . . . φL }. Let A be the matrix whose rows are the elements θi , thus, the function Cb(w, t ), can be written as:. Cb(w, t ) =. T 1 1 X b(w)}{y i (t ) − µ b(t )} = {y i (w) − µ ΦT (w)A T AΦ(t ), T − 1 i =1 T −1. (1.11). Now, we assume that each eigen-function f i belong to the linear space spanned by the basis Φ such that f i (t ) = Φ(t )T θψi , with ψi = (ψi 1 , . . . , ψi L )T . Using the estimation Cb of C , the eigen problem (C f i = λi f i ) becomes Z 0. b. Cb(w, t ) f i (t )dt = λi f i (w). (1.12). and using the expansion of f i (t ) in the basis Φ and (1.11), we can see that the above equation is equivalent to 1 Φ(w)T A T A T −1. Z 0. b. Φ(t )Φ(t )T dt ψi = λi Φ(w)T ψi .. Therefore, we have: (T − 1)−1 A T AW ψi = λi ψi , where W =. Rb 0. (1.13). Φ(t )Φ(t )T dt is a symmetric L ×. L matrix of the inner products between the basis functions. Now, defining z i = W 1/2 ψi , the multivariate functional principal component analysis is reduced to the usual PCA of the matrix p ( T − 1)−1 AW 1/2 : 1 T W 1/2 A T AW 1/2 z i = λi z i . (1.14) T −1.

(26) CHAPTER 1. INTRODUCTION AND BACKGROUND. 15. Therefore, the coefficients ψi , i ≥ 1 are obtained by ψi = (W 1/2 )−1 z i , and the principal component scores, are given by C i = AW ψi , i ≥ 1. A similar approach can be used for multivariate functional data (Jacques and Preda, 2014). We now offer some preparations on copulas.. 1.4 Learning about dependence via copulas Copulas are widely applied in Quantitative Risk Management (McNeil et al., 2015), and will be key for Chapter 3. Copulas originate from the need to rigorously define the relationship between a multidimensional distribution function and its underlying one-dimensional marginal distributions. This problem has been addressed by several authors such as Fréchet (1951) and Sklar (1959), who contributed to the solution obtaining important results. The previously mentioned relationship between a multidimensional distribution function with its one-dimensional marginals is given by a function (that Sklar called “copula”) that summarizes the entire dependence structure of a random vector, and it has become a powerful tool for multivariate analysis. In general terms, a copula is a multivariate distribution function defined on the unit square [0, 1]D with uniformly distributed marginals. This definition is natural if the copula is obtained from a continuous multivariate distribution function, in which case the unique copula is simply the original multivariate distribution function with univariate transformed marginals. Definition 1.4.1 (Copula). A D-dimensional copula is a function C : [0, 1]D → [0, 1], i.e. a mapping of the unit hypercube into the unit interval, such as: 1. C (u 1 , . . . , u D ) = 0 if u i = 0 for any i . 2. C (1, . . . , 1, u i , 1, . . . , 1) = u i for all i ∈ {1, . . . , D}, u i ∈ [0, 1]. 3. For all (a 1 , . . . , a D ), (b 1 , . . . , b D ) ∈ [0, 1]D with a i ≤ b i we have 2 X i 1 =1. ···. 2 X (−1)11 +···+i D C (u 1i 1 , . . . , u d i d ) ≥ 0, iD. where u j 1 = a j and u j 2 = b j for all j ∈ {1, . . . , D}.. (1.15).

(27) CHAPTER 1. INTRODUCTION AND BACKGROUND. 16. The second property in Definition 1.4.1 requires that marginal distributions are uniform. The so-called rectangle inequality in (1.15) ensures that if the random vector (U1 , . . . ,UD ) has distribution function C , then P (a 1 ≤ U1 ≤ b 1 , . . . , a D ≤ UD ≤ b D ) is always non-negative. Also, it is important to observe that the k-dimensional margins of a D-dimensional copula are themselves copulas for 1 ≤ k ≤ D. Copulas have a probabilistic interpretation, that follows from Sklar’s Theorem (Sklar, 1959); specifically, the latter theorem ensures that copulas are joint distribution functions, and that any joint distribution function can be rewritten in terms of the marginals and a single copula. Thus, most studies of joint distribution functions can be reduced to the study of copulas. Theorem 1.4.1 (Sklar 1959). Let F be a joint distribution function with margins F 1 , . . . , F D . Then there exists a copula C : [0, 1]D → [0, 1] such that, for all x 1 , . . . , x D ∈ R = [−∞, ∞],. F (x 1 , . . . , x D ) = C (F 1 (x 1 ), . . . , F D (x D )).. (1.16). If the margins are continuous, then C is unique; otherwise C is uniquely determined on Ran(F 1 )×· · ·×Ran(F D ), where Ran(F i ) = F i (R) denote the range of F i . Conversely, if C is a copula and F 1 , . . . , F D are univariate distribution functions, then the function F defined in (1.16) is a joint distribution function with margins F 1 , . . . , F D . Below, u = (u 1 , . . . , u D ) unless mentioned otherwise. Theorem 1.4.2 (Fréchet–Hoeffding bounds). For every copula C (u) we have the bounds ( max. D X. ) u i + 1 − D, 0 ≤ C (u) ≤ min{u 1 , . . . , u D }.. (1.17). i =1. We provide now some examples of copulas and these are subdivided into three groups: fundamental copulas represent an important number of special dependence structure; implicit copulas are extracted from well-know multivariate distributions using Sklar’s Theorem, but do not necessarily have simple closed-form expressions; explicit copulas have simple closed-form expressions and follow mathematical constructions known to yield copulas. Another important class of copulas is that of Archimedean copulas; the particularity of Archimedean copulas is that there is a function that generates them, which is called the copula generator..

(28) CHAPTER 1. INTRODUCTION AND BACKGROUND. 17. Fundamental copulas. • The independece copula is Π(u) =. D Y. ui .. (1.18). i =1. • The comonotonicity copula, which correspond to perfect positive dependence, is the Fréchet upper bound copula from (1.17):. M (u) = min{u 1 , . . . , u D }.. (1.19). • The countermonotonicity copula, which correspond to perfect negative dependence, is the two-dimensional Fréchet lower bound copula from (1.17) given by. W (u) = max{u 1 + u 2 − 1, 0},. u = (u 1 , u 2 ).. (1.20). Implicit copulas • The Gaussian copula is a distribution over [0, 1]D constructed from a multivariate normal distribution over RD . For a given correlation matrix Σ ∈ MD ([−1, 1]) (vector space of square matrices with entries in [−1, 1]), the Gaussian copula with parameter Σ can be written as C Σ (u) = ΦΣ (Φ−1 (u 1 ), . . . , Φ−1 (u D )), where Φ−1 is the inverse cumulative distribution function of a standard normal and ΦΣ is the joint cumulative distribution function of a multivariate normal distribution with mean vector zero and covariance matrix equal to Σ. Note that both the independence and comonotonicity copulas are special cases of the Gaussian copula. If Σ = I D , we obtain the independence copula in (1.18); if Σ = 1D , the D × D matrix consisting entirely of ones, then we obtain the comonotonicity copula in (1.19). In the same way that we can extract a copula from the multivariate normal distribution, we can obtain an implicit copula from any other distribution with continuous marginal density functions..

(29) CHAPTER 1. INTRODUCTION AND BACKGROUND. 18. Figure 1.3: Simulated data from Gaussian, t , Clayton and Gumbel bivariate copulas. • The t copula is a distribution over [0, 1]D which is constructed from a multivariate t distribution over RD . For a given correlation matrix Σ ∈ MD ([−1, 1]) and degrees of freedom ν, the t copula with parameters Σ and ν can be written as C Σ,ν (u) = t Σ,ν (t ν−1 (u 1 ), . . . , t ν−1 (u D )), where t ν−1 is the inverse cumulative distribution function of a standard univariate t distribution with ν degrees of freedom, and t Σ,ν is the joint cumulative distribution function of a multivariate t distribution with mean vector zero, ν degree of freedom and covariance matrix equal to Σ. Explicit copulas The previous examples consist of copulas implied by well-known multivariate density functions but do not themselves have a simple closed forms; yet there are various examples of copulas that do have simple closed forms, and these are known as explicit copulas. An example is.

(30) CHAPTER 1. INTRODUCTION AND BACKGROUND. 19. the D-Farlie–Gumbel–Morgenstern copula with 2D − D − 1 parameters for D ≥ 2: " C α (u) = u 1 · · · u D 1 +. #. D X. X. α j 1 j 2 ··· j d (1 − u j 1 ) · · · (1 − u j d ) ,. d =2 1≤ j 1 <···< j d ≤D. where the parameters α = (α1 , . . . , α2D −D−1 ) must be satisfy the following 2D constraints:. 1+. D X d =2. (. ) α j 1 ··· j d ε j 1 · · · ε j d + αD ε1 · · · εD. X. ≥ 0,. ε j 1 , . . . , ε j d ∈ {−1, +1}.. 1≤ j 1 <···< j d ≤D−1. Thus, the constraints imply that each parameter must satisfy |α| ≤ 1. For further copulas examples, see Joe (2014). Archimedean copulas The family of Archimedean copulas has been extensively studied and applied, for instance in the modelling of portfolio credit risk (Frey and McNeil, 2003; Rogge and Schönbucher, 2003). Most common Archimedean copulas admit an explicit formula, and can model the dependence structure in arbitrarily high dimensions. Formally, a copula C is called D−Archimedean (D > 2) if it admits the following representation, C (u) = ϕ−1 (ϕ(u 1 ) + · · · + ϕ(u D )) where ϕ−1 is the inverse of generator ϕ. A generator uniquely determines an Archimedean copula (up to a scalar multiple). Table 1.1 shows three examples of Archimedean copulas defined through their corresponding generators. Family Clayton Frank Gumbel. Parameter space α≥0 α≥0 α≥1. Generator ϕ(t ) −α t −1 −αt −1 − log ee −α −1 (− log t )α. Generator Inverse ϕ−1 (s) (1 + s)−1/α −α−1 log(1 + e s (e −α − 1)) exp(−s 1/α ). Table 1.1: Summary of three one-parameter (α) Archimedean copulas for D > 2. One of the fundamental aspect of copulas is that they can be used to generate measures of association that differ from Pearson’s correlation coefficient; the latter is simply a measure of linear association, so if the relationship is not linear, this coefficient does not correctly measure.

(31) CHAPTER 1. INTRODUCTION AND BACKGROUND. 20. the association between random variables; in addition, it is well known that a Pearson correlation equal to zero does not imply independence, and on top Pearson’s correlation depends on marginal features. Using copulas, alternative measures of association can be defined such as Kendall’s tau (Nelsen, 2006, p. 158) τ, Spearman’s rho S (Nelsen, 2006, p. 167) and the Tail dependence coefficient λU (Nelsen, 2006, p. 214). 1.5 Background on empirical likelihood and on NPMLE To avoid parametric assumptions, nonparametric methods will be used throughout to learn about the models to be proposed in this thesis. Thus, we discuss below some background on empirical likelihood (Owen, 2001). Empirical Likelihood (EL) is an inference method that can be used to create estimators known as nonparametric maximum likelihood estimator (NPMLE); the basic idea is the following. Let Z1 , . . . , Zn ∼ F and let p i = dF (Zi ), for i = 1, . . . , n. The nonparametric log-likelihood is defined as. l (p 1 , . . . , p n ) =. n X. log p i ,. (p 1 , . . . , p n ) ∈ 4,. (1.21). i =1. where 4 = {(p 1 , . . . , p n ) :. Pn. i =1 p i. = 1, 0 ≤ p i ≤ 1, i = 1, . . . , n} is the unit simplex. Observe that. Equation (1.21) can be interpreted as the log-likelihood for a multinomial model, where the support of the multinomial distribution is given by the empirical observations Z1 , . . . , Zn , even though the distribution F of Zi is not assumed to be multinomial. If the maximum of the logP likelihood in (1.21) is attained at p i = 1/n, the empirical measure F n = n −1 ni=1 δ Zi , with δz denoting a unit mass at z, can be regarded as the nonparametric maximum likelihood estimator (NPMLE) of F . Roughly speaking, empirical likelihood is a nonparametric maximum likelihood procedure that has many properties in common with conventional parametric likelihood when applied to moment-constrained models. Let Z E {g θ (Zi )} =. g θ (z)dF (z) = 0,. θ ∈ Θ ⊂ Rk ,. (1.22). where g is a known Rs -valued function and the parameter θ and F are unknown. The empirical.

(32) CHAPTER 1. INTRODUCTION AND BACKGROUND. 21. version of (1.22) consists of n X. p i g θ (Zi ) = 0.. (1.23). i =1. The value of (θ, p 1 , . . . , p n ) ∈ Θ × 4 that maximizes l , as defined in (1.21), subject to (1.23) is b pb1 , . . . , pbn ). To solve called the maximum empirical likelihood estimator and it is denoted by (θ, the maximization problem in (1.23) instead of maximizing l with respect to the parameters (θ, p 1 , . . . , p n ) jointly, first fix θ and consider the log-likelihood with the parameters (p 1 , . . . , p n ), thus the empirical likelihood procedure is represented by: Pn. i =1 log p i. max. p 1 ,...,p n. Pn. i =1 p i. s.t.. (1.24). =1. Pn. i =1 p i g θ (Zi ) = 0.. Therefore, the problem in (1.24) can be solved using the method of Lagrange multipliers, with the corresponding Lagrangian being given by. L=. n X. Ã log p i + γ 1 −. n X. ! p i − nλT. i =1. i =1. n X. p i g α (Zi ),. (1.25). i =1. where γ ∈ R and λ ∈ Rs are Lagrange multipliers. It is possible to show that the first order conditions for L are solved by γ̂ = n,. Thus, l (θ) = minλ∈Rs −. λ̂ = arg mins − λ∈R. n X. log{1 + λT g θ (Zi )},. i =1. p̂ i =. 1 n{1 + λT g θ (Zi )}. .. Pn. T i =1 log{1 + λ g θ (Zi )} − n log n, and so the empirical likelihood estimator. for θ is θ̂EL = arg max l (θ) = arg max mins − θ∈Θ. θ∈Θ λ∈R. n X. log{1 + λT g θ (Zi )}.. i =1. Numerical methods such as Newton–Raphson algorithm can be used to solve the above maximization problem..

(33) CHAPTER 1. INTRODUCTION AND BACKGROUND. 22. 1.6 Problems to be adressed and main contributions The following work packages describe the challenges that this thesis aims to tackle along with the strategies implemented for approaching the problems of interest.. Contribution 1: Cluster analysis for nonstationary extremes The question motivating the research in this work package was recently mentioned by de Carvalho (2016b), who emphasized the need to cluster time series that share similar extremal properties, such as their scedasis function and extreme-value index. The subject of clustering time series has received considerable attention in recent literature, but none of the available methods allows for clustering to be conducted on the basis of such extremal features. In Finance, the interest on the topic stems from the importance of determining financial assets with similar dynamic patterns, and to capture the level of similarity between financial time series. Beyond Ando and Bai (2017), other recent contributions on clustering financial time series include, for example, D’Urso et al. (2013), who propose a fuzzy clustering method, Bastos and Caiado (2014), who suggest partitioning time series according to a variance ratio statistic, and Dias et al. (2015), who propose a model-based clustering approach built over regime-switching models. With the exception of Durante et al. (2014) none of the available approaches takes into consideration the important need for clustering to explicitly carry information on extremes losses. It is the goal of Chapter 2 to address this gap in the literature.. Contribution 2: Learning and visualizing D-dimensional dependence structures Several financial applications require working with the joint distribution of a random vector. Particularly, in the context of portfolio risk management it is important to quantify the risk considering the dependence between the assets that make up these portfolios (Embrechts et al., 2002), especially when one of the main issues of risk management (McNeil et al., 2015) is the aggregation of individual risks. While copulas fully describe the dependence structure within a D-dimensional random vector, from a statistical outlook it would be desirable to be able to.

(34) CHAPTER 1. INTRODUCTION AND BACKGROUND. 23. ‘compress’ essential features of the dependence structure into a lower-dimensional function that could be depicted regardless of D; by learning about the shape of that lower-dimensional function from data, one would aim at having a consolidated picture of the dependence structure within a random vector. Yet, none of the available approaches is tailored for such inquiry. Chapter 3 will be focused on addresing this gap.. Contribution 3: Learning about frequency of extremes in a K -sample setting The development of statistical models for nonstationary extremes has been a field of active research after the seminal paper by Davison and Smith (1990) who popularized models based on indexing the parameters of the generalized Pareto distribution function with a covariate x. Related approaches can be found in Coles (2001), Chavez-Demoulin and Davison (2005), Eastoe and Tawn (2009),Opitz et al. (2018) and similar paradigms have been recently developed for the setting of nonstationary multivariate extremes. However, none of these approaches addresses a multisample setting not it explores how to borrow strength across samples. Chapter 4 will be focused on addresing this gap.. 1.7 Thesis outline, structure, and organization This thesis is organized as follows. Chapters 2, 3 and 4 include the proposed solutions for the problems described in Contributions 1, 2 and 3 as discussed in Section 1.6. Each chapter aims to be self-contained in terms of notation, definitions, and results. For convenience of the reader, some parts mentioned in Chapter 4.1 may be revisited on later chapters. In more detail: • Chapter 2 develops statistical methods of similarity-based clustering for heteroscedastic extremes, which allow us to group time series according to their extreme-value index and scedasis function. The material in this Chapter has lead to the paper Rubio et al. (2020a) (submitted). • Chapter 3 presents a new class of distributions, that extend the concept of marginal distribution, to which we refer to as diagonal distributions. The main diagonal is studied in detail, which consists of a mean-constrained univariate distribution function on [0, 1].

(35) CHAPTER 1. INTRODUCTION AND BACKGROUND. 24. that summarizes key features on the dependence structure of a random vector, and whose variance connects with Spearman’s rho, and the tail dependence coefficient. The material in this Chapter has lead to the paper Rubio et al. (2020b) (in preparation). • Chapter 4 develops a proportional tails model to multisample non-identically distributed observations. This model builds over de Carvalho and Davison (2014), and Einmahl et al. (2016) and can be regarded as a proportional tails model for multisample settings. A semiparametric specification is made linking all elements in a family of scedasis densities through the action of a baseline scedasis density. Finally, Chapter 5 summarizes the main findings, it puts the research into perspective, and it comments on possible directions for future research..

(36) Chapter 2 Cluster analysis for heteroscedastic extremes Statistical modelling of the magnitude and the frequency of extreme losses in a stock market is essential for institutional investors, professional money managers, and traders. In this chapter, we develop statistical methods of similarity based clustering for heteroscedastic extremes, which allow us to group stocks according to their extreme value index and scedasis function, i.e., the magnitude and frequency of extreme losses, respectively. Clustering is performed here in a product-space and a tuning parameter is used to control whether more emphasis should be put on the latter or the former. The proposed approach also allows for clustering stocks with similar risk loss patterns, by identifying affinities in time-varying value-at-risk functions.. 2.1 Introduction Small-probability events—such as a stock market crash—often lead to devastating economic and financial aftershocks. The need to assess the likelihood of such rare events is now widely understood, and statistics of extremes (Embrechts et al., 1997; Coles, 2001; Beirlant et al., 2004; de Haan and Ferreira, 2006; Resnick, 2007; Balkema and Embrechts, 2007; Davison and Huser, 2015b) provides an appropriate probabilistic framework to address this issue in a mathematically rigorous manner. An overarching principle of statistics of extremes is that any sensible 25.

(37) CHAPTER 2. CLUSTER ANALYSIS FOR HETEROSCEDASTIC EXTREMES. 26. assessment of risk, requires the application of resilient methods that are able to extrapolate into the tails of a distribution—often beyond observed extremes in a dataset. For random samples, a key result in statistics of extremes is that if there is a nontrivial limiting distribution for the normalized sample maxima, then it must be a generalized extreme-value (GEV) distribution (McNeil et al., 2015, Theorem 7.3). The shape parameter of the GEV distribution (1.1) governs the rate of tail decay, and is also known as extreme-value index, or tail index. The main goal of this chapter is to develop cluster analysis methods for highly volatile time series of extremes, and to apply them to get more insight into the synchronization of extreme losses from the London Stock Exchange. Applying methods from statistics of extremes to finance has a long history, including Danielsson and de Vries (1997), Longin and Solnik (2001), Poon et al. (2003), Herrera and Schipp (2013), Hilal et al. (2014), Chavez-Demoulin et al. (2014). The monograph recently edited by Longin (2016) contains an up-to-date survey of methods and applications for statistical modeling of extreme values in finance. To model the dynamics of extremes over time, parametric methods to handle nonstationary data have been proposed by Davison and Smith (1990), Chavez-Demoulin and Davison (2005) and Eastoe and Tawn (2009), among others, for more details see Section1.2. More recently, Einmahl et al. (2016) have developed a semi–parametric modeling and inference framework for heteroscedastic extremes; their two main objects of interest being the scedasis function, which describes the dynamics of extremes over time, and the extreme-value index, which describes the magnitude of extremes. A recent paper by Ando and Bai (2017), which relates to our approach in terms of financial motivation, considers the issue of clustering financial time series based on observable and unobservable factors. In contrast, our approach clusters financial time series based on their resemblance in terms of risk. More precisely, we build on Einmahl et al. (2016) and develop a clustering algorithm which allows us to group stocks that share similarities in their univariate distribution in terms of the overall magnitude and the temporal dynamics of extreme losses. Since the extreme-value index carries information on the rate of tail decay while the scedasis tracks the evolution of extreme losses over time, these parameters form a natural basis for clustering stocks to address our motivating question. Clustering is an unsupervised learning problem, in the sense that the ‘true’ cluster labels are unknown and need to be estimated from the data. Introductions to the subject of data clus-.

(38) CHAPTER 2. CLUSTER ANALYSIS FOR HETEROSCEDASTIC EXTREMES. 27. tering include Hastie et al. (2009), Everitt et al. (2011), and King (2014), for more details see Section 1.3. The huge literature on data clustering is difficult to survey in a few paragraphs, but a concise description of mainstream approaches is offered by Hennig and Liao (2013). The most popular clustering approaches may be classified among similarity-based, model-based, and/or hierarchical clustering techniques. Similarity-based methods include K -means (MacQueen and others, 1967) and K -medoids (Kaufman and Rousseeuw, 1987). Model-based clustering is typically based on mixture models (Fraley and Raftery, 2002), whereas hierarchical clustering builds hierarchies of clusters, often represented in so-called dendograms (Hastie et al., 2009). Hennig and Liao (2013) recently highlighted the lack of clear guidance on choosing an appropriate clustering algorithm for a given problem, despite the many approaches that have been proposed in the literature. The literature discussing extremes is much sparser, although much has been written about the clustering of extreme events within time series induced by short-term temporal dependence (see, e.g., Leadbetter et al., 2012). Clustering algorithms for parallel time series, tailored to specific extreme-value applications, are often intrinsically related to the concept of dimension reduction. For example, Bernard et al. (2013) developed a clustering method for spatial climate extremes, and Chautru (2015) and Vettori et al. (2019a) proposed clustering and dimension reduction techniques for multivariate extreme events based on tree mixtures. Clustering stocks with a similar scedasis function and extreme-value index entails clustering objects defined by the combination of a function (scedasis) and a scalar (extreme-value index); recently, there has been much interest in clustering complex objects such as functions, and the development of methods for clustering functional data is still an area of ongoing research, both from applied and theoretical viewpoints (Peng and Müller, 2008; Delaigle et al., 2012; Wang et al., 2015). To tackle this challenging problem, we develop a clustering approach that can be seen as a K -means method, but where clustering is executed in a product-space defined in terms of the space of all scedasis functions and the positive real line. A weight parameter is then applied as defined by the analyst to control whether the clustering algorithm should prioritize the frequency or the magnitude of extreme losses, which are respectively controlled by the scedasis function and the extreme-value index. The proposed approach also allows for clustering stocks with similar risk loss patterns, by identifying affinities in time-varying value-at-risk functions..