This dissertation and specifically the publications herein presented, generally con- tribute to the high-frequency econometrics literature. This is achieved through three different perspectives and involve three different areas, however, linked among each other as addressed in Section 1.3. First, this dissertation looks at the problem of mid-price prediction in limit order book data, also presenting a new publicly available limit-order book dataset for machine learning applications. Second, it explores the fractal nature of different duration time-series extracted for several order book events. Third and lastly, it deals with the prediction of daily realized measures of volatility, under a copula-based approach.
Two are the contributions from Publication I, a major one and a minor one. The major contribution is not a proper literature contribution, rather a contribution to the research community involved in ML application in high-frequency finance. Indeed, a Limit Order Book (LOB) dataset is one of the outcomes of this research. Publication I relies upon a substantial heterogeneity in the datasets so far utilizes in high-frequency ML applications and the challenge of accessing free but detailed and accurate LOB data. As a consequence, comparability and reproducibility of the earlier studies are particularly challenging, also because the overall experimental protocol design is not uniformly addressed. With Publication I we disclose a publicly available
Level-II LOB dataset including the first 10 best order book levels on both the book sides along with a number of features extracted from the raw data, capturing statics and dynamical aspects of the messages inflow. As a second contribution, the problem of the mid-price prediction is addressed (for this specific data, for the specific output classification labels and estimation-forecasting scheme, and in terms of movement in the next 1, 2, 3, 5, 10 events, but for different data normalization procedures). For this purpose, two standard ML methods (ridge regression and single layer forward-feed network) are implemented, and the respective performance measures (and errors) reported. A well-specified experimental protocol guaranteeing the replicability of the results is carefully described. In this way, the datasets along with the prediction outputs of the mid-price prediction task constitute, inclusive of performance results for standard models for future machine learning applications.
The natural tensor representation of a time-series constitutes an attractive direction for time-series modeling. Several ML algorithms exploit its classical representation in term of time-specific vector-sets of features, where the whole intertemporal con- nections are disregarded. In the perspective, Publication II develops and utilizes ML strategies capable of dealing with tensor inputs. Classical Fisher’s Linear Discriminant Analysis (LDA) (e.g. Welling, 2005) relies on vector inputs, but its extension known as Multilinear Discriminant Analysis (MDA) is capable of accommodate tensors as inputs. LDA and MDA are applied fro the mid-price forecasting problem on the very same dataset of Publication I, and the boost in performance measures is shown to be quite noticeable when switching from LDA to MDA. However, MDA stands as a complex method for its implementation and parameter estimation. An alternative estimator based on the tensor representation of the time-series is therefore developed:
Weighted Multi-channel Time-series Regressor/Regression (WMTR)1WMTR is
of easy implementation and fast calibration. However, a number of parameters are required to be properly tuned. A method for the optimal parameter selection based on the algorithm learning rate is devised and applied. With respect to the ML methods 1The term “channel” is popular in signal processing language, background of the co-authors in Publication II, while “multi-channel” means multivariate inputs. “Channel” is considered as a signal that is observed/acquired with different sources. It might represent different frequency bands, or e.g. inputs from different sensors placed at different positions in space (conveying spatial information). For the dataset Publication II uses, “channel” refers to the 144 different features (characteristics, data- representations) that are perceived (extracted) from the data (Section C therein), while “multichannel” refers to the actual nature of the inputs feeding the algorithm, i.e. collections of multiple channels (vectors of features).
implemented in Publication I and the results of (Passalis, Tsantekidis et al., 2017), the WMTR leads to the best-performing F1 measure. A tensor-based representation of the time-series and appropriate ML methods capable of exploiting it seems, therefore to input vector-representations.
Publication III contributes to the literature by providing an analysis on the fractal properties of duration times series extracted from the limit order book. Earlier studies have separately analyzed inter-trade durations (Ivanov, Yuen and Perakakis, 2014, e.g.) and inter-cancellation duration (e.g. Gu et al., 2014), and detected cross-overs in the scaling exponent for inter-trades durations. Along their lines we methodologically adopt the detrended fluctuation analysis to characterize the long-range correlation for inter -order, -trade, - cancel durations as well as for the duration of the cross- events, order (submissions) to (their) cancellation, order-to-trade and order-to-cancel durations. This is done for the best book levels on both sides, for different securities all listed ad NASDAQ Nordic, but traded on different exchanges. The analyses point out a ubiquitous presence of long-range autocorrelation in all the series analyzed, consistency in the scaling exponent estimates within the single exchange and some heterogeneity between exchanges. For all the series, crossovers are identified at day, week, and month horizons, while (Ivanov, Yuen, Podobnik et al., 2004; Ivanov, Yuen and Perakakis, 2014; Tiwari et al., 2017) reported two scaling exponents for the inter-trade durations only. This indicates that fractal properties in duration series are more complex of how previously thought. Following and widely expanding (Ivanov, Yuen and Perakakis, 2014), we explore the association of some relevant but generic economic variables and the scaling exponent. The most relevant association we find is that between the scaling exponent and volatility, which is of strong financial interpretation. Furthermore, associations between the scaling exponents in the order- to-order and cancel-to-cancel series and the economic variables show complex and widespread patterns for the ten time-series involved (5 stocks, bid and ask side for each), underlying the complex nature of long-range autocorrelation in the order book, specifically for the duration series analyzed.
Publication IV presents a novel method for volatility modeling and forecasting, in particular, in modeling and forecasting daily measures of realized volatility. Sev- eral methods have been proposed as extensions of the HAR model of (Corsi, 2009). Among them, (A. J. Patton and Sheppard, 2015) separates the positive and negative contributions of intraday returns to the realized volatility, (Bollerslev et al., 2016) ex-
ploits the discrepancy between the Realized Variance (RV) measure and the Integrated Variance (IV) for finite samples, (Andersen, Bollerslev and Diebold, 2007) accounts for a jump component and, for instance (Hillebrand et al., 2007; McAleer et al., 2011; Buccheri et al., 2017) introduce non-linearities in the HAR specification. In this regard, econometric literature lacks a copula-based approach. Publication IV shows a close connection with the linear regression in the HAR’s models formulation and the modeling of conditional expectations of general multivariate distribution, achievable with the so-called pair copula construction method. This approach is inspired by the work of (Sokolinskiy et al., 2011) which makes use of simple bivariate copulas for the modeling joint distributions. However, the actual framework of the HAR model, which serves as a basis and benchmark for the CV-HAR model calls for the flexible yet parsimonious multivariate copula construction method, i.e. Vine copulas. Publication IV shows how to tackle the problem of predicting tomorrows’ volatility given the past information in this setting, with several different approaches in e.g. modeling the marginal CDFs or in pair-copulas selections. Importantly, it shows that the conditional expectation which serves as a predictor can be quickly and efficiently computed by numerical integration: although the complexity of the distribution (obtained as a mixture of copulas and marginal CDFs) no simulation methods are invoked. A reliable and wide dataset involving 10 stocks confirm the improvement of the CV-HAR model over the HAR model in forecasting daily realized measures. Following the practice of (e.g. Andersen, Bollerslev and Diebold, 2007; Bollerslev et al., 2016) different performance measures, in-sample and out-of-sample schemes for their evaluation and proper statistical testing are considered, confirming that the CV-HAR model is able to capture features in the volatility dynamics that the HAR model is unable of. Furthermore, the CV-HAR model has a number of smaller advantages over the HAR, e.g. does not lead to negative volatility forecasts.