• No se han encontrado resultados

The Time Series Processing Framework (TSPF) has the following user-controlled options:

1. Social media data read-in selection. The user selects if he wishes to read in raw social media data for a particular financial-instrument/Twitter-Filter combination from the Twitter Collection Framework (TCF) for the first time, or if he wishes to open data from the TCF that has been read-in on a previous occasion. Reading data for the first time is more time-consuming as the TSPF has to convert .txt file data into MATLAB’s own .m file data line by line, and this takes place at a rate of up to 2,500 rows per second on a standard desktop machine. Opening the pre-read data any subsequent time is near-instantaneous. There is no limitation on the size of the data files which can be read-in.

2. Financial data read-in selection. The user selects the underlying file which contains the raw price data for a particular financial-instrument/Twitter-Filter combination. Financial data are sourced either from Dukascopy (in which case the data are in the form of a CSV), or from Fulcrum Asset Management (in

62

which case the data are the form of an .m file), as discussed in Chapter 4.3. Whilst there is no restriction on the granularity of the financial data that can be used, all financial data considered in this study were presented in 5-minute tick intervals.

3. Discretisation-window selection. The user selects the size of the window into which the social media and financial data are aggregated. This allows for the conversion of raw data, which is continuous, into discretised time frames, as discussed in Chapter 5.3. The choice of discretisation frequency in the financial services industry is often ad-hoc, typically dictated by the observation intervals of the available data79. As discussed in Chapter 4.1, the development of SocialSTORM57 provided preliminary access to Twitter data for initial exploration of the relationships between social media data and financial data. Whilst the Twitter data provided by SocialSTORM which was continuous, as is the case with the TCF, the financial data used during this preliminary investigation was not available to discretised resolutions smaller than an houra80. Based on this past data limitation, it was decided that relationships between Twitter data and financial data would be evaluated as discretised to the hourly level, followed by testing the robustness of the relationships at different discretisation levels (as discussed in Chapter 7.1).

For example, if the user selects the window to be 1-hour in size, the system performed the following calculations:

a) A discretised time-series T of time-stamps with elements Ti is created, where T1 = 00:00:00 on 11th December 2012 and Tn = 23:59:59 on 11th March 2013 (bringing the data-capture period up to 12th March 2013, giving a total of 90 days).

b) The number of periods per 24-hours is determined as a function of the desired window size, W when expressed in hours (in this example, 1):

Nperiods= 24 1

The number of elements in the discretised time-series T is therefore:

a Financial data used for the preliminary investigation was sourced from Thomson Reuters and from

Fulcrum Asset Management, and was discretised to hourly windows due to the unavailability of higher- resolution data.

63

Tn = Nperiods× 90 = 24 × 90 = 2160

c) It is then identified whether the input data time-series of price, sentiment and message volume, Iprice, Isentiment, Imessage volume belong to each location in the discretised time-series T. An input data-point I is deemed to belong to a location in the discretised time-series T if its time-stamp is between up to and including the time-stamp for the current location in the discretised time-series, Ti, and above but not including the time- stamp for the chronologically previous location in the discretised time- series, i.e., Ti−1.

d) For each location in the discretised time-series T, the discretised means of the values for each of the corresponding input data series of price, sentiment and message volume, Iprice, Isentiment, Imessage volume are determined. Denoted D̅̅̅̅̅̅̅̅̅̅, DsentimentpriceTn ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ and Dmessage volumeTn ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ Tn respectively, these are calculated as:

DpriceTi

̅̅̅̅̅̅̅̅̅̅ =Iprice1+ Iprice2 + ⋯ Ipricen

n DsentimentTi

̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ = Isentiment1+ Isentiment2+ ⋯ Isentimentn

n Dmessage volumeTi

̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ =Imessage volume1+ Imessage volume2 + ⋯ Imessage volumen

n

e) Finally, the changes in these discretised mean values of Iprice, Isentiment, Imessage volume are then calculated. Denoted ∆D̅̅̅̅̅̅̅̅̅̅, ∆DpriceTn ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ sentimentTn and ∆D̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ respectively, these are calculated as message volumeTn

∆D̅̅̅̅̅̅̅̅̅̅ = DpriceTi ̅̅̅̅̅̅̅̅̅̅ − DpriceTi ̅̅̅̅̅̅̅̅̅̅̅̅ priceTi−1 ∆D̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ = DsentimentTi ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ − DsentimentTi ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ sentimentTi−1

∆D̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ = Dmessage volumeTi ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ − Dmessage volumeTi ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ message volumeTi−1

In this manner, this methodology not only discretises the input data, but also normalises the data by the volume of data-points for each element in the time-series T.

64

f) Note, the values of ∆D̅̅̅̅̅̅̅̅̅̅, ∆DpriceT1 ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ and ∆DsentimentT1 ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ message volumeT1 (i.e., for element T1) are empty as these are the first entries in the discretised time-series T and therefore there are no prior elements from which to calculate the changes in these discretised mean values of Iprice, Isentiment, Imessage volume.

The TSPF also calculates the net sentiment for each Tweet, as described in Chapter 5.1. This is calculated by subtracting the negative sentiment from the positive sentiment for each message, and is ranked on a scale of -4 (most negative) through 0 (neutral) to +4 (most positive).

A full copy of the code underpinning the TSPF is available in the Appendix (see Chapter 11.2).

Documento similar