2. Related Works 11
2.2. The value of data
“Data” has become a valuable production factor comparable in importance to capital, land, or technology. An increasing number of business models rely on data to provide services to end users or to improve decision-making, and hence estimating the value of data has become a central problem that has attracted the attention of practitioners from different disciplines. From accounting and estimating the value of data assets of a firm in a due diligence process [190] to estimating the value of data samples to train specific ML models [75], the range of data valuation tasks and their targets is wide and covers very different goals. Due to the elusive nature of “data”
as an economic good, it is far from obvious how to estimate its value, and its unique characteristics dramatically condition how it is exploited and traded in the economy.
Before summarising the literature dealing with this topic, Sect. 2.2.1 discusses some metaphors comparing data to other well-known economic goods in an attempt to explain its peculiar economics. Then we review the distinctive attributes of data as an economic good in Sect. 2.2.2. Finally, Sect. 2.2.3 provides an overview of the state of the art and highlights relevant publications dealing with the value of data. The existing research around this topic is immense and comprises works from very different disciplines. It is not the objective of this section to provide a thorough list of publications dealing with the topic, but to point the reader to valuable survey works and to frame the scope of our contributions.
2.2.1. “Data is like...” – Metaphors about the value of data
It is precisely the elusive nature of data as an economic good and an asset that has inspired a number of metaphors by the industry in the last years. Perhaps the most famous (and abused) quotation is that “Data is the new oil”, initially attributable to Clive Humby’s keynote in the Senior Marketer’s Summit at Kellogg School of the American Advertisers Association back in 2006 [30].
He emphasises that data is easily available to marketers, and its huge value if adequately “refined”, and transformed to relevant information for making decisions and taking actions. He used the term
2.2 The value of data 13
commodity, which is a gross oversimplification of what data is. Unlike oil, data can be copied / transmitted / processed with close to zero cost. Notice that whereas two litres of gasoline yield a similar mileage on two similar cars under similar driving styles, nothing of this sort applies to data since 1) two datasets of equal volume may carry vastly different amounts of usable information, 2) the same information may have tremendously different value for Service A than for Service B, and 3) even if the per usage value of two services is the same, Service A may use the data 1,000 times more intensely than Service B leading to extremely different produced benefits.
Ed King. CEO of OpenPrise compared data to water [103] in the sense that data is everywhere, companies live on data and can be “drowned” in useless data, too. Data, like water, need to be cleansed and processed for consumption. Like in selling water, packaging is very important. He also touches upon the value of data for specific purposes and states that it is not necessarily tied to its price. For example, free open data can be used to enrich and significantly enhance expensive datasets. However, unlike water or oil, data is non-depletable meaning a data source cannot be exhausted by repetitive and continuous consumption and usage.
Some authors dare compare data to a currency, in the sense that it has economic value and can be purchased, sold and traded, as long as ownership is clearly established and protected [73].
Other works question and review the role of governments as a data producer, consumer and fa- cilitator of this data market that will enable the new currency [61]. The idea of providing mi- cropaymentsto users for their personal data as a way to overcome the abuse of privacy on the Internet has received a lot of public attention [114,155]. More recent work describes fundamental technological challenges that need to be addressed for the above vision to be fulfilled [115].
In that scenario, individuals would be able to have an income out of their personal data and out of their contribution to the digital economy. This transforms data in a production factor comparable to labour [11]. Should this possibility materialise, users may need to associate to data unions that defend their rights on the exploitation of their personal data. Some PIMS are already positioning themselves as a data union of ”users practising their data rights” [174]. Association to such unions should probably be recognised as a digital right, and supporting them may require regulatory intervention upon large digital firms, similar to obligations regarding labour unions that have been imposed on large traditional enterprises. Data unions would probably be required to act as non-for-profit organisations (i.e., as fiduciaries) for passive data users to trust them [54].
However, data as labour is not an accurate metaphor for data in the economy, either. Unlike labour, data is a non-rivalrous good meaning that its supply is not affected by its consumption, and thus selling data to a consumer A does not prevent a data provider from selling (a copy of) the same data to a consumer B.
More recently, Tim O’Reilly compared data to sand, thereby highlighting the huge effort re- quired to develop digital products and services from raw data and sending a radically different message regarding the distribution of the value of data across the chain. Raw data from an indi- vidual is worthless, unless combined with that of other million individuals and adequately treated through complex processing and data pipelines to feed an innovative use and business case [180].
14 Related Works
In the end, data is a digital sub-product, and hence likely to benefit from the recommendations on the monetisation of this kind of economic goods [164]. Still, some authors have pointed at some specific characteristics of data that are not strictly applicable to other digital goods, such as the uncertainty as to the units of data to be traded (or rather their dependency on the type of data and the context in which it is used), its inherently combinatorial and aggregated value, and the fact that data often is consumed by machines, whereas digital products and services target people and enterprises. Finally, data and digital goods are sold in different ways. Unlike digital goods, data is often an intermediate production factor to be combined and processed with other data to build digital services [153].
In conclusion, the economics of “data” are peculiar, and no single metaphor or comparison is able to accurately capture its specific characteristics and define its behaviour in the economy.
Rather than being accurate in all situations, the intention of such metaphors has been to highlight specific features of “data” in keynote speeches, position papers, reports, or scientific papers.
2.2.2. Data as an economic good
Many works discuss on the characteristics of data as an economic good. In this section, we point to some of them and summarise key features of data, which will serve as a preliminary basic background to understand the challenges of marketplaces and entities dealing with data and trading it in the market [47, 113]. As an economic good, data is:
Freely replicable, meaning it can be copied and transmitted at close to zero cost.
Non-depletableorNon-perishable, meaning that it is not exhausted by a repetitive and continuous consumption.
Reusablefor different purposes and clients.
Non-rivalrous, that is, selling data to a consumer A does not prevent from selling (a copy) of the same data to another consumer B.
Permanent, unlike information that is perishable and depreciates over time.
Unsurprisingly, these peculiar characteristics determine the behaviour of data in the economy.
Next we summarise some considerations found in the literature about the value of data:
Since the marginal cost of processing data is reduced,its value increases with its use.
It has aninherently combinatorial value, the more data is combined and used, the more value it usually generates.
However, itsvalue depends on the contextand on the existing information: more data does not necessarily mean better data.
2.2 The value of data 15
Its value is affected by quality, but features affecting that quality (e.g., completeness, accuracy, timeliness) also depend on the context.
Uniqueness also affects the value of data: the more you share it, the lower its value becomes.
Packaging and delivery methods(linkability, interoperability) are important.
Its value is affected by externalities, such as the implications of data sharing in terms of privacy.
In a nutshell, data is a key factor for producing digital services and goods, and the goal of data pipelines and value chains is to transform data into information and then into more valuable digital services and goods that benefit from the increasing value of combining different pieces of data.
Hence, enabling a market around data means trading intermediate inputs along these pipelines and allowing third parties to combine them into more complex use cases aimed to improve decision making, to increase the efficiency of productive and operational processes, and to create new innovative services for end users.
Furthermore, the characteristics of data and the former considerations regarding its value affect how it is traded in the market [97, 164] and shape the data economy and the nascent data markets. Since it is easily replicable and its value reduces as it is shared in the market, most companies are often reluctant to share data and give up monetising it because they are scared of losing competitive advantages in doing so. This is having severe consequences and undermining the potential benefits of data in the market:
1. Firms tend to hoard huge amounts of data, avoiding any sharing with third parties. As a result most data being collected by companies remains in silos nowadays.
2. Since it is often difficult to find data in the market, most companies look for innovative ways to collect those data. As a result, data collection is inefficient.
3. The abuse of personal data collection with little or no consent and the amount of personal data stored by different parties are raising a sentiment of lack of privacy on the Internet.
Regardless its nature, data is becoming a cornerstone in the digital economy, and it is now considered a key asset of data-driven companies. Therefore, it is necessary to measure its value to understand the importance of firms to the economy, or just to monitor the development of the data economy. The practice of data governance intends to improve the management of data, make an inventory of data assets in firms, and apply asset valuation methodologies the measuring and understanding the value of data. In the next section, we provide an overview about these valuation methodologies and we frame the scope of the work and contributions of this thesis in this field.
16 Related Works
2.2.3. Measuring the value of data
For most people, the ’value of data’ has been linked to that of personal data, and more specif- ically to its application in marketing and advertising. As we will show in chapters 3 and 4, this view is very restrictive, and there are many different types of data being traded in data marketplace that can be used in very different use cases other than targeting users to improve the performance of online advertising campaigns [87]. Due to the wide spectrum of data and use cases that can be found in the market [118], many different methodologies and works attempt to estimate its value often resulting in apparently contradicting estimations.
From a macroeconomics perspective, the OECD published an interesting survey summaris- ing different approaches to determine the monetary value of personal data [139], including very heterogeneous methods such as examining market capitalisation, revenues or net income of data- driven firms per individual, analysing revenues or net income per record/user, or assessing the cost of data breaches, which in turn assumes personal data as a liability. Another common methodol- ogy to approach this problem is through economic experiments and surveys to users’ willingness to pay to protect their data [34]. A more recent literature survey works adds impact-based valua- tion that also considers the social and economic outcomes of data use cases [57].
In chapter 4, we address the value of data from a market perspective, which assumes the value is related to the prices of data in data markets, and the volume of money traded in such markets.
Previous papers have presented similar market-based solutions close to the one we followed, but focused on prices observed in online advertising [100]. Some tools like this were implemented later on and are able to estimate the relative value of different user profiles and the revenue users generate for social networks like Facebook [31].
Other works have assessed the value of data for specific tasks from a perspective in the bor- der between microeconomics and computer science. As opposed to macroeconomic approaches, these works provide a detailed valuation of data for specific purposes and contexts. Finding scal- able and fair ways to compute value-based contributions to a ML problem is key in calculating the contribution of individuals to the data economy, and moving towards a human-centric data economy will require to find fair scalable ways to do so.
In particular, in this thesis we have focused on the value of spatio-temporal data for ML pre- diction tasks. The use of spatio-temporal data in transportation and smart city applications has attracted much attention from the research community. Different works look at how knowledge extraction from spatio-temporal data can significantly improve the effectiveness of transporta- tion [198], mobility prediction [10], or last mile delivery [46], among others.
In this context, some authors have also studied theintrinsicvalue of spatio-temporal informa- tion, and calculate it as the reduction of the uncertainty about the position of an individual [136].
Similar to ours, other works have dealt with theextrinsicvalue of spatio-temporal data for a spe- cific problem in a specific context [8, 9], while others have introduced the notion of privacy in pricing spatio-temporal data [135]. These valuable works differ from ours in 1) they adapt dif- ferent notions of value instead of using the more generic Shapley value used in our work, 2) they