Bases de datos relacionales 44 - Estudio de tecnologías 37

Capítulo 4. Estudio de tecnologías 37

4.3.3 Bases de datos relacionales 44

Often in this chapter, we will discuss the methods suitable for classifying and grouping observations in homogeneous groups. In other words, we will consider the relationships between the rows of the data matrix which correspond to observations. In order to compare observations, we need to introduce the idea of a distance measure, or proximity, among them. The indexes of proximity between pairs of observations furnish indispensable preliminary information for identify- ing homogeneous groups. More precisely, an index of proximity between any two observations xi and xj can be deﬁned as a function of the corresponding

row vectors in the data matrix:

IPij =f (x i, x

We will use an example from Chapter 8 as a running example in this chapter. We have n=22 527 visitors to a website and p=35 dichotomous variables that deﬁne the behaviour of each visitor. In this case a proximity index will be a function of two 35-dimensional row vectors. Knowledge of the proximity indexes for every pair of visitors allows us to individualize those among them that are more similar, or at least the less different ones, with the purpose of constituting some groups as the most possible homogeneous among them.

When the considered variables are quantitative, the proximity indexes are typi- cally known as distances. If the variables are qualitative, the distance between the observations can be measured by indexes of similarity. If the data are contained in a contingency table, the chi-squared distance can also be employed. There are also indexes of proximity that can be used on a mixture of qualitative and quantitative variables. We will examine the Euclidean distance for quantitative variables and some indexes of similarity for qualitative variables.

4.1.1 Euclidean distance

Consider a data matrix containing only quantitative (or binary) variables. Ifxand

y are rows from the data matrix then a functiond(x, y) is said to be a distance between two observations if it satisﬁes the following properties:

• Non-negativity: d(x, y)≥0 for allx andy

• Identity: d(x, y)=0⇔x=y for all xandy

• Symmetry:d(x, y)=d(y, x)for allx andy

• Triangle inequality: d(x, y)≤d(x, z)+d(y, z)for all x,y andz

To achieve a grouping of all observations, the distance is usually considered between all observations present in the data matrix. All such distances can be represented in a matrix of distances. A distance matrix can be represented in the following way: =         0 . . . d1i . . . d1n .. . . .. ... ... di1 . . . 0 . . . din .. . ... . .. ... dn1 . . . dni . . . 0        

where the generic elementdij is a measure of distance between the row vectorsxi

andxj. The Euclidean distance is the most used distance measure. It is deﬁned,

for any two units indexed byiandj, as the square root of the difference between the corresponding vectors, in the p-dimensional Euclidean space:

2dij =d(xi, xj)= _p s=1 (xis−xjs)2 1/2

The Euclidean distance can be strongly inﬂuenced by a single large difference in one dimension of the values, because the square will greatly magnify that difference. Dimensions having different scales (e.g. some values measured in centimetres, others measured in metres) are often the source of these overstated differences. To overcome this limitation, the Euclidean distance is often calculated not on the original variables, but on useful transformations of them. The commonest choice is to standardise the variables (Section 2.5). After standard- isation, every transformed variable contributes to the distance calculation with equal weight. When the variables are standardised, they have a zero mean and unit variance; furthermore, it can be shown that, fori, j =1, . . . , p,

2dij2 =2(1−rij)

rij =1−dij2/2

where rij indicates the correlation coefﬁcient between the observations xi and

xj. The previous relationships shows that the Euclidean distance between two

observations is a function of the correlation coefﬁcient between them.

4.1.2 Similarity measures

Given a ﬁnite set of observationsui ∈U, a functionS(ui, uj)=Sij fromU×U

toRis called an index of similarity if it satisﬁes the following properties:

• Non-negativity: Sij ≥0,∀ui, uj ∈U

• Normalisation: Sii =1, ∀ui ∈U

• Symmetry: Sij =Sji,∀uii, uj ∈U

Unlike distances, the indexes of similarity can be applied to all kinds of variables, including qualitative variables. They are defined with reference to the observation indexes, rather than to the corresponding row vectors, and they assume values in the closed interval [0, 1], rather than on any non-negative value, facilitating interpretation. The complement of an index of similarity is called an index of dissimilarity and represents a class of indexes of proximity wider than that of the distances. In fact, as a distance, a dissimilarity index satisfies the properties of non-negativity and symmetry. However, the property of normalisation is not equivalent to the property of identity of the distances. And, finally, dissimilarities do not have to satisfy the triangle inequality.

Indexes of similarity can be calculated, in principle, for quantitative variables. But they would be of limited use, since they would distinguish only whether two observations had, for the different variables, observed values equal or different, without saying anything about the size of the difference. From an operational viewpoint, the principal indexes of similarity make reference to data matrices containing binary variables. More general cases, with variables having more than two levels, can be brought into this framework through binarisation (Section 2.3). Consider data regarding n visitors to a website, which has P pages. Corre- spondingly, there areP binary variables, which assume the value 1 if the speciﬁc

Table 4.1 Classiﬁcation of the visited webpages. Visitor B Visitor A 0 PA = 4 CA=21 25 1 CP=2 AP=1 3 Total 6 22 P=28 1 0

page has been visited, or else the value 0. To demonstrate the application of similarity indexes, we now analyse only data concerning the behaviour of the ﬁrst two visitors (2 of thenobservations) to the website described in Chapter 8, among theP =28 webpages they can visit. Table 4.1 summarises the behaviour of the two visitors, treating each page as a binary variable.

Note that, of the 28 considered pages (P =28), 2 have been visited by both visitors. In other words, 2 represents the absolute frequency of contemporary occurrences (CP, for co-presence, or positive matches) for the two observations. In the lower right corner of the table there is a frequency of 21, equal to the number of pages that are visited neither by A nor by B. This frequency corresponds to contemporary absences in the two observations (CA, for co-absences or negative matches). Finally, the frequencies of 4 and 1 indicate the number of pages that only one of the two navigators visits (P Aindicates presence-absence and AP absence-presence, where the ﬁrst letter refers to visitor A and the second to visitor B).

The latter two frequencies denote the differential aspects between the two visitors and therefore must be treated in the same way, being symmetrical. The co-presence is aimed at determining the similarity between the two visitors, a fundamental condition because they could belong to the same group. The co-absence is less important, perhaps negligibly important for determining the similarities between the two units. In fact, the indexes of similarity devel- oped in the statistical literature differ in how they treat the co-absence, as we now describe.

Similarity index of Russel and Rao

Sij =

CP p

This index is a function of the co-presences and is equal to the ratio between the number of co-presences and the total number of considered binary variables,P. From Table 4.1 we have

Sij =

Similarity index of Jaccard

Sij =

CP +P A+AP

This index is the ratio between the number of co-presences and the total number of variables, excluding those that manifest co-absences. Note that this index is indeﬁnite when the two visitors, or more generally the two observations, manifest only co-absences (CA=P). In our example we have

Sij =

7 ≈0.286

Similarity index of Sokal and Michener

Sij =

CP +CA

This represents the ratio between the number of co-presences or co-absences and the total number of variables. In our example we have

Sij =

23 28 ≈0.82

For the index of Sokal and Michener (also called the simple matching coefﬁcient) it is simple to demonstrate that its complement to one (a dissimilarity index) corresponds to the average of the squared Euclidean distance between the two vectors of binary variables associated with the observations:

1−Sij =

P(2d

ij)

This relationship shows that the complement to one of the index of Sokal and Michener is a distance. In fact, it is one of the most used indexes of similarity. It is also known as the coefﬁcient of ‘simple matching’ and the ‘binary distance’; calling it the binary distance is a slight abuse of terminology. Chapter 12 contains a real application of the index of Sokal and Michener.

4.1.3 Multidimensional scaling

We have seen how to calculate proximities between observations, on the basis of a given data matrix, or a table derived from it. Sometimes only the proximities between observations are available, for instance in terms of a distance matrix, and it is desired to reconstruct the values of the observations. In other cases the proximities are calculated using a dissimilarity measure and it is desired to reproduce them in terms of a Euclidean distance, to obtain a representation of the observations in a two-dimensional plane. Multidimensional scaling methods are aimed at representing observations whose observed values are unknown (or

not expressed numerically) in a low-dimensional Euclidean space (usually in R2_{). The representation is achieved by preserving the original distances as far as}

possible.

Section 3.5 explained how to use the method of principal components on a quantitative data matrix in a Euclidean space. It turns the data matrix into a lower-dimensional Euclidean projection by minimising the Euclidean distance between the original observations and the projected ones. Similarly, multidimensional scaling methods look for low-dimensional Euclidean representations of the observations, representations which minimise an appropriate distance between the original distances and the new Euclidean distances. Multidimensional scaling methods differ in how such distance is deﬁned. The most common choice is the stress function, deﬁned by

n i=1 n j=1 (δij−dij)2

where δij are the original distances (or dissimilarities) between each pair of

observations, and dij are the corresponding distances between the reproduced

coordinates.

Metric multidimensional scaling methods look forkreal-valuedn-dimensional vectors, each representing one coordinate measurement of the n observations, such that then×n distance matrix between the observations, expressed bydij,

minimises the squared stress function. Typicallyk=2, so the results of the pro- cedure can be conveniently represented in a scatterplot. The illustrated solution is also known as least squares scaling. A variant of least squares scaling is Sammon mapping, which minimises

n i=1 n j=1 (δij−dij)2 δij

thereby preserving smaller distances.

When the proximities between objects are expressed by a Euclidean distance, it can be shown that the solution of the previous problem corresponds to the principal component scores that would be obtained if the data matrix were available. It is possible to deﬁne non-metric multidimensional scaling methods, where the preserved relationship between the original and the reproduced distances is not necessarily Euclidean. Chapter 12 contains some applications of multidimensional scaling methods. For further information see Mardia, Kent and Bibby (1979).

In document Aplicaciones web para móvil: Estudio y desarrollo de una plataforma web orientada a la telemedicina (página 44-50)