2. SAN ANDRÉS DE PISIMBALÁ-TIERRADENTRO: TERRITORIO, COMUNIDAD Y
2.3 Primera infancia y conflictos
2.3.2 Primera infancia y conflicto social
As it has been mentioned above, the most widespread similarity measures that used for comparing numerical attributes, are based on the Euclidean distance. However, using this measure is not adequate for other types of data such as when using categorical (or nominal) data. Generally, there is no known ordering approach between the values of categorical attribute. The study of similarity between data objects with categorical attributes has had a long history (Ahmad & Dey, 2011).
Categorical attributes take values from a discrete and xed set of linguistic terms. As has been presented in Chapter 2, this type of data attributes also can be represented by the following:
• nominal attributes: each attribute aj has possible values that are elements
of a predened list/domain (or universe of discourse) and not from a con- tinuous domain. For example, the PropType attribute in Example 1 can be classied into six main categories: house, at/apartment, bungalow, land or commercial property (e.g., for any xij, yij ∈ Uj either xij = yij
or xij 6= yij. For this data type there is no way to calculate dissimilarity
between attributes other than as binary values. In other words, the dissim- ilarity degree here is a kind of comparison resulting in either 1 when they
5.2 Crisp Similarity Measures are similar, or 0 when they are dierent because we are only interested to know whether they are the same or not.
• binary attributes: this attribute itself is a special case of nominal attribute when the domain (or universe of discourse) of the attribute is limited to simply True and False values (i. e., each attribute aj of such type corre-
sponding to a single value belongs to the set {0, 1}).
• ordinal attributes: an ordinal (or rank-order) attribute is similar to a nom- inal attribute. The dierence between the two is that there is a clear or- dering of the ordinal attributes. It is one where the order matters but not the dierence between values. For example, quality attribute variable can be measured as: low, average, high. Hence, the categories in this attribute data type are assigned by a specied number of linguistic terms in an or- dered way. If these categories were equally spaced, then the attribute would be an interval attribute.
• interval attributes: this type of attribute is similar to an ordinal attribute, except that the intervals between the values of the attribute are equally spaced. For example, quality attribute variable can be represented by equally spaced categories. In this case, using some measures such as Eu- clidean distance or average between the values of these attributes is mean- ingful.
Now, let C1and C2be two objects described by m categorical attributes atC1 =
{a1, a2, . . . , am} and atC2 = {b1, b2, . . . , bm}, respectively, and each attribute is
5.2 Crisp Similarity Measures where j = 1, 2, . . . , m and mj is the number of linguistic terms that represent the
jth attribute.
In this research, we focus on the overlap measure δ1 : Uj × Uj → {0, 1} for
nominal and binary attributes that is dened as δ1(aij, bij) = 0 if aij = bij and
δ1(aij, bij) = 1, otherwise. However, for the ordinal and interval attributes we
dene a normalised dissimilarity measure δ2 : Uj × Uj → {0, 1} based on Gower
formula which oers to map the ordinal values to their ranking values and then nds dissimilarity between their ranking positions represented by their ranking values (Gower, 1971). The closer two values are in their ranking positions, the less dissimilar they are. This is dened as follows:
δ2(aij, bij) =
|aij − bij|
|L − U | (5.5)
where L and U are the minimal and maximal values by which the attribute aj
can be ranked. Therefore, the distance function between any two attribute values can be chosen according to the data representation for categorical attributes. However, it should be noted that Eq. 5.1 can be reduced into Gower formula: if the Minkowski distance is used, then for any p, both approaches are almost equivalent, particularly, if we assume that dm(aj, bj) = 0 in Eq. 5.1.
Consequently, the normalised dissimilarity is dened according to the related categorical attribute values as follows:
Denition 5.2.6. Let dC : atC1 × atC2 → [0, 1] denotes a normalised distance
between two corresponding attributes aj and bj. Then,
5.2 Crisp Similarity Measures if aj, bj are nominal or binary, and
dC(aj, bj) = δ2(aij, bij) (5.7)
if aj, bj are ordinal or interval.
Again, we will use the same denition in Eq.4.4 to dene similarity between categorical attributes:
Denition 5.2.7. A similarity between a pair of corresponding categorical at- tributes is a mapping SC : atC1 × atC2 → [0, 1], such that:
SC(aj, bj) =
1 − dC(aj, bj)
1 + kjdC(aj, bj)
; kj ≥ 0, (5.8)
and when kj = 0, we get the linear transformation:
SC(aj, bj) = 1 − dC(aj, bj) (5.9)
Denition 5.2.8. Let C = {C1, C2, . . . , Ck} be a set of k categorical objects.
then a similarity between any two objects Cs, Ct ∈ C is a mapping CatSim :
C × C → [0, 1], such that:
CatSim(Cs, Ct) = ⊗(SC(a1, b1), SC(a2, b2), . . . , SC(am, bm)),
where ⊗ : [0, 1]m → [0, 1] is an aggregation function dened as in equations 4.5
5.2 Crisp Similarity Measures
Figure 5.2: Calculating similarity between two categorical objects C1 and C2.
Proposition 5.2.9. The denition of similarity CatSim(Cs, Ct)between the two
objects Cs and Ct described by numerical attributes satises the properties in
Denition 3.4.7 of similarity relation:
Proof. For (i), since δ(aij, aij) = 0; ∀aij ∈ Uj, then dC(aj, aj) = 0. There-
fore, SC(aj, bj) = 1. For (ii) we have δ(aij, aij) = δ(bij, aij) for aij ∈ Uj, then
dC(aj, bj) = dC(bj, aj), and thus SC(aj, bj) = SC(bj, aj). Thus, CatSim(Cs, Ct) =
CatSim(Ct, Cs). For (iii)∀aj, bj ∈ Uj and aj 6= bj, we have dC(aj, bj) > 0 =
dC(aj, aj) this implies SC(aj, bj) < 1. Therefore; CatSim(Cs, Ct) < 1.
Now, let us recall the problem of accommodations comparison presented in example 1.3.1. Since the values of attributes are mixed (numeric and symbolic), the similarity approach for comparing numerical and categorical objects intro- duced above, can be used for the comparison. These two measures are referred to as the crisp similarity model. However, this model is not able to handle the
5.3 A Unied Framework of Similarity Measures for Objects