• No se han encontrado resultados

3.8 Análisis e interpretación de resultados

3.8.2 Encuesta dirigida a las madres y padres de familia

Analysing local structure by means of existential features over groups has a number of characteristics that set it apart from what one might be used to in Propositional Data Mining. These peculiarities of MRDM are important because of their impact on the design of MRDM algorithms. They are also important when interpreting correctly the output of algorithms. We summarise the most crucial ones. Note that these characteristics derive directly from predicate logic:

• Refinements may not always produce patterns with a lower coverage. This is especially true for association refinements, where nodes that are totally redundant may be added to a selection graph. This redundancy is the result of declarative bias that indicates that there will always be a part corresponding to the new node. In our molecular example, simply adding a new atom-node through the association between atom and molecule does not logically change the pattern at hand, as all molecules must contain atoms. Although such irrelevant

38

nodes produce no immediate benefit, they are essential because they may be the starting point for further refinements.

• Associations will often appear multiple times as edges from a single node in a selection graph. This indicates that parts with different characteristics, but of the same class, are required to appear. These different sets of characteristics need not be mutually exclusive. They may appear in a single part. Multiple edges of the same association only make sense (and should be allowed as refinements) if the data model indicates a one-to-many relationship.

• In Data Mining, it is often desirable to split a particular subgroup into mutually exclusive subgroups, for example to build hierarchical models of the data, such as decision trees. In Propositional Data Mining, this can be done for example on the basis of the values of a nominal attribute. Each value represents a subgroup, and the union of subgroups thus formed represents the original subgroup. There is no overlap between subgroups, so the size of the original subgroup equals the sum of the sizes of the splits. This simple and attractive feature does not hold for splits on nominal attributes in MRDM. Due to the one-to-many associations and the existential nature of conditions, individuals will often exhibit multiple values for the nominal attribute at the same time. This will place individuals in multiple subgroups represented by the nominal values, causing overlap, and thus problems with adding sizes. In general, nominal attributes can only be used to form true splits if they occur in the target table. As a resulting effect, it is often not possible to use the distribution of values in the nominal attribute to predict the size of subgroups. For example, if we list all atoms in a database of organic compounds, we notice that hydrogen atoms occur more often than carbon atoms. However, if we use these values to produce subgroups, we notice that both subgroups are equal. In fact, all organic molecules (in this case) contain both carbon and hydrogen.

• A similar effect occurs with numeric splits. The intuition from PDM is that as one changes the numeric threshold to increasing values as they occur in the numeric attributes, the size of the resulting subgroup will either monotonically increase or decrease. In MRDM however, often only a few of the encountered values will actually be proper thresholds. Of all the numeric values occurring within a single individual, only the minimum and maximum are valid as thresholds to in- or exclude the individual. Handling numeric data is covered in Section 4.4.

• As a corollary of the previous two effects, it should be noted that complementary operators, such as = and ≠, or < and ≥, do not generally produce complementary subgroups. This is clear if we think of the two operators as representing the two values of a binary attribute, which is a special case of a nominal attribute. Each operator should be considered independently. For numeric attributes, each inequality operator should be treated separately, together with separate sets of valid thresholds. See again Section 4.4.

• The most obscure peculiarity of MRDM has to do with interactions between multiple conditions. An interesting effect of this is that two conditions may be irrelevant in isolation, whereas combined they do produce a proper refinement. This is demonstrated by the database in Figure 4.1. The database describes two hands of playing cards. If we query this database for suit = ♠, we obtain two hands. Two hands are also obtained if queried for rank

= A. We would expect the conjunction of these conditions to produce the same result, as both conditions appear not to be selective. However, there is only a single hand with suit = ♠

rank = A. This example shows how a refinement may change (beneficially or not) the usefulness of a prior refinement, thus producing interactions between conditions. One way to understand this is to look at conditions not just as selections on the level of individuals,

but also on the level of parts within the individual. This change in local structure produces the described effect. Especially with numeric attributes this may cause surprising effects. As was noted before, there is a limited set of proper numeric thresholds, which is a subset of the available values. Adding new refinements may have an influence on this set of thresholds.

Figure 4.1 Two hands of playing cards.

Documento similar