Matching techniques according to the case of correspondence

R ELATED WORK

2.3. Matching of geospatial data

2.3.2. Matching techniques according to the case of correspondence

The supported case of correspondence for each method guides this classification into one-to-one (1:1), one-to-many (1:n), and many-to-many (m:n). One-to-one matching is the simplest approach present from early studies of conflation (Lynch and Saalfeld 1985) to more recent research (Safra et al. 2013). This conventional set also comprises the null matching cases, although some approaches do not support it. At the next level, one-to-many (and also many-to-one) matching methods are able to deal with aggregate objects. This case is particularly useful in a multi-scale environment whose data has different LoD. In the most general case, many-to-many matching is able to manipulate all possible cases. Figure 2.10 illustrates the proposed assortment.

Below we present and review some related papers using the proposed classification.

2.3.2.1. One-to-one matching

The one-to-one matching case (1:1) is the simplest approach to matching geospatial data. Beginning with the work of Saalfeld (1988), this approach was followed by later studies (Beeri et al. 2004, Safra et al. 2010, Song et al. 2011). According to Li and Goodchild (2010), this strategy is unable to handle certain real application cases, such as when one object has several parts in another dataset (1:n), or various objects corresponding to a different number of objects (m:n). Notwithstanding this limitation, the 1:1 case is the prevailing situation in near-scale matching experiments (Huh et al. 2011, Fan et al. 2014), but some authors argue that these ideal cases are rare (Song et al.

2006).

A new approach to dealing with one-to-one matching cases, called the Opt method, was presented by Li and Goodchild (2010). The authors proposed a matching algorithm developed by converting the matching problem into an assignment problem (in operations research) formulated as an objective function that can be solved by an optimization model. At the core of this objective function lies the constraint that permits only 1:1 cases. Despite this limitation, the approach was later extended to accommodate 1:n cases (Li and Goodchild 2011) and even m:n cases (Tong et al.

2014). The main gain of the Opt method is that it focuses on the search for corresponding objects rather than choosing some measures and using a greedy strategy.

Based on the iterative relaxation labelling algorithm initially proposed for point matching by Ranade and Rosenfeld (1980), Song et al. (2011) integrated a distance metric with the compatibility coefficient, which designates the correlation of one candidate matching pair to another. This point-based approach matches road intersections but it has limited scope since it ignores multiple cases.

Another interesting one-to-one matching approach was proposed by Zhang et al.

(2012), aiming at the maintenance of a multiple representation database. The authors Figure 2.10. Classification according to case (Xavier et al. 2016a).

transferred the data matching problem into a pattern classification problem, which they felt should be able to solve the problem of normalization and weighting multiple measures. They tested the approach with four supervised classifiers: C4.5 algorithms (Quinlan 1993), the classification and regression tree (Breiman et al. 1984), the Naive Bayes classifier (Rish 2001) and Support Vector Machines (Hsu and Lin 2002). The probabilistic Naive Bayes model was also used in a later study (Zhang et al. 2014).

Although the four classifiers outperformed the weighted average of applied measures in terms of precision, these classifiers require large training samples, of almost the same size as the test data.

Although some authors feel that the one-to-one matching case is insufficient to handle datasets which differ in topology (Walter and Fritsch 1999, Zhang 2009), there are studies on matching road networks limited to 1:1 cases (Diez et al. 2008). This limitation may decrease the recall rates (see Van Rijsbergen 1979) of a method (e.g. Olteanu 2007), since multiple cases are treated as null matching, or even mismatching. In truth, the research indicates that there are authentic one-to-many and many-to-many cases using different geometric primitives or distinguished classes, like buildings and roads (Fu and Wu 2008).

2.3.2.2. One-to-many matching

One-to-many (and also many-to-one) matching methods (1:n, n:1) can manipulate composite objects, a common case while using a data in different LoDs. One-to-many cases are also common while manipulating networking data, like roads or rivers. Some studies using linear data sometimes split polylines and create virtual vertices (Volz 2006, Stankute and Asche 2009, Yang et al. 2013) in order to handle the multiplicity of corresponding objects. Sometimes existing one-to-many enabled approaches are derivatives from previous studies limited to one-to-one cases.

In her first study using the theory of evidence, Olteanu (2007) was only able to handle 1:1 matching cases. The support for multiple (1:n) cases was added in the later version (Olteanu-Raimond and Mustière 2008). In this study the authors developed a networking matching approach method on the Belief Theory (Dempster 1967, Shafer 1976). Using this technique Olteanu-Raimond and Mustière (2008) combined various matching criteria including positional, geometric, semantic, attribute, and topological (neighbourhood) criteria, in order to match French road networks at different scales.

However, this approach requires a detailed analysis of data and expert knowledge in order to model the masses of belief (weights among criteria).

Another two papers describe one-to-many approaches using a greedy method: Tang et al. (2008) and Kieler et al. (2009). In the first study the authors proposed a method for areal feature matching that selects various characteristics of areal features: position (Euclidean distance between MBBs), shape and size. The final similarity is calculated by a weighted mean that can be obtained by training or by human expertise. The study

by Kieler et al. (2009) aimed to match river datasets at different scales. The authors developed a three-step procedure to find corresponding river features, including using different representation (areas and lines). Under this method, the areal features are collapsed to center-lines using a skeletonization algorithm that preserves the topology.

Then the subsequent matching steps occur in the line network, by using distances between nodes and angles between arcs.

In other one-to-many method Li and Goodchild (2011) play on the asymmetry of directed Hausdorff distance to settle m:1 and 1:m correspondences extending their Opt method (Li and Goodchild 2010). In this later paper (Li and Goodchild 2011), the authors focus more on measure than method, also using other measures like the Hamming distance to feature names and angles between edges.

Matching methods that support one-to-many cases can attain results impossible for the one-to-one approach. Although this class of techniques has increased the matching capabilities, they are not adequate to handle all possible cases in real world applications. Only the most generic approaches (m:n) can reach some cases (see Figure 2.11).

2.3.2.3. Many-to-many matching

Many-to-many matching (m:n) is the broadest approach that encompasses all possible corresponding cases. The support for m:n cases has been seen as fundamental to reaching authentic complex cases present from early work (Brown et al. 1995) until more recent research (Fan et al. 2014). Figure 2.11 illustrates how the same phenomena in real world ((a) source imagery) can be acquired in different manners from two distinct specifications ((b) and (c)), and in diverse representations ((1)-(3)).

And many times these discrepancies lead to m:n matching cases.

One of the first studies to enable the management of many-to-many matching cases was Van Wijngaarden et al. (1997). Their goal was to propagate building updates between topographic maps at two different scales. The authors used an area overlapping percentage criterion and a geometry-enabled database management system to find corresponding buildings.

Another many-to-many approach for areal features was proposed by Butenuth et al.

(2007). The authors expanded the work of Göesseln and Sester (2004) who aimed to develop a framework for the integration of geo-scientific datasets in a federated database environment. Their matching strategy first creates potential matching pairs using intersections. Then these candidates are assessed using symmetric difference (the union of both geometries minus their intersection) and azimuth histograms. In the case of many-to-many relations being identified these features are grouped together into a 'relation set' that can be processed as a simple one-to-one case.

A new matching approach that can handle many-to-many cases, termed the Delimited- Stroke-Oriented (DSO) algorithm, was developed by Zhang (2009). DSO utilizes three assisting methodologies: matching guided by structure (geometric and topologic properties), by semantics (attributes), and by spatial index (to increase matching speed). In order to deal with more generic cases, Zhang (2009) also proposes the concepts of partial correspondence (incomplete matching) and equivalent correspondence (topological inconsistency).

Based on Christmas et al. (1995), Yang et al. (2013) developed a probabilistic relaxation matching approach. The authors establish an initial probabilistic matrix using distance, Figure 2.11. Same real world phenomena acquired using different specifications (Xavier

et al. 2016a).

length and direction. The compatibility coefficient is used during the relaxation process to iteratively recalculate the initial matrix until it reaches convergence values. In order to handle many-to-many cases, Yang et al. (2013) propose an additional five-step procedure based on structural similarity – the sum of matching probabilities of one candidate road matching pair and its neighbouring pairs. This method was also used to integrate VGI points of interest and street linear data (Yang et al. 2014). The framework described in the later work first executes a linear clustering algorithm based on DBSCAN (Ester et al. 1996) to determine the line patterns in VGI POIs, then it constructs a complete graph structure which represents these virtual lines. Finally, the matching problem between the 'lines' and a set of professional road data is solved by the probabilistic relaxation method.

Another two recent approaches to dealing with m:n cases are those developed by Zhang et al. (2014) and Tong et al. (2014). Zhang et al. (2014) extended the relaxation labelling techniques of Parent and Zucker (1989) which aimed to integrate areal building features using a context measure. Tong et al. (2014) developed a new technique by extending the Optimization method (Li and Goodchild 2010) with logistic regression models. The logistic regression model enables both one-to-many and many-to-many matching relationship cases, in this case proving useful to integrate linear road data.

Although m:n cases sometimes indicates an inconsistent matching (Van Wijngaarden et al. 1997), it is possible to find genuine cases of many-to-many matches (Huh et al.

2011, Ying et al. 2011). Some authors have argued that it is necessary to go beyond m:n cases, using the concepts of partial correspondence and equivalent correspondence (Zhang 2009). However, a deeper analysis indicates that these are particular to cases of many-to-many matching, so they can be framed within this classification.

In document Automatic evaluation of geospatial data quality using web services (página 32-37)