DE CINCO AÑOS
2.3.16 APRENDIZAJE SIGNIFICATIVO
Lately the number of natural catastrophes increased all over the world. The natural catastrophes which are most dangerous having the highest number of death counts are hurricanes and tornados but also avalanches, forest fires, flood waves, or floods can cause epidemic plagues, lead to crop loss, or threaten human life. Therefore, weather forecast is an important system for early detection of hazards.
For our weather forecast system we obtained weather data from the web- site [Rad10] which contains freely available weather data from the city of Freiburg in Germany ranging from January 1st, 2005 until December 31st, 2009. Therefore, the database consisted of 1826 days. The data of one day corresponds to one object in the database, each being represented by a 4-dimensional nGMM, containing the temperature, humidity, barometric pressure, and the wind speed of one day in Freiburg. We obtained the nG- MMs of the weather data using again the EM algorithm as described in the previous Subsection. EM is a two-step algorithm consisting of the Expecta- tion and the Maximization step. The output of the EM algorithm was one nGMM for each day.
The goal was to find the MLIQ/NN for one specific day given the weather data of the past 5 years in order to predict the weather of the following day as precise as possible. We randomly selected one day from the year 2010 (March 2nd, 2010) for which we performed a weather forecast. Then, we performed a 1-MLIQ/NN search for the previous day (March 1st, 2010) using
the approximationk-MLIQ search, the axis-parallelk-MLIQ search, and the
k-NN search. Based on the results we built our weather prediction always using the following day of each hit as source point for predicting the weather of March 2nd, 2010.
The probability P and the distances between March 1st, 2010 and the
Method Hit Probability/Distance
Approximation k-MLIQ search March 6th, 2009 0.372
Axis-parallel k-MLIQ search Feb. 6th, 2009 0.144
k-NN search March 11th, 2008 28.386
Table 4.1: The k-MLIQ hit and probability of March 1st, 2010 with k = 1 for our approximation k-MLIQ search and the axis-parallel k-MLIQ search and the k-NN hit and distance of March 1st, 2010 withk = 1 for the k-NN
Mean (SD) Min Max March 2nd, 2010 Temp. (◦C) 2.7 (4.1) -2.4 11.0 Humidity (%) 71.7 (14.1) 42.0 86.0 Barometric p. (hPa) 1019.2 (2.5) 1014.0 1023.0 Wind speed (km/h) 2.7 (3.7) 0.0 16.0 Temp. (◦C) 5.2 (2.9) 1.9 11.4 Approximation Humidity (%) 69.7 (13.5) 43.0 87.0
k-MLIQ search Barometric p. (hPa) 1009.4 (3.1) 1003.0 1013.0 Wind speed (km/h) 11.2 (6.2) 0.0 30.3 Temp. (◦C) 1.8 (1.1) -0.3 3.2 Axis-parallel Humidity (%) 90.3 (1.5) 84.0 92.0
k-MLIQ search Barometric p. (hPa) 988.6 (1.8) 987.0 994.0 Wind speed (km/h) 1.3 (2.6) 0.0 12.5 k-NN search Temp. (◦C) 8.4 (1.7) 5.6 12.0 Humidity (%) 69.0 (11.2) 44.0 91.0 Barometric p. (hPa) 997.1 (5.6) 991.0 1008.0 Wind speed (km/h) 37.3 (11.3) 10.5 72.1 Table 4.2: Real mean, standard deviation, minimum, and maximum values of the temperature, humidity, barometric pressure, and wind speed of March 2nd, 2010 as well as the predicted values using the following day of the 1-
MLIQ/NN search of the approximation k-MLIQ, the axis-parallel k-MLIQ, and thek-NN search as source point.
MLIQ/NN hit of all three methods are shown in Table 4.1. For the weather prediction of March 2nd, 2010 always the day after the identified MLIQ/NN
was chosen. Figure 4.13 (gray bars) shows the frequency of each feature value of March 2nd, 2010, meaning how often each temperature, humidity,
barometric pressure, and wind speed value occurred on March 2nd, 2010.
To visually compare the weather prediction quality of the approximation k- MLIQ search (magenta), the axis-parallel k-MLIQ search (green), and the
k-NN search (light-blue) we plotted their predicted feature value distributions on top of the real distribution. Furthermore, Table 4.2 contains the mean
ï05 0 5 10 15 100 Temperature [oC] Frequency 40 60 80 100 0 120 Humidity [%] Frequency 980 1000 1020 0 100
Barometric pressure [hPa]
Frequency 0 20 40 60 80 0 200 Wind speed [km/h] Frequency March 2nd 2010 kïNN search
Axisïparallel kïMLIQ search Approx kïMLIQ search
Figure 4.13: Weather prediction for March 2nd, 2010. The gray bars show the
real frequency distributions of the temperature, humidity, barometric pres- sure, and the wind speed of March 2nd, 2010. The colored distributions are
the predicted weather distributions of March 2nd, 2010 by the approximation k-MLIQ search (magenta), the axis-parallelk-MLIQ search (green), and the
k-NN search (light-blue).
values, standard deviations, minimum, and maximum values of March 2nd, 2010 and the predictions of the three methods.
Based on the data distributions (Figure 4.13) and the mean values (Ta- ble 4.2) the approximationk-MLIQ method obtained the best overall results, meaning the mean values and the data distributions were closer to the real values than the predicted values of the other two methods. The approxima- tion k-MLIQ method had a mean temperature difference of 5.7◦ C, a mean humidity difference of 2 %, a mean barometric pressure difference of 10 hPa,
and a mean wind speed difference of 8.5 km/h. In contrast to that the axis- parallel k-MLIQ search and the k-NN search obtained the following values: mean temperature difference 0.9◦ C, 5.7◦ C; mean humidity difference 18.6 %, 2.7 %; mean barometric pressure difference 30.6 hPa, 22.1 hPa, and mean wind speed difference 1.4 km/h, 34.6 km/h. Even though the axis-parallel
k-MLIQ search produced slightly better results considering the temperature and the wind speed than the approximation k-MLIQ search, our method was able to score well in all 4 dimensions. Hence, the approximation k- MLIQ search produced the best overall results and would, therefore, be the method of choice for further weather predictions.
4.6
Conclusion
We proposed a new efficient and accurate similarity search for uncertain data. Existing approaches do not consider correlations between different features leading to a loss of information and, therefore, introducing inaccuracy in the search. To overcome this problem we extended the similarity measure to han- dle very precise Probability Density Functions consisting of non-axis parallel Gaussian Mixture Models. To our knowledge this has not been done so far. Since the calculation of the Mahalonobis distance is very time consuming we introduced a combination of GMM approximation and angle clustering to speed up the procedure while keeping 100 % filter selectivity. Thereby, the angle clustering step minimizes the approximation error by clustering those Gaussian distributions with a similar orientation in space. The newly de- termined coordinate systems are then used to rotate Gaussians according to their major orientation in space. These rotated, more axis-parallel distribu- tions are subsequently approximated using our filter-refinement architecture. The conservative approximation in combination with the accurate Ma- halonobis distance considering correlations in the similarity search let to more accurate results of our approach compared with methods which ig-
nored correlations thoroughly. This could be demonstrated in our very de- tailed experimental section including various synthetic and real world data sets. Furthermore, due to our filter-refinement architecture we demonstrated the 100 % filter selectivity resulting in no false dismissals while on average a runtime reduction of 10 fold in comparison with the complete calculation of all exact Mahalonobis distances could be achieved.
Similarity Search Based Glioma
Grading
5.1
Introduction
The proceeding development in medical imaging techniques has accounted to a large amount of high-resolution three-dimensional image data. Especially the high volume of non-invasive measures acquired during clinical routine like structural and functional Magnetic Resonance Imaging (MRI) have revealed new possibilities to study the functioning of the human brain. For the field of brain imaging, data mining techniques have proven to be very useful since the large amount of image data cannot be processed directly due to efficiency reasons. Especially for brain tumors, the use of different structural as well as functional MRI techniques has become a considerable research area, in order to improve non-invasive diagnosis, grading, and post-therapeutic follow-up.
The correct assignment of tumor malignancy is important due to the different prognosis and therapy planning of brain tumors of different his- tological grades. To date, histological evidence provided by biopsy, being the gold standard for glioma grading, is necessary to amplify the validity of
the diagnosis. However, biopsy requires non-invasive determination of tumor hot spots, since a single tumor mass can be histologically heterogeneous; ex- tracting parts of the tumor that are not representative (sampling error) would hence lead to an incorrect diagnosis and an inadequate treatment [KTE+11].
Furthermore, biopsy implies risks associated with anesthesia and surgery. Standard MRI protocols for diagnosis of glioma patients mainly rely on the interpretation of contrast enhanced T1-weighted images for tumor grad- ing. Thereby, contrast enhancement is used as an indicator for tumor ma- lignancy. Since some low-grade gliomas show contrast enhancement while a considerable subgroup of high-grade gliomas does not [LYW+03], more so-
phisticated techniques for glioma grading are needed. Many studies have extracted single features of the structural tumor images like location, vol- ume, size, shape, etc. in order to find relevant features for tumor grading [BJS10, KKK+10, LKK+01, MFS+00]. Some techniques have been proposed trying to classify tumors by considering spatial information of three dimen- sional tumor Regions Of Interest (ROI) [MDH99, PML+05]. Others have
used non-invasive functional dynamic MRI techniques like perfusion MR, Diffusion Tensor Imaging (DTI), or MR spectroscopy in order to find mean- ingful criteria to improve non-invasive glioma grading [KIN+01, MAA+03,
MJSA+04, PMB06, ZWC+09].
Several research groups have demonstrated that perfusion MRI can be used to distinguish between different tumor grades [BJS10, BSW06, LYB+04,
LKK+01, PMB06] mainly differentiating between grade II and a combined
group of grade III and IV brain tumors. It has been shown that Cerebral Blood Volume reliably correlates with tumor grade and histological findings of increased tumor vascularity [BJS10, BSW06, PP00, SKK+98, WJH+98]. Nevertheless, until now the non-invasive grading of low-grade versus anaplas- tic glioma has remained very difficult.
The goal of this work was the development of a semi-automatic classifier- based method for differentiating between low-grade (grade I/II) and anaplas-
tic (grade III) gliomas using perfusion-weighted MR imaging in combination with post contrast T1-weighted imaging (T1CE). For data preprocessing we included the outlier detection algorithm described in Chapter 3 and for the similarity search we utilized the algorithm introduced in Chapter 4 consid- ering amongst others also feature correlations for the grading of the tumors. The database used for the similarity search consisted of four-dimensional non- axis parallel Gaussian Mixture Models (GMM), whereat the four dimensions were comprised of three perfusion parameters Cerebral Blood Volume, Cere- bral Blood Flow, and Mean Transit Time, as well as the T1CE image. In our approach we considered the entire intensity distribution information em- bedded in the tumor ROIs in order to render our methodological approach more accurate.