EJECUCIÓN DEL PLAN DE ACCIÓN 1 PUESTA EN MARCHA DEL PLAN DE MEJORAMIENTO

Figure 3.2: Plot of PC1 and PC2. The squares are actual tumour cells; the circles actual stroma cells. The filled (black) symbols are correctly classified; the open symbols incorrectly classified

These results indicate that the combination of FTIR spectra data sets with PCA and

clustering algorithms is helpful in classification of the data. It also shows that with a

few PCs, the majority of information about the data spread can be obtained helping de-

crease computational cost for larger data sets that are anticipated to be used for future

experiments.

3.2 Second Data Set Description

For this data set, a total of 25 breast cancer tissue samples from 5 different breast cancer

biological subtypes were identified from the archives of the Nottingham Tenvous Primary

Breast cancer Series. These samples were collected by a team at the breast cancer research

3.2. Second Data Set Description 68

grade for each sample. The samples belong to different breast cancer categories classified

as either G-I, G-II or G-III. The aim of these experiments is to classify the grade of a

sample with the help of standard commonly used clustering algorithms i.e k-means and

FCM.

3.2.1 Methods

Figure 3.3 shows the flow of the work for our experiments in the form of a step wise pipe

line. We will now describe each stage of the pipe line in detail pointing out the difficulties

3.2. Second Data Set Description 69 Samples FTIR Pre-processing Dimension Reduction Clustering Algorithm Hospital Chemistry Computer Science

3.2. Second Data Set Description 70

3.2.2 Samples

After detailed discussions with clinical experts at Nottingham City Hospital, we selected

one sample of each case for our experiments which were recommended by the experts as

these samples were clinically better and reliable.

3.2.3 FTIR of Samples

FTIR was carried out on a Nicolet Continum FTIR Microscope in the school of Chemistry,

University of Nottingham. For this purpose, selected samples were mounted on a slide

and then placed within a slide holder on the microscope and the spectrum was recorded

within the region of 800-4000 cm-1.

Each spectrum obtained by this method contained 3319 absorbance values and the

total size of data for each sample was very large. For the G-I sample, the data size was

25944*3319 wave numbers, for G-II sample, it was 18400*3319 wave numbers and for

G-III sample, it was 9393*3319 wave numbers. The data size for each sample varied

because of different size of the samples. It is computationally expensive to process such

a large amount of data, therefore, a section of the cancerous region from each sample was

identified with the help of pathologists and data was extracted for those cancerous regions

using a script written in Matlab version 7.02 (Mathworks, Natick, MA, USA). These

regions have been represented as boxes in Figure 3.4 for G-I, G-II and G-III selected

cases. The number of spectra in each region varies as it is not possible to get same size of

section for each sample because of the different shapes of samples. In order to get the best

spectral data available, only 100 spectra from each section were used for our experiments.

The selection of these spectra was made possible by visual inspection of each spectrum

with the help of the clinical experts. In total, 300 spectra (100 of each grade) were used as

data set for the current work. It is also important to note that the sections identified were

not the only cancerous regions in the samples. The samples also contained non-cancerous

3.2. Second Data Set Description 71

(a) G-I sample with selected area in box (b) G-II sample with selected area in box

Figure 3.4: Samples of second data set with selected areas in box

3.2.4 Pre-processing

Baseline correction was performed to correct the sloping baseline that is present with

cell spectra. Subsequently all data underwent vector normalisation, to remove effects

arising from the thickness of the sample. It was achieved by scaling all spectra such that

the squared deviation over the indicated wave number range equals unity. An example

of normalised spectra has been shown in Figure 3.6. The two spectra belonging to G-

II were clearly different from each other before pre-processing as shown in Figure 3.5.

After pre-processing, the blue colour spectrum and red colour spectrum have overlapped

and it seems that there is only one spectrum visible in Figure 3.6. Normalisation has

synchronized the raw spectra which can now be used for analysis to distinguish it from

spectra of other grades. All of the corrections mentioned were achieved using a script

3.2. Second Data Set Description 72

to the region between 900-1800 cm-1 as a finger print region. Each spectrum consist of

934 absorbance values over this region which is significantly less than 3319 found in the

original data set.

3.2. Second Data Set Description 73

Figure 3.6: Example of processed spectra

A data set was created by combining the selected 100 spectra from the cancerous

region of each case. The final data set was of the size 300*934 wave numbers.

3.2.5 Dimension Reduction

For this data set, we have selected PCA as the standard method for dimension reduction.

Although our data set size is not very large (300*934), but we have used it as a standard

procedure. We have used the first 10 PCs for our experiments. The selection was made

on the basis that first 10 PCs contained more than 95% variation of the overall data set.

After dimension reduction, the final data set was of size 300 spectra * 10 PCs.

3.2.6 Clustering Algorithms

We have used k-means and FCM clustering algorithms for our experiments. The selec-

3.2. Second Data Set Description 74

algorithm (FCM).

For the FCM Clustering algorithm squared Euclidean distance was used. Fuzziness

index was set a value of 2 and minimal amount of improvement was set as 10-5. For the

k-means clustering algorithm, again squared Euclidean distance was used for the mea-

surement and maximum number of iterations was set as 100. The number of clusters for

both FCM and k-means clustering algorithms was set as 3. The results obtained from

clustering were compared with the classification made by the expert histopathologists.

3.2.7 Results

Table 3.3 shows the results with FCM and k-means clustering algorithm with 3 clusters.

Table 3.3a shows results with FCM clustering algorithm. It indicates that cluster one

mainly contains members of G-I. Twenty four members of G-I became part of cluster

3 which mainly contains G-II members. Cluster 2 was able to successfully differentiate

the G-III members from the data set and only one member was misclassified and became

part of cluster 1. Cluster 3 represents 87 members of G-II where as the remaining 13

were part of cluster 1. Table 3.3b describes the results obtained by the k-means clustering

algorithm. Cluster 1 has majority of grade 2 members and only one G-III member where

as cluster 2 consists of G-II members. G-III is clearly separable by cluster 3. Both FCM

and k-means clustering algorithms results indicate that spectral data of G-I and II had less

variation, therefore, cluster members became part of each other. In case of G-III data,

both FCM and k-means clustering algorithms were able to clearly distinguish it from rest

of the grades.

Table 3.3: Results with FCM and k-means clustering algorithm with data set 2

(a) FCM Clustering

Cluster with members G-I G-II G-III

1(87) 73 13 1

2(102) 3 0 99

3 (111) 24 87 0

(b) K-means Clustering

Cluster with members G-I G-II G-III

1(113) 40 73 1

2(87) 60 27 0

In document Herramientas informáticas para el fortalecimiento del proyecto educativo “Generación Alternativa” (página 66-93)