Figure 3.2: Plot of PC1 and PC2. The squares are actual tumour cells; the circles ac- tual stroma cells. The filled (black) symbols are correctly classified; the open symbols incorrectly classified
These results indicate that the combination of FTIR spectra data sets with PCA and
clustering algorithms is helpful in classification of the data. It also shows that with a
few PCs, the majority of information about the data spread can be obtained helping de-
crease computational cost for larger data sets that are anticipated to be used for future
experiments.
3.2
Second Data Set Description
For this data set, a total of 25 breast cancer tissue samples from 5 different breast cancer
biological subtypes were identified from the archives of the Nottingham Tenvous Primary
Breast cancer Series. These samples were collected by a team at the breast cancer research
3.2. Second Data Set Description 68
grade for each sample. The samples belong to different breast cancer categories classified
as either G-I, G-II or G-III. The aim of these experiments is to classify the grade of a
sample with the help of standard commonly used clustering algorithms i.e k-means and
FCM.
3.2.1
Methods
Figure 3.3 shows the flow of the work for our experiments in the form of a step wise pipe
line. We will now describe each stage of the pipe line in detail pointing out the difficulties
3.2. Second Data Set Description 69 Samples FTIR Pre-processing Dimension Reduction Clustering Algorithm Hospital Chemistry Computer Science
3.2. Second Data Set Description 70
3.2.2
Samples
After detailed discussions with clinical experts at Nottingham City Hospital, we selected
one sample of each case for our experiments which were recommended by the experts as
these samples were clinically better and reliable.
3.2.3
FTIR of Samples
FTIR was carried out on a Nicolet Continum FTIR Microscope in the school of Chemistry,
University of Nottingham. For this purpose, selected samples were mounted on a slide
and then placed within a slide holder on the microscope and the spectrum was recorded
within the region of 800-4000 cm-1.
Each spectrum obtained by this method contained 3319 absorbance values and the
total size of data for each sample was very large. For the G-I sample, the data size was
25944*3319 wave numbers, for G-II sample, it was 18400*3319 wave numbers and for
G-III sample, it was 9393*3319 wave numbers. The data size for each sample varied
because of different size of the samples. It is computationally expensive to process such
a large amount of data, therefore, a section of the cancerous region from each sample was
identified with the help of pathologists and data was extracted for those cancerous regions
using a script written in Matlab version 7.02 (Mathworks, Natick, MA, USA). These
regions have been represented as boxes in Figure 3.4 for G-I, G-II and G-III selected
cases. The number of spectra in each region varies as it is not possible to get same size of
section for each sample because of the different shapes of samples. In order to get the best
spectral data available, only 100 spectra from each section were used for our experiments.
The selection of these spectra was made possible by visual inspection of each spectrum
with the help of the clinical experts. In total, 300 spectra (100 of each grade) were used as
data set for the current work. It is also important to note that the sections identified were
not the only cancerous regions in the samples. The samples also contained non-cancerous
3.2. Second Data Set Description 71
(a) G-I sample with selected area in box (b) G-II sample with selected area in box
(c) G-III sample with selected area in box
Figure 3.4: Samples of second data set with selected areas in box
3.2.4
Pre-processing
Baseline correction was performed to correct the sloping baseline that is present with
cell spectra. Subsequently all data underwent vector normalisation, to remove effects
arising from the thickness of the sample. It was achieved by scaling all spectra such that
the squared deviation over the indicated wave number range equals unity. An example
of normalised spectra has been shown in Figure 3.6. The two spectra belonging to G-
II were clearly different from each other before pre-processing as shown in Figure 3.5.
After pre-processing, the blue colour spectrum and red colour spectrum have overlapped
and it seems that there is only one spectrum visible in Figure 3.6. Normalisation has
synchronized the raw spectra which can now be used for analysis to distinguish it from
spectra of other grades. All of the corrections mentioned were achieved using a script
3.2. Second Data Set Description 72
to the region between 900-1800 cm-1 as a finger print region. Each spectrum consist of
934 absorbance values over this region which is significantly less than 3319 found in the
original data set.
3.2. Second Data Set Description 73
Figure 3.6: Example of processed spectra
A data set was created by combining the selected 100 spectra from the cancerous
region of each case. The final data set was of the size 300*934 wave numbers.
3.2.5
Dimension Reduction
For this data set, we have selected PCA as the standard method for dimension reduction.
Although our data set size is not very large (300*934), but we have used it as a standard
procedure. We have used the first 10 PCs for our experiments. The selection was made
on the basis that first 10 PCs contained more than 95% variation of the overall data set.
After dimension reduction, the final data set was of size 300 spectra * 10 PCs.
3.2.6
Clustering Algorithms
We have used k-means and FCM clustering algorithms for our experiments. The selec-
3.2. Second Data Set Description 74
algorithm (FCM).
For the FCM Clustering algorithm squared Euclidean distance was used. Fuzziness
index was set a value of 2 and minimal amount of improvement was set as 10-5. For the
k-means clustering algorithm, again squared Euclidean distance was used for the mea-
surement and maximum number of iterations was set as 100. The number of clusters for
both FCM and k-means clustering algorithms was set as 3. The results obtained from
clustering were compared with the classification made by the expert histopathologists.
3.2.7
Results
Table 3.3 shows the results with FCM and k-means clustering algorithm with 3 clusters.
Table 3.3a shows results with FCM clustering algorithm. It indicates that cluster one
mainly contains members of G-I. Twenty four members of G-I became part of cluster
3 which mainly contains G-II members. Cluster 2 was able to successfully differentiate
the G-III members from the data set and only one member was misclassified and became
part of cluster 1. Cluster 3 represents 87 members of G-II where as the remaining 13
were part of cluster 1. Table 3.3b describes the results obtained by the k-means clustering
algorithm. Cluster 1 has majority of grade 2 members and only one G-III member where
as cluster 2 consists of G-II members. G-III is clearly separable by cluster 3. Both FCM
and k-means clustering algorithms results indicate that spectral data of G-I and II had less
variation, therefore, cluster members became part of each other. In case of G-III data,
both FCM and k-means clustering algorithms were able to clearly distinguish it from rest
of the grades.
Table 3.3: Results with FCM and k-means clustering algorithm with data set 2
(a) FCM Clustering
Cluster with members G-I G-II G-III
1(87) 73 13 1
2(102) 3 0 99
3 (111) 24 87 0
(b) K-means Clustering
Cluster with members G-I G-II G-III
1(113) 40 73 1
2(87) 60 27 0