Feature 5, Smoking- Description in Table 5.11 contained free text data (doctor notes), this feature was pre-processed in order to extract information about pa- tient’s smoking history. This categorical feature was further broken down into following features.
SMPNUM=length (smoketext); smokeattr=nan(SMPNUM,6);
IND EX CIGARETTE=1 1-Ex Smoker: How Many Cigarettes Per Day? IND EX CIGAR=2 2-Ex Smoker: How Many Cigars Per Day? IND EX TIME=3 3-Ex Smoker: Stopped How Many Weeks? IND SMK CIGARETTE=4 4-Smoker: How Many Cigarettes Per Day? IND SMK CIGAR=5 5-Smoker: How Many Cigars Per Day? IND SMK TIME=6 6-Smoker: How Long?
Patients belonging to each diagnosis category type are represented in Table
5.12. The Patient data utilised in this studies is imbalanced, most of the classes (final diagnoses) did not have substantial amount of patient cases for data clas- sification work.
Final Diagnosis Assessment Type Number of Patients Acute coronary Syndrome 17
New Exertional Angina 101 Non-cardiac Symptoms 176
Other 20
Possible Exertional Angina 294 Total Number of Patients 608
Table 5.12: Final Diagnoses
5.3.3
Expectation Maximisation (EM) Approach
In order to utilize the missing/incomplete data effectively, we applied and ex- tended the mixture probabilistic model appropriate for the given RACPC dataset with missing values. We regarded the class label as a categorical feature of the sample and estimated the joint distribution of the variables using the training samples. Using the test sample we worked out its likelihood to estimate the missing values in the given test sample. We assigned the estimated value to a particular class keeping in view the maximal likelihood probability in which the class label to be predicted was simply regarded as a missing data. In our data classification problem, the features encapsulated in the RACPC dataset are trans- formed into binary values (during the data analysis phase) which is why we have implemented a model containing a mixture of several Bernoulli variables and a categorical variable. We present the description of this mixture model as follows: The data are assumed to be generated from a mixture of M densities, where each component is a joint density composed of multiple Bernoulli variables and a categorical variable. Since there are D = 17 binary features and C = 5 class labels, the model parameters for each (the j-th) component include 17 Bernoulli variables {µjd}17d=1 and a 5-dimensional categorical variable {νjd}5d=1,
where P
yνjy = 1, ∀j. Denoting the features are x = [x1 x2 . . . xD]
> and the
P (x, y | µ, ν) = M X j=1 P (ωj)P (x, y | νj, µj) = M X j=1 P (ωj)νjy D Y d=1 µxd jd(1 − µjd) (1−xd), (5.1)
where ωj represents the j-th component of the mixture.
Then the log likelihood of the parameters given the data X = {(xi, yi)}Ni=1 is
l(µ, ν | X ) =
N
X
i=1
log P (xi, yi | µ, ν) (5.2)
Given any sample x, the likelihood P (x, y) for each class y = 1, 2, . . . , 5 is calculated, and then the sample is assigned with the label corresponding to the maximal likelihood.
Solving the EM Algorithm for Mixture Models
The parameters of the log likelihood (5.2) is usually intractable due to the loga- rithm of the summation. In practice, the model is optimized by the Expectation- Maximization (EM) algorithm.
To resolve the logarithm of summation, the binary indicator variables Z = {zi}Ni=1 is introduced defined such that zi = [zi1 . . . ziM] and zij = 1 iff (xi, yi) is
generated by the j-th density. Then the log likelihood can be written as
lc(µ, ν | X , Z) = N X i=1 M X j=1 zijlog[P (xi, yi | zi, µ, ν)P (zi)], (5.3)
which does not involve a logarithm of summation.
Denoting Q(µ, ν | µ(k), ν(k)) as the expectation of lc(µ, ν | X , Z), then the
likelihood l(µ, ν | X ) can be maximized by iterating the following two steps: E-step:
M-step:
(µ(k+1), ν(k+1)) = arg minµ,νQ(µ, ν | µ(k), ν(k))
In the case where there exist missing data, the observation xi is divided into
(xoi, xmi ) and the algorithm is rewritten as E-step:
Q(µ, ν | µ(k), ν(k)) = E[lc(µ, ν | Xo, Xm, Z) | Xo, µ(k), ν(k)]
M-step:
(µ(k+1), ν(k+1)) = arg minµ,νQ(µ, ν | µ(k), ν(k)).
At the E-step, the expectation of zij is calculated over the observed part of
xi as hij = νjyi Q d∈Do i µ xid jd (1 − µjd) (1−xid) PM l=1νlyi Q d∈Do i µ xid ld (1 − µld)(1−xid) , where Do
i is the indices of the observed part of the i-th sample, xid is the d-th
dimension of the i-th sample, and νjyi is the probability of that the label of a
sample from the j-th component is yi.
At the M-step, the parameters are re-estimated as µk+1j = PN i=1hijxi PN i=1hij ,
where hij is calculated from the E-step, and for the missing part hijxmi is replaced
with the expectation E[zijxmi | xoi, µ, ν] = hijµmj .
These two steps are repeated until convergence and we obtain the model of this problem.
Given a test sample, the probability that this sample belongs to each class is obtained from (5.1) and it is assigned to the class corresponding to the maximal likelihood.