Clustering - ESTADO DEL ARTE - Desarrollo de una herramienta de asistencia para el análisis de

2. ESTADO DEL ARTE

3.6. Clustering

The critical value (r) is the threshold value between identification and mis- identification. For correlation spectral matching this is the match value above which it is deemed that identification has occurred and below which it is

accepted that identification has not taken place. This value is derived from the internal and external database validation. The critical value defines the

standard equal to or above which a match result is considered to be acceptable (or a pass). It is intended to be used as a guide rather than the ‘hard and fast rule’ because of the lack of experience of data, i.e. this will be determined as the identification of the external test samples is carried out.

A mis-identification is where a correlation spectral match value greater than the critical r value has been obtained but where the true identity of the test

substance and the database match differ.

For wavelength distance matching the critical value (r) is the value below which it is deemed that identification has occurred. Therefore wavelength distance match values above this critical r value are deemed as non-identifications. A mis-identification is where a wavelength distance match below the critical value has been obtained where the true identity of the test substance and the

database match differ.

The first step was to determine if it was possible to distinguish between all the substances in the database. This was performed within the IQ^ software. For

correlation spectral matching the 300+ compounds were included within the database.

When matching by distance mode the algorithm uses the maximum calculated distance between the data points of interest i.e. the database substance and the test samples. Samples must be both identified and qualified for the database to be validated as suitable for matching.

Problems were encountered while attempting to validate the database internally. Both identification by correlation and identification by distance produced numerous repeated errors across all wavelength ranges attempted. The closest to full validation was the correlation matching between 2000 and 2400 nm. This still produced two failures (4-epianhydrotetracycline and 4- epichlortetracycline hydrochloride). These two products are remarkably similar in structure and behaviour. They were not identified as difficult substances in Part 1, the earlier test because they were not included in the database of substances. This was because they were used as test substances to challenge the few products in the database and as such were not actually compared against each other. Admittedly, this was overlooked but the point here is that both products fail the internal validation due to the similarity of their spectra, which is in turn because of the structural similarities of the two products, which is correct.

However in the validation stage here, their structural similarity made detection between the two difficult and caused a failure when attempting to validate the database by either correlation or distance matching. The detection settings for each correlation internal validation were altered from 0.850 to 0.900, 0.950. 0.975, 1.000. The wavelengths attempted were 1100-2500 nm and 2000 -2400

nm - however, distinct identifications were still not possible for the two failing products. The internal validation was also unsuccessfully attempted for wavelength distances (15.0, 10.0, 7.5, 5.0, 2.5 standard deviations). Each validation attempt took between 4-6 hours to run which was fairly time

consuming and frustrating. It is envisaged that this task would have taken less time with enhanced computer processing capabilities but would remain the most time consuming part of the practical NIR process. It is also envisaged that the time taken would be longer as the database is increased in size. It must be repeated each time a new sample is added to the database.

In the case of the tetracyclines it was found that the only way of completing the full database validation with zero failures was by removing one of the products - 4-epianhydrotetracycline. The successful correlation validation threshold occurred at 0.850. The database was able to be validated over a range of wavelengths - separately between 1100 - 2500nm, 2000 - 2400 nm (and later 2200 - 2400 nm.) The distance threshold value was found to be 10.00 standard deviations.

However, what this demonstrated is that neither identification method can be classified as a ‘one method fits all' identification technique. Failing the validation at this stage is highlighting the fact that there are, or will be ‘problem cases’ and exceptions can be built around these.

34 product test samples each comprising different batches of drug substance were used to externally validate the spectral database by correlation spectral matching and wavelength distance matching. The samples were run in a blind manner (i.e. the analyst did not know the identity) and included samples for which there were no spectra in the database (see Table 64 in the Appendix for

the full list of samples used). The test samples compared against the database were obtained from different sources to those used to construct the database. The variations in particle size, particle shape, potency and age which were introduced by the use of external test samples provided a more stringent and ‘real life' challenge to the database than alternative batches of the same samples used to construct the database.

In document Desarrollo de una herramienta de asistencia para el análisis de pruebas psicométricas de una población grande utilizando técnicas de Big Data (página 30-35)