CAPÍTULO IV: FASE CUANTITATIVA
4.3. Análisis de los datos
3.4.1 Java library
OPSIN’s main mode of distribution is as a Java library typically including both the core and InChI modules. The API has been designed to offer convenience methods for the most commonly required capabilities in conjunction with more advanced configurability. The methods in the public API of NameToStructure are listed below:
Method Output
parseToCML(String name) nu.xom.Element
parseToSmiles(String name) String
parseChemicalName(String name) OpsinResult
parseChemicalName(String name, NameToStructureConfig n2sConfig)
OpsinResult
134
The parseToCML and parseToSmiles are convenience methods and allow the direct conversion of a chemical name to the relevant format e.g. a CML document and a SMILES string respectively, using the program’s default options. A CML document is returned as a XOM Element object allowing in-memory manipulation or trivial serialisation to XML.
Alternatively the output may be an OpsinResult. This contains whether name
interpretation was successful, the error message that was returned (if applicable) and the name that was interpreted. An OpsinResult may be lazily serialised to either CML or SMILES using the class’ methods.
If greater configurability is desired, a NameToStructureConfig object can be provided that allows configuration of OPSIN’s options (Table 3-16).
Option Explanation Default value
allowRadicals Should names that formally describe radials be accepted e.g. ethyl
false detailedFailureAnalysis If a chemical name is uninterpretable should OPSIN parse
it from right to left to attempt to generate a more informative error message
false
Table 3-16 OPSIN’s configurable options
The ParseRules object returned by getOpsinParser allows the parsing of words using OPSIN’s grammar. This functionality is employed extensively by the OPSIN Document Extractor (Section 3.4.4) but is not known to be employed elsewhere. Note that generally only a single word may be parsed at a time e.g. ‘ethyl ethanoate’ will not be fully parsable but ‘ethyl’ or ‘ethanoate’ are parsable.
If one wishes to debug OPSIN’s behaviour an end user may achieve this by setting the Log4J log level to either debug or trace depending on the level of detail required.
Library functions for InChI generation reside in the NameToInchi class in the InChI module. Functions are available for the generation of an InChI with fixed-H layer or a StdInChI from an OpsinResult. Convenience methods are also available to go directly from a name to either form of InChI.
The library is available either from the project’s download page on BitBucket132 or from the Maven central repository.
135
3.4.2 Command-line interface
When OPSIN is distributed in library form as an executable jar file, execution yields a command line interface. Flags are available to set all of OPSIN’s configurable options, the desired output format and verbosity (Figure 3-148). Verbose output corresponds to a Log4J log level of debug. The same command-line is employed regardless of whether the InChI module is included, hence to avoid the command-line interface depending on the InChI module, reflection is used to check for the presence of the InChI functionality on the classpath. The command-line interface may be used to perform batch processing by piping in a file of chemical names and directing the output to an appropriate output file.
Figure 3-148 Screenshot of OPSIN command line help dialog showing available flags
3.4.3 OPSIN web service
The OPSIN web service133 provides access to OPSIN’s functionality to convert names to CML, SMILES and InChI via a convenient web interface. Additionally the web interface can generate depictions using the Indigo toolkit55. The Indigo toolkit is also used to enrich the CML with generated 2D coordinates.
Requests to the web interface may be either done using a browser by entering a chemical name at opsin.ch.cam.ac.uk or programmatically by sending requests to opsin.ch.cam.ac.uk/opsin. Requests may be made using content negotiation or by adding a suitable file extension to the request (Table 3-17).
136
Request type Internet media type File extension
CML chemical/x-cml .cml
CML without 2d coordinates n/a* .no2d.cml
SMILES chemical/x-daylight-smiles .smi
InChI chemical/x-inchi .inchi
Depiction image/png .png
Table 3-17 Request types supported by the OPSIN web service. *chemical/x-no2d-cml is accepted but is not a recognised internet mime type
The web service is employed by the Chemistry Add-in for Word134, a joint development between the Unilever Centre and Microsoft, as a means of converting chemical names to chemical objects.
The web service’s logs were analysed over a one week period in early December 2011 showing requests from 171 unique IP addresses. Usage patterns varied from single names all the way through to automated requests for 1000s of names. Analysis of failing web service requests has revealed that the vast majority of failures have been caused by unrecognised trivial names (e.g. drug names), spelling mistakes, non-English chemical names and non-names (e.g. SMILEs, molecular formulae etc.). The few genuine failings have proven of some use in finding “bugs” and areas of unsupported nomenclature.
When a failure is encountered the web service employs OPSIN’s reverse parsing to attempt to identify the exact part of a name that is uninterpretable in the error response. Users of the service have reported this to be useful in identifying and correcting errors in chemical names135.
3.4.4 OPSIN Document Extractor
The OPSIN Document Extractor136 attempts to find all sequences of words that are parsable by OPSIN. This is assumed to indicate that, with a high degree of confidence, the identified strings are chemical names. The program works as follows on a string of text:
Whitespace tokenisation to form an array of words. The character indices of these words in the original string are recorded.
OPSIN’s pre-processor is employed to generate an array of normalised words which will be operated on henceforth.
137
Identification of stop words e.g. ‘on’, ‘one’, ‘at’. These are English words that can also be the ending of chemical names (often German chemical names) and should be prevented from forming chemical names.
The words are parsed by OPSIN in pairs. Depending on whether or not OPSIN believes a word to be interpretable on its own, the program may add one or both words to a buffer of successfully parsed name fragments e.g. ‘ethyl benzene’ would be consumed in two cycles but ‘benzoic acid’ or ‘chloral hydrate’ would be consumed as one.
If a pair of words is partially interpretable and the point of failure does not occur at a word boundary, spaces are removed until either no improvement in the length of name that is interpretable is noticed or the chemical name ends at a word boundary.
As OPSIN knows the role of chemical words and whether they are valid on their own, intelligent choices can be made as to whether space removal should be attempted. For example ‘benzene sulfonamide’ should be ‘benzenesulfonamide’ but ‘pyridine acetic acid’ should be interpreted as is, rather than treating the acetic acid as a conjunctive substituent of the pyridine ring.
Punctuation at the end of a chemical name, or a bracketed section immediately following a chemical name is ignored and indicates the chemical name is complete. A chemical name is also indicated as being complete if a subsequent word cannot be interpreted as being chemical or the end of the array of words is reached.
Identified chemical names are classified as “complete”, “part”, “family” or “polymer”. “part” names are names classified by OPSIN as substituents. “family” names are classed by OPSIN as functional terms or are names that end in an ‘s’ which could not be interpreted by OPSIN. “polymer” names start with the functional term ‘poly’ or ‘oligo’.
An unbalanced opening bracket at the start of a chemical name, or an unbalanced closing bracket at the end of a chemical name, is removed. Balanced brackets surrounding a chemical name are removed. A terminal ‘-’ or ‘,’ is removed e.g. ‘ethyl-’ is recognised as ‘ethyl’
The output is a list of identified chemical names which can be queried for the normalised chemical name, the raw text, the chemical name classification, the start and end character indices within the original string and the start and end positions within the array of words.
138
As the program knows whether punctuation is valid as part of a chemical name, individual chemical names may still be extracted from lists of chemical names even in the presence of erroneous whitespace (Table 3-18).
Input: ‘indane, 1,2, 3,4- tetrahydroquinoline, 3, 4-dihydro-2H-1, 4-benzoxazine, 1,5-naphthyridine, 1, 8- naphthyridine’
Identified chemical name Text value
indane indane
3,4-tetrahydroquinoline 3,4- tetrahydroquinoline
3,4-dihydro-2H-1,4-benzoxazine 3, 4-dihydro-2H-1, 4-benzoxazine
1,5-naphthyridine 1,5-naphthyridine
8-naphthyridine 8- naphthyridine
Table 3-18 Output from OPSIN Document Extractor on a list of chemical names containing erroneous whitespace
The OPSIN Document Extractor is utilised as a tagger for use with ChemicalTagger (as
described in Section 4.4.5.6) and as an aid in name type assignment (as described in Section 4.5.1.4). It should be emphasised that, whilst the approach taken by the OPSIN Document Extractor is rather brute force in nature, it is still typically an order of magnitude faster than performing entity
recognition with OSCAR4. Hence, using the OPSIN Document Extractor as a complement to OSCAR4, as is done in the work on reaction extraction described in Chapter 4 of this thesis, may be done with minimal effect on performance.