3. Biografía de Deng Xiz
3.1. Datos Históricos sobre la vida de Deng Xiz
A two-dimensional structural representation of a chemical compound and its semantic
nomenclature had been established before the end of the nineteenth century [32]. In the
first half of the twentieth century fragment-coding systems were developed to identify sets of sub-structural fragments presented in a molecule. The development of computer
systems and computational chemistry required the presence of more sophisticated and machine-readable representations of a chemical compound.
There are many ways to represent a chemical compound including: names, for- mula, line-symbol notation, molecular representation, physical and chemical proper- ties and fingerprint. Names and indexes like the CAS number are used to identify a query chemical compound and enable fast information (chemical properties) retrieval
from large databases. The CAS number is assigned by Chemical Abstract Service [14]
to all publicly available chemicals. It does not relate any chemical properties to struc- tures. Its numerical value is assigned in sequential, increasing order when a substance is added into the CAS REGISTRY database. It is a unique numerical identifier with the following format: XXXXXXX-XX-X. The first group may contain up to seven digits, the second group contains only two digits and the last consists of one digit called checksum. This number allow for a quick check if a query chemical compound identifier is correct.
The second group of chemical representations uses their molecular structure. Cur- rently, 1-D, 2-D and 3-D molecular representations are known and there is still a strong
interest in deriving new molecular representations [41].
• 1-D representation is a linear string notation of a chemical compound formula
(see Figure2.2). The most popular formats are:
– SMILES language (Simplified Molecular Input Line Entry Specification), – WLN (Wiswesser Line Notation),
– InChI (IUPAC International Chemical Identifier),
– ROSDAL (Representation Of Structure Diagram Arranged Linearly). Over last few years SMILES and InChI have become the most used line nota- tions. SMILES are string notations decoding the molecular structure. They are obtained by printing the symbol nodes encountered in a depth-first tree traversal
of a chemical graph [32]. Often, SMILES are not unique. A chemical compound
can have a few SMILES notation caused by using different starting points in the traversal procedure. InChI keys describe chemical substances using information layers including: atoms and their bond connectivity, tautomeric information, iso-
Figure 2.2: Names and line notations for a tyrosine structure diagram [32]. contrast to widely used CAS registry numbers, SMILES and InChI are com- puted from the structural information and they are readable by experts. They are also are well suited for chemical compound searching and retrieval from large chemical databases.
• 2-D representation includes connection tables (see Figure2.3). It is a graph rep-
resentation G = (V, E) where molecular atoms define a set of graph nodes V and bonds represent a set of edges E. The connection table consists of three parts. The first line in the table, called the header block, contains: molecule name and file origin counts of atoms and bonds. The second part, called the atoms block, includes: one line per atom and specifies 2D coordinates, atom symbol, isotope, charge and stereo code. And the last part, called the bonds block, contains: one line per bond (each bond shown once) specifies row num- bers for atoms, and codes for bond type, bond stereochemistry. The molecular graph representation is used for queries in similarity searching and especially in sub-structure searching.
• 3-D representation contains the graph representation extended by 3D coordi- nates, molecular surface or conformations information. This representation is
Figure 2.3: A fragment of tyrosine - connection table representation [88]. Another representation format of a chemical compound is a fragment-based code (index) of its molecule structure. Presence or absence of a certain structural fragment
is encoded in a binary vector called a fingerprint [32]. This representation is widely
used in substructure searching. There are many similarity metrics such as Hamming distance, Dice coefficient, Euclidean distance to compare two binary vectors to test their similarity. Various measures have been studied by the Sheffield research group
in the context of chemical similarity, and the results are presented in [121]. The most
common similarity measure between two molecules A and B is Tanimoto coefficient defined as follows:
TAB =
c
a + b− c (2.1)
where a and b are numbers of bits set on in the molecules A and B, respectively, and c is the number of bits set on in both molecules. Comparing with atom-by-atom searching (for a molecular representation), the advantage of using fingerprints is the faster search time for large databases. Unfortunately, the fragment code is not unique. Several struc- tures can have the same fingerprint representation. This is why, the circular fingerprint has become very popular. It can be used to generate patterns of various diameter for a molecule. The diameter represents the size of the fragment used to be encoded. By increasing the diameter, one can enrich the information about the molecules. However,
this will also increase the overhead of balancing the fingerprint size and reducing the bit clashes. Nevertheless, fingerprint is a very useful tool to filter a large dataset to find frequently repeated structural patterns.
The last group of chemical representation is called descriptors. There are var- ious physical and chemical properties of a chemical compound calculated from its molecular representation. There are four types of descriptors: topological, geometri- cal, electronic and hybrid. Topological descriptors are derived from connection tables and include information about a number of atoms, bonds and substructures. They includes also topological indices, such as connectivity or kappa indices. From 3D molecular representation, the geometrical descriptors are calculated. They include in- formation such as principal moments of inertia, molecular volume or cross-sectional areas. Electronic descriptors include LUMO and HOMO energies, bond orders or par- tial atomic charges. Various combinations of the above described descriptor types are called hybrid descriptors and they are mostly used in the modelling of quantitative structure-activity relationships. The most comprehensive collection of molecular de-
scriptors with detailed review is presented by Todeschini et al. [112]. All descriptors
are listed with their definition, symbols and labels, formulas, some numerical exam- ples, data and molecular graphs, while numerous figures and tables aid comprehension of the definitions