2.1 Marker content of YAC clones
2.1.1 The Reference Library DataBase
2.1.1.1 Origin o f the data
The Reference Library System was used in two stages. First, SQL (Structured Query Language) queries for all clones that had been identified by probes mapped to the X chromosome retrieved 1,723 records. Most of these clones were identified in primary screens of library filters, and in 90% of the cases the results had not been confirmed by a secondary hybridisation to the clones themselves. These results were difficult to use directly. Since most probes were named using an internal laboratory nomenclature, it was not possible to relate the results to each other. The stated localisation of the probes on the chromosome were often relative to a cytogenetic band and not to neighbouring loci, which is not sufficient for placing clones on a map. It was therefore necessary to go further upstream and contact directly the scientists who submitted their results to RLDB. A list of 21 laboratories was drawn up which had made use of RLS resources for building YAC contigs on the X chromosome. In the documentation accompanying the original RLS material sent out (filters and clones), investigators were informed of their responsibility to return all mapping data derived from these samples. Each was therefore individually contacted by letter, FAX, telephone calls or personal visits. In order to facilitate the process of returning substantial information, extensive submission forms were sent to each group. In most cases, it was however easier to simply collect laboratory notes and maps. Furthermore, attendance at three successive International workshops on X chromosome mapping (St. Louis, USA 1993; Heidelberg, Germany 1994; Banff, Canada, 1995) allowed a large amount of results to be collected and updated in a concise format. At these occasions, it became clear that many investigators had screened copies of the ICRF YAC library that had been distributed by our laboratory to other groups. These results had bypassed the RLS system and therefore were not listed in the RLDB records studied in the first phase. Results from 28 groups were finally collected, which together represented 42 contigs. All were established in the context of positional cloning projects. A number of these projects overlapped and therefore some maps were constructed over identical regions.
snapshot of the current results on a particular mapping project. Information was generaily condensed on one map representing the clones graphically in the relevant genomic region, with their attached genetic and physical markers. An accompanying table summarised the expérimentai results, by plotting the list of probes against the YAC clones and indicating a positive or a negative match. The main disadvantage however of activeiy collecting results is that data is presented in a very heterogeneous format, in contrast to results provided via submission forms. In order to make sense out of the information, to compare results and ultimately use them for our mapping project, it was therefore necessary to transiate the heterogeneous set of data into one format. ACEDB was chosen at an early stage of the project to be a repository of this information. For undertakings of this scaie and nature, ACEDB is a well suited database system that provides extensive graphical features, is very easy to customise to a given set of genomic data, and has a simple way of importing and exporting information. The database that progressively emerged from compiling diverse X chromosome data in ACEDB (and iater in ORACLE) was named the Integrated X chromosome Database (IXDB)
Maps provided by coliaborators in the RLS were hand or computer drawn on paper and it rapidly became obvious that manuaily converting them to ace format (the ASCII text format required to enter data in ACEDB) would be a tedious and error prone process. A software caiied xcontigview was therefore written by Huw Griffith in our laboratory to partially automate this operation. Maps were first converted to TIFF images (Tagged Image Format) using either a CCD camera normally used for ethidium bromide stained gels, or an X-ray film scanner (Amersham). Images were loaded in the program and appeared on the computer screen. After setting some parameters such as the origin, scale, source of data etc., the content of the map (clones and probes) was digitised using the computer mouse. This was simply done by clicking on the beginning and the end of each clone, probe or gene and typing in the name of each object. The program then automaticaiiy converted this information into ace format. In a few minutes a complicated map could therefore be parsed into IXDB and represented in the physical map display. Experimental results accompanying the maps and often represented as tables, were directly typed in the database.
2.2 Public domain data
Our laboratory is part of the Integrated Genome Database consortium, which in the first phase of the project attempted to compiie in a single format (ACEDB) data from a variety of independent databases such as GDB, OMIM, RLDB, EMBL, etc. It was clear that when IGD would release its first set of data, the X chromosome section couid be directly imported into our ACEDB database. This would ensure that an exhaustive sampiing of public domain information reiated to the X chromosome would be integrated with our experimental dataset. In the early stages of the X chromosome
project however, the IGD project was still in its early development and it was therefore necessary to start translating publicly available data in-house. The main focus was put on the Genome Database, since it provides a way of identifying the official D- number of numerous probes used by collaborators from the RLS. The Human Genome Mapping Project Resource Centre (HGMP-RC) situated at the time in Harrow (UK) was a mirror site of the GDB and provided a convenient way of accessing data from the ICRF. A simple user interface was available for registered users via a telnet session. The GDB was queried for all loci and genes localised on the X chromosome, and approximately 2500 entries were retrieved the first time. Three updates were performed in the course of the following 18 months and the number of entries reached approximately 3000 for the last update. Complete entries were downloaded to a local Unix station (Sun Sparc 2) via electronic mail, in batches of 100 kilobytes. The files were then concatenated and processed by an awk program to translate the information into ace files. These were then directly parsed into IXDB.
Approximately 24 months after the start of the X chromosome project, IGD released its first set of integrated data. Data was available via ftp from the Deutsches Krebsforschungszentrum (DKFZ) in Heidelberg and was sorted either according to the database of origin (GDB, EMBL, RLDB, etc.) or according to the relevant human chromosome. The X chromosome dataset was downloaded and regular updates were made every few weeks. The emerging YAC map under construction in our laboratory was already annotated with approximately 600 genetic and physical markers. Incorporating the IGD X chromosome section aliowed each locus and gene to be fully described with up-to-date information. The first phase of the IGD project has now ended and consequently the release of data has been terminated as well.
A consensus marker map is established for each chromosome once a year based on information coliected at single chromosome workshops, and is a reference for all investigators in the field. The YAC map constructed in our laboratory is also strongly correlated with the consensus maps. Xcontigview was used to parse in IXDB the consensus maps from the 1994 and 1995 workshops (Heidelberg and Banff respectively (Nelson et al., 1995; Willard et al., 1994)), providing a convenient way to compare the order of markers with other maps as well as a position for approximately 500 loci described by the IGD data. By combining experimental data from our group and 28 collaborators from the RLS with the exhaustive dataset from IGD, IXDB gradually became an electronic catalogue of a large fraction of the data generated so far on the X chromosome physical map by the community.