Computerized Identification of Bacteria is still largely based on the determination of phenetic characters. These include morphological features, growth requirements, and physiological and biochemical activities. Other phenotypic methods include: analysis of cell wall composition, cellular fatty acids, isoprenoid quinones, whole-cell protein analysis, polyamines; pyrolysis mass spectrometry, Fourier transformation infrared spectroscopy and UV resonance Raman spectroscopy. These have not been incorporated into routine computer based identification systems although computers are used during the analytic process. Recently microbiologists have shown more interest in the use of genotypic methods to establish the taxonomic relationships between bacteria including: DNA base ratio (moles percent C_G), DNA–DNA hybridisation, rRNA homology studies and DNA-based typing methods. Computers are employed in the collection and analysis of this data. Polyphasic identification is the integration of these various techniques for identification of unknown bacteria. Few, if any, computer based polyphasic identification systems have been developed. Most computer-based identification systems use only a subset of taxonomic information available to the bacteriologist for a particular group of bacteria. We will concentrate on identification systems where the bacteriologist can enter the results of tests carried out on an unknown, obtain a suggested identification and, where identification has not been achieved, obtain a list of additional tests that should be carried out to enable identification.
I. PRINCIPLES OF BACTERIAL IDENTIFICATION The taxonomy of any group of organisms is based on three sequential stages: classification, nomenclature, and identification. The first two stages are the prime concern of professional taxonomists, but the end product of their studies should be an identification system that is of practical value to others. Therefore, an identification system is clearly dependent on the accuracy and data content of classification schemes and the predictive value of the name assigned to the defined taxa. The ideal identification system should contain the minimum number of features required for a correct diagnosis, which is predictive of the other characters of the taxon identified. However, the minimum number of characters required is dependent on both the practical objectives of the exercise and the clarity of the taxa defined in classification. Thus many enterobacteria can be identified using relatively few physiological and biochemical tests, the numerous serotypes of Salmonella are recognized by their reactions to specific antisera, and the accurate identification of Streptomyces species requires determination of up to 50 diverse characters. Workers at the Central Public Health Laboratory, United Kingdom demonstrated the first practical computerized identification system in the 1970s using a system for enterobacteria. During the same period, the application of computers for numerical classification was developed. This concept was subsequently applied to many bacterial groups, and these studies provided data that were ideal for the development of computerized identification schemes. Many bacterial taxonomists were slow to realize this potential, but these data now form the basis of many probabilistic identification schemes. A wide and increasing range of computerized systems for the identification of bacteria is now reported in the scientific literature and allied to commercial kits. This reflects both the expansion of techniques used to determine characters for the classification of bacteria and the rapid developments in computer technology.
II. COMPUTER IDENTIFICATION SYSTEMS The main approach to the identification of an unknown bacterium involves determination of its relevant characters and the matching of these with an appropriate database that defines known taxa. This database may be known as a probability matrix, or identification matrix.
The ideal objective is to assign a name to the unknown that is not only correct but also predictive of some or all its natural characters. Computerized identification schemes provide a more flexible system than those of sequential systems (e.g., dichotomous keys) do. Computerized identification can be achieved in several ways: numerical codes and probabilistic identification are the most popular approaches. Expert system and neural network have been investigated but their performance is no better than the probabilistic approach. A. Numerical codes These are usually based on _/_ character reactions. They are applied to a relatively small set of characters that have been selected for their good diagnostic value and are applied to clearly defined taxa. Numerical codes require determination of a series of character states and the conversion of the binary results into a code number that is then accessed against the identification database. Such identification systems are particularly appropriate for the analysis of test results obtained when commercial identification kits are used. An example is the API 20E kit that generates a unique seven-digit number from a battery of 21 tests (Table 52.1). The tests are divided into groups of three, and the results are coded 1, 2, 4 for a positive result for tests in each group. These values are then used to produce a score that reflects the test results, which can be accessed against the identification system. Organisms that generate profile numbers that are not in the identification system can be tested against appropriate computer assisted probabilistic identification systems. Numerical codes have proved to be convenient and effective, particularly for well-studied groups such as the Enterobacteriaceae. B. Probabilistic identification Probabilistic schemes are designed to assess the likelihood of an unknown strain identifying to a known taxon. In theoretical terms, the taxa are treated as hyperspheres in an attribute space (a-space) in which the dimensions are the characters. The center of the hypersphere (taxon) is defined by the centroid (the most typical representative), and the critical radius encompasses all the members of each taxon. Ideally, each taxon will be distinct from any others if the identification matrix has been well constructed. To obtain an identification, the diagnostic characters for an unknown strain are determined and its position in the a-space calculated. If it falls within the hypersphere (taxon) of a known taxon, it is identified. Thus, in essence, probabilistic identification systems allow for an acceptable number of “deviant” characters in both the known taxa and the unknown strains. Most computer-assisted identification systems are based on Willcox’s implementation of Bayes theorem. where: P(ti|R) is the probability that an unknown isolate, giving a pattern of test results R, is a member of taxon (group of bacteria) ti and P(R|ti) is the probability that the unknown has a pattern R given that it is a member of taxon ti. Bayes theorem incorporates prior probabilities, these are the expected prevelance of strains included in the identification matrix. For bacterial identification most authors give all taxa an equal chance of being isolated and therefore the prior probabilities for all taxa are set to 1.0 and omitted from the equation.
The above equation therefore can be re-expressed as: where the probabilities are now referred to as identification scores, or Willcox Scores. The identification scores for each taxon are normalised values and Li * for all taxa sums to one. Identification of an unknown isolate is achieved when Li * for one taxon exceeds a specified threshold value. An example is shown below with an identification matrix consisting of three taxa for which we have the probabilities for four tests (Table 52.2). An unknown has been isolated whose results for the first three tests are positive, negative and positive, respectively. The likelihoods that the taxa a, b and c will give the pattern of results observed for the unknown is calculated by multiplying the probability of obtaining a positive result for test 1 by the probability of obtaining a negative result for test 2 by the probability of obtaining a positive result for test 3 for each taxon in turn (Table 52.3). The original identification matrix (Table 52.2) only gives the probabilities for positive results, in order to use the probability for a negative result we must subtract the matrix entries for test 2 from 1. The identification scores are expressed as normalized likelihoods in Table 52.4. In this example the unknown is not identified because a single taxon does not reach the identification threshold value. Taxa b and c are still both candidates for the identity of the unknown. Threshold values of 0.999 are typically used, for example with the Enterobacteriaceae, but with other groups of bacteria, such as the streptomycetes, values as low as 0.95 have been used. In practical terms, a value of 0.999 means that the taxon which the unknown identifies with will have at least two test differences from all other taxa in the matrix. Whatever type of identification system is used, there are four possible outcomes: _ The unknown is identified with the correct taxon. _ The unknown is misidentified, that is, incorrectly attributed to wrong taxon. _ The unknown is not identified at all, and correctly so because the taxon to which it belongs is not present in the matrix. _ The unknown is not identified, but should have been identified with a taxon that is present in the matrix. It is important that any system deals with these possibilities, although the last one is difficult to resolve. One problem with the identification score is that if an unknown is not represented in the matrix, but one strain within the matrix is closer to it (in a-space) than all others, the unknown may be identified as this strain.
This is where additional criteria should be used to assist the identification process. These include, listing the differences in test results between the unknown and the strain it has been identified as, as well as the use of other numeric criteria such as taxonomic distance, the standard error of taxonomic distance measures or maximum likelihoods. Taxonomic distance is the distance of an unknown from the centroid of any taxon with which it is being compared; a low score, ideally less than 1.5, indicates relatedness. The standard error of taxonomic distance assumes that the taxa are in hyperspherical normal clusters. An acceptable score is less than 2.0–3.0, and about half the members of a taxon will have negative scores, because they are closer to the centroid than average. The maximum, or best likelihood, is the maximum probability for a taxon calculated using those tests carried out on the unknown. The calculation uses the maximum of the probabilities of a negative and positive result of a test (Table 52.5). This allows for taxa with several entries of 0.50 in a matrix. Some authors calculate the likelihood/maximum likelihood ratio, termed the modal likelihood fraction (see Table 52.6), or it’s inverse and use it to decide whether to accept the identification offered by a Willcox score that has exceeded the identification threshold. In the IDENT module of the MICRO-IS program for example, the identification score is not given if the best likelihood/ likelihood is greater than 100.
III. GENERATION OF IDENTIFICATION MATRICES One of three approaches can be used to generate identification matrices. Cluster analysis. Acluster analysis of fresh isolates and reference strains (taxa) is carried out. Phena (clusters of taxa) are selected from a dendrogram and their properties are summarised to create a starting or frequency matrix containing all characters used in the study. The identification matrix is developed by including the most useful characters and rejecting those that do not distinguish between phena. Some clusters produced by the cluster analysis are omitted from the starting matrix because they contain few members, one, two or three isolates, or comprise a group that cannot be identified. Grouping known strains. Known strains are characterized using a range of tests and the percentage of each strain exhibiting a character calculated. The identification of each strain is assumed to be accurate and no taxonomic analysis is carried out. The problem with this approach is that isolates that have traditionally been treated as strains of one species might be a grouping of two or more species. If characterisation is carried out on strains that have been repeatedly subcultured these strains may show less metabolic activity compared to fresh isolates. This could result in a matrix that is biased against fresh isolates. Data collected from the literature. This is the least reliable approach. Data is collected from a variety of publications and merged to create a single matrix. The authors must resolve any conflicting test results and assign probabilities for positive, negative and variable results. If the characterization methods are not adequately described in the literature, some characters may be misinterpreted and the results matrix may contain erroneous probabilities. Whatever method is used to create the probability matrix, it is important that the characterization of unknowns is performed using the same techniques. For example, it would be inappropriate to create a matrix using miniaturized tests and use it for the identification of an unknown characterized using conventional tests. A. Selection of characters Whatever the size and scope of the frequency matrix, by no means all the characters used will have sufficient diagnostic value for use in an identification matrix. Therefore, the major task is to determine a minimal battery of reliable tests that will distinguish between the taxa. The ideal diagnostic character is one that is consistently positive or negative within one taxon, and this differentiates it from most of all selected taxa. This is seldom achieved with one character, but selected groups of characters may approach this ideal. How far this is achieved depends on the consistency of the taxa studied and the objectives of the identification, which in turn influence the principles and methods used to select characters and to identify unknown strains. Few characters can be regarded as entirely constant. Character variation may be real (e.g., strain variation) or occur from experimental error.
Many tests and observations are difficult to standardize completely within or between laboratories. Therefore, any identification system should ideally take account of these sources of variation. The first stage is to check the quality of the initial frequency matrix before proceeding further. Two criteria of particular relevance are: (a) the homogeneity of the taxa; and (b) the degree of separation or minimal overlap between them. These can be assessed using appropriate statistics calculated by the OVERMAT and OUTLIER programs. Once it can be assumed that the frequency matrix is sound, the next step is to select the minimum number of characters from it that are required for the distinction between all the taxa included. There is some controversy about the minimum number of tests needed to separate a range of taxa effectively. One guideline is that the number of tests should at least equal the number of taxa. This may apply to relatively small, and tightly defined taxa, particularly for genera rather than species. However, for many taxa (e.g., Bacillus, Clostridium, Enterobacteriaceae, and Streptomyces), the large number of species would necessitate use of an excessive number of characters. Most of these problems can be solved if: (a) the aims of the identification exercise are clear; and (b) the selection of characters is approached objectively. The ideal diagnostic character should be always positive for 50% of the taxa in the matrix and always negative for 50% of taxa in the matrix. Characters that are either always positive or always negative are clearly of no diagnostic value, as are those that have a frequency of 50% within all or most taxa. Various separation indices have been devised that can rank characters in order of their diagnostic value. The CHARSEP program incorporates several of the indices and provides a useful means for selection characters. The use of one index, the variance separation potential (VSP), is illustrated in Table 52.7; this index is based on the variance within taxa multiplied by separation potential. Values greater than 25% indicate acceptable characters. Another program (DIACHAR) ranks characters according to their diagnostic potential. The diagnostic scores of each character for each group in a frequency matrix are ordered and the sum of scores for all selected characters in each group (Table 52.8) is also provided. The higher the score, the greater the diagnostic value of the selected characters. Sometimes, it is desirable to select a few characters that, although of low overall separation potential, are shown by DIACHAR to be diagnostic for a particular taxon. The program BEST uses similar methods to identify useful tests and can create an identification matrix from the frequency matrix.
IV. EVALUATION OF IDENTIFICATION MATRICES Once an identification matrix has been constructed, it is important that its diagnostic value is assessed before it is recommended for use by microbiologists who are not necessarily expert taxonomists. Matrices can be evaluated by both theoretical and practical means. A. Theoretical evaluation Evaluation of the matrix typically consists of determining the identification scores of the hypothetical median organism of each taxon in the matrix. This provides the best possible identification scores for each taxon included in the matrix. If any of these scores are unsatisfactory, practical identification of unknowns against such taxa will inevitably be unreliable. The MOSTTYP program does this using the Willcox probability, taxonomic distance, and the standard error of taxonomic distance as identification coefficients. The IDSC performs similar calculations and where test probabilities of 0.50 are encountered, three scores are calculated for positive, negative and missing test results. B. Practical evaluation Practical assessment involves entering the diagnostic character states of known taxa to the matrix. This should involve the redetermination of the diagnostic characters of a random selection of taxon representatives that have been included in the construction of both the frequency and identification matrices. It provides another assessment of experimental error in the determination of character states and its impact on the identification system.
The selected representatives should then identify closely to their taxon when their identification coefficients are determined against the matrix. With a well-constructed matrix, bad identification scores are rare, but when they occur they may reflect the random choice of an atypical representative of a poorly defined taxon rather than experimental error. The final practical evaluation of a matrix clearly involves assessment of its success in identifying unknown strains. Therefore, the appropriate characters for unknowns are determined, and their identification scores are determined and assessed by the investigator. To date, most probabilistic identification systems have been tested against and applied to natural or “wild” isolates. However, there is an increasing use of genetic manipulation of such strains for scientific, medical, ecological, and industrial purposes. For a variety of reasons, not least patent laws, it is important to compare manipulated strains with each other and with their wild types. This is still a developing area in bacterial taxonomy, but probabilistic systems can be useful. For example, streptomycete strains that had been manipulated by various means, such as mutagens, plasmid transfer, and genetic recombination, were compared with their parent strains against an identification matrix. Most of the manipulated strains identified to the same species as their parents, indicating that most of the selected diagnostic characters were unchanged and that the identification matrix could accommodate some character state changes. C. Computer software Most of the programs mentioned above were developed by Prof. P.H.A. Sneath (Department of Microbiology, Leicester University, United Kingdom.)
They are written in BASIC, and full details of the programs can be obtained from the following publications. Those by T.N. Bryant can be obtained from www.som.soton.ac.uk/staff/tnb/pib.htm Bryant, T.N. (1987). Computer Applications in the Biosciences 3, 45–48. Bryant, T.N. (1991). Computer Applications in the Biosciences 7, 189–193. Sneath, P.H.A. (1979). MATIDEN program. Computers & Geosciences 5, 195–213. Sneath, P.H.A. (1979). CHARSEP program. Computers & Geosciences 5, 349–357. Sneath, P.H.A. (1980). DIACHAR program. Computers & Geosciences 6, 21–26. Sneath, P.H.A. (1980). MOSTTYP program. Computers & Geosciences 6, 27–34. Sneath, P.H.A. (1980). OVERMAT program. Computers & Geosciences 6, 267–278. Sneath, P.H.A., and Langham, C.D. (1989). OUTLIER program. Computers Geosci. 15, 939–964. Sneath, P.H.A., and Sackin, M.J. (1979). IDEFORM program. Computers Geosci. 5, 359–367. D. Published identification matrices Many identification matices have been published (see Table 52.9). Most of these have been developed using the procedures described above. Success rates vary with the bacterial group under study. Probabilistic identification of gram-negative bacteria has been most effective, for example, 933 (98.2%) isolates of fermentative gram-negative bacteria and 621 (91.5%) isolates of nonfermenters were identified using a Willcox probability threshold of greater than 0.999. Of 243 vibrios isolated from freshwater, 71.6% were identified at a level greater than 0.999 and 79.4% at greater than 0.990. When such stringent coefficient levels are applied to gram-positive bacteria, the results are often less impressive. For example, when a probability of greater than 0.999 was applied to coryneform bacteria only 50% of the unknowns identified and using a level of greater than 0.995, only 42% of streptomycete isolates were identified. If less stringent coefficients are applied to take account of the heterogeneity of such groups, a higher rate of useful identifications can be achieved. Thus, 73% of 153 streptomycete isolates were identified using a Willcox probability of greater than 0.85.
V. BACTERIAL IDENTIFICATION SOFTWARE Several probabilistic identification programs have been published, some are available as interactive web pages although whether these pages offer any advantage over software installed on a user’s machine is questionable. Alist of resources is presented below. In many instances the software can be used with various identification matrices. Examples from one program, PIBWin (a Windows version of Bacterial Identifier) are shown in Figs 52.1–52.3. BBACTID Bryant, T. N., Capey, A. G., and Berkeley, R. C. W. (1985). Computer Applications in the Biosciences 1, 23–27. BACTID Jilly, B. J. (1988). International Journal of Biomedical Computing 22, 107–119. CIBAC Döring, B., Ehrhardt, S., Lücke, F., and Schillinger, U. (1988). Systematic and Applied Microbiology 11, 67–74. Gideon www.cyinfo.com IDENTIFY Jahnke, K. D. (1995). Journal of Microbiological Methods 21, 133–142. MICRO-IS Portyrata, D. A., and Krichevsky, M. I. (1992) Binary 4, 31–36. www.bioint.org/support/ microis/microis.html MATIDEN Sneath, P. H. A. (1979). Computers & Geosciences 5, 195–213. Identmpm Maradona, M. P. (1994). Computer Applications in the Biosciences 10, 71–73. no-name Tortoli, E., Boddi, V. and Penati, V. (1992). Binary 4, 200–203. The Identifier Gibson, L. F., Clarke, C. J., and Khoury, J. T. (1992). Binary 4, 25–30. PIBWin www.som.soton.ac.uk/staff/tnb/pib.htm (Bacterial Identifier) Recognet www.pasteur.fr/recherche/banques/recognet
VI. OTHER APPLICATIONS OF COMPUTERS TO BACTERIAL IDENTIFICATION We have concentrated on the use of computers to construct identification systems as well as to access them. However, Computer programs are increasingly used solely to access taxonomic data for the identification of unknown strains. Developments in computer technology can also provide a more direct link between the determination of test results and their evaluation. An example is the use of so-called “breathprints,” for identification of gram-negative, aerobic bacteria. This relies on a redox dye to detect the increased respiration when a carbon source is oxidized. A range of substrates in a microtiter plate are inoculated with a strain; if a substrate is used, a pigment is formed by the redox dye, indicating a positive reaction. The pattern of these on the plate provides a breathprint, which can then be compared with those of known taxa using a system consisting of a microplate reader and a computer. Databases for both the classification and identification of bacteria have been extended and improved by the inclusion of diagnostic characters provided by chemical analysis of cell components. These include cell wall amino acids, membrane lipids, and proteins, which have been particularly useful for the definition and identification of higher taxa such as genera or families. For example: Analysis of bacterial lipids involves gas chromatography, which results in a printout of a set of peaks that are defined chemically but are difficult to evaluate and compare quantitatively by eye. Various programs have been used to transform the data for principal component analysis and to provide similarity and overlap coefficients for comparison of unknown strains. The use of polyacrylamide gel electrophoresis (PAGE) to analyze protein patterns of bacteria is well established in bacterial taxonomy. The results are obtained in the form of stained bands on the gels. A variety of computer programs has been devised to facilitate their analysis. Typically, the stained gels are scanned by a computer controlled densitometer. This digitizes the continuous output of the densitometer scan, removes or corrects for background effects, and permits comparison with other stored traces. Thus, an unknown can be compared with known taxa using various similarity coefficients. Another method of bacterial identification is pyrolysis-mass spectrometry. This involves the thermal degradation of a small sample of cells in an inert atmosphere or vacuum, leading to production of volatile fragments. Under controlled conditions, these are characteristic of a taxon, and they are separated and analyzed in a mass spectrometer. Assessment of the traces obtained requires software for performing principal components, discriminant, and cluster analysis, to allow known taxa to be distinguished and unknowns identified. The developments in nucleic acid techniques are having a marked and exciting impact on bacterial taxonomy, where they provide a genetic assessment of taxa, which can be used to supplement or revise the existing phenetic systems. Techniques such as DNA reassociation, DNA–rRNA hybridization, and DNA and RNA sequencing are increasingly used in bacterial classification, whereas nucleic acid proves and DNA fingerprints are of great potential in identification. Computation is used in the analysis and application of such data. Despite the relative novelty of these sources of taxonomic data, ultimately a convenient and accurate means of comparing data for unknown strains with those of established taxa is still required. Thus, when determining DNA fingerprints it is useful to have a permanent record of fragment size. Precise migration measurements on gels should not be spoiled by inaccurate assessments of fragment size. Fragments have a curvilinear relationship between the mobility of their bands on gels and their molecular sizes. A number of programs have been devised to transfer and assess such measurements. Thus, computation has an established and developing role in all stages and aspects of bacterial identification.
No comments:
Post a Comment