Comprehensive and Updated Computer Programming!: Microbial Genomics

INTRODUCTION

Our ability to “see” the microbes that surround us first arose in the 17th century when Anton

van Leeuwenhoek created a microscope, which provided the first physical evidence of the

diversity and ubiquity of microbes in the world. Another giant leap occurred in the 19th

century when Koch first demonstrated that bacteria could be grown in pure culture,

beginning with the analysis of blood from cows infected with the anthrax agent. He

subsequently became most widely known for the postulates regarding microbial disease

causation that bear his name. For decades following, a combination of culture and

microscopy was the only tool available to see microbes. In the last decade of the 20th

century, however, consensus ribosomal PCR revealed that a much greater diversity of

bacteria exists beyond what could be cultured, and that cultured bacteria represented

only~1% of all bacterial species. Today, sequencing technology has evolved to the point that

it is feasible to comprehensively define the collection of microbes present in humans (i.e.,

the human “microbiome”) or any other ecological niche in a fashion that is completely

independent of culture or microscopy.

A BRIEF HISTORY OF PATHOGEN DISCOVERY Back to top

Classic Methods

Classic methods of microbial discovery have relied heavily on the ability to readily cultivate

or passage the organism in question. Clinical material associated with a disease thought to

be of infectious origin is used to inoculate diverse growth media to cultivate the microbe(s)

present in the sample. In the case of suspected bacterial agents, selective or nonselective

growth media can be utilized, while primary or immortal cell lines are inoculated for

suspected viruses. In addition, the clinical specimens could also be used to infect animal

models. If a microbe could be cultivated, attempts to identify it would follow. For bacterial

identification, differential stains and growth conditions are used to categorize and ascertain

the genus or species present. Viral identification is similarly based on differential growth in

various cell types as well as serological reactions to a variety of specific antisera. The use of

microscopic techniques, including light and electron microscopy, is also extremely important

in the process of identification. These classic tools have been incredibly useful and resulted in

the discovery of many currently accepted human pathogens, for example, Bacillus anthracis,

Mycobacterium tuberculosis, Yellow fever virus, and Poliovirus. However, there are two

fundamental limitations to this approach: (i) these methods are dependent on the ability of

the microbe to grow in the substrate provided, and (ii) even if the microbe can be cultivated,

that fact alone will not necessarily lead to unambiguous identification of the unknown agent.

Molecular Approaches: Candidate Dependent

In the late 1980s, with the advent of PCR, scientists could now easily use molecular

approaches to detect microbes present in a given clinical sample if there was existing

sequence for the microbe(s) to be targeted. The recognition that selecting PCR primers

designed to highly conserved regions in a set of sequences (e.g., multiple bacteria or several

viruses from a common taxonomic group) could enable the detection of previously

unsequenced or unidentified microbes provided a novel approach for the identification of

microbes. These types of approaches are referred to interchangeably as broad-range or

consensus PCR methods. One of the broadest applications of this technique has been the

design of primers to the 16S rRNA gene, which enables the detection of nearly all members

of the bacterial domain. Alternatively, more specific primers can be selected that are

conserved within a given taxon (family, genus, or species) to identify a more targeted set of

microbes. The ability to design highly conserved primers is, of course, predicated on the

existence of sufficient sequence data to identify the appropriate conserved regions. There are

a large number of microbes that have been discovered by using PCR in conjunction with the

classical methods of microbial detection and discovery.

Two pioneering papers by Relman and coworkers describe the first examples of using

consensus PCR primers to identify the causative agents of specific human diseases. Bacillary

angiomatosis was commonly thought to be of infectious origin, but for many years no

specific microbial agent could be identified. A putative agent could be visualized in tissue

sections following staining, but efforts to culture the organism had failed. Sequencing of

amplicons generated by PCR using 16S rRNA gene consensus primers demonstrated that a

previously uncharacterized rickettsia-like bacterium was present in tissue samples of patients

with bacillary angiomatosis (67). The bacterium was later identified as a member of the

genus Bartonella. A similar approach was subsequently applied to address the etiology of

Whipple’s disease. Whipple’s disease was first described in 1907 as a rare systemic disorder

that primarily caused malabsorption but could affect any part of the body. Consensus PCR

using primers targeting the bacterial 16S ribosomal gene resulted in the identification of an

uncharacterized actinomycete, which would later be classified as Tropheryma whipplei (68).

These two cases demonstrated the power of molecular pathogen discovery methods.

The use of conserved PCR primers has also been applied to the discovery of viruses.

However, since no universally conserved sequence akin to the 16S rRNA sequences in

bacteria is present in all viruses, consensus sequences must be identified and consensus

primers designed for each given viral taxon of interest. A seminal example of using

consensus PCR to identify a viral pathogen occurred during the emergence of hantavirus

pulmonary syndrome in 1993 (61). In the course of investigating an unusual outbreak of a

lethal pulmonary disease in otherwise healthy young adults in the southwestern United

States, extensive testing by classic microbiological methods ruled out most of the likely

candidates known to cause severe respiratory disease. Serological tests revealed that patient

sera were cross-reactive with known hantaviruses. From this lead, PCR primers were

designed to conserved regions of known hantavirus sequences, which were then used to

amplify nucleic acids extracted from tissue samples isolated from dying patients. Sequencing

of the amplicon generated by the primers resulted in the identification of a novel member of

this family, which was ultimately named Sin Nombre virus.

Since these seminal applications of consensus PCR for the identification of bacterial and viral

pathogens, there have been many instances of microbial identification using this strategy.

Consensus PCR, either alone or in conjunction with classic culture and antigen detection

methods, continues to be of great utility in the discovery of novel microbes, as illustrated by

the recent discoveries of a new phylogenetic group of rhinoviruses (47, 49,51, 52, 59), a

spate of parechoviruses (4, 7, 8, 23, 40, 55, 78), the arenavirus Chapare virus (19), and

Bundibugyo ebolavirus (70). However, what is not documented in the literature is the

number of times that broad-range PCR strategies were applied but failed to identify an

agent. It should also be evident that in order for this approach to be useful and successful,

the list of potential candidates must be relatively short. In these successful examples

described above, the authors had a strong hypothesis regarding the nature of the microbe

(i.e., bacterium versus virus) present or which specific candidate viral taxon might be

present. However, in many situations, there may not be a leading candidate(s), thus limiting

the feasibility of using consensus PCR approaches, especially for viral identification. Thus,

despite these successes, there has been significant impetus for the development of pathogen

discovery strategies that are broad range and not candidate dependent.

Molecular Approaches: Candidate Independent

The discoveries of Hepatitis C virus (HCV) and Human herpesvirus 8 (HHV-8), also called

Kaposi’s sarcoma-associated herpesvirus (KSHV), represented two breakthroughs in the

application of candidate-independent molecular methods for pathogen discovery. In 1989,

the identification of HCV in patients with non-A, non-B (NANB) hepatitis relied upon a library

immunoscreening strategy (17). A randomly primed cDNA library was made from material

from infected animals and screened using patient serum from NANB hepatitis patients with

the goal of identifying cDNA clones that generated peptide sequences recognized by the

patient sera. From over a million clones that were screened, a single clone reacted

specifically with NANB hepatitis patient sera. From this initial cDNA clone fragment, the HCV

genome was eventually sequenced. Today, HCV is recognized as being responsible for the

vast majority of cases of NANB hepatitis. In 1994, human herpesvirus 8 was discovered in

the lesions of AIDS-associated Kaposi’s sarcoma (13). The identification of Kaposi’s sarcomaassociated

herpesvirus relied on representational difference analysis, a subtractive

hybridization-based method, to enrich for and then identify unique sequences present in

Kaposi’s sarcoma lesions but not in healthy tissue controls. While these two examples

demonstrate the potential of these methods, there have been few subsequent success stories

using either of these two methods, most likely due to technical challenges associated with

both of these strategies. Thus, there remained a clear need for further improved strategies

for pathogen discovery.

By the end of the 20th century, the classic culture-based methods for microbial discovery

had been augmented by multiple molecular approaches, such as consensus PCR, library

immunoscreening, and representational difference analysis. In parallel, targeted sequencing

of specific microbes was starting to become feasible, thus setting the stage for the

convergence of pathogen discovery efforts and microbial sequencing efforts.

MICROBIAL GENOMICS HISTORY Back to top

Microbe sequencing in the 20th century relied exclusively upon Sanger dideoxy sequencing,

the dominant sequencing strategy since its invention in 1977. From its initial incarnation

using slab gels as a readout, incremental advances in sequencing capacity evolved as the

readout transitioned to capillary electrophoresis, and then from single capillaries to

simultaneous analysis of 96 capillaries, which is still used today.

Formally, the era of microbial genomics began with the complete sequencing of

the Haemophilus influenzaegenome in 1995. However, it was recognized almost two decades

earlier that an organism’s genomic sequence, as the ultimate marker of evolution, could

serve to classify and define the relatedness of both prokaryotic and eukaryotic organisms

(80). rRNA was a molecule with an appropriately broad distribution which mutated slowly

over time, permitting the detection of relatedness. With the advent of Sanger sequencing,

entire 16S rRNA genes could be sequenced, including the Escherichia coli 16S rRNA gene in

1978 (11). Sequencing at the time and in the ensuing decades was time-consuming and

expensive and was performed to obtain the minimum amount of data that was needed.

When the H. influenzae genome was sequenced, this bacterium became the first free-living

organism to have its genome sequenced in its entirety (31). This was a landmark

achievement, notable also because of the use of a “shotgun” strategy to assemble the

complete genome. “Shotgun” refers to the random fragmentation and cloning of DNA

fragments followed by computational assembly of the overlapping regions to generate a

complete genome sequence. Based on this proof of principle, genomes of larger microbes

and eukaryotic organisms were subsequently sequenced in this fashion. The following

year, Saccharomyces cerevisiae was the first eukaryotic organism to be fully sequenced (35),

and then in 1998, the first multicellular eukaryotic genome to be sequenced, that

of Caenorhabditis elegans, was published (12). Since then, the complete genomes of many

human and animal pathogens have been sequenced, including notable pathogens such

asMycobacterium tuberculosis (2001), Yersinia pestis (2001), and Plasmodium

falciparum (2002). In 2004, the complete 1.2-Mb genome of mimivirus, the largest known

virus, was published (65).

The human genome project was first proposed in 1990, and initial sequencing began in

1995. By 2001 two drafts of the human genome had been published (50, 75). During the

course of this massive project, many technological refinements in the efficiency of Sanger

sequencing itself, as well as novel tools for the downstream computational analysis, were

implemented. These developments could naturally be applied to sequencing of much smaller

microbial genomes and therefore contributed substantially to the rapid increase in the rate of

microbial sequencing.

Over the past 5 years, a number of new sequencing modalities that together have been

termed the next generation, or NextGen, of sequencing have been developed. The three

major platforms in current use are 454 (Roche Titanium), Solexa (Illumina), and SOLiD

(ABI). Key characteristics of these platforms include the fact that all of them have

geometrically increased the raw sequence generation capacity and decreased the cost per

base pair 10- to 100-fold relative to Sanger sequencing. Although each of these new

platforms utilizes a fundamentally different sequencing modality, in all cases, clonal

amplification of the template DNA has been moved from bacteria (thereby eliminating the

need for plasmid cloning and propagation) to an in vitro setting. In the 454 technology,

approximately 1 million sequence reads averaging 400 bp, or about 400 Mb of total

sequence, is generated per run. Clearly, in terms of microbes such as bacteria, one

sequencing run of a 454 instrument is sufficient to generate greater than 100 times the

coverage of the average 3-Mb bacterial genome. By comparison, the Illumina platform

currently produces~20 Gb of sequence with read lengths up to 100 bp, while the SOLiD can

generate~30 Gb with an average read length of 35 bp. For a more detailed description of

each of these platforms and capabilities, see reference 58.

With these increases in sequencing capacity, the sequencing of microbial genomes has

become routine. In fact, there are currently ambitious projects that have been conceived to

comprehensively sequence the microbial diversity of the human microbiome and the viral

diversity, “the virome,” present in humans (seechapter 13). These efforts will vastly expand

the world of sequenced microbes far beyond the current 5,900 species of bacteria, fungi,

parasites, and viruses that have been completely sequenced and whose sequences have

been deposited in GenBank (GenBank Genome Records

11.12.09;http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome).

THE MERGING OF PATHOGEN DISCOVERY AND

MICROBIAL GENOMICS Back to top

The onset of the 21st century has seen the convergence of the fields of pathogen discovery

and microbial genomics (Fig. 1). The goals of systematically defining the microbes present in

a given clinical sample dovetail with the goals of “pathogen discovery,” namely, to identify

one or more microbes present in a clinical sample that are responsible for a disease

phenotype. Two new molecular approaches for massively parallel analysis have recently

emerged that are capable of defining the spectrum of microbes present in clinical specimens:

microarray and sequencing-based detection. Both of these strategies have benefited greatly

from the increased focus on human and microbial sequencing. The vast database of nucleic

acid sequences provides the substrate from which consensus PCR primers (described above)

and probes for microarrays (described below) have been designed. Naturally, sequencingbased

approaches have evolved along with the sequencing capacity of new platforms. In the

first half of this decade, Sanger method-based sequencing dominated these efforts, and due

to the costs and limited throughput, experimental strategies were devised to minimize the

extent of sequencing necessary to identify microbial agents. As the decade progressed, and

more robust sequencing methods evolved, the need to enrich for microbial sequences

diminished and more brute force strategies have come to the forefront. For all the

sequencing-based methods, a critical component of the process is the bioinformatic analysis

of the sequences generated. Typically, computational pipelines have been established that

compare the resulting sequences against various public sequence databases to determine the

origin of each sequence read. The “discovery” of a novel microbial sequence results when a

sequence with only limited similarity to existing microbial sequences is encountered.

Sequencing-Based Approaches

The discovery of Human metapneumovirus (HMPV) in 2001 combined classic viral culture

with a molecular strategy termed random arbitrarily primed PCR (73). Efforts to culture

respiratory secretions from children suffering respiratory tract infections led to the

identification of a putative unidentified virus that could be passaged in tertiary monkey

kidney (tMK) cells, Vero cells, and, to a lesser extent, A549 cells. In order to identify the

virus present, random, arbitrary primers were used to generate PCR amplicons. Differentially

represented amplicons present in infected cells but absent from control cells were identified

by gel electrophoresis and selectively sequenced. Multiple fragments having limited sequence

identity to avian pneumoviruses were detected, indicating that a novel virus, now known as

HMPV, was present in the infected cells. Seroprevalence studies indicated that by the age of

5 to 10 years, most individuals were antibody positive, suggesting that this virus is a

common infection acquired in childhood. HMPV can cause severe respiratory infections,

including pneumonia and bronchiolitis, and is responsible for 5 to 10% of hospitalizations of

patients with respiratory tract infection (18). The presentation and case severity are very

similar to those of respiratory syncytial virus (18).

In the same year, a candidate independent sequencing strategy for identification of novel

viruses termed DNase-SISPA was described (2). The experimental strategy relied upon

sequence-independent single primer amplification (SISPA), wherein an adaptor containing a

primer binding sequence is ligated to both ends of a cDNA fragment and a single primer is

then used for PCR. To enrich specifically for viral nucleic acids present in virions, the clinical

sample is first subjected to ultracentrifugation to pellet the virions and is then treated with

DNase to degrade any cellular nucleic acids that are not protected within the viral capsids.

Following this enrichment, the sample is then extracted for DNA or RNA and amplified using

SISPA. The enrichment steps in this protocol are necessary to increase the chances of

sequencing a virus-derived sequence, given the labor and costs of performing extensive

Sanger sequencing on the unenriched sample. In this proof-of-concept study, two novel

bovine parvoviruses were identified (2).

A variation of the DNase-SISPA strategy, called virus discovery cDNA-AFLP (amplified

restriction fragment length polymorphism), was utilized in 2004 to identify a novel virus from

the family Coronaviridae, human coronavirus NL63 (HCoV-NL63), from a child with

bronchiolitis (74). In this protocol, following the standard ultracentrifugation, DNase

treatment, nucleic acid purification, restriction digestion, linker ligation, and PCR

amplification, a second set of PCRs was performed to identify differentially expressed bands

present only in the putatively infected sample. There were 16 such bands that were cloned

and sequenced. Of these, 13 had limited sequence similarity to known coronaviruses. Once

completely sequenced, genome analysis demonstrated that HCoV-NL63 is most closely

related to HCoV-229, a known respiratory pathogen, with 65% nucleotide identity.

Subsequent studies have associated HCoV-NL63 with croup, and infection rates approaching

80% by the age of 6 as defined by serology have been reported (22). Of note, a third novel

coronavirus, coronavirus HKU1, was identified using consensus PCR from a patient with

pneumonia (82).

Application of DNase-SISPA to examine plasma samples from patients with febrile illness

resulted in the identification of sequences with only limited identity to known members of the

family Parvoviridae and theAnellovirus genus in 2005 (41). The entire sequence was

obtained for the novel parvovirus, Parvovirus 4(PARV4), and phylogenetic analysis revealed

that the greatest similarity was only 24 to 29% identity to open reading frame 1 (ORF1)

of Adeno-associated virus and Avian parvovirus. PARV4 has been detected in blood, bone

marrow, and lymphoid tissue from patients with either HCV or human immunodeficiency

virus/AIDS and in plasma of kidney transplant patients, and high frequencies of exposure

have been reported for hemophiliacs and injection drug users. Two novel anelloviruses, SA1

and SA2, were also identified in this study. Notably, these viruses were highly divergent from

known viruses and shared only 32 to 35% similarity to TT virus (TTV). TTVs and TTV-related

viruses have been ubiquitously found to infect humans, but there has been no direct causal

evidence to link these viruses to any specific disease. The findings of these three novel

viruses with very low amino acid similarity to known viruses highlights the importance of

sequencing over methods that require nucleic acid homology for detection, as these viruses

would likely have been missed by those methods.

A similar method was also used to identify several novel sequences with similarity to

anelloviruses in the blood of healthy donors (9). Blood samples were subjected to density

centrifugation followed by chloroform and DNase treatment before nucleic acid extraction.

The nucleic acids were amplified using a polymerase with strand displacement activity,

randomly sheared, ligated to linkers, and then PCR amplified before cloning and sequencing.

Using this technique, seven sequences with limited similarity to known members of the

genusAnellovirus were identified. Detailed analysis of two of these sequences demonstrated

that one shared 35% amino acid identity to SA2, which had just been discovered only

months earlier, while the second sequence shared 63% amino acid identity to SEN virus,

another known anellovirus. As discussed previously, the role of anelloviruses in the causation

of disease has not been established.

The discovery of Human bocavirus (HBoV) in 2005 relied on yet another slight variation of

the DNase-SISPA strategy (3). Pooled respiratory secretions from multiple patients with

unexplained respiratory illness were ultracentrifuged to concentrate viral particles and DNase

treated. Following nucleic acid extraction, a random-primer-linker-based amplification

(similar to that used for the previously described DNA microarray studies) was used.

Amplicons of 600 to 1,500 bp were cloned and sequenced using high-throughput Sanger

sequencing (one 384-well plate). Sequences were identified with amino acid similarity to

known Parvoviridaefamily members, Bovine parvovirus and Canine minute virus. The original

sample was identified from the pool, and the complete genome was obtained. Phylogenetic

analysis demonstrated that this novel genome is a previously uncharacterized species of the

genus Bocavirus, HBoV. Subsequent studies of this virus have demonstrated that HBoV is

frequently detected in children with respiratory tract infection, children with asthma

exacerbation, and children with acute gastroenteritis. Seroepidemiology studies have

confirmed infection in a Japanese cohort, with 71.1% overall prevalence with exposure by

age 6 (26), while a second study in Sweden reported a lower rate, 33%, in a cohort of

children with acute wheezing (43).

Using the exact same method on pooled respiratory secretions, a novel member of the

family Polyomaviridae,KI polyomavirus (KIV), was identified in 2007 (1). A single sequence

read of 363 bp was identified that had limited similarity to the simian virus 40 (SV40) VP1

protein, and using primers to span the circular genome resulted in the completion of the

genome of 5,040 bp. KIV has approximately 36 to 48% amino acid identity to the BK virus,

JC virus, and SV40 T antigens, the most conserved proteins of the family. As described in the

initial publication and subsequent publications, KIV has been identified in both respiratory

secretions and feces. Although no clear disease association has been demonstrated to date,

seroprevalence rates ranging from 55 to 90% (60) have been described, indicating that KIV

infection is relatively ubiquitous.

The discovery of the WU polyomavirus (WUV) utilized a similar strategy of high-throughput

Sanger sequencing analysis of respiratory secretions, although in this instance, individual

samples rather than a pool were analyzed (34). Total nucleic acid of the respiratory

secretions of a child with pneumonia was randomly amplified, and 384 clones were

sequenced. The library contained six sequence reads that shared 35 to 50% amino acid

identity to JC virus and SV40. The genome of WUV followed the canonical organization of the

familyPolyomaviridae and was 5,229 bp. Initial experiments as well as numerous subsequent

publications have identified WUV in respiratory secretions. It has also been found in feces,

blood (whole, plasma, and serum), cerebrospinal fluid, and lymphoid tissue. Using

seroprevalence as a measure, infection rates with WUV range from 69 to 98% (60), but no

disease association has been established to date.

Saffold virus, a novel member of the Cardiovirus genus, was discovered using DNase-SISPA

on a virus cultured from a patient stool sample (42). The following year, Saffold-like viruses

were found in patients with acute enteritis and in respiratory secretions from three countries

(24). In parallel, a series of related cardioviruses were identified using the ViroChip and

subsequent PCR screening (15). To date, there has been one study examining the

seroepidemiology of Saffold virus. Using a virus neutralization assay to Saffold virus 3, a

seropositivity rate of 75% was observed by 24 months, which increased to ~90% in older

children and adults (83).

The first application of high-throughput sequencing to identify viruses in diarrhea patients by

shotgun sequencing resulted in the identification of a novel species in the

family Astroviridae, Astrovirus MLB1 (28). The methodology used was essentially identical to

that used in the discovery of WUV except that a filtration step was implemented to minimize

the recovery and amplification of bacterial sequences. Sequencing of a stool sample from a

3-year-old child from Australia with acute diarrhea resulted in the identification of seven

sequences with 67% or less amino acid identity to known astroviruses. Phylogenetic analysis

of the complete genome demonstrated that it was a highly divergent astrovirus. Subsequent

studies have described astrovirus MLB1 in 4 out of 254 additional stool samples of children

with diarrhea (29). To date, no case control studies have been described.

A novel genus, Cosavirus, of the family Picornaviridae was first proposed in 2008 following

the identification of a handful of novel viruses with limited similarity to other picornaviruses.

The first isolate, human cosavirus A1 (HCoSV-A1), was identified from a stool sample of a

child with nonpolio acute flaccid paralysis from Pakistan (45). Analysis of the complete

genome of 7,634 bp established that this virus shared 33 to 49% amino acid identity to its

closest known relative, Seneca Valley virus (37), a member of the genus Cardiovirus.

Subsequent PCR screening of a cohort composed of 57 symptomatic patients and 9 healthy

contacts resulted in 34 positives that could be classified into four distinct genetic groups (A

to D) based on sequence analysis and phylogeny. A proposed genetic group E has also been

described following its identification from a child with acute diarrhea in Australia (38).

Two independent groups using similar Sanger sequencing-based screening of stool samples

described Human bocavirus 2 (HBoV2), a putative new species of the Parvoviridae family

(5, 44). In a similar cohort of patients with acute flaccid paralysis from which human

cosaviruses A to D were discovered, Kapoor et. al. identified HBoV2 from two consecutive

stool samples of a child with acute flaccid paralysis. The entire genome was sequenced; it

shares 67 to 80% similarity to the corresponding HBoV proteins (44). In parallel, HBoV2 was

also identified in an Australian cohort of children with acute gastroenteritis. In this study,

analysis of cases and controls revealed a statistically significant association between infection

with HBoV2 and acute gastroenteritis (5). In subsequent PCR-based screens to define the

prevalence of HBoV2, yet another divergent bocavirus, HBoV3, was identified (5).

These discoveries demonstrate that sequence-independent amplification followed by limited

Sanger capillary sequencing (typically ≤384 clones) is a robust method for identification of

novel viruses present in clinical specimens. With the advent of next-generation sequencing

technology, samples could be sequenced in much greater depth, allowing for detection of

microbes present at lower titers as well as facilitating the generation of complete genomes of

novel microbes.

The discovery in 2008 of Merkel cell polyomavirus (MCPyV) was the first study describing

identification of a novel virus using a next-generation sequencing platform (Roche/454 FLX

platform) to identify a novel virus (27). In this instance, cDNA libraries were made from

Merkel cell carcinoma (MCC) tumors and then sequenced using the 454 FLX platform. From

~382,000 high-quality sequence reads generated, one fragment had detectable sequence

similarity to a known polyomavirus. Further analysis demonstrated that a highly divergent

polyomavirus genome, that of MCPyV, was present in the majority of the MCC tumors

examined. Subsequent studies have corroborated this finding, and mapping of integration

sites demonstrated that in several instances the virus was clonally integrated in the

respective tumors. Given the very low abundance of MCPyV mRNA sequences in these

samples, the detection of viral transcripts would not have been possible without the use of

the next-generation platform, which enabled the samples to be sequenced deeply in a costeffective

fashion. If the same study had been attempted with Sanger sequencing to the same

depth, the effort would have been prohibitively expensive.

Next-generation sequencing played a pivotal role in defining the etiology of a mysterious

case cluster of five patients with undiagnosed hemorrhagic fever (10). RNAs from two

postmortem liver biopsy samples and one serum sample were randomly amplified and

sequenced with 454. Analysis of the approximately 300,000 sequences generated yielded

nine fragments with limited sequence similarity to viruses in the

genusArenavirus. Phylogenetic analysis of the novel virus Lujo virus demonstrated that it

branched from the Old World arenavirus complex and had the greatest identity to Mobala

virus, Lassa fever virus, and Tamiami viruswith 67 to 74% amino acid identity in the

nucleoprotein. Further examination of the receptor-binding portion of G1 demonstrated that

Lujo virus is equally distant from the Old World and New World arenaviruses.

The identification of Human klassevirus 1, a novel picornavirus, also utilized 454 sequencing.

This virus was most similar to members of the genus Kobuvirus and has been detected in

both human stool specimens and raw sewage (36, 39). Greninger et al. were able to

sequence the complete genome of 7,889 bp [excluding the poly(A) tail] from an infant with

gastroenteritis of unknown etiology. Subsequent screening of 751 stool samples identified a

second positive sample, which turned out to be from the twin sibling of the index case. Holtz

et al. identified a similar virus in an acute diarrheal sample collected in 1984 from a child in

Australia. Reverse transcriptase PCR (RT-PCR) screening for klassevirus 1 resulted in the

identification of two slightly divergent isolates, one from raw sewage collected in Barcelona,

Spain, and one from a pediatric patient with acute diarrhea (out of 340 pediatric stool

specimens tested). Given the low homology of human klassevirus 1 to Aichi virus at 34.8 to

43.3% amino acid identity in the P1, P2, and P3 coding regions, a novel

genus,Klassevirus, has been proposed.

In an unexplained outbreak of gastrointestinal illness, a novel astrovirus, Astrovirus

VA1, was identified by simultaneous Sanger and 454 mass sequencing efforts (30). The

complete genome of 6,586 bp was sequenced using a combination of Sanger shotgun

sequencing and targeted RT-PCR and rapid amplification of cDNA ends. In parallel, 454

sequencing alone generated a contig of 6,581 bp, demonstrating again the benefits of the

next-generation platforms. In the most conserved region, ORF1B, VA1 shared 61% amino

acid identity to mink astrovirus and 62% amino acid identity to ovine astrovirus. RT-PCR

screening of the six samples from the outbreak demonstrated that three samples were

unequivocally positive, with high copy numbers. While these initial results support a potential

role for VA1 in this outbreak, further studies are necessary to explicitly define the

relationship between VA1 and human diarrhea.

ASSESSING THE ROLE OF PATHOGENICITY Back to top

From the above examples, it has become increasingly clear that tremendous microbial

diversity is being uncovered and many more microbes remain to be discovered. The pace at

which new microbes (and viruses in particular) in clinical samples from humans are being

discovered is growing geometrically. The challenge that now faces the scientific community is

how best to define the relevance of the growing list of new microbes to human disease. This

has long been a challenge in the study of infectious diseases. In 1890, Robert Koch published

a set of postulates in an attempt to standardize the evidence needed to demonstrate a

causal role for a microbe in a disease. Koch’s postulates are well known to this day and,

despite being over 100 years old, still serve as guidelines for proof of causality. They are as

follows.

1. The parasite occurs in every case of the disease in question and under circumstances

which can account for the pathological chances and clinical course of the disease.

2. The parasite occurs in no other disease as a fortuitous and nonpathogenic parasite.

3. After being fully isolated from the body and repeatedly grown in pure culture, the parasite

can induce disease anew.

A major challenge in the fulfillment of Koch’s postulates, especially in a molecular era, is that

many microbes cannot be grown in pure culture. Another limitation is that microbes that

have either a carrier state or can cause subclinical infections, such as Neisseria

meningitidis and Mycobacterium tuberculosis, violate Koch’s postulates. Other scenarios that

limit the applicability of Koch’s postulates include cases in which coinfection with more than

one microbe causes disease, or situations in which the host genetic background contributes

to the disease state.

Over the years, various incarnations of Koch’s postulates have been formulated. Bradford Hill

(1965) and Alfred Evans (1976) proposed broader criteria for causation, including

epidemiological and immunological data. Most recently, a guide for disease causality that

accounts for molecular methods of microbial detection has been proposed by Fredericks and

Relman (33). These revisions of Koch’s postulates have remained focused on the traditional

concept that disease arises from the presence of a foreign microbe (and the biological

consequences of its presence). However, in the genomic era, the concept of a “pathogen”

and how it causes disease must be reimagined in the 21st century. With the increasing

recognition that humans (and animals) are hosts to large communities of bacteria and

viruses, more complex models of human disease, such as those resulting from imbalances or

alterations in the endogenous community of microbes, must be entertained. For example,

researchers have begun to elucidate the complex role of microbial communities in obesity by

analysis of the microbiomes of animal models of human disease, as well as humans

themselves.

In a genetic model of obesity, sequencing of 16S ribosomal DNA (rDNA) of the distal gut of

genetically obese (ob/ob) mice and their lean (ob/+) and wild-type (+/+) siblings

demonstrated that the microbial composition in the ob/ob mice differed in the relative

abundance of Bacteroidetes and Firmicutes (53). Specifically, ob/ob animals have a 50%

reduction in the number of Bacteroidetes and a proportional increase in Firmicutescompared

to lean (ob/+) mice. Similar results were obtained in human studies in which 12 obese

humans were assigned to either a fat- or carbohydrate-restricted diet, and their gut

composition was analyzed throughout the year by monitoring of 16S rDNA sequencing (54).

Obese humans have fewer Bacteroidetes than Firmicutesin comparison with lean controls.

Whether this imbalance is the cause of disease or whether the imbalance is a consequence of

the disease is currently unclear. Regardless, these observations demonstrate that a

“pathogenic state,” in this case obesity, can be associated with the makeup of a microbial

population rather than the presence or absence of a specific, singular, “causal microbe.”

Thus, in efforts to define the role of the newly identified microbes in human disease, we must

not limit ourselves to the traditional one-microbe, one-disease definition of a pathogen.

Comprehensive and Updated Computer Programming!

Microbial Genomics

No comments:

Post a Comment