Microbial colonization of the human gastrointestinal tract plays an important role in establishing health and homeostasis. However, the time-dependent functional signatures of microbial and human proteins during early colonization of the gut have yet to be determined. Thus, we employed shotgun proteomics to simultaneously monitor microbial and human proteins in fecal samples from a healthy preterm infant during the first month of life. Microbial community complexity and functions increased over time, with compositional changes that were consistent with previous metagenomic and rRNA gene data indicating three distinct colonization phases. Overall microbial community functions were established relatively early in development and remained stable. Detected human proteins included those responsible for epithelial barrier function and antimicrobial activity. Some neutrophil-derived proteins increased in abundance early in the study
period, suggesting activation of the innate immune system. Likewise, abundances of cytoskeletal and mucin proteins increased later in the time course, suggestive of subsequent adjustment to the increased microbial load. This study provides the first snapshot of coordinated human and microbial protein expression in the infant gut during early development.
A search database was generated from the predicted protein sequences of dominant members reconstructed from metagenomic sequences collected on days 10, 16, 18, and 21 from matched samples. These included a Serratia species UC1SER, two closely related Citrobacter strains, UC1i and UC1ii, an Enterococcus species UC1ENC, and associated virus and plasmids UC1ENCp, UC1ENCv, and UC1CITp. Since samples from early time points were not represented in the metagenomic sequences, the following additional isolate sequences, selected based on 16S rRNA data, were also included in the database: Arcobacter butzleri RM4018, Acinetobacter junii SH205, Bacteroides fragilis NCTC 9343, Bifidobacterium adolescentis ATCC 15703, Bifidobacterium longum
infantis ATCC 15697, Campylobacter concisus 13826, Clostridium sporogenes ATCC
15579, Enterobacter cancerogenus ATCC 35316, Escherichia coli K12 DH10B,
Eubacterium rectale ATCC 33656, Fusobacterium sp. 1_1_41FAA, Klebsiella sp.
1_1_55, Lactococcus lactis subsp. lactis KF147, Lactobacillus reuteri 100-23,
Leuconostoc mesenteroides cremoris ATCC 19254, Pseudomonas aeruginosa PAO1, Staphylococcus aureus 04-02981, Streptococcus sp. 2_1_36FAA, Weissella paramesenteroides ATCC 33313 (acquired from JGI: http://www.hmpdacc- resources.org/cgi- bin/img_hmp/main.cgi in January of 2011 ).
Since mass spectrometry based proteomics identifies proteins by their corresponding peptide sequences, data analysis must take into consideration the high levels of protein redundancy within and between species to avoid inflating the total number of proteins identified or misinterpretation of the biological conclusions by over- representing proteins with the same function. Therefore, we applied a bioinformatic clustering algorithm to the database in order improve confidence in protein identification and
quantification. Different similarity thresholds were chosen to reflect the higher level of redundancy in the human genome due to gene duplications, splice variants, and multiple protein isoforms. Microbial proteins were clustered using more stringent criteria in order to preserve species information and distinguish functional contributions of different community members. Specifically, using the publically-available software, USEARCH v.5.0,128 microbial proteins were clustered into a protein group if they shared 100% amino acid identity, and human proteins were clustered into a protein group if they contained ≥90% amino acid similarity. These differing similarity thresholds were chosen based on the higher numbers of paralogous proteins present within the human genome, and were supported by plotting similarity thresholds ranging from 0.5-1 against the percent proteome reduction via clustering. In fact, the clustered microbial metaproteome had 0.5% of its protein groups with more than one member and the clustered human proteome had 36% of its protein groups characterized by multiple members. Spectral counts were assigned, balanced, normalized, and adjusted according to methods previously described, yielding adjusted NSAF values.103, 137, 138 In total, 4,413 microbial and 3,062 human protein groups were detected across the dataset. Protein groups range from singletons to groups that contain multiple protein isoforms.
By measuring both microbial and human proteins simultaneously in each run, we observed an increased complexity of the microbial composition and a decrease in the ratio of total human/microbial proteins with time (Figure 4.4). At the earliest time point, when the initial microbial communities were being established, human proteins comprised ~96% of all proteins identified (day 7). The low microbial load may be a consequence of antibiotic administration during the first week of life for this particular infant. Human proteins comprised ~72% of the identified protein dataset on day 13, and by day 15 the percent of human proteins decreased to ~30%, with a concomitant increase in the number of microbial proteins detected. The ratio of human to microbial proteins remained at this level for the remainder of the times measured, with the exception of day 20, when an unexpected rise in human proteins was detected. Microbial proteins detected in this time course study are consistent with metagenomic inference of three distinct
colonization phases with vastly different species composition. Despite temporal changes in microbial community composition, the overall functions of the community increase in complexity with time, stabilize relatively early, and remain remarkably conserved thereafter. Thus, this study provides detailed information about the microbial and human proteins in fecal samples from a newborn premature infant during the first month of life, and reveals the complex-but-synergistic interplay of host adaptation to microbiome establishment.
4.3 Conclusions
In this project, we developed a potential solution to the protein inference problem: clustering protein databases by sequence similarity groups together proteins that we are unlikely to analytically distinguish while also taking into consideration shared biological functions. While other existing approaches group proteins based on the observed peptides detected within a run or experiment, our approach, Clustering Unique Sequences in Proteomes (CUSPs), provides more stable grouping that only changes with the database- not with observed data. In addition, by comparing entire protein sequences rather than partial sequences, we are more confident that the proteins are grouped based on similar biological function (i.e., multiple domains and motifs). We suggest using two approaches to identify an appropriate clustering threshold: the reduction of proteome size as a result of clustering and the number of distinguishable identifications from the clustered database. While lowering the threshold for grouping proteins will create more groups, it is also possible to lose unique information that could be helpful in confidently pinpointing which proteins are identified within the sample. We considered these tradeoffs for a number of complex proteomes, including Mus muculus, Populus
trichocarpa, Oriza sativa, and Zea mays. In total, we suggest that each proteome has
different properties that would recommend different identity thresholds, so future studies would need to adopt this methodology to find the most appropriate identity for grouping the proteome of interest. Case studies of Populus trichocarpa and the infant gut microbiome demonstrated successful implementation of CUSPs to gain crucial insight into the identification and quantification of proteins that would have otherwise been excluded from their analyses. Therefore, CUSPs not only removes ambiguity from protein reports, but also rescues and strengthens the confidence in the protein identifications and abundances measured in complex proteomic studies.