5 Phyloseq Objects Demo Datasets

In this section, we utilize publicly available real-world datasets to demonstrate the functionality and analysis workflows of phyloseq objects. These datasets serve as valuable resources for researchers seeking to reproduce and validate their analyses using similar microbiome data. By leveraging these demo datasets, we aim to enhance reproducibility and accessibility in microbiome research and analysis.

5.1 The GlobalPatterns dataset

The GlobalPatterns dataset, sourced from the Earth Microbiome Project (EMP), serves as a comprehensive repository for studying microbial communities worldwide. Here’s a breakdown of its key attributes:

Source: GlobalPatterns originates from the Earth Microbiome Project (EMP), collecting samples worldwide.
Composition: It comprises high-throughput sequencing data, revealing the taxonomic composition of microbial communities.
Scope: Samples represent diverse global ecosystems, offering a comprehensive view of microbial biodiversity.
Format: Presented as a phyloseq object in R, it integrates sample metadata and taxonomic abundance for analysis.
Utility: Researchers utilize it for community profiling, abundance testing, and ecological modeling, enhancing understanding of global microbial diversity and function.

library(phyloseq) # for GlobalPatterns dataset
data("GlobalPatterns")

ps_GlobalPatterns <-GlobalPatterns
df_GlobalPatterns <-GlobalPatterns %>% 
  phyloseq::psmelt() %>% 
  tibble::rownames_to_column("sample_id") %>% 
  rename_all(tolower)

GlobalPatterns phyloseq-class

ps_GlobalPatterns
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 19216 taxa and 26 samples ]
sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
tax_table()   Taxonomy Table:    [ 19216 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]

Columns in GlobalPatterns dataset

colnames(df_GlobalPatterns)
 [1] "sample_id"                "otu"                     
 [3] "sample"                   "abundance"               
 [5] "x.sampleid"               "primer"                  
 [7] "final_barcode"            "barcode_truncated_plus_t"
 [9] "barcode_full_length"      "sampletype"              
[11] "description"              "kingdom"                 
[13] "phylum"                   "class"                   
[15] "order"                    "family"                  
[17] "genus"                    "species"                 
cat("\n")

5.2 The dietswap dataset

The dietswap dataset, available through the microbiome R package, offers valuable insights into the effects of dietary interventions on microbial communities. Here’s an overview of its attributes:

Source: The dietswap dataset is derived from research examining the impact of dietary changes on the human gut microbiome.
Composition: It comprises high-throughput sequencing data, providing insights into the taxonomic composition and dynamics of microbial communities in response to dietary alterations.
Scope: Samples are obtained from human participants undergoing dietary interventions, enabling researchers to explore how different diets influence microbial diversity and function within the gut microbiome.
Format: Presented in a format suitable for microbiome analysis, the dataset includes sample metadata and taxonomic abundance data, facilitating comprehensive analyses of microbial community dynamics.
Utility: Researchers utilize the dietswap dataset to investigate the effects of dietary interventions on gut microbiome composition, contributing to a better understanding of the intricate interactions between diet, host physiology, and microbial ecology.

library(microbiome) # for dietswap dataset
data("dietswap")

dietswap
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 130 taxa and 222 samples ]
sample_data() Sample Data:       [ 222 samples by 8 sample variables ]
tax_table()   Taxonomy Table:    [ 130 taxa by 3 taxonomic ranks ]

Note: dietswap is missing the phylo_tree slot, we can construct it and add it like so:

library(microbiome)
data('dietswap')
ps_raw_basic <- dietswap

library(ape)
ps_tree = rtree(ntaxa(ps_raw_basic), rooted=TRUE, tip.label=taxa_names(ps_raw_basic))
ps_dietswap <- phyloseq::merge_phyloseq(ps_raw_basic, ps_tree)

df_dietswap <-ps_dietswap %>% 
  phyloseq::psmelt() %>% 
  tibble::rownames_to_column("sample_id") %>% 
  dplyr::select(-9) %>% 
  rename_all(tolower)

Dietswap phyloseq-class

ps_dietswap
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 130 taxa and 222 samples ]
sample_data() Sample Data:       [ 222 samples by 8 sample variables ]
tax_table()   Taxonomy Table:    [ 130 taxa by 3 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 130 tips and 129 internal nodes ]

Columns in Dietswap dataset

colnames(df_dietswap)
 [1] "sample_id"              "otu"                    "sample"                
 [4] "abundance"              "subject"                "sex"                   
 [7] "nationality"            "group"                  "timepoint"             
[10] "timepoint.within.group" "bmi_group"              "phylum"                
[13] "family"                 "genus"

5.3 Caporaso dataset

The Caporaso dataset provides valuable insights into microbial marker genes and their associations with various biological factors.

Source: The Caporaso dataset is included in the microbiomeMarker R package and is derived from research led by Dr. J. Gregory Caporaso. It aims to provide insights into microbial marker genes and their associations with various biological factors.
Composition: The dataset comprises high-throughput sequencing data, focusing on microbial marker genes from diverse biological samples. It offers valuable information regarding the taxonomic composition and functional potential of microbial communities.
Scope: Samples are collected from a range of environments, including but not limited to human microbiomes, environmental samples, and animal microbiomes. This diversity allows researchers to explore microbial diversity across different ecosystems and conditions.
Format: The dataset is structured to facilitate microbiome marker analysis, with sample metadata and taxonomic abundance data included. This format enables researchers to conduct comprehensive analyses of microbial marker genes and their associations with environmental or biological factors.
Utility: Researchers utilize the Caporaso dataset to investigate microbial marker genes’ roles in various ecosystems, such as host-microbiome interactions, environmental responses, and disease states. The dataset contributes to a better understanding of microbial ecology and its implications for human health and environmental management.

library(microbiomeMarker) # for caporaso dataset
data("caporaso")

ps_caporaso <-caporaso
df_caporaso <-caporaso %>% 
  phyloseq::psmelt() %>% 
  tibble::rownames_to_column("sample_id") %>% 
  rename_all(tolower)

Caporaso phyloseq-class

ps_caporaso
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3426 taxa and 34 samples ]
sample_data() Sample Data:       [ 34 samples by 8 sample variables ]
tax_table()   Taxonomy Table:    [ 3426 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 3426 tips and 3424 internal nodes ]

Columns in Caporaso dataset

colnames(df_caporaso)
 [1] "sample_id"                "otu"                     
 [3] "sample"                   "abundance"               
 [5] "sampletype"               "year"                    
 [7] "month"                    "day"                     
 [9] "subject"                  "reportedantibioticusage" 
[11] "dayssinceexperimentstart" "description"             
[13] "kingdom"                  "phylum"                  
[15] "class"                    "order"                   
[17] "family"                   "genus"                   
[19] "species"

5.4 Kostic_CRC dataset

The Kostic_CRC dataset provides valuable insights into the gut microbiome of individuals diagnosed with colorectal cancer (CRC).

Source: The Kostic_CRC dataset is included in the microbiomeMarker R package. It is derived from research conducted by the Kostic Lab and focuses on investigating the gut microbiome in colorectal cancer (CRC) patients.
Composition: This dataset comprises high-throughput sequencing data, specifically targeting the gut microbiome of individuals with colorectal cancer. It provides insights into the taxonomic composition and potential functional characteristics of microbial communities associated with CRC.
Scope: Samples are collected from individuals diagnosed with colorectal cancer, allowing researchers to explore the microbial diversity and potential biomarkers associated with CRC development and progression.
Format: The dataset is structured to facilitate microbiome marker analysis, including sample metadata and taxonomic abundance data. This format enables researchers to conduct comprehensive analyses of microbial community dynamics in colorectal cancer.
Utility: Researchers utilize the Kostic_CRC dataset to investigate the role of the gut microbiome in colorectal cancer pathogenesis, prognosis, and treatment response. The dataset contributes to a deeper understanding of the complex interplay between the gut microbiome and colorectal cancer biology, potentially leading to novel diagnostic or therapeutic strategies.

library(microbiomeMarker) # for kostic_crc dataset
data("kostic_crc")

ps_raw_basic <- kostic_crc

library(ape)
ps_tree = rtree(ntaxa(ps_raw_basic), rooted=TRUE, tip.label=taxa_names(ps_raw_basic))
ps_kostic_crc <- phyloseq::merge_phyloseq(ps_raw_basic, ps_tree)

df_kostic_crc <-kostic_crc %>% 
  phyloseq::psmelt() %>% 
  tibble::rownames_to_column("sample_id") %>% 
  rename_all(tolower)

Kostic_crc phyloseq-class

ps_kostic_crc
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 2505 taxa and 177 samples ]
sample_data() Sample Data:       [ 177 samples by 71 sample variables ]
tax_table()   Taxonomy Table:    [ 2505 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 2505 tips and 2504 internal nodes ]

Columns in Kostic_crc dataset

colnames(df_kostic_crc)
 [1] "sample_id"                     "otu"                          
 [3] "sample"                        "abundance"                    
 [5] "x.sampleid"                    "barcodesequence"              
 [7] "linkerprimersequence"          "necrosis_percent"             
 [9] "target_subfragment"            "assigned_from_geo"            
[11] "experiment_center"             "title"                        
[13] "run_prefix"                    "age"                          
[15] "normal_equivalent_percent"     "fibroblast_and_vessel_percent"
[17] "depth"                         "treatment"                    
[19] "age_at_diagnosis"              "common_name"                  
[21] "host_common_name"              "body_site"                    
[23] "elevation"                     "reports_received"             
[25] "cea"                           "pcr_primers"                  
[27] "collection_date"               "altitude"                     
[29] "env_biome"                     "sex"                          
[31] "platform"                      "race"                         
[33] "bsp_diagnosis"                 "study_center"                 
[35] "country"                       "chemotherapy"                 
[37] "year_of_death"                 "ethnicity"                    
[39] "anonymized_name"               "taxon_id"                     
[41] "sample_center"                 "samp_size"                    
[43] "year_of_birth"                 "original_diagnosis"           
[45] "age_unit"                      "study_id"                     
[47] "experiment_design_description" "description_duplicate"        
[49] "diagnosis"                     "body_habitat"                 
[51] "sequencing_meth"               "run_date"                     
[53] "histologic_grade"              "longitude"                    
[55] "env_matter"                    "target_gene"                  
[57] "env_feature"                   "key_seq"                      
[59] "body_product"                  "tumor_percent"                
[61] "library_construction_protocol" "region"                       
[63] "run_center"                    "tumor_type"                   
[65] "bsp_notes"                     "radiation_therapy"            
[67] "inflammation_percent"          "host_subject_id"              
[69] "pc3"                           "latitude"                     
[71] "osh_diagnosis"                 "stage"                        
[73] "primary_disease"               "host_taxid"                   
[75] "description"                   "kingdom"                      
[77] "phylum"                        "class"                        
[79] "order"                         "family"                       
[81] "genus"                         "species"

5.5 Save objects for transformation and exploration

save(df_GlobalPatterns, 
     df_dietswap,  
     df_caporaso,
     df_kostic_crc,     
     file = "data/dataframe_objects.rda")

save(ps_GlobalPatterns, 
     ps_dietswap,
     ps_caporaso,
     ps_kostic_crc,
     file = "data/phyloseq_objects.rda")

5.6 Confirm saved objects

load("data/phyloseq_objects.rda", verbose = TRUE)
Loading objects:
  ps_GlobalPatterns
  ps_dietswap
  ps_caporaso
  ps_kostic_crc

4 Creating phyloseq Objects

6 Reviewing Phyloseq Objects