2 Downloading NCBI-SRA metadata

2.1 Demo Projects for metadata exploration

In this demonstration, we will delve into sample metadata obtained from various randomly selected public microbiome NCBI BioProjects. This includes information from:

  1. PRJNA477349: 16S rRNA from bushmeat samples collected from Tanzania Metagenome (Multispecies).
  2. PRJNA802976: Changes to Gut Microbiota following Systemic Antibiotic Administration in Infants (Multispecies).
  3. PRJNA322554: The Early Infant Gut Microbiome Varies In Association with a Maternal High-fat Diet (Multispecies).
  4. PRJNA937707: Microbiome associated with spotting disease in the purple sea urchin (Multispecies).
  5. PRJNA589182: 16S rDNA gene sequencing of the phyllosphere endophytic bacterial communities colonizing wild Populus trichocarpa Raw sequence reads(Multispecies).
  6. PRJEB13870: Dysbiosis of gut microbiota contributes to the pathogenesis of hypertension (Multispecies).
  7. PRJNA208226: Colonization patterns of soil microbial communities in the Atacama Desert (Multispecies).

2.2 Download Methods

Various methods are available for downloading sample metadata from the Sequence Read Archive (SRA) or the European Nucleotide Archive (ENA). Each method provides slightly different information, making it prudent to explore and select the one that best aligns with your specific needs and preferences.

2.2.1 Manually via SRA Run Selector

We can manually retrieve metadata from the SRA database via the SRA Run Selector.

  • Note that the SRA filename for metadata is automatically named SraRunTable.txt.
  • Users can change the default TXT extension to like CSV if preferred.
  • In our demo, we will use CSV to save the metadata file in the data/ folder.
Example screen shot of SRA Run Selector for metadata associated with the NCBI-SRA bioproject number PRJNA477349
Example screen shot of SRA Run Selector for metadata associated with the NCBI-SRA bioproject number PRJNA477349

2.2.2 Computationally via Entrez Direct

You can retrieve SRA metadata computationally using Entrez Direct, a set of tools provided by NCBI for accessing the Entrez databases. Here’s a step-by-step guide:

  1. Download Entrez Direct You can download and install Entrez Direct by following the instructions on the NCBI GitHub repository.

  2. Add Entrez Direct to PATH After downloading, add the edirect folder to your system’s PATH environment variable. This allows you to run Entrez Direct commands from any location in your terminal.

export PATH=$PATH:/path/to/edirect

Make sure to replace “/path/to/edirect” with the actual path to the edirect folder on your system.

  1. Download SRA Metadata Use the following Entrez Direct commands to download SRA metadata for specific BioProjects. Replace “PRJ…” with the BioProject accessions of interest.
#!/bin/bash

esearch -db sra -query 'PRJNA477349[bioproject]' | efetch -format runinfo >data/runinfo_PRJNA477349_metadata.csv;
esearch -db sra -query 'PRJNA802976[bioproject]' | efetch -format runinfo >data/runinfo_PRJNA802976_metadata.csv;
esearch -db sra -query 'PRJNA322554[bioproject]' | efetch -format runinfo >data/runinfo_PRJNA322554_metadata.csv;
esearch -db sra -query 'PRJNA937707[bioproject]' | efetch -format runinfo >data/runinfo_PRJNA937707_metadata.csv;
esearch -db sra -query 'PRJNA589182[bioproject]' | efetch -format runinfo >data/runinfo_PRJNA589182_metadata.csv;
esearch -db sra -query 'PRJEB13870[bioproject]' | efetch -format runinfo >data/runinfo_PRJEB13870_metadata.csv;
esearch -db sra -query 'PRJNA208226[bioproject]' | efetch -format runinfo >data/runinfo_PRJNA208226_metadata.csv;

2.2.3 Computationally using pysradb

  1. Create pysradb environment The pysradb tool can obtain metadata from SRA and ENA. Here we will create an independent environment and install pysradb. We can delete this env when no longer needed. To learn more click here.
  • First, we create a pysradb environment and install the pysradb tool.
  • Then we use pysradb to download the SRA metadata on CLI.
conda activate base
conda create -c bioconda -n pysradb PYTHON=3 pysradb
  1. Download using a bash script
#!/bin/bash
# Shell script: workflow/scripts/pysradb_sra_metadata.sh

pysradb metadata PRJNA477349 --detailed >data/PRJNA477349_pysradb.csv
pysradb metadata PRJNA802976 --detailed >data/PRJNA802976_pysradb.csv
pysradb metadata PRJNA322554 --detailed >data/PRJNA322554_pysradb.csv
pysradb metadata PRJNA937707 --detailed >data/PRJNA937707_pysradb.csv
pysradb metadata PRJNA589182 --detailed >data/PRJNA589182_pysradb.csv
pysradb metadata PRJEB13870 --detailed >data/PRJEB13870_pysradb.csv
pysradb metadata PRJNA208226 --detailed >data/PRJNA208226_pysradb.csv
  1. Download using a python script
# Python script: workflow/scripts/pysradb_sra_metadata.py

import os
import sys
import csv
import pandas as pd

from pysradb.sraweb import SRAweb

db = SRAweb()
df = db.sra_metadata('PRJNA477349', detailed=True)
df.to_csv('data/PRJNA477349_pysradb_metadata.csv', index=False)

db = SRAweb()
df = db.sra_metadata('PRJNA802976', detailed=True)
df.to_csv('data/PRJNA802976_pysradb_metadata.csv', index=False)

db = SRAweb()
df = db.sra_metadata('PRJNA322554', detailed=True)
df.to_csv('data/PRJNA322554_pysradb_metadata.csv', index=False)

db = SRAweb()
df = db.sra_metadata('PRJNA937707', detailed=True)
df.to_csv('data/PRJNA937707_pysradb_metadata.csv', index=False)

db = SRAweb()
df = db.sra_metadata('PRJNA589182', detailed=True)
df.to_csv('data/PRJNA589182_pysradb_metadata.csv', index=False)

db = SRAweb()
df = db.sra_metadata('PRJEB13870', detailed=True)
df.to_csv('data/PRJEB13870_pysradb_metadata.csv', index=False)

db = SRAweb()
df = db.sra_metadata('PRJNA208226', detailed=True)
df.to_csv('data/PRJNA208226_pysradb_metadata.csv', index=False)

I sometimes experience ConnectionError when using python method. Try a different method if that happens.

2.3 Filtering user-specified information

Using keywords to search any extensive database helps filter user-specified information, such as certain disease-related studies.

#!/bin/bash
# Install pysradb using mamba package manager
mamba install -c bioconda pysradb

# Search the SRA database for studies related to "Amplicon" and limit to 100 results
pysradb search --db sra -q Amplicon --max 100 > sra_amplicon_studies.csv

# Search the ENA database for studies related to "Amplicon" and limit to 100 results
pysradb search --db ena -q Amplicon --max 100 > ena_amplicon_studies.csv