Download data from the SRA, ENA, and GenBank

The aim of this tutorial is to demonstrate how you can download data from the Sequence Read Archive, the European Nucleotide Archive, and GenBank via the Linux command-line.

Prerequisites for this tutorial:

  • A Linux computer
  • 1GB storage space available on your machine*


1 Download data from the SRA / ENA

The Sequence Read Archive (SRA) and the European Nucleotide Archive (ENA) are public repositories that store sequence data; at present, these databases mostly consist of short-read data generated by high-throughput sequencing. You can download raw data from both databases (usually in FASTQ format) using the general-purpose tool fastq-dl — for more extensive queries to each database, the NCBI (National Center for Biotechnology Information) and the EBI (European Bioinformatics Institute) provide their own tools: SRA Toolkit and enaBrowserTools.

We will download sequence data for bovine tuberculosis (Mycobacterium bovis) by parsing accession numbers. The accession number is an ID that is unique to a particular project (study), sequencing experiment or sequencing run. Therefore, this is what we will use when downloading sequencing data for a study or sample of interest.

Accession ID for each level:

Back to Contents


1.1 fastq-dl

Installation
One way to install fastq-dl is via conda from the Bioconda channel. By following the commands below, you will set up a conda environmental called fastq-dl which downloads fastq-dl and all of its dependencies to the fastq-dl environment.

conda create -n fastq-dl -c conda-forge -c bioconda fastq-dl
conda activate fastq-dl

Note: installing via conda requires you to have installed miniconda3 or miniforge3

Tip: to deactivate a conda environment type conda activate base

Back to Contents


Download reads
Download raw reads for a single sequencing run to the current working directory.

# Download from the SRA
fastq-dl --outdir . --cpus 4 SRR18391668 SRA

# Download from the ENA
fastq-dl --outdir . --cpus 4 SRR18391668 ENA
Argument Description
--outdir path to the directory to store output files
--cpus number of processors to use

Note: for paired reads, the fastq-dl default is to download two separate gzipped FASTQ files

You can also specify a project or experiment accession ID. For example, supplying the project accession, PRJNA804719, tells the program to download all 136 sequencing runs to the current directory. Similarly, supplying the experiment accession, SRX14525734, tells the program to download all sequencing runs for that experiment. In the example used here, there is only one run in the experiment so only one sample of paired reads will be downloaded.

# Download all sequencing experiments for Project PRJNA804719
fastq-dl --outdir . --cpus 4 PRJNA804719 SRA # SRA
fastq-dl --outdir . --cpus 4 PRJNA804719 ENA # ENA

# Download all sequencing runs for Experiment SRX14525734
fastq-dl --outdir . --cpus 4 SRX14525734 SRA # SRA
fastq-dl --outdir . --cpus 4 SRX14525734 ENA # ENA

*Note: to download all 136 sequencing runs you will need approximately 20GB storage space

Back to Contents


2 Download data from GenBank

GenBank is a genetic sequence database on the NCBI. It is designed to “provide and encourage access within the scientific community to the most up-to-date and comprehensive DNA sequence information”. You can download data using the command-line tools provided by Entrez Direct (EDirect).

Installation
To install EDirect tools to your home directory, follow the commands below:

sh -c "$(curl -fsSL ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
# Type "y" and press the Return key to add the path to your .bashrc file
# Type "N" if you want to move the edirect folder and manually add a path to your .bashrc file

Examples
This tutorial presents examples for downloading the following data from GenBank:

  • Nucleotide DNA sequences (nuclear and organelle)
  • Protein sequences
  • Gene information
  • Genome assemblies

Back to Contents


2.1 Nucleotide

Single 16S rRNA sequence
Download the AF276989 16S rRNA sequence for Salmonella enterica in FASTA format.

efetch -db nucleotide -id AF276989 -format fasta > AF276989_16S.fa

Multiple 16S rRNA sequences
Download multiple 16S rRNA sequences for Salmonella enterica in FASTA format.

efetch -db nucleotide -id AF276989,NR_074799,NR_041696,NR_074910,NR_119108 -format fasta > Senterica_16S.fa

Multiple 16S rRNA sequences — text file containing IDs
Same as above but instead submit a text file containing the IDs to download.

# Text file called nucleotide_ids.txt
AF276989
NR_074799
NR_041696
NR_074910
NR_119108
efetch -db nucleotide -input nucleotide_ids.txt -format fasta > Senterica_16S.fa

Mitogenome
Download the complete mitochondrion genome (MH747083) for Homarus gammarus in FASTA format.

efetch -db nucleotide -id MH747083 -format fasta > Hgammarus_mitogenome.fa

Plastome
Download the complete plastid genome (MH281627) for Lithothamnion sp. in FASTA format.

efetch -db nucleotide -id MH281627 -format fasta > LithothamnionSp_plastome.fa

Back to Contents


2.2 Protein

Single protein sequence
Download the NP_776541 protein sequence for myogenic factor 5 (Bos taurus) in FASTA format.

efetch -db protein -id NP_776541 -format fasta > Btaurus_MYF5.fa

Multiple protein sequences
Download multiple protein sequences for humans in FASTA format.

efetch -db protein -id NP_001186746,AAB51177,AAB51171,AAF88103,AAF88103 -format fasta > Human_proteins.fa

Multiple protein sequences — text file containing IDs
Same as above but instead submit a text file containing the IDs to download.

# Text file called protein_ids.txt
NP_001186746
AAB51177
AAB51171
AAF88103
AAF88103
efetch -db protein -input protein_ids.txt -format fasta > Human_proteins.fa

Back to Contents


2.3 Gene

Single gene
Download information for the Mbis1 gene (492906) from mouse (Mus musculus) in tabular format.

# Method 1: use gene ID
efetch -db gene -id 492906 -format tabular > Mouse_Mbis1.tsv

# Method 2: use a descriptive query
esearch -db gene -query "Mus musculus Mbis1" | efetch -format tabular > Mouse_Mbis1.tsv

Multiple genes
Download information for multiple genes from zebrafish (Danio rerio) in tabular format.

efetch -db gene -id 30592,30513,30425,30590,30269 -format tabular > Zebrafish_genes.tsv

Multiple genes — text file containing IDs
Same as above but instead submit a text file containing the IDs to download.

# Text file called gene_ids.txt
30592
30513
30425
30590
30269
efetch -db gene -input gene_ids.txt -format tabular > Zebrafish_genes.tsv

Back to Contents


2.4 Assembly

Mycobacterium bovis
Download Mycobacterium bovis complete genome assembly (GCA_005156105.1) in gzipped FASTA format.

wget $(esearch -db assembly -query "GCA_005156105.1" | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq | awk -F"/" '{print $0"/"$NF"_genomic.fna.gz"}')

Homarus americanus
Download Homarus americanus scaffold genome assembly (GCA_018991925.1) in gzipped FASTA format.

wget $(esearch -db assembly -query "GCA_018991925.1" | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq | awk -F"/" '{print $0"/"$NF"_genomic.fna.gz"}')

Chondrus crispus
Download Chondrus crispus scaffold genome assembly (GCA_000350225.2) in gzipped FASTA format.

wget $(esearch -db assembly -query "GCA_000350225.2" | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq | awk -F"/" '{print $0"/"$NF"_genomic.fna.gz"}')

Back to Contents


Further information for downloading data via EDirect can be found on the following pages:

https://www.ncbi.nlm.nih.gov/books/NBK179288/
https://github.com/NCBI-Hackathons/EDirectCookbook


3 R session info

This tutorial was created using R Markdown and Knitr.

R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.utf8 
[2] LC_CTYPE=English_United Kingdom.utf8   
[3] LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitr_1.39     rmarkdown_2.15 magrittr_2.0.3 gt_0.7.0      

loaded via a namespace (and not attached):
 [1] bslib_0.4.0      compiler_4.2.1   pillar_1.8.1     jquerylib_0.1.4 
 [5] tools_4.2.1      digest_0.6.29    gtable_0.3.0     jsonlite_1.8.0  
 [9] evaluate_0.16    lifecycle_1.0.1  tibble_3.1.8     pkgconfig_2.0.3 
[13] rlang_1.0.4      cli_3.3.0        DBI_1.1.3        rstudioapi_0.14 
[17] yaml_2.3.5       blogdown_1.11    xfun_0.32        fastmap_1.1.0   
[21] dplyr_1.0.9      stringr_1.4.1    generics_0.1.3   vctrs_0.4.1     
[25] sass_0.4.2       grid_4.2.1       tidyselect_1.1.2 glue_1.6.2      
[29] R6_2.5.1         fansi_1.0.3      bookdown_0.28    purrr_0.3.4     
[33] ggplot2_3.3.6    ellipsis_0.3.2   scales_1.2.1     htmltools_0.5.3 
[37] assertthat_0.2.1 colorspace_2.0-3 utf8_1.2.2       stringi_1.7.8   
[41] munsell_0.5.0    cachem_1.0.6    

Back to Contents

Tom Jenkins
Tom Jenkins
Bioinformatician & Software Developer