Download data from the SRA, ENA, and GenBank
The aim of this tutorial is to demonstrate how you can download data from the Sequence Read Archive, the European Nucleotide Archive, and GenBank via the Linux command-line.
Prerequisites for this tutorial:
- A Linux computer
- 1GB storage space available on your machine*
1 Download data from the SRA / ENA
The Sequence Read Archive (SRA) and the European Nucleotide Archive (ENA) are public repositories that store sequence data; at present, these databases mostly consist of short-read data generated by high-throughput sequencing. You can download raw data from both databases (usually in FASTQ format) using the general-purpose tool fastq-dl — for more extensive queries to each database, the NCBI (National Center for Biotechnology Information) and the EBI (European Bioinformatics Institute) provide their own tools: SRA Toolkit and enaBrowserTools.
We will download sequence data for bovine tuberculosis (Mycobacterium bovis) by parsing accession numbers. The accession number is an ID that is unique to a particular project (study), sequencing experiment or sequencing run. Therefore, this is what we will use when downloading sequencing data for a study or sample of interest.
Accession ID for each level:
- Project: PRJNA804719
- Experiment: SRX14525734
- Run: SRR18391668
1.1 fastq-dl
Installation
One way to install fastq-dl is via conda from the Bioconda channel. By following the commands below, you will set up a conda environmental called fastq-dl which downloads fastq-dl and all of its dependencies to the fastq-dl environment.
conda create -n fastq-dl -c conda-forge -c bioconda fastq-dl
conda activate fastq-dl
Note: installing via conda requires you to have installed miniconda3 or miniforge3
Tip: to deactivate a conda environment type
conda activate base
Download reads
Download raw reads for a single sequencing run to the current working directory.
# Download from the SRA
fastq-dl --outdir . --cpus 4 SRR18391668 SRA
# Download from the ENA
fastq-dl --outdir . --cpus 4 SRR18391668 ENA
Argument | Description |
---|---|
--outdir | path to the directory to store output files |
--cpus | number of processors to use |
Note: for paired reads, the fastq-dl default is to download two separate gzipped FASTQ files
You can also specify a project or experiment accession ID. For example, supplying the project accession, PRJNA804719, tells the program to download all 136 sequencing runs to the current directory. Similarly, supplying the experiment accession, SRX14525734, tells the program to download all sequencing runs for that experiment. In the example used here, there is only one run in the experiment so only one sample of paired reads will be downloaded.
# Download all sequencing experiments for Project PRJNA804719
fastq-dl --outdir . --cpus 4 PRJNA804719 SRA # SRA
fastq-dl --outdir . --cpus 4 PRJNA804719 ENA # ENA
# Download all sequencing runs for Experiment SRX14525734
fastq-dl --outdir . --cpus 4 SRX14525734 SRA # SRA
fastq-dl --outdir . --cpus 4 SRX14525734 ENA # ENA
*Note: to download all 136 sequencing runs you will need approximately 20GB storage space
2 Download data from GenBank
GenBank is a genetic sequence database on the NCBI. It is designed to “provide and encourage access within the scientific community to the most up-to-date and comprehensive DNA sequence information”. You can download data using the command-line tools provided by Entrez Direct (EDirect).
Installation
To install EDirect tools to your home directory, follow the commands below:
sh -c "$(curl -fsSL ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
# Type "y" and press the Return key to add the path to your .bashrc file
# Type "N" if you want to move the edirect folder and manually add a path to your .bashrc file
Examples
This tutorial presents examples for downloading the following data from GenBank:
- Nucleotide DNA sequences (nuclear and organelle)
- Protein sequences
- Gene information
- Genome assemblies
2.1 Nucleotide
Single 16S rRNA sequence
Download the AF276989 16S rRNA sequence for Salmonella enterica in FASTA format.
efetch -db nucleotide -id AF276989 -format fasta > AF276989_16S.fa
Multiple 16S rRNA sequences
Download multiple 16S rRNA sequences for Salmonella enterica in FASTA format.
efetch -db nucleotide -id AF276989,NR_074799,NR_041696,NR_074910,NR_119108 -format fasta > Senterica_16S.fa
Multiple 16S rRNA sequences — text file containing IDs
Same as above but instead submit a text file containing the IDs to download.
# Text file called nucleotide_ids.txt
AF276989
NR_074799
NR_041696
NR_074910
NR_119108
efetch -db nucleotide -input nucleotide_ids.txt -format fasta > Senterica_16S.fa
Mitogenome
Download the complete mitochondrion genome (MH747083) for Homarus gammarus in FASTA format.
efetch -db nucleotide -id MH747083 -format fasta > Hgammarus_mitogenome.fa
Plastome
Download the complete plastid genome (MH281627) for Lithothamnion sp. in FASTA format.
efetch -db nucleotide -id MH281627 -format fasta > LithothamnionSp_plastome.fa
2.2 Protein
Single protein sequence
Download the NP_776541 protein sequence for myogenic factor 5 (Bos taurus) in FASTA format.
efetch -db protein -id NP_776541 -format fasta > Btaurus_MYF5.fa
Multiple protein sequences
Download multiple protein sequences for humans in FASTA format.
efetch -db protein -id NP_001186746,AAB51177,AAB51171,AAF88103,AAF88103 -format fasta > Human_proteins.fa
Multiple protein sequences — text file containing IDs
Same as above but instead submit a text file containing the IDs to download.
# Text file called protein_ids.txt
NP_001186746
AAB51177
AAB51171
AAF88103
AAF88103
efetch -db protein -input protein_ids.txt -format fasta > Human_proteins.fa
2.3 Gene
Single gene
Download information for the Mbis1 gene (492906) from mouse (Mus musculus) in tabular format.
# Method 1: use gene ID
efetch -db gene -id 492906 -format tabular > Mouse_Mbis1.tsv
# Method 2: use a descriptive query
esearch -db gene -query "Mus musculus Mbis1" | efetch -format tabular > Mouse_Mbis1.tsv
Multiple genes
Download information for multiple genes from zebrafish (Danio rerio) in tabular format.
efetch -db gene -id 30592,30513,30425,30590,30269 -format tabular > Zebrafish_genes.tsv
Multiple genes — text file containing IDs
Same as above but instead submit a text file containing the IDs to download.
# Text file called gene_ids.txt
30592
30513
30425
30590
30269
efetch -db gene -input gene_ids.txt -format tabular > Zebrafish_genes.tsv
2.4 Assembly
Mycobacterium bovis
Download Mycobacterium bovis complete genome assembly (GCA_005156105.1) in gzipped FASTA format.
wget $(esearch -db assembly -query "GCA_005156105.1" | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq | awk -F"/" '{print $0"/"$NF"_genomic.fna.gz"}')
Homarus americanus
Download Homarus americanus scaffold genome assembly (GCA_018991925.1) in gzipped FASTA format.
wget $(esearch -db assembly -query "GCA_018991925.1" | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq | awk -F"/" '{print $0"/"$NF"_genomic.fna.gz"}')
Chondrus crispus
Download Chondrus crispus scaffold genome assembly (GCA_000350225.2) in gzipped FASTA format.
wget $(esearch -db assembly -query "GCA_000350225.2" | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq | awk -F"/" '{print $0"/"$NF"_genomic.fna.gz"}')
Further information for downloading data via EDirect can be found on the following pages:
https://www.ncbi.nlm.nih.gov/books/NBK179288/
https://github.com/NCBI-Hackathons/EDirectCookbook
3 R session info
This tutorial was created using R Markdown and Knitr.
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.utf8
[2] LC_CTYPE=English_United Kingdom.utf8
[3] LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.39 rmarkdown_2.15 magrittr_2.0.3 gt_0.7.0
loaded via a namespace (and not attached):
[1] bslib_0.4.0 compiler_4.2.1 pillar_1.8.1 jquerylib_0.1.4
[5] tools_4.2.1 digest_0.6.29 gtable_0.3.0 jsonlite_1.8.0
[9] evaluate_0.16 lifecycle_1.0.1 tibble_3.1.8 pkgconfig_2.0.3
[13] rlang_1.0.4 cli_3.3.0 DBI_1.1.3 rstudioapi_0.14
[17] yaml_2.3.5 blogdown_1.11 xfun_0.32 fastmap_1.1.0
[21] dplyr_1.0.9 stringr_1.4.1 generics_0.1.3 vctrs_0.4.1
[25] sass_0.4.2 grid_4.2.1 tidyselect_1.1.2 glue_1.6.2
[29] R6_2.5.1 fansi_1.0.3 bookdown_0.28 purrr_0.3.4
[33] ggplot2_3.3.6 ellipsis_0.3.2 scales_1.2.1 htmltools_0.5.3
[37] assertthat_0.2.1 colorspace_2.0-3 utf8_1.2.2 stringi_1.7.8
[41] munsell_0.5.0 cachem_1.0.6