SRA Toolkit Installation and Configuration guide
Table of Contents
The following guide will outline the download, installation, and configuration of the SRA Toolkit. Detailed information regarding the usage of individual tools in the SRA Toolkit can be found on the tool-specific documentation pages.
The NCBI SRA Toolkit enables reading ("dumping") of sequencing files from the SRA database and writing ("loading") files into the .sra format (Note that this is not required for submission). The Toolkit source code is provided in the form of the SRA SDK, and may be compiled with GCC. However, pre-built software executables are available for Linux, Windows, and Mac OS X, and we highly recommend using these pre-built executables whenever possible. If configuration of the toolkit is required, Java or Perl will need to be installed.
Download the Toolkit from the SRA website
- If you are using a web browser, the following page contains download links to the most current version of the toolkit for each of the supported platforms: SRA Toolkit download page: http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software
If you are instead working from a command line interface, you may use FTP or wget to obtain the software from the following directory:
Unpack the Toolkit:
For Linux, use tar:
tar -xzf sratoolkit.current-centos_linux64.tar.gz
- For Mac OS X, double-click on the .tar.gz file and the Archive Utility will unpack it. Alternatively, command-line tar will also work (see Linux example, above).
- For Windows, either use an archiving and compression utility (e.g., Winzip, 7-Zip, etc.), or simply double-click on the .zip file and drag the 'sratoolkit...' folder to the preferred install location.
Note: For most users, the Toolkit functions (fastq-dump, sam-dump, etc.) will not be located in their PATH environmental variable. This may require providing directory information about the location of the Toolkit. See the below examples for how 'fastq-dump' would be called in different circumstances:
~/[user_name]/sra-toolkit/fastq-dumpYES: The Toolkit "bin" directory has been placed in the user-specified directory "sra-toolkit"
./fastq-dumpYES: The Toolkit components are the in the current working directory
fastq-dumpNO: If the toolkit location is not specified in your $PATH variable, then the OS cannot locate the fastq-dump program, even if it is in the current directory. NOTE: Windows users should be able to enter only "fastq-dump.exe" if you have navigated to the Toolkit "bin" directory.
The Toolkit comes with a default configuration that will work for most users. You may elect to perform the following tests to confirm that your configuration is working correctly. The default location for the "download repository" is:
- Linux: /home/[user_name]/ncbi/public
- Mac OS X: /Users/[user_name]/ncbi/public
- Windows: C:\Users\[user_name]\ncbi\public
Note that if the tests fail, or if you wish to specify the download location for files sourced from NCBI, you should configure your Toolkit installation. During normal operation, the Toolkit may be required to download the following types of data to the default location:
- Reference sequences: Small (most less than 70 MB) sequences used to decompress aligned SRA data.
- SRA data files: If data are downloaded "on-the-fly" using the toolkit, then partial and whole SRA datasets (most are several Gb in size) can be located here. Note: Manually downloaded SRA data obtained using a web browser, wget, ascp, or FTP may be stored anywhere in the local file system.
For the test, we are using an arbitrary dataset, SRR390728 (RNA-Seq (polyA+) analysis of DLBCL cell line HS0798), from the National Cancer Institute’s Cancer Genome Characterization Initiative (CGCI) Project. It is a reasonably small SRA dataset that contains aligned (reference-compressed) data, allowing us to test multiple aspects of the toolkit simultaneously.
Open a terminal or command prompt and "cd" into the directory containing the toolkit executables
Linux and OS X users should execute the following command:
./fastq-dump -X 5 -Z SRR390728
Windows users should execute the following command:
fastq-dump.exe -X 5 -Z SRR390728
- Linux and OS X users should execute the following command:
- If successful, the test should connect to NCBI, download a small amount of data from SRR390728 and the reference sequence needed to extract the data, and stream the first 5 spots of the file ("-X 5" option) to the screen ("-Z" option).
If the configuration is not valid, an error like the following will likely be displayed:
fastq-dump.2.x err: item not found while constructing within virtual database module - the path 'SRR390728' cannot be opened as database or table"
- If you receive an error like the one above, please configure the toolkit (described in the next section). If you have already configured the toolkit but are still unable to complete the test successfully, please email firstname.lastname@example.org with a full description of steps taken and error messages received.
If you are using SRA Toolkit version 2.3 or higher, you should run the Java configuration tool, located within the /bin subdirectory of the Toolkit package.
Go to the "bin" subdirectory for the Toolkit and run the following command line:
java -jar sratoolkit.jar &This tool will setup your download/cache area for downloaded files and references. You can accept the defaults as you walk through the configuration wizard, or change the location of the location of the download directory.
A window will open and present the below screen. In this example, user "jane_user" has opened
the Java configuration manager and it has defaulted to a location in her home directory.
You may accept this default or change it using the "Browse" button. Click "Next" to proceed.
The configuration manager wil then ask you to confirm the target directory.
Click "Yes" to accept or "No" to cancel or change the location.
The configuration manager will ask if you wish to grant access for the Toolkit to obtain data
and reference sequences directly from NCBI, as needed.
The default is "Enable Repository"; you must check the box to disable access (Note: This will require manual retrieval of data AND reference sequences; disabling access is NOT recommended).
After specifying the download directory and setting repository access, you may now click "EXIT"
to return to the command line and test the Toolkit configuration.
If you are attempting to open dbGaP Authorized Access data, you should instead click "OK" to import your .ngc file. Please proceed to the Potected Access Usage guide for additional information.
If you do not have Java installed on your computer, or if you are accessing a remote server that does not have X windows configured, you may elect to run the "configuration-assistant.perl" Perl script. Wherever possible, it is recommended that you use the Java configuration tool, as the Perl script is no longer being developed and is included with the Toolkit for compatibility purposes only. Note that Perl is included with most Linux installations and Mac OS X, but generally not with Windows.
Initiate the configuration-assistant.perl script:
Your configuration is incomplete. Would you like to fix it? [Y/n]Type "Y" or press ENTER.
Would you like to enable Remote Internet access to NCBI(recommented)? [Y/n]Type "Y" or press ENTER. Note that disabling remote access will make extracting data from reference-compressed SRA data sets difficult and is not recommended.
Please indicate where data should be stored." ... "Path to your Repository [ /home/USERNAME/ncbi/public ]:If the default path provided is okay, press ENTER to accept. Otherwise, you may type in a preferred path.
Directory " ... " does not exist. Would you like to create it? [Y/n]Type "Y" or press ENTER if you are certain that the path is correct.
Would you like to enable caching of downloaded data? [Y/n]Type "Y" or press ENTER. This will enable the toolkit to cache remote data locally on your system. Turning caching off will allow the toolkit to function, but only at network speeds when accessing remote data.
Toolkit download: http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software
Toolkit documentation: http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc
Protected Data Usage guide: http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=dbgap_use
Toolkit high-performance computing guide: Coming soon!