SRA objects that contain reads placements on reference genome in addition to raw reads require a reference sequences in order to interpret them. Small reference sequences are packaged inside SRA run object. The large ones (like human chromosomes) are packaged as independent reference objects and pulled independently.
Getting from SRA
The SRA toolkit tools will fetch and cache reference objects from NCBI when required. If you prefetch SRA run it will also prefetch all external reference objects. If you plan to prefetch a lot of runs referring to the same assembly - please do few of them sequentially before running parallel download to make sure common references fetched before mass dump started. You can also manually download all reference objects from SRA reference ftp - but you need to configure SRA toolkit to find them.
Submitting to SRA
In order to process BAM data SRA need to know reference sequences used in alignment. BAM files describe used references through reference name and optional assembly name. SRA archive can recognize the following combinations:
- INSDC accession.version (i.e. CM000663.1). No assembly name needed in this case
- sequence name in known assembles from NCBI Assembly database
- names of the sequences in a fasta file provided as part of submission