skip to main content

Trace Archive Frequently Asked Questions

  1. Should I submit to Trace Archive?
  2. How can I obtain a trace for one or just a few sequences?
  3. Can I view the chromatogram on your web site?
  4. What is a mate-pair?
  5. How can I download large data sets?
  6. What is the RCF format?
  7. Where can I find field requirements for submitting data?
  8. What are the available common fields in the submission file?
  9. Can submitted data be kept confidential in advance of publication?
  1. Should I submit to Trace Archive?

    No. Please consider SRA instead.

  2. How can I obtain a trace for one or just a few sequences?

    The actual trace file for a particular sequence can be obtained from the web site. When you have retrieved the reads in which you are interested use the check boxes at the top to select the information you wish to save and then click the "Save" button.

  3. Can I view the chromatogram on your web site?

    Yes. On the results page, select "Trace" from the "Show" pull down menu.

  4. What is a mate-pair?

    Most of the sequence in the Trace Archive is derived from Whole Genome Shotgun (WGS) sequencing. WGS involves generating libraries of discrete size and sequencing both ends of the clones in the library. Sequences derived from different ends of the same clones are called mate-pairs. This information can be useful for inferring the distance between two mate pairs if the average insert size of the library is known.

  5. How can I download large data sets?

    The number of records which can be obtained on a single request is limited. Currently this number is set to 40,000. In order to download more records, you would need to place several requests accordingly. Although it is generally possible to download all needed data with a browser, the best approach to do this job is to use our Perl script query_tracedb. After copying this script, don't forget to make it executable. All records in the archive are assigned a unique identifier - TI, and therefore, first, you would need to obtain all identifiers which comply to your query. Using these identifiers you can then retrieve the actual data. Let's see how this works on a real example (please note that this page is static, and all the numbers shown in the example may not reflect the current status of the archive):

    1. The first step is to count all available records:
      query_tracedb "query count species_code='AEDES AEGYPTI'"
      122116
    2. A simple calculation shows that to retrieve all records we will need to make at least 4 requests, so let's obtain the identifiers. Please note that the identifiers are in network (BIG ENDIAN) format:
      query_tracedb "query page_size 40000 page_number 0 binary species_code='AEDES AEGYPTI'" > page1.bin
      query_tracedb "query page_size 40000 page_number 1 binary species_code='AEDES AEGYPTI'" > page2.bin
      ...
      query_tracedb "query page_size 40000 page_number 3 binary species_code='AEDES AEGYPTI'" > page4.bin
    3. You can now retrieve the data in the submission form (tarball):
      (echo -n "retrieve_tgz all 0b"; cat page1.bin) | query_tracedb > data1.tgz
      ...
      (echo -n "retrieve_tgz all 0b"; cat page4.bin) | query_tracedb > data4.tgz
      The above will retrieve all files from the archive: fasta, quality scores, chromatograms in scf format, mate_pairs, and ancillary files.
    4. *Note: steps 2 and 3 can be done at the same time:
      (echo -n "retrieve_tgz all 0b"; query_tracedb "query page_size 40000 page_number 0 binary species_code='AEDES AEGYPTI'") | query_tracedb > data1.tgz

    For more information please apply 'query_tracedb help' for available data formats, and 'query_tracedb usage' for usage examples.

    If you need to save only TI numbers for future reference, you might want to obtain them in text form:

    query_tracedb "query page_size 40000 page_number 0 text species_code='AEDES AEGYPTI'" > page1.txt
  6. What is the RCF format?

    RCF stands for Relieved Compress Format and represents the data the exact way it is residing on the server. In order to minimize disk space usage as well as computation time, it was decided after thorough tests that the originally supplied data is to be reprocessed and recompressed on-the-fly during the data loading process. Thus all chromatograms are being kept in the proprietary format which is called RCF. RCF is a combination of two simple computation algorithms: derivation and Huffman encoding, which yield a significant data compression while remaining simple and not requiring much computation power.

    Typically it takes much less time when the data is downloaded in RCF format due to the smaller size of the data. The data can then be converted into SCF format locally. We greatly encourage you to do this, since it relieves pressure on the server while also saving you some waiting time. The converter can be obtained from the public ftp site: rcf2scf

  7. Where can I find field requirements for submitting data?

    Check the requirements in the Validation Table (Excel format) for specific combinations of STRATEGY and TRACE_TYPE_CODE.

  8. What are the available common fields in the submission file?

    See the list of common fields here

  9. Can submitted data be kept confidential in advance of publication?

    If you need this feature, please contact us before loading (trace@ncbi.nlm.nih.gov). As soon as data have been loaded they became public.