|Run||Spots||Bases||Size||GC content||Published||Access Type|
This run has 1 read per spot:
Technical read Application Read L=4, 100% Length is 4, 100% spots contain this read ̅L=165, σ=92.8, 66% Average length is 165, standard deviation is 92.8, 66% spots contain this read
|PRJNA312294||SRP070425||Use of the Fluidigm C1 platform for RNA sequencing of single mouse pancreatic islet cells|
This study provides an assessment of the Fluidigm C1 platform for RNA sequencing of single mouse pancreatic islet cells. The system combines microfluidic technology and nanoliter-scale reactions. We sequenced 622 cells allowing identification of 341 islet cells with high-quality gene expression profiles. The cells clustered into populations of alpha-cells (5%), beta-cells (92%), delta-cells (1%) and PP-cells (2%). We identified cell-type specific transcription factors and pathways primarily involved in nutrient sensing and oxidation and cell signaling. Unexpectedly, 281 cells had to be removed from the analysis due to low viability (23%), low sequencing quality (13%) or contamination resulting in the detection of more than one islet hormone (64%). Collectively, we provide a resource for identification of high-quality gene expression datasets to help expand insights into genes and pathways characterizing islet cell types. We reveal limitations in the C1 Fluidigm cell capture process resulting in contaminated cells with altered gene expression patterns. This calls for caution when interpreting single-cell transcriptomics data using the C1 Fluidigm system. Overall design: Single-cell RNA sequencing of mouse C57BL/6 pancreatic islet cells
SRA archive data
|Type||Size||Location||Name||Free Egress||Access Type|
|Type||Size||Location||Name||Free Egress||Access Type|
- Egress and Access: what does it mean?
- worldwide can be downloaded from anywhere for free
- Why is SRA data in the cloud?
In order to support large scale (hyper parallel) data analyses SRA data is now available at GCP and AWS with few caveats:
- SRA data is copied to the cloud from NCBI. There may be a lag between availability from NCBI and from CSP (cloud service providers)
- To access public data user account with the cloud service provider is required. Your account will incur costs for cloud compute and/or to copy data (either archival or results of your comute) outside of the specified cloud service region
- Distribution of protected data is signed by NIH account and requires user to operate in the same region as the data
SRA has also begun to provide access to originally submitted source files:
- not all files have been validated by SRA
- not all files have been copied to cloud locations (recovering it from NCBI tape system takes time ).
- the volume of this type of data a much larger and it is not used as often so we will keep most of it on tape or "cold" storage in cloud. As a result the data may not be available instantly and restore requests will be served on first-come first-served basis and cost of resore may be charged to your user account.
- Unidentified reads: 71.73%
- Identified reads: 28.27%
- How to read results?
Results show distribution of reads mapping to specific taxonomy nodes as a percentage of total reads within the analyzed run. In cases where a read maps to more than one related taxonomy node, the read is reported as originating from the lowest shared taxonomic node. So when a read maps to two species belonging to the same genus, it is assigned at the genus level. Sequence reads from a single organism will map to several taxonomy nodes spanning the organism’s lineage. The number of reads mapping to higher level nodes will typically be greater than those that map to terminal nodes.
STAT results are proportional to the size of sequenced genomes. Given a mixed sample containing several organisms at equal copy number, proportionally more reads originate from the larger genomes. This means that the percentages reported by STAT will reflect genome size and must be considered against the genomic complexity of the sequenced sample.
- How taxonomy analysis is done?
The NCBI SRA Taxonomy Analysis Tool (STAT) calculates the taxonomic distribution of reads from next generation sequencing runs. This analysis maps individual sequencing reads to a taxonomic hierarchy and reports the taxonomic composition of reads within a sequencing run.
STAT maps sequencing reads to a taxonomic hierarchy using a two-step strategy based on exact query read matches to precomputed k-mer dictionary databases. In the first pass a small, a "coarse" reference dictionary database is used to identify organisms matching a read set. In the second pass, organism-specific slices from a "fine" reference dictionary database are used to compute distribution of reads between identified taxonomy classes (species and higher order taxonomy nodes). When multiple tax nodes are mapped for single spot we use the lowest non-ambiguous mapping.
STAT k-mer dictionaries are built using an iterative minhash based approach against reference genomic databases. For every fixed segment length of incoming reference nucleotide sequence, k-mer representing this segment selected based on minimum fvn1 hash function. Several strategies were used to enhance the specificity and accuracy of STAT results. Low complexity k-mers composed of >50% homo-polymer or dinucleotide repeats (e.g. AAAAAA or ACACACACACA) were filtered from dictionaries, and discrete k-mers belonging to multiple taxonomic references were "merged" at the lowest common taxonomic node shared between references. Finally, the specificity of representative k-mers was determined by searching against the source reference genomic database. When representative k-mers were found in multiple taxonomic references nodes, they were merged at the lowest common taxonomic node as above.
Reference sequences were mapped to the taxonomy hierarchy using the NCBI taxonomy database. The database contained 48,180 taxonomy nodes in January 2017.
Segment sizes and K-mer selection
K-mer dictionaries were built by computationally slicing reference genomes into sequential segments and selecting 32-mers to represent each segment. The "coarse" k-mer dictionary uses variable segment lengths, proportional to genomes size and ranging from 200-8000 nt. The "fine" k-mer dictionary uses a constant 64 nt segment length for all genomes (for 32-mer index it gives us 32x reduction in space with the assumption that we have at least one error-free 64-mer for every spot).
Can I get the software?
Yes. At github
git clone https://github.com/ncbi/ngs-tools.git --branch tax
folder you can find helper *.sh scripts
How can I cite you?
No publication yet. We intend to post a preprint soon.