The goal of this analysis is to compare different “groups” of samples based on a submitted submission form. Samples grouped for analysis will be analyzed for basic quality control metrics, clustering, and marker gene analysis. In each section there will be a “Data Interpretation” tab. Here we provide guidance for interpretation of HIVE scRNA-seq data for individual metrics.
Please see our support documentation for additional resources for data analysis: Honeycomb Biotechnologies’ Support Page. Please contact support@honeycomb.bio for any additional questions.
Analysis Tier: Tier 1
The following samples were included in this analysis:
ExternalSampleID | AnalysisDate | Dataset | AlignedRef | SampleID | Group | SampleType | GeneThreshold | TranscriptThreshold | CellInput | NumBC | nFastq |
---|---|---|---|---|---|---|---|---|---|---|---|
MySample3 | 20230426 | 20230426-TestData | 20210603_GRCh38.104 | 30k_S1 | 30k | PBMC | 300 | 600 | 30000 | 20000 | 1 |
MySample4 | 20230426 | 20230426-TestData | 20210603_GRCh38.104 | 30k_S2 | 30k | PBMC | 300 | 600 | 30000 | 20000 | 1 |
MySample1 | 20230426 | 20230426-TestData | 20210603_GRCh38.104 | 7.5k_S1 | 7.5k | PBMC | 300 | 600 | 7500 | 20000 | 1 |
MySample2 | 20230426 | 20230426-TestData | 20210603_GRCh38.104 | 7.5k_S2 | 7.5k | PBMC | 300 | 600 | 7500 | 20000 | 1 |
This section contains quality control metrics for all reads in the FASTQs for each sample.
All values where designated are calculated a percent of the total number of reads. Raw values for each sample are found in the sample’s individual SampleQC.tsv. file
Total Reads: Total reads in the FASTQ files for each
sample.
% Filtered Reads: The % of reads that have passed
filtering prior to alignment for each sample.
% Mapped Reads: The % of reads that map to the
reference genome for each sample.
% Exon Reads: The % of reads that map to exons for each
sample.
% PolyAReads: The % of reads that were filtered out for
containing polyA stretch.
% 5PFReads: The % of reads that were filtered out due
to having adaptor sequence present in the 5’ end of the read.
% 3PFReads: The % of reads that were filtered out due
to having adaptor sequence present in the 3’ end of the read.
% badBaseBC: The % of reads that were filtered out due
to having two or more bases in the cell barcode with poor phred
scores.
MySample1 | MySample2 | MySample3 | MySample4 | |
---|---|---|---|---|
Total Reads | 30121627 | 29422071 | 87502938 | 83688475 |
% Filtered Reads | 98.87030 | 98.67853 | 99.20077 | 99.34335 |
% Mapped Reads | 93.01524 | 93.42827 | 94.38337 | 94.35557 |
% Exon Reads | 38.23032 | 39.53936 | 39.97816 | 39.62604 |
% PolyA Reads | 0.3136218 | 0.3292528 | 0.2744091 | 0.1922941 |
% PF5Reads | 0.8006473 | 0.9738472 | 0.5083544 | 0.4484261 |
% PF3 Reads | 0.01543409 | 0.01837056 | 0.01646916 | 0.01592692 |
% BadBC Reads | 0 | 0 | 0 | 0 |
Quality control metrics for each sample:
FracPassFilter: The fraction of all reads that pass
filtering (Filtered Reads/ Total Reads).
HQCellsTotal Reads: The number of all reads for cell
barcodes that pass gene and threshold filtering (High Quality
cells).
FracReadsHQCells: The fraction of all reads that map to
High Quality Cells.
Total Reads: The median total reads per cell.
Mapped Reads: The median mapped reads per cell.
Exon Reads: The median exon reads per cell.
Sequencing Saturation: 1- (number of unique valid
cell-barcode,transcript combinations) / number of mapped reads)
Complexity: The number of exon reads / number of
transcripts per cell.
ExternalSampleID | SampleID | Group | FracPassFilter | HQCellsTotalReads | FracReadsHQCells | TotalReads | MappedReads | ExonReads | SeqSat | Complexity |
---|---|---|---|---|---|---|---|---|---|---|
MySample3 | 30k_S1 | 30k | 0.9920077 | 45308013 | 0.5177885 | 15215 | 14642 | 7780.0 | 0.8306365 | 3.133043 |
MySample4 | 30k_S2 | 30k | 0.9934335 | 43801407 | 0.5233864 | 14230 | 13702 | 7287.0 | 0.8404012 | 3.310704 |
MySample1 | 7.5k_S1 | 7.5k | 0.9887030 | 12629948 | 0.4192983 | 15929 | 15320 | 8702.5 | 0.8592230 | 3.752627 |
MySample2 | 7.5k_S2 | 7.5k | 0.9867853 | 13223613 | 0.4494454 | 15082 | 14569 | 7888.0 | 0.8487386 | 3.534989 |
In this section we provide guidance for interpretation of your
results. Keep in mind that non-standard reference genomes may have
different metrics for some of these parameters.
There might be slight discrepencies between values in the
3.1.1:Sequencing QC-BeeNet tab and the 3.1.2:Sequencing QC-Plots/Table
tabs. The data in Section 3.1.1 is unfiltered, while the data in
Sections 3.1.2 and 3.1.3 contain only the metrics for cells retained for
analysis after QC filtering.
High quality data can be generally defined by the following parameters:
Sequencing QC- BeeNet
% Filtered Reads: > 80%
% Mapped Reads: > 80%
% Exon Reads: ~ 30-60%
% PolyAReads: < 10%
% 5PFReads: < 10 %
% 3PFReads: < 10 %
% badBaseBC: < 10 %
Sequencing QC- Plots
Samples are sequenced to saturation if the following parameters have been met:
% Complexity: > 4
Median Total Reads Per Cell: > 20,000
Sequencing Saturation: > 75%
Make sure that your sequencing depth is similar between your samples. If the Median Exon Reads per cell by Group is vastly different between samples, it is likely that you will have to sequence the shallow libraries more deeply or downsample the deeper-sequenced libraries to make accurate comparisons.
We removed cells with metrics below the gene and transcript thresholds specified in the metadata table.
Cell Recovery quality metrics for each sample:
CellInput: The number of cells originally loaded into
the HIVE device.
NumBC: The user-specified number of cell barcodes from
cells recovered from the HIVE. GeneThreshold: The
user-specified cut-off for gene recovery.
TranscriptThreshold: The user-specified cut-off for
transcript recovery.
nCells: The number of cells recovered per sample.
nGenes: The median number of genes recovered per
cell.
nCount_RNA: The median number of transcripts recovered
per cell.
percMito: The % of transcripts that map to
mitochondrial genes per cell.
ExternalSampleID | SampleID | Group | CellInput | NumBC | GeneThreshold | TranscriptThreshold | nCells | nGenes | nCount_RNA | percMito |
---|---|---|---|---|---|---|---|---|---|---|
MySample3 | 30k_S1 | 30k | 30000 | 20000 | 300 | 600 | 2339 | 1150 | 2507 | 6.111567 |
MySample4 | 30k_S2 | 30k | 30000 | 20000 | 300 | 600 | 2247 | 1066 | 2210 | 6.291149 |
MySample1 | 7.5k_S1 | 7.5k | 7500 | 20000 | 300 | 600 | 560 | 1068 | 2309 | 6.103375 |
MySample2 | 7.5k_S2 | 7.5k | 7500 | 20000 | 300 | 600 | 589 | 1077 | 2295 | 6.369427 |
In this section we provide guidance for interpretation of your results. Keep in mind that non-standard reference genomes may have different metrics for some of these parameters.
We advise you to consider the following parameters using our range of performance metrics:
Cells Recovered by Group: If the number of cells
recovered is equal to the number of barcodes input into BeeNet, it is
likely that the true number of cells recovered is higher. Please rerun
this analysis with an increased NumBC parameter.
Median Genes by Group: > 800. This number is usually
higher for cell lines, but will likely be lower for samples that contain
granulocytes (Neutrophils, Basophils, Eosinophils). Granulocytes on
average express fewer genes and transcripts independent of sample
quality.
Median Transcripts by Group: > 1200. This number is
usually higher for cell lines, but will likely be lower for samples that
contain granulocytes (Neutrophils, Basophils, Eosinophils). Granulocytes
on average express fewer genes and transcripts independent of sample
quality.
Percent Mito by Group: < 20%.
Below are the results of dimensionality reduction and clustering (scTransform, vars.to.regress = pct.mito (if available)).
By Cluster
Dimensionality Reduction using scTransform results in cells that are clustered based on gene expression profiles.
A UMAP (Uniform Manifold Approximation and Projection) is a 2 dimensional plot that captures this clustering data. In a UMAP, distance along the axis relates to how similar clusters are to one another. This is different from a TSNE plot, where axis distance is uninformative.
A high quality UMAP contains:
- Clusters that are distinct from one another.
- Clusters that are not sample specific (unless biologically
relevant).
A low quality cluster on a UMAP may:
- Be diffuse around the UMAP. These clusters may be defined by RBC
contamination, high mitochondrial gene expression, or abnormally low
genes/transcripts.
- Be in the middle of or bridge two larger, unrelated clusters. This
could be indicative of multiplets.
If your data contains many cells in low-quality clusters, we recommend removing them manually and re-clustering with those cells removed or increasing the gene/transcript threshold and rerunning your analysis.
Auto-annotation of your samples using HIVE data (see Honeycomb Biotechnologies’ Resources Page) is available for filtered blood, bone marrow or PBMCs only. If auto-annotation is not available, scType (https://github.com/IanevskiAleksandr/sc-type) was used for preliminary cell-type annotation of mouse and human sample types.
If scType was used to auto-annotate your sample, the table below shows the score for each tissue in the algorithm. Cells were auto-annotated using the tissue with the highest score UNLESS your Sample Type matches a known dataset.
This table shows how many cells are assigned to each cell type.
Cell type | # Cells |
---|---|
Neutrophil-2 | 71 |
Neutrophil-3 | 77 |
Monocyte | 2042 |
CD16+ Monocyte | 432 |
CD1c+ DC | 98 |
Plasmocytoid DC | 52 |
Naive CD4 T cells | 422 |
Memory CD4 T cells | 1711 |
Memory CD8 T cells | 131 |
Cytotoxic cells | 382 |
Naive B cells | 189 |
Memory B cells | 128 |
Based on UMAP and Clustering analysis, clusters can be assigned to disinct cell types according to their gene expression profile.
Clusters and cell types are not always interchangeable! Often there is more than one cluster within a cell type, indicating heterogeneity within a cell type. Other times, there are clusters not defined by biologically meaningful genes that should be merged together.
We have performed a preliminary cell-type annotation based on your sample type and reference genome. If scType was used to detect your tissue, the dataset with the top score was used for auto-annotation.
Additional analysis is likely required to better understand the biology behind your experiment.
Here we examine the marker genes expressed in each cluster for all samples. Gene expression is presented as the log2 fold change (Log2 fold change of 0.58 is equivalent to 1.5 fold, or a 50% change). The fold change is calculated by comparing the expression of each gene in the “cluster” column versus the expression of the same gene in all other cells in the other clusters.
The table below is the output of this analysis.
avg_log2FC: Average log2 fold change.
pct.1: Percent of cells expressing the indicated gene
in the cluster specified in the “cluster” column.
pct.2: Percent of all other cells in the other clusters
that express the indicated gene.
cluster: The specific cluster that is being
evaluated.
SCT_enriched: The SCTransformed expression value of the
indicated gene in the cluster specified in the “cluster” column.
SCT_otherCluster: The SCTransformed expression value of
the indicated gene in all other clusters.
Here we examine the marker genes expressed in each cell type for all samples. Gene expression is presented as the log2 fold change (Log2 fold change of 0.58 is equivalent to 1.5 fold, or a 50% change). The fold change is calculated by comparing the expression of each gene in the “cell_type” column versus the expression of the same gene in all other cells in the other cell types.
The table below is the output of this analysis.
avg_log2FC: Average log2 fold change.
pct.1: Percent of cells expressing the indicated gene
in the cell type specified in the “celltype” column.
pct.2: Percent of all other cells in the other clusters
that express the indicated gene.
cell_type: The specific cell type that is being
evaluated.
SCT_enriched: The SCTransformed expression value of the
indicated gene in the cell type specified in the “cell_typ”
column.
SCT_otherCluster: The SCTransformed expression value of
the indicated gene in all other cell types.
Marker genes were defined using Seurat’s FindAllMarkers function. Each individual cluster is compared to all other clusters to define genes that are either more highly expressed (as defined by the log2 Fold Change), or are uniquely expressed (defined as the ratio of pct.1 to pct.2).
Expression levels of these genes (SCT.Enriched_Cluster) are provided for the user to determine whether the gene is expressed highly enough to be biologically meaningful.
In some cases, low quality cells can be defined by presence of only mitochondrial or ribosomal genes in this marker gene table for a particular cluster. Clusters that are only distinguished by mitochondrial or ribosomal genes should likely be removed (after taking a look at other quality metrics for these cells).
Here we examine the highest expressing genes associated with each cluster and their expression in the overall data. Feature Plots for individual clusters are found in the output files, but a snapshot is shown here:
Here we examine the top genes associated with each cell type and their expression in the overall data. Feature Plots for individual clusters are found in the output files, but a snapshot is shown here:
Based on your sample type, we can provide marker genes that define each known cell type. This Dot Plot is shown below:
Feature Plots map expression of your favorite genes onto the UMAP plot. Feature Plots can be generated for any gene, but this snapshot shows:
Feature Plot by cluster: The two most highly
expressed genes for an individual cluster. Unique genes were chosen to
be represented within this plot to remove ubiquitously expressed genes
and to determine whether heterogeneity exists between individual
clusters within cell types.
Feature Plot by cell type: The two most highly
expressed genes for an individual cell type. These are often canonical
pan-markers for biological cell types.
Dot Plots are a different method to show expression of individual genes. It is a visual representation of both expression level (the color intensity) and expression specificity and sensitivity (size of the dot). Based on cell type annotation, a Dot Plot with canonical marker genes for those cell types is also provided.
In some instances, the most expressed or most unique genes might not be the most biologically appropriate genes to display. Our plots are only a guide to drive further biological analysis.
This plot shows the proportion of each sample that makes up individual clusters.
This table shows the proportion of each cluster that makes up individual samples.
This plot shows the proportion of each group that makes up individual
clusters.
This table shows the proportion of each cluster that makes up each group.
Cluster proportions are often useful for determining sample-specific batch effects and sample quality. However, there are often biological reasons for differences in cell type proportions.
This plot shows the proportion of each sample that makes up each cell type.
This table shows the proportion of each cell type that makes up each sample.
This plot shows the proportion of each group that makes up each cell type
This table shows the proportion of each group that makes up each cell type.
Cell type proportions are often useful for determining sample-specific batch effects and sample quality. However, there are often biological reasons for differences in cell type proportions.
This is a preliminary analysis performed using the Honeycomb External Informatics Pipeline.
Each section of this report contains a “Data Interpretation” Tab that provides guidelines and metrics for interpretation of your data. Please remember that these are only guidelines, and that depending on your sample type and reference genome your data may look different.
Your output folder, when unzipped, may contain the following files:
ClusterMetrics-DATASETNAME.txt: Sequencing quality
control metrics, cell, gene and transcript recovery per cluster.
ClusterMetricsByGroup.txt: Sequencing quality control
metrics, cell, gene and transcript recovery per cluster per group.
FullCluster_Enrichments_CellType.txt: Marker gene
analysis per cell type
FullCluster_Enrichments.txt: Marker gene analysis per
cluster.
FullDataset-DATASETNAME.Rdata: R object containing the
Seurat Object and data for this analysis.
HCB-scRNAseq-DATASETNAME_files: This folder contains
all of the files found in this report, output as png images.
SampleMetrics-DATASETNAME.txt: This file contains all
of the metrics for each sample, inclduing sequencing metrics, metadata,
cell/gene/transcript recovery, and additional metadata.
To ensure accurate single-cell data analysis, be sure that sequencing depth is similar between all samples. If the total number of reads per cell is significantly different between samples:
The Honeycomb BioX team recommends examining the data for appropriate cell, gene and transcript recovery. If data is of low quality (e.g. high percent mitochondria, avg <200 genes and <400 transcripts), consider the following next steps:
You are the biological experts for your data! Auto-annotation packages can often miss nuances in gene expression patterns. This is a preliminary data analysis workflow, but in-depth analysis of your samples is highly recommended!
Version | |
---|---|
AlignedRef | 20210603_GRCh38.104 |
BeeNet | 5f58c57_v1.1.3 |
SecondaryPipeline | v1.0 |
WDL | v1.0.0 |
Setting | Value |
---|---|
version | R version 4.2.2 (2022-10-31) |
os | Ubuntu 22.04.1 LTS |
system | x86_64, linux-gnu |
ui | X11 |
language | (EN) |
collate | en_US.UTF-8 |
ctype | en_US.UTF-8 |
tz | Etc/UTC |
date | 2023-04-26 |
pandoc | 2.19.2 @ /usr/local/bin/ (via rmarkdown) |
Package | Loaded version | Date |
---|---|---|
Biobase | 2.58.0 | 2022-11-01 |
BiocGenerics | 0.44.0 | 2022-11-01 |
celldex | 1.8.0 | 2022-11-03 |
dplyr | 1.0.10 | 2022-09-01 |
DT | 0.26 | 2022-10-19 |
forcats | 0.5.2 | 2022-08-19 |
GenomeInfoDb | 1.34.3 | 2022-11-10 |
GenomicRanges | 1.50.1 | 2022-11-06 |
ggplot2 | 3.4.0 | 2022-11-04 |
ggpubr | 0.5.0 | 2022-11-16 |
HGNChelper | 0.8.1 | 2019-10-24 |
IRanges | 2.32.0 | 2022-11-01 |
knitr | 1.40 | 2022-08-24 |
magick | 2.7.3 | 2021-08-18 |
MatrixGenerics | 1.10.0 | 2022-11-01 |
matrixStats | 0.62.0 | 2022-04-19 |
patchwork | 1.1.2 | 2022-08-19 |
plyr | 1.8.8 | 2022-11-11 |
purrr | 0.3.5 | 2022-10-06 |
RColorBrewer | 1.1-3 | 2022-04-03 |
readr | 2.1.3 | 2022-10-01 |
S4Vectors | 0.36.0 | 2022-11-01 |
scales | 1.2.1 | 2022-08-20 |
scuttle | 1.8.0 | 2022-11-01 |
Seurat | 4.2.1 | 2022-11-08 |
SeuratData | 0.2.2 | 2022-11-18 |
SeuratDisk | 0.0.0.9020 | 2022-11-18 |
SeuratObject | 4.1.3 | 2022-11-07 |
SingleCellExperiment | 1.20.0 | 2022-11-01 |
SingleR | 2.0.0 | 2022-11-01 |
stringr | 1.4.1 | 2022-08-20 |
SummarizedExperiment | 1.28.0 | 2022-11-01 |
tibble | 3.1.8 | 2022-07-22 |
tidyr | 1.2.1 | 2022-09-08 |
tidyverse | 1.3.2 | 2022-07-18 |
UCell | 2.2.0 | 2022-11-01 |