1 Introduction

The goal of this analysis is to compare different “groups” of samples based on a submitted submission form. Samples grouped for analysis will be analyzed for basic quality control metrics, clustering, and marker gene analysis. In each section there will be a “Data Interpretation” tab. Here we provide guidance for interpretation of HIVE scRNA-seq data for individual metrics.

Please see our support documentation for additional resources for data analysis: Honeycomb Biotechnologies’ Support Page. Please contact support@honeycomb.bio for any additional questions.

Analysis Tier: Tier 1

2 Metadata

The following samples were included in this analysis:

Samples for analysis
ExternalSampleID	AnalysisDate	Dataset	AlignedRef	SampleID	Group	SampleType	GeneThreshold	TranscriptThreshold	CellInput	NumBC	nFastq
MySample3	20230426	20230426-TestData	20210603_GRCh38.104	30k_S1	30k	PBMC	300	600	30000	20000	1
MySample4	20230426	20230426-TestData	20210603_GRCh38.104	30k_S2	30k	PBMC	300	600	30000	20000	1
MySample1	20230426	20230426-TestData	20210603_GRCh38.104	7.5k_S1	7.5k	PBMC	300	600	7500	20000	1
MySample2	20230426	20230426-TestData	20210603_GRCh38.104	7.5k_S2	7.5k	PBMC	300	600	7500	20000	1

3 Data Processing

3.1 Sequencing QC

3.1.1 Sequencing QC-BeeNet

This section contains quality control metrics for all reads in the FASTQs for each sample.

All values where designated are calculated a percent of the total number of reads. Raw values for each sample are found in the sample’s individual SampleQC.tsv. file

Total Reads: Total reads in the FASTQ files for each sample.
% Filtered Reads: The % of reads that have passed filtering prior to alignment for each sample.
% Mapped Reads: The % of reads that map to the reference genome for each sample.
% Exon Reads: The % of reads that map to exons for each sample.
% PolyAReads: The % of reads that were filtered out for containing polyA stretch.
% 5PFReads: The % of reads that were filtered out due to having adaptor sequence present in the 5’ end of the read.
% 3PFReads: The % of reads that were filtered out due to having adaptor sequence present in the 3’ end of the read.
% badBaseBC: The % of reads that were filtered out due to having two or more bases in the cell barcode with poor phred scores.

Combined BeeNet outputs (Calculated from SampleQC.tsv)
	MySample1	MySample2	MySample3	MySample4
Total Reads	30121627	29422071	87502938	83688475
% Filtered Reads	98.87030	98.67853	99.20077	99.34335
% Mapped Reads	93.01524	93.42827	94.38337	94.35557
% Exon Reads	38.23032	39.53936	39.97816	39.62604
% PolyA Reads	0.3136218	0.3292528	0.2744091	0.1922941
% PF5Reads	0.8006473	0.9738472	0.5083544	0.4484261
% PF3 Reads	0.01543409	0.01837056	0.01646916	0.01592692
% BadBC Reads	0	0	0	0

3.1.2 Sequencing QC-Plots

3.1.3 Sequencing QC-Table

Quality control metrics for each sample:

FracPassFilter: The fraction of all reads that pass filtering (Filtered Reads/ Total Reads).
HQCellsTotal Reads: The number of all reads for cell barcodes that pass gene and threshold filtering (High Quality cells).
FracReadsHQCells: The fraction of all reads that map to High Quality Cells.
Total Reads: The median total reads per cell.
Mapped Reads: The median mapped reads per cell.
Exon Reads: The median exon reads per cell.
Sequencing Saturation: 1- (number of unique valid cell-barcode,transcript combinations) / number of mapped reads)
Complexity: The number of exon reads / number of transcripts per cell.

Sequencing QC
ExternalSampleID	SampleID	Group	FracPassFilter	HQCellsTotalReads	FracReadsHQCells	TotalReads	MappedReads	ExonReads	SeqSat	Complexity
MySample3	30k_S1	30k	0.9920077	45308013	0.5177885	15215	14642	7780.0	0.8306365	3.133043
MySample4	30k_S2	30k	0.9934335	43801407	0.5233864	14230	13702	7287.0	0.8404012	3.310704
MySample1	7.5k_S1	7.5k	0.9887030	12629948	0.4192983	15929	15320	8702.5	0.8592230	3.752627
MySample2	7.5k_S2	7.5k	0.9867853	13223613	0.4494454	15082	14569	7888.0	0.8487386	3.534989

3.1.4 Sequencing QC-Data Interpretation

In this section we provide guidance for interpretation of your results. Keep in mind that non-standard reference genomes may have different metrics for some of these parameters.
There might be slight discrepencies between values in the 3.1.1:Sequencing QC-BeeNet tab and the 3.1.2:Sequencing QC-Plots/Table tabs. The data in Section 3.1.1 is unfiltered, while the data in Sections 3.1.2 and 3.1.3 contain only the metrics for cells retained for analysis after QC filtering.

High quality data can be generally defined by the following parameters:

Sequencing QC- BeeNet

% Filtered Reads: > 80%
% Mapped Reads: > 80%
% Exon Reads: ~ 30-60%
% PolyAReads: < 10%
% 5PFReads: < 10 %
% 3PFReads: < 10 %
% badBaseBC: < 10 %

Sequencing QC- Plots

Samples are sequenced to saturation if the following parameters have been met:

% Complexity: > 4
Median Total Reads Per Cell: > 20,000
Sequencing Saturation: > 75%

Make sure that your sequencing depth is similar between your samples. If the Median Exon Reads per cell by Group is vastly different between samples, it is likely that you will have to sequence the shallow libraries more deeply or downsample the deeper-sequenced libraries to make accurate comparisons.

3.2 Cell Recovery

3.2.1 Cell Recovery-Plots

We removed cells with metrics below the gene and transcript thresholds specified in the metadata table.

3.2.2 Cell Recovery-Table

Cell Recovery quality metrics for each sample:
CellInput: The number of cells originally loaded into the HIVE device.
NumBC: The user-specified number of cell barcodes from cells recovered from the HIVE. GeneThreshold: The user-specified cut-off for gene recovery.
TranscriptThreshold: The user-specified cut-off for transcript recovery.
nCells: The number of cells recovered per sample.
nGenes: The median number of genes recovered per cell.
nCount_RNA: The median number of transcripts recovered per cell.
percMito: The % of transcripts that map to mitochondrial genes per cell.

Cell Recovery Metrics after filtering
ExternalSampleID	SampleID	Group	CellInput	NumBC	GeneThreshold	TranscriptThreshold	nCells	nGenes	nCount_RNA	percMito
MySample3	30k_S1	30k	30000	20000	300	600	2339	1150	2507	6.111567
MySample4	30k_S2	30k	30000	20000	300	600	2247	1066	2210	6.291149
MySample1	7.5k_S1	7.5k	7500	20000	300	600	560	1068	2309	6.103375
MySample2	7.5k_S2	7.5k	7500	20000	300	600	589	1077	2295	6.369427

3.2.3 Cell Recovery-Data Interpretation

In this section we provide guidance for interpretation of your results. Keep in mind that non-standard reference genomes may have different metrics for some of these parameters.

We advise you to consider the following parameters using our range of performance metrics:

Cells Recovered by Group: If the number of cells recovered is equal to the number of barcodes input into BeeNet, it is likely that the true number of cells recovered is higher. Please rerun this analysis with an increased NumBC parameter.
Median Genes by Group: > 800. This number is usually higher for cell lines, but will likely be lower for samples that contain granulocytes (Neutrophils, Basophils, Eosinophils). Granulocytes on average express fewer genes and transcripts independent of sample quality.
Median Transcripts by Group: > 1200. This number is usually higher for cell lines, but will likely be lower for samples that contain granulocytes (Neutrophils, Basophils, Eosinophils). Granulocytes on average express fewer genes and transcripts independent of sample quality.
Percent Mito by Group: < 20%.

4 Results

4.1 Dimensionality Reduction

4.1.1 Dimensionality Reduction-UMAP

Below are the results of dimensionality reduction and clustering (scTransform, vars.to.regress = pct.mito (if available)).

4.1.2 Dimensionality Reduction-Violin Plots

By Cluster

4.1.3 Dimensionality Reduction-Data Interpretation

Dimensionality Reduction using scTransform results in cells that are clustered based on gene expression profiles.

A UMAP (Uniform Manifold Approximation and Projection) is a 2 dimensional plot that captures this clustering data. In a UMAP, distance along the axis relates to how similar clusters are to one another. This is different from a TSNE plot, where axis distance is uninformative.

A high quality UMAP contains:
- Clusters that are distinct from one another.
- Clusters that are not sample specific (unless biologically relevant).

A low quality cluster on a UMAP may:
- Be diffuse around the UMAP. These clusters may be defined by RBC contamination, high mitochondrial gene expression, or abnormally low genes/transcripts.
- Be in the middle of or bridge two larger, unrelated clusters. This could be indicative of multiplets.

If your data contains many cells in low-quality clusters, we recommend removing them manually and re-clustering with those cells removed or increasing the gene/transcript threshold and rerunning your analysis.

4.2 Cell Type Annotation

4.2.1 Cell Type Annotation-UMAP

Auto-annotation of your samples using HIVE data (see Honeycomb Biotechnologies’ Resources Page) is available for filtered blood, bone marrow or PBMCs only. If auto-annotation is not available, scType (https://github.com/IanevskiAleksandr/sc-type) was used for preliminary cell-type annotation of mouse and human sample types.

4.2.2 Cell Type Annotation-Table

If scType was used to auto-annotate your sample, the table below shows the score for each tissue in the algorithm. Cells were auto-annotated using the tissue with the highest score UNLESS your Sample Type matches a known dataset.

This table shows how many cells are assigned to each cell type.

HCB Cell Type ID
Cell type	# Cells
Neutrophil-2	71
Neutrophil-3	77
Monocyte	2042
CD16+ Monocyte	432
CD1c+ DC	98
Plasmocytoid DC	52
Naive CD4 T cells	422
Memory CD4 T cells	1711
Memory CD8 T cells	131
Cytotoxic cells	382
Naive B cells	189
Memory B cells	128

4.2.3 Cell Type Annotation-Data Interpretation

Based on UMAP and Clustering analysis, clusters can be assigned to disinct cell types according to their gene expression profile.

Clusters and cell types are not always interchangeable! Often there is more than one cluster within a cell type, indicating heterogeneity within a cell type. Other times, there are clusters not defined by biologically meaningful genes that should be merged together.

We have performed a preliminary cell-type annotation based on your sample type and reference genome. If scType was used to detect your tissue, the dataset with the top score was used for auto-annotation.

Additional analysis is likely required to better understand the biology behind your experiment.

4.3 Marker Gene Analysis

4.3.1 Marker Gene Analysis-Cluster

Here we examine the marker genes expressed in each cluster for all samples. Gene expression is presented as the log2 fold change (Log2 fold change of 0.58 is equivalent to 1.5 fold, or a 50% change). The fold change is calculated by comparing the expression of each gene in the “cluster” column versus the expression of the same gene in all other cells in the other clusters.

The table below is the output of this analysis.

avg_log2FC: Average log2 fold change.
pct.1: Percent of cells expressing the indicated gene in the cluster specified in the “cluster” column.
pct.2: Percent of all other cells in the other clusters that express the indicated gene.
cluster: The specific cluster that is being evaluated.
SCT_enriched: The SCTransformed expression value of the indicated gene in the cluster specified in the “cluster” column.
SCT_otherCluster: The SCTransformed expression value of the indicated gene in all other clusters.

4.3.2 Marker Gene Analysis-Cell Type

Here we examine the marker genes expressed in each cell type for all samples. Gene expression is presented as the log2 fold change (Log2 fold change of 0.58 is equivalent to 1.5 fold, or a 50% change). The fold change is calculated by comparing the expression of each gene in the “cell_type” column versus the expression of the same gene in all other cells in the other cell types.

The table below is the output of this analysis.

avg_log2FC: Average log2 fold change.
pct.1: Percent of cells expressing the indicated gene in the cell type specified in the “celltype” column.
pct.2: Percent of all other cells in the other clusters that express the indicated gene.
cell_type: The specific cell type that is being evaluated.
SCT_enriched: The SCTransformed expression value of the indicated gene in the cell type specified in the “cell_typ” column.
SCT_otherCluster: The SCTransformed expression value of the indicated gene in all other cell types.

4.3.3 Marker Gene Analysis-Data Interpretation

Marker genes were defined using Seurat’s FindAllMarkers function. Each individual cluster is compared to all other clusters to define genes that are either more highly expressed (as defined by the log2 Fold Change), or are uniquely expressed (defined as the ratio of pct.1 to pct.2).

Expression levels of these genes (SCT.Enriched_Cluster) are provided for the user to determine whether the gene is expressed highly enough to be biologically meaningful.

In some cases, low quality cells can be defined by presence of only mitochondrial or ribosomal genes in this marker gene table for a particular cluster. Clusters that are only distinguished by mitochondrial or ribosomal genes should likely be removed (after taking a look at other quality metrics for these cells).

4.4 Gene Expression Plots

4.4.1 Feature Plots by Cluster

Here we examine the highest expressing genes associated with each cluster and their expression in the overall data. Feature Plots for individual clusters are found in the output files, but a snapshot is shown here:

4.4.2 Feature Plots by Cell Type

Here we examine the top genes associated with each cell type and their expression in the overall data. Feature Plots for individual clusters are found in the output files, but a snapshot is shown here:

4.4.3 Dot Plot by Cell Type-Known Markers

Based on your sample type, we can provide marker genes that define each known cell type. This Dot Plot is shown below:

4.4.4 Gene Expression Plots- Data Interpretation

Feature Plots map expression of your favorite genes onto the UMAP plot. Feature Plots can be generated for any gene, but this snapshot shows:

Feature Plot by cluster: The two most highly expressed genes for an individual cluster. Unique genes were chosen to be represented within this plot to remove ubiquitously expressed genes and to determine whether heterogeneity exists between individual clusters within cell types.
Feature Plot by cell type: The two most highly expressed genes for an individual cell type. These are often canonical pan-markers for biological cell types.

Dot Plots are a different method to show expression of individual genes. It is a visual representation of both expression level (the color intensity) and expression specificity and sensitivity (size of the dot). Based on cell type annotation, a Dot Plot with canonical marker genes for those cell types is also provided.

In some instances, the most expressed or most unique genes might not be the most biologically appropriate genes to display. Our plots are only a guide to drive further biological analysis.

4.5 Cluster Proportions

4.5.1 By Sample

This plot shows the proportion of each sample that makes up individual clusters.

This table shows the proportion of each cluster that makes up individual samples.

4.5.2 By Group

This plot shows the proportion of each group that makes up individual clusters.

This table shows the proportion of each cluster that makes up each group.

4.5.3 Data Interpretation

Cluster proportions are often useful for determining sample-specific batch effects and sample quality. However, there are often biological reasons for differences in cell type proportions.

4.6 Cell Type Proportions

4.6.1 By Sample

This plot shows the proportion of each sample that makes up each cell type.

This table shows the proportion of each cell type that makes up each sample.

4.6.2 By Group

This plot shows the proportion of each group that makes up each cell type

This table shows the proportion of each group that makes up each cell type.

4.6.3 Data Interpretation

Cell type proportions are often useful for determining sample-specific batch effects and sample quality. However, there are often biological reasons for differences in cell type proportions.

5 Conclusions and Next Steps

This is a preliminary analysis performed using the Honeycomb External Informatics Pipeline.

Each section of this report contains a “Data Interpretation” Tab that provides guidelines and metrics for interpretation of your data. Please remember that these are only guidelines, and that depending on your sample type and reference genome your data may look different.

5.1 Explanation of output files

Your output folder, when unzipped, may contain the following files:

ClusterMetrics-DATASETNAME.txt: Sequencing quality control metrics, cell, gene and transcript recovery per cluster.
ClusterMetricsByGroup.txt: Sequencing quality control metrics, cell, gene and transcript recovery per cluster per group.
FullCluster_Enrichments_CellType.txt: Marker gene analysis per cell type
FullCluster_Enrichments.txt: Marker gene analysis per cluster.
FullDataset-DATASETNAME.Rdata: R object containing the Seurat Object and data for this analysis.
HCB-scRNAseq-DATASETNAME_files: This folder contains all of the files found in this report, output as png images.
SampleMetrics-DATASETNAME.txt: This file contains all of the metrics for each sample, inclduing sequencing metrics, metadata, cell/gene/transcript recovery, and additional metadata.

5.2 How to interpret your data

To ensure accurate single-cell data analysis, be sure that sequencing depth is similar between all samples. If the total number of reads per cell is significantly different between samples:

Sequence shallow libraries deeper or
Downsample the more deeply sequenced library so that the number of reads per cell is equivalent.

The Honeycomb BioX team recommends examining the data for appropriate cell, gene and transcript recovery. If data is of low quality (e.g. high percent mitochondria, avg <200 genes and <400 transcripts), consider the following next steps:

Rerun this analysis using different genes and transcript thresholds.
Remove low quality clusters manually from the R object using the subset command, and recluster with the help of your local informatics core or using our Seurat Tutorial.

You are the biological experts for your data! Auto-annotation packages can often miss nuances in gene expression patterns. This is a preliminary data analysis workflow, but in-depth analysis of your samples is highly recommended!

6 Session Info

6.1 Workflow Session

BeeNetPLUS Session Info
	Version
AlignedRef	20210603_GRCh38.104
BeeNet	5f58c57_v1.1.3
SecondaryPipeline	v1.0
WDL	v1.0.0

6.2 R Session

R Session Info
Setting	Value
version	R version 4.2.2 (2022-10-31)
os	Ubuntu 22.04.1 LTS
system	x86_64, linux-gnu
ui	X11
language	(EN)
collate	en_US.UTF-8
ctype	en_US.UTF-8
tz	Etc/UTC
date	2023-04-26
pandoc	2.19.2 @ /usr/local/bin/ (via rmarkdown)

6.3 Package Info

R Package Info
Package	Loaded version	Date
Biobase	2.58.0	2022-11-01
BiocGenerics	0.44.0	2022-11-01
celldex	1.8.0	2022-11-03
dplyr	1.0.10	2022-09-01
DT	0.26	2022-10-19
forcats	0.5.2	2022-08-19
GenomeInfoDb	1.34.3	2022-11-10
GenomicRanges	1.50.1	2022-11-06
ggplot2	3.4.0	2022-11-04
ggpubr	0.5.0	2022-11-16
HGNChelper	0.8.1	2019-10-24
IRanges	2.32.0	2022-11-01
knitr	1.40	2022-08-24
magick	2.7.3	2021-08-18
MatrixGenerics	1.10.0	2022-11-01
matrixStats	0.62.0	2022-04-19
patchwork	1.1.2	2022-08-19
plyr	1.8.8	2022-11-11
purrr	0.3.5	2022-10-06
RColorBrewer	1.1-3	2022-04-03
readr	2.1.3	2022-10-01
S4Vectors	0.36.0	2022-11-01
scales	1.2.1	2022-08-20
scuttle	1.8.0	2022-11-01
Seurat	4.2.1	2022-11-08
SeuratData	0.2.2	2022-11-18
SeuratDisk	0.0.0.9020	2022-11-18
SeuratObject	4.1.3	2022-11-07
SingleCellExperiment	1.20.0	2022-11-01
SingleR	2.0.0	2022-11-01
stringr	1.4.1	2022-08-20
SummarizedExperiment	1.28.0	2022-11-01
tibble	3.1.8	2022-07-22
tidyr	1.2.1	2022-09-08
tidyverse	1.3.2	2022-07-18
UCell	2.2.0	2022-11-01

BeeNetPLUS scRNAseq Report v1.0

Document Prepared by Honeycomb BioX Team

April 26, 2023