Single-cell multiomics experiments generate a large amount of data. Choosing the optimal data analysis tool during each step of the data analysis process enables you to uncover unique biological insights from these data. Explore our resources to learn how to process and analyze data generated from experiments using TotalSeq™ antibodies.
Multiomics Analysis Software (MAS) is our free cloud-based program that allows you to quickly and easily explore CITE-seq data without extensive bioinformatics knowledge.
Overview of the Analysis Pipeline
Primary Data Processing
FASTQ to count matrices and web summary
The FASTQ files are then typically run through the Cell Ranger count pipeline or similar, in which the reads are aligned and filtered, and the barcode and UMI sequences are counted. This pipeline produces a variety of file types such as BAM, matrix files, and summary files. If there are multiple samples within the libraries that have been multiplexed using TotalSeq™ hashtags, after FASTQ conversion the samples can be run through the Cell Ranger count pipeline or multi pipeline*.
*Note: Cell Ranger multi does not currently support HTO + VDJ data, please refer to the BioLegend notebook for more information.
The Cell Ranger pipeline produces a Web summary.html file that contains metrics and automated secondary analysis results that can be useful for assessing both library and sample quality.
Raw and filtered feature-barcode matrices – Unfiltered (raw) and filtered feature-barcode matrices are output in both the Market Exchange format (MEX) and Hierarchical Data Format (HDF5). The HDF5 format matrices are the most commonly used and are typically the primary input for sequential analysis pipelines such as MAS, R (Seurat), and Python software. The unfiltered matrix contains every barcode from the fixed list of known barcode sequences that have at least one associated read. This includes background and cell-associated barcodes. The filtered matrix only contains cell-associated barcode sequences and is the primary input to the MAS analysis pipeline.
For more information on the file types that are not discussed here, visit the 10x Cell Ranger output review.
Typical CITE-seq data sets are highly dimensional and contain information for thousands of genes, ADT and/or HTO reads, which are denoted as columns/features. They can also contain anywhere between 1,000 to 50,000+ cells (rows/observations), depending on the experimental design. Dimensionality reduction techniques help reduce the data complexity and aid in the visualization of this high-dimensional data. The most common dimensionality reduction methods used in CITE-seq include t-SNE or UMAP methods, both of which are non-linear methods that project data in the high-dimensional space into two (or more) dimensional space to enable visualization. Broadly speaking, these methods attempt to preserve local neighborhoods observed in the high-dimensions when projecting into the lower dimensions; i.e. cells that are close to each other in the high-dimensional space are typically close to each other in the lower dimensional space. As a result, clusters of cells are easier to visualize and probe them for co-expression patterns.
Normalization and identifying RNA with sufficient variance
Data normalization is a crucial part to the analysis of CITE-seq datasets data set since it helps reduce sequencing noise and bias that is present due to the inherent nature of this assay (i.e. gene length, GC content, sequencing depth, etc.). There are many methods available but most focus on correcting for the difference in RNA abundance related to the size of the cells. Once normalized the read counts more accurately reflect the differences in biology of the samples/cells rather than cell volume.
Clustering of cells aids in understanding the cellular heterogeneity within datasets. Clustering involves the grouping of cells based on their “similarities”, found within the gene expression and/or ADT expression profiles of those cells.