Data Preparation

There isn’t a lot of data preparation to be done before running the demultiplexing or doublet detecting softwares.

Data Required

The demultiplexing and transcriptome-based doublet detecting softwares have different data input requirements:

Software Group	Single Cell Count Data Required	SNP Genotype Data Required
Demultiplexing	✔️	✔️
Doublet Detecting	✔️	✖️

You won’t need to pre-process the single cell count data unless you are using DoubletFinder or DoubletDecon which need QC-filtered and normalized counts (for example with Seurat).

For the demultiplexing softwares, you should filter the SNP genotypes that you will use.

SNP Genotype Data

Note

The SNP genotype data can be for multiplexed donors in the pool OR it can be publicly available common SNP genotypes.

We provide instructions on how to access or prepare these data here.

Pulblicly Available SNP Genotype Data

We have provided common SNP vcf files that we have generated on both GRCh37 and GRCh38 and with a variety of filtering and ‘chr’ encoding. These can be downloaded using the following links:

Minor Allele Frequency	Genome	Region Filtering	Chr Encoding	vcf File	md5sum File
1%	GRCh37	Genes	No ‘chr’	GRCh37_1000G_MAF0.01_GeneFiltered_NoChr.vcf	GRCh37_1000G_MAF0.01_GeneFiltered_NoChr.vcf.md5
		Genes	‘chr’ encoding	GRCh37_1000G_MAF0.01_GeneFiltered_ChrEncoding.vcf	GRCh37_1000G_MAF0.01_GeneFiltered_ChrEncoding.vcf.md5
		Exons	No ‘chr’	GRCh37_1000G_MAF0.01_ExonFiltered_NoChr.vcf	GRCh37_1000G_MAF0.01_ExonFiltered_NoChr.vcf.md5
		Exons	‘chr’ encoding	GRCh37_1000G_MAF0.01_ExonFiltered_ChrEncoding.vcf	GRCh37_1000G_MAF0.01_ExonFiltered_ChrEncoding.vcf.md5
	GRCh38	Genes	No ‘chr’	GRCh38_1000G_MAF0.01_GeneFiltered_NoChr.vcf	GRCh38_1000G_MAF0.01_GeneFiltered_NoChr.vcf.md5
		Genes	‘chr’ encoding	GRCh38_1000G_MAF0.01_GeneFiltered_ChrEncoding.vcf	GRCh38_1000G_MAF0.01_GeneFiltered_ChrEncoding.vcf.md5
		Exons	No ‘chr’	GRCh38_1000G_MAF0.01_ExonFiltered_NoChr.vcf	GRCh38_1000G_MAF0.01_ExonFiltered_NoChr.vcf.md5
		Exons	‘chr’ encoding	GRCh38_1000G_MAF0.01_ExonFiltered_ChrEncoding.vcf	GRCh38_1000G_MAF0.01_ExonFiltered_ChrEncoding.vcf.md5

Of course, you can further filter these to a 5% minor allele frequency if you would prefer.

You can also download SNP genotype data and process it yourself from 1000G (hg19 and hg38) or HRC (hg19 only).

For 1000G, use the instructions at the above link to access the data per your preferences and you can find the required files at the following directories:

The hg19 data is available at /ftp/release/

The hg38 data is available at /ftp/release/20130502/supporting/GRCh38_positions/

Preparing your own SNP Genotype Data

It is best to filter the SNP genotypes for common SNPs (generally > 1% or > 5% minor allele frequency) that overlap either exons or genes. We typically suggest filtering for exons since it typically resultsin in ~250k SNPs to remain following filtering which is sufficient for demultiplexing without using too many SNPs which can slow down the demultiplexing softwares. However, some capture types might be better suited to look for SNPs overlapping genes such as single nuclei RNA-seq. For relative numbers of SNPs in the exons and introns, see the issue raised by @jamesnemesh. Here we provide an example of how to do this filtering. We built the required softwares into the singularity image so you can run these filtering steps with the image.

Note

We have found it best to impute reference SNP genotypes so there are more SNP locations available. If you are using reference SNP genotypes for the donors in your pool, please be sure to impute before filtering.

Filter for Common SNPs

First, filter the SNP genotypes for common SNPs - 5% minor allele frequency should work for most datasets but you can change this to another minor allele frequency if you would like. We assume that you have already filtered imputed SNPs for quality based on the imputation method used.

singularity exec Demuxafy.sif bcftools filter --include 'MAF>=0.05' -Oz --output $OUTDIR/common_maf0.05.vcf.gz $VCF

Where $OUTDIR is the output directory where you want to save the results and $VCF is the path to the SNP genotype vcf file.

Filter for SNPs overlapping Exons

Next, filter for the SNPs that overlap exons. The below command uses vcftools` with a bed file <https://en.wikipedia.org/wiki/BED_(file_format)> that contains the locations of each exon (or gene if you prefer) across the genome.

Note

You can get an exon bed using the UCSC table browser (see instructions here) and we have also provided bed files for hg19 and hg38

Here is an example of what the head of a bed file might look like and you can find a description of the file type here <https://en.wikipedia.org/wiki/BED_(file_format)>. The only required columns are the first three that contain the location of the exon:

singularity exec Demuxafy.sif vcftools \
  --gzvcf $OUTDIR/common_maf0.05.vcf.gz \
  --max-alleles 2 \
  --remove-indels \
  --bed $BED \
  --recode \
  --recode-INFO-all \
  --out $OUTDIR/common_maf0.05_exon_filtered

We typically expect ~250k SNPs to remain following this filtering step.

Test Dataset

In addition, we have provided a test dataset that you can use. Find the data that can be downloaded below along with some information about the data. The data have been aligned to GRCh38 so for running. If you are running souporcell, you will needa a fasta file, which you can download from Ensembl FTP

Information

The test dataset includes 20,982 droplets captured of PBMCs from 13 multiplexed individuals.

10x Directories + Other Necessary Files

We have provided this dataset as the complete dataset which is pretty large (~40Gb tar.gz directory). Therefore, we have also provided the same dataset where the data has been significantly reduced.

Warning

The reduced test dataset may not produce real-world results due to the small size - especially for doublet detecting softwares since the reads have been significantly downsampled to reduce the size.

You can download the dataset with one of the following commands:

First, download the dataset and the md5sum:

wget https://www.dropbox.com/s/3oujqq98y400rzz/TestData4PipelineFull.tar.gz
wget https://www.dropbox.com/s/5n7u723okkf5m3l/TestData4PipelineFull.tar.gz.md5

After downloading the tar.gz directory, it is best to make sure the md5sum of the TestData4PipelineFull.tar.gz file matches the md5sum in the TestData4PipelineFull.tar.gz.md5:

md5sum TestData4PipelineFull.tar.gz > downloaded_TestData4PipelineFull.tar.gz.md5
diff -s TestData4PipelineFull.tar.gz.md5 downloaded_TestData4PipelineFull.tar.gz.md5

That should return the following statement indicating that the two md5sums are identical:

Files TestData4PipelineFull.tar.gz.md5 and downloaded_TestData4PipelineFull.tar.gz.md5 are identical

Finally, you can access the data by unzipping the file:

tar -xvf TestData4PipelineFull.tar.gz

This should unzip the TestData4PipelineFull directory where you will have the following file structure:

TestData4PipelineFull
├── donor_list.txt
├── samplesheet.txt
├── test_dataset
│   ├── outs
│   │   └── filtered_gene_bc_matrices
│   │       └── Homo_sapiens_GRCh38p10
│   │           ├── barcodes.tsv
│   │           ├── genes.tsv
│   │           └── matrix.mtx
│   ├── possorted_genome_bam.bam
│   └── possorted_genome_bam.bam.bai
└── test_dataset.vcf

First, download the reduced dataset and the md5sum:

wget https://www.dropbox.com/s/m8u61jn4i1mcktp/TestData4PipelineSmall.tar.gz
wget https://www.dropbox.com/s/ykjg86q3xw39wqr/TestData4PipelineSmall.tar.gz.md5

After downloading the tar.gz directory, it is best to make sure the md5sum of the TestData4PipelineSmall.tar.gz file matches the md5sum in the TestData4PipelineSmall.tar.gz.md5:

md5sum TestData4PipelineSmall.tar.gz > downloaded_TestData4PipelineSmall.tar.gz.md5
diff -s TestData4PipelineSmall.tar.gz.md5 downloaded_TestData4PipelineSmall.tar.gz.md5

That should return the following statement indicating that the two md5sums are identical:

Files TestData4PipelineSmall.tar.gz.md5 and downloaded_TestData4PipelineSmall.tar.gz.md5 are identical

Finally, you can access the data by unzipping the file:

tar -xvf TestData4PipelineSmall.tar.gz

This should unzip the TestData4PipelineSmall directory where you will have the following file structure:

TestData4PipelineSmall
├── donor_list.txt
├── individuals_list_dir
│   └── test_dataset.txt
├── samplesheet.txt
├── test_dataset
│   └── outs
│       ├── filtered_gene_bc_matrices
│       │   └── Homo_sapiens_GRCh38p10
│       │       ├── barcodes.tsv
│       │       ├── genes.tsv
│       │       └── matrix.mtx
│       ├── pooled.sorted.bam
│       └── pooled.sorted.bam.bai
└── test_dataset.vcf

Seurat Object

We have also provided a filtered, QC normalized Seurat object (needed for DoubletFinder and DoubletDecon)

Download the rds object and the md5sum:

wget https://www.dropbox.com/s/po4gy2j3eqohhjv/TestData_Seurat.rds
wget https://www.dropbox.com/s/rmix7tt9aw28n7i/TestData_Seurat.rds.md5

After downloading the rds.object, it is best to make sure the md5sum of the TestData_Seurat.rds file matches the md5sum in the TestData_Seurat.rds.md5:

md5sum TestData_Seurat.rds > downloaded_TestData_Seurat.rds.md5
diff -s TestData_Seurat.rds.md5 downloaded_TestData_Seurat.rds.md5

That should return the following statement indicating that the two md5sums are identical:

Files TestData_Seurat.rds.md5 and downloaded_TestData_Seurat.rds.md5 are identical

The TestData_Seurat.rds can then be used directly as input for the DoubletFinder and DoubletDecon tutorials. You can also load the TestData_Seurat.rds into R to see the seurat object by first opening R:

singularity exec Demuxafy.sif R

Then read it in with:

seurat <- readRDS(TestData_Seurat.rds)

Note

We have used this dataset for each of the tutorials. The example tables in the Results and Interpretation sections of each tutorial are the results from this dataset.