Data Preparation

There isn’t a lot of data preparation to be done before running the demultiplexing or doublet detecting softwares.

Data Required

The demultiplexing and transcriptome-based doublet detecting softwares have different data input requirements:

Software Group

Single Cell Count Data Required

SNP Genotype Data Required




Doublet Detecting



You won’t need to pre-process the single cell count data unless you are using DoubletFinder or DoubletDecon which need QC-filtered and normalized counts (for example with Seurat).

For the demultiplexing softwares, you should filter the SNP genotypes that you will use.

SNP Genotype Data


The SNP genotype data can be for multiplexed donors in the pool OR it can be publicly available common SNP genotypes.

We provide instructions on how to access or prepare these data here.

Pulblicly Available SNP Genotype Data

We have provided common SNP vcf files that we have generated on both GRCh37 and GRCh38 and with a variety of filtering and ‘chr’ encoding. These can be downloaded using the following links:







Chr Encoding

vcf File

md5sum File




No ‘chr’



‘chr’ encoding




No ‘chr’



‘chr’ encoding





No ‘chr’



‘chr’ encoding




No ‘chr’



‘chr’ encoding



Of course, you can further filter these to a 5% minor allele frequency if you would prefer.

You can also download SNP genotype data and process it yourself from 1000G (hg19 and hg38) or HRC (hg19 only).

For 1000G, use the instructions at the above link to access the data per your preferences and you can find the required files at the following directories:

  • The hg19 data is available at /ftp/release/

  • The hg38 data is available at /ftp/release/20130502/supporting/GRCh38_positions/

Preparing your own SNP Genotype Data

It is best to filter the SNP genotypes for common SNPs (generally > 1% or > 5% minor allele frequency) that overlap exons. Here we provide an example of how to do this filtering. We built the required softwares into the singularity image so you can run these filtering steps with the image.


We have found it best to impute reference SNP genotypes so there are more SNP locations available. If you are using reference SNP genotypes for the donors in your pool, please be sure to impute before filtering.

Filter for Common SNPs

First, filter the SNP genotypes for common SNPs - 5% minor allele frequency should work for most datasets but you can change this to another minor allele frequency if you would like.

singularity exec Demuxafy.sif bcftools filter --include 'MAF>=0.05' -Oz --output $OUTDIR/common_maf0.05.vcf.gz $VCF

Where $OUTDIR is the output directory where you want to save the results and $VCF is the path to the SNP genotype vcf file.

Filter for SNPs overlapping Exons

Next, filter for the SNPs that overlap exons.


You can get an exon bed using the UCSC table browser (see instructions here) and we have also provided bed files for hg19 and hg38

singularity exec Demuxafy.sif vcftools \
  --gzvcf $OUTDIR/common_maf0.05.vcf.gz \
  --max-alleles 2 \
  --remove-indels \
  --bed $BED \
  --recode \
  --recode-INFO-all \
  --out $OUTDIR/common_maf0.05_exon_filtered

Test Dataset

In addition, we have provided a test dataset that you can use. Find the data that can be downloaded below along with some information about the data. The data have been aligned to GRCh38 so for running. If you are running souporcell, you will needa a fasta file, which you can download from UCSC FTP


The test dataset includes 20,982 droplets captured of PBMCs from 13 multiplexed individuals.

10x Directories + Other Necessary Files

We have provided this dataset as the complete dataset which is pretty large (~40Gb tar.gz directory). Therefore, we have also provided the same dataset where the data has been significantly reduced.


The reduced test dataset may not produce real-world results due to the small size - especially for doublet detecting softwares since the reads have been significantly downsampled to reduce the size.

You can download the dataset with one of the following commands:

First, download the dataset and the md5sum:


After downloading the tar.gz directory, it is best to make sure the md5sum of the TestData4PipelineFull.tar.gz file matches the md5sum in the TestData4PipelineFull.tar.gz.md5:

md5sum TestData4PipelineFull.tar.gz > downloaded_TestData4PipelineFull.tar.gz.md5
diff -s TestData4PipelineFull.tar.gz.md5 downloaded_TestData4PipelineFull.tar.gz.md5

That should return the following statement indicating that the two md5sums are identical:

Files TestData4PipelineFull.tar.gz.md5 and downloaded_TestData4PipelineFull.tar.gz.md5 are identical

Finally, you can access the data by unzipping the file:

tar -xvf TestData4PipelineFull.tar.gz

This should unzip the TestData4PipelineFull directory where you will have the following file structure:

├── donor_list.txt
├── samplesheet.txt
├── test_dataset
│   ├── outs
│      └── filtered_gene_bc_matrices
│          └── Homo_sapiens_GRCh38p10
│              ├── barcodes.tsv
│              ├── genes.tsv
│              └── matrix.mtx
│   ├── possorted_genome_bam.bam
│   └── possorted_genome_bam.bam.bai
└── test_dataset.vcf

Seurat Object

We have also provided a filtered, QC normalized Seurat object (needed for DoubletFinder and DoubletDecon)

Download the rds object and the md5sum:


After downloading the rds.object, it is best to make sure the md5sum of the TestData_Seurat.rds file matches the md5sum in the TestData_Seurat.rds.md5:

md5sum TestData_Seurat.rds > downloaded_TestData_Seurat.rds.md5
diff -s TestData_Seurat.rds.md5 downloaded_TestData_Seurat.rds.md5

That should return the following statement indicating that the two md5sums are identical:

Files TestData_Seurat.rds.md5 and downloaded_TestData_Seurat.rds.md5 are identical

The TestData_Seurat.rds can then be used directly as input for the DoubletFinder and DoubletDecon tutorials. You can also load the TestData_Seurat.rds into R to see the seurat object by first opening R:

singularity exec Demuxafy.sif R

Then read it in with:

seurat <- readRDS(TestData_Seurat.rds)


We have used this dataset for each of the tutorials. The example tables in the Results and Interpretation sections of each tutorial are the results from this dataset.