Tutorial

In this tutorial, we’ll analyze the Test Dataset which contains 13 multiplexed donors and ~20,982 droplets.

This tutorial will take you through the typical steps for demultiplexing data with Demuxafy. The process would be the same for non-multiplexed data for doublet detection but would be a different combination of softwares.

Select appropriate software combination
Run each selected software
Combine the results and call a final assignment for each droplet

Selecting Software Combination

First, we’ll identify the softwares we should run using the Software Selection Tool. Start by entering the required information about the single-cell run into the Software Selection Tool. This sample is genetically multiplexed with reference SNP genotypes for each sample in the pool and 13 donors. When we enter that information and press “SUBMIT”, we can see the recommended method combination in the table below:

Note

If you want to compare different capture characteristics for experimental design, the recommendations and expected doublet numbers will be maintained in the table for easy comparison.

Per the recommendations we will run the following methods before combining them.

Demuxalot (refined)
Dropulation
DoubletDetection
ScDblFinder
Scds

Run Each Software

We will first run each software before combining them and calling final assignments for each droplet based on the annotations from each software. Please note that even though they are listed in sequence here, we run them in parallel since none of them are dependent on the other

Data

This is the data that you will need to have prepared to run each of the softwares and the softwares that use them:

Required

Reference SNP genotypes for each individual ($VCF)
- Filtered for common SNPs (> 5% minor allele frequency) and SNPs overlapping genes
Barcode file ($BARCODES)
- A file that has the barcodes annotated to contain cells in pool. One barcode per line, no header.
Bam file ($BAM)
- Aligned single cell reads
Output directory ($OUTDIR)
A text file with the individual ids ($INDS)
- File containing the individual ids (separated by line) as they appear in the vcf file
- For example, this is the individual file for our example dataset

Here’s an example of our folder tree setup that will be used for this tutorial:

/path/to/data/
└── TestData4PipelineFull
    ├── donor_list.txt
    ├── samplesheet.txt
    ├── test_dataset
    │   ├── outs
    │   │   └── filtered_gene_bc_matrices
    │   │       └── Homo_sapiens_GRCh38p10
    │   │           ├── barcodes.tsv
    │   │           ├── genes.tsv
    │   │           └── matrix.mtx
    │   ├── possorted_genome_bam.bam
    │   └── possorted_genome_bam.bam.bai
    └── test_dataset.vcf

We’ll start by setting up variables to different files and folders that we need for running each of the softwares These are files provided as a test dataset available in the Data Preparation Documentation Please replace /path/to with the full path to your data directory.

## Paths to files ##
VCF=/path/to/TestData4PipelineFull/test_dataset.vcf
COUNTS=/path/to/TestData4PipelineFull/test_dataset/outs/filtered_gene_bc_matrices/Homo_sapiens_GRCh38p10/
BARCODES=$COUNTS/barcodes.tsv
BAM=/path/to/TestData4PipelineFull/test_dataset/possorted_genome_bam.bam
INDS=/path/to/TestData4PipelineFull/donor_list.txt
GTF=/path/to/genes.gtf ## We do Not provide this - it should be the gtf file that you used to align your data. Otherwise you can download an appropriate gtf file from https://www.gencodegenes.org/human/

## Output directories ##
OUTDIR=/path/to/output
DEMUXALOT_OUTDIR=$OUTDIR/demuxalot
DROPULATION_OUTDIR=$OUTDIR/dropulation
DOUBLETDETECTION_OUTDIR=$OUTDIR/DoubletDetection
SCDBLFINDER_OUTDIR=$OUTDIR/scDblFinder
SCDS_OUTDIR=$OUTDIR/scds

Demuxalot (refined)

We’ll run Demuxalot with variant refinement:

⏱️ Expected Resource Usage

~2.5h using a total of 81Gb memory when using 32 threads

singularity exec Demuxafy.sif python Demuxalot.py \
        -b $BARCODES \
        -a $BAM \
        -n $INDS \
        -v $VCF \
        -o $DEMUXALOT_OUTDIR \
        -r True

HELP! It says my file/directory doesn’t exist!

If you receive an error indicating that a file or directory doesn’t exist but you are sure that it does, this is likely an issue arising from Singularity. This is easy to fix. The issue and solution are explained in detail in the Notes About Singularity Images

If Demuxalot is successful, you will have these new files in your $DEMUXALOT_OUTDIR:

/path/to/output/demuxalot
├── assignments_refined.tsv.gz
├── assignments.tsv.gz
├── likelihoods_refined.tsv.gz
├── likelihoods.tsv.gz
├── posterior_probabilities_refined.tsv.gz
└── posterior_probabilities.tsv.gz

Let’s check how many droplets were assigned as each donor and as doublets by Demuxalot with the demuxalot_summary.sh script:

singularity exec Demuxafy.sif bash demuxalot_summary.sh $DEMUXALOT_OUTDIR/assignments_refined.tsv.gz

which will return:

Classification	Assignment N
113_113	1344
349_350	1463
352_353	1619
39_39	1306
40_40	1082
41_41	1129
42_42	1437
43_43	1553
465_466	1091
596_597	1267
597_598	1523
632_633	872
633_634	961
660_661	1371
doublet	2964

The estimated number of doublets (2,964) is slightly lower than the predicted number of doublets (3,522)

Dropulation

We’ll also run Dropulation:

Dropulation Assignment

First, Dropulation estimates the likelihood of each donor for each droplet

⏱️ Expected Resource Usage

~4h using a total of 3Gb memory when using 12 threads

Please note that the \ at the end of each line is purely for readability to put a separate parameter argument on each line.

singularity exec Demuxafy.sif AssignCellsToSamples --CELL_BC_FILE $BARCODES \
          --INPUT_BAM $DROPULATION_OUTDIR/possorted_genome_bam_dropulation_tag.bam \
          --OUTPUT $DROPULATION_OUTDIR/assignments.tsv.gz \
          --VCF $VCF \
          --SAMPLE_FILE $INDS \
          --CELL_BARCODE_TAG 'CB' \
          --MOLECULAR_BARCODE_TAG 'UB' \
          --VCF_OUTPUT $DROPULATION_OUTDIR/assignment.vcf \
          --MAX_ERROR_RATE 0.05

If the bam annotation is successful, you will have these new files in your $DROPULATION_OUTDIR:

/path/to/output/dropulation
├── assignments.tsv.gz
├── out_vcf.vcf
├── out_vcf.vcf.idx
└── possorted_genome_bam_dropulation_tag.bam

⏱️ Expected Resource Usage

~1.5h using a total of 5Gb memory when using 16 thread for the full Test Dataset which contains ~20,982 droplets of 13 multiplexed donors,

Next, we will identify the likelihoods of each droplet being a doublet.

Note

Please change the cell barcode and molecular barcode tags as necessary. For 10x experiments processed with cellranger, this should be ‘CB’ for the CELL_BARCODE_TAG and ‘UB’ for the MOLECULAR_BARCODE_TAG

Please note that the \ at the end of each line is purely for readability to put a separate parameter argument on each line.

singularity exec Demuxafy.sif DetectDoublets --CELL_BC_FILE $BARCODES \
          --INPUT_BAM $DROPULATION_OUTDIR/possorted_genome_bam_dropulation_tag.bam \
          --OUTPUT $DROPULATION_OUTDIR/likelihoods.tsv.gz \
          --VCF $VCF \
          --CELL_BARCODE_TAG 'CB' \
          --MOLECULAR_BARCODE_TAG 'UB' \
          --SINGLE_DONOR_LIKELIHOOD_FILE $DROPULATION_OUTDIR/assignments.tsv.gz \
          --SAMPLE_FILE $INDS \
          --MAX_ERROR_RATE 0.05

Dropulation Doublet

Next, Dropulation estimates the likelihood of a doublet for each droplet

⏱️ Expected Resource Usage

~1.5h using a total of 5Gb memory when using 16 threads

Next, we will identify the likelihoods of each droplet being a doublet.

Note

Please change the cell barcode and molecular barcode tags as necessary. For 10x experiments processed with cellranger, this should be ‘CB’ for the CELL_BARCODE_TAG and ‘UB’ for the MOLECULAR_BARCODE_TAG

Please note that the \ at the end of each line is purely for readability to put a separate parameter argument on each line.

singularity exec Demuxafy.sif DetectDoublets --CELL_BC_FILE $BARCODES \
          --INPUT_BAM $DROPULATION_OUTDIR/possorted_genome_bam_dropulation_tag.bam \
          --OUTPUT $DROPULATION_OUTDIR/likelihoods.tsv.gz \
          --VCF $VCF \
          --CELL_BARCODE_TAG 'CB' \
          --MOLECULAR_BARCODE_TAG 'UB' \
          --SINGLE_DONOR_LIKELIHOOD_FILE $DROPULATION_OUTDIR/assignments.tsv.gz \
          --SAMPLE_FILE $INDS \
          --MAX_ERROR_RATE 0.05

Dropulation Call

Finally, we will make final assignments for each droplet based on the doublet and assignment calls.

Please note that the \ at the end of each line is purely for readability to put a separate parameter argument on each line.

singularity exec Demuxafy.sif dropulation_call.R --assign $DROPULATION_OUTDIR/assignments.tsv.gz \
                           --doublet $DROPULATION_OUTDIR/likelihoods.tsv.gz \
                           --out $DROPULATION_OUTDIR/updated_assignments.tsv.gz

If the bam annotation is successful, you will have these new files in your $DROPULATION_OUTDIR:

/path/to/output/dropulation
├── assignments.tsv.gz
├── likelihoods.tsv.gz
├── out_vcf.vcf
├── out_vcf.vcf.idx
├── possorted_genome_bam_dropulation_tag.bam
└── updated_assignments.tsv.gz

Dropulation Summary

We can check the distribution of cells that were assigned to each donor and annotated as doublets with the Dropulation_summary.sh script:

singularity exec Demuxafy.sif bash Dropulation_summary.sh $DROPULATION_OUTDIR/updated_assignments.tsv.gz

which will return:

Classification

Assignment N

113_113

1327

349_350

1440

352_353

1562

39_39

1255

40_40

1082

41_41

1122

42_42

1365

43_43

1546

465_466

1084

596_597

1258

597_598

1515

632_633

815

633_634

892

660_661

1364

doublet

3355

The estimated number of doublets (3,355) is very close to the predicted number of doublets (3,522)

DoubletDetection

We will also run DoubletDetection:

singularity exec Demuxafy.sif DoubletDetection.py -m $COUNTS -o $DOUBLETDETECTION_OUTDIR

HELP! It says my file/directory doesn’t exist!

If you receive an error indicating that a file or directory doesn’t exist but you are sure that it does, this is likely an issue arising from Singularity. This is easy to fix. The issue and solution are explained in detail in the Notes About Singularity Images

This will return the following files:

/path/to/output/DoubletDetection
├── convergence_test.pdf
├── DoubletDetection_doublets_singlets.tsv
├── DoubletDetection_summary.tsv
└── threshold_test.pdf

Looking at the DoubletDetection_summary.tsv file, the number of doublets (2,594) is lower than the predicted number of doublets (3,522)

DoubletDetection_DropletType

Droplet N

doublet

2594

singlet

18388

ScDblFinder

We will also run ScDblFinder.

⏱️ Expected Resource Usage

~1min using a total of 3Gb memory when using 2 thread for the full Test Dataset which contains ~20,982 droplets of 13 multiplexed donors,

singularity exec Demuxafy.sif scDblFinder.R -o $SCDBLFINDER_OUTDIR -t $COUNTS

HELP! It says my file/directory doesn’t exist!

If you receive an error indicating that a file or directory doesn’t exist but you are sure that it does, this is likely an issue arising from Singularity. This is easy to fix. The issue and solution are explained in detail in the Notes About Singularity Images

After running the ScDblFinder you will have two files in the $SCDBLFINDER_OUTDIR:

/path/to/output/scDblFinder
├── scDblFinder_doublets_singlets.tsv
└── scDblFinder_doublet_summary.tsv

Looking at the scDblFinder_doublet_summary.tsv file, the number of doublets (3,323) is slightly lower than the predicted number of doublets (3,522)

Classification	Droplet N
doublet	3323
singlet	17659

Scds

Finally, we will also run Scds.

⏱️ Expected Resource Usage

~7min using a total of 10Gb memory when using 2 thread for the full Test Dataset which contains ~20,982 droplets of 13 multiplexed donors,

To run Scds with our wrapper script, simply execute the following in your shell:

singularity exec Demuxafy.sif scds.R -o $SCDS_OUTDIR -t $COUNTS

HELP! It says my file/directory doesn’t exist!

If you receive an error indicating that a file or directory doesn’t exist but you are sure that it does, this is likely an issue arising from Singularity. This is easy to fix. The issue and solution are explained in detail in the Notes About Singularity Images

After running the Scds with the wrapper script or manually you should have two files in the $SCDS_OUTDIR:

/path/to/output/scds
├── scds_doublets_singlets.tsv
└── scds_doublet_summary.tsv

Looking at the scds_doublet_summary.tsv file, the number of doublets (2,771) is lower than the predicted number of doublets (3,522)

Merging Results and Joint Software Calls

Now, we will combine the results from each of the softwares we’ve run (Demuxalot (refined), Dropulation, DoubletDetection, ScDblFinder, Scds) and annotate droplet cell types and donor assignments

singularity exec Demuxafy.sif Combine_Results.R \
  -o $OUTDIR/combined_results.tsv \
  --demuxalot $DEMUXALOT_OUTDIR \
  --dropulation $DROPULATION_OUTDIR \
  --solo $DOUBLETDETECTION_OUTDIR \
  --scds $SCDBLFINDER_OUTDIR \
  --scds $SCDS_OUTDIR \
  --method "MajoritySinglet"

Results and Interpretation

After running the Combine_Results.R script, you will have have the following results Here, we show the results for the above example that also provides combined calls with the “MajoritySinglet” calls.

/path/to/output/combined
├── combined_results_assignment_summary.tsv
├── combined_results_demultiplexing_summary.tsv
├── combined_results_droplet_type_summary.tsv
├── combined_results_Singlets_upset_donor_assignment.pdf
├── combined_results_Singlets_upset_droplet_type.pdf
├── combined_results_Singlets_upset.pdf
├── combined_results_summary.tsv
├── combined_results.tsv
└── combined_results_w_combined_assignments.tsv

Here’s a deeper look at the contents of each of some of those files:

First, we can look at the combined calls in the upset plot (combined_results_Singlets_upset.pdf):

This is an upset figure of the droplets which are colored by their final individual or doublet classification.

A filled circle indicates the that those droplets are classified as singlets by that software while empty circles indicate a doublet classification by that software

There are also summary files that provide the number of droplets annotated as a singlet or doublet by the combined softwares combined_results_droplet_type_summary.tsv:

Classification	Droplet N
doublet	2771
singlet	18211

and the number of droplets assigned to each donor and as doublets and unassigned by the combination of the softwares (combined_results_assignment_summary.tsv):

Classification	Droplet N
113_113	1333
349_350	1443
352_353	1607
39_39	1289
40_40	1072
41_41	1123
42_42	1409
43_43	1542
465_466	1084
596_597	1249
597_598	1493
632_633	859
633_634	953
660_661	1354
doublet	2771
unassigned	401

The combined_results.tsv file contains all the method calls + the combined finalized calls together in a single file that can be used as cell metadata for downstream analyses.

Barcode	Demuxalot_Individual_Assignment	Demuxalot_DropletType	Dropulation_DropletType	Dropulation_Individual_Assignment	DoubletDetection_DropletType	scDblFinder_DropletType	scDblFinder_Score	scds_score	scds_DropletType	MajoritySinglet_DropletType	MajoritySinglet_Individual_Assignment
AAACCTGAGATAGCAT-1	41_41	singlet	singlet	41_41	singlet	singlet	0.000161838892381638	0.11384647224872	singlet	singlet	41_41
AAACCTGAGCAGCGTA-1	465_466	singlet	singlet	465_466	singlet	singlet	0.038923978805542	0.503487172824797	singlet	singlet	465_466
AAACCTGAGCGATGAC-1	113_113	singlet	singlet	113_113	singlet	singlet	0.000687798717990518	0.0122651890679041	singlet	singlet	113_113
AAACCTGAGCGTAGTG-1	349_350	singlet	singlet	349_350	singlet	singlet	6.88672153046355e-05	0.099564348390602	singlet	singlet	349_350
AAACCTGAGGAGTTTA-1	632_633	singlet	singlet	632_633	singlet	singlet	0.000810008263215423	0.0887153542233592	singlet	singlet	632_633
AAACCTGAGGCTCATT-1	39_39	singlet	singlet	39_39	singlet	singlet	0.0342786461114883	0.0521116636059276	singlet	singlet	39_39
AAACCTGAGGGCACTA-1	465_466	singlet	singlet	465_466	singlet	doublet	0.962486505508423	0.600842973151551	singlet	singlet	465_466
AAACCTGAGTAATCCC-1	660_661	singlet	singlet	660_661	singlet	singlet	0.00426467135548592	0.431225466194795	singlet	singlet	660_661
…	…	…	…	…	…	…	…	…	…	…	…

You have completed the Tutorial! You can now apply the required methods to your dataset. Feel free to reach out if you have any questions, issues or recommendations with a Github issue.

Citation

If you used the Demuxafy platform for analysis, please reference our preprint_ as well as DoubletDetection.