Dropulation

Dropulation is a genotype demultiplexing software that requires reference genotypes to be available for each individual in the pool. Therefore, if you don’t have reference genotypes, you may want to demultiplex with one of the softwares that do not require reference genotype data (Freemuxlet, scSplit, Souporcell or Vireo)

Data

This is the data that you will need to have prepare to run Dropulation:

Required

  • Reference SNP genotypes for each individual ($VCF)

    • Filter for common SNPs (> 5% minor allele frequency) and SNPs overlapping genes

  • Barcode file ($BARCODES)

  • Bam file ($BAM)

    • Aligned single cell reads

    • This must contain gene annotations as produced by TagReadWithGeneFunction to have gene name (gn), gene strand (gs) and gene function (gf) - we have included code for doing this below, it will require a gtf file for annotation.

  • Output directory ($DROPULATION_OUTDIR)

  • GTF gene annotation file ($GTF)

    • If you already have a bam annotated with TagReadWithGeneFunction, then you won’t need the $GTF file

    • Ideally, this would be the gtf file used for alignment but you can also get gtf files from GENCODE

  • A text file with the individual ids ($INDS)

    • File containing the individual ids (separated by line) as they appear in the vcf file

    • For example, this is the individual file for our example dataset

Run Dropulation

First, let’s assign the variables that will be used to execute each step.

Example Variable Settings

Below is an example of the variables that we can set up to be used in the command below. These are files provided as a test dataset available in the Data Preparation Documentation Please replace paths with the full path to your data on your system.

VCF=/path/to/TestData4PipelineFull/test_dataset.vcf
BARCODES=/path/to/TestData4PipelineFull/test_dataset/outs/filtered_gene_bc_matrices/Homo_sapiens_GRCh38p10/barcodes.tsv
BAM=/path/to/test_dataset/possorted_genome_bam.bam
DROPULATION_OUTDIR=/path/to/output/dropulation
INDS=/path/to/TestData4PipelineFull/donor_list.txt
GTF=/path/to/genes.gtf

Bam Annotation

⏱️ Expected Resource Usage

~4h using a total of 3Gb memory when using 12 thread for the full Test Dataset which contains ~20,982 droplets of 13 multiplexed donors,

You will most likely need to annotate your bam using TagReadWithGeneFunction (unless you have already annotated your bam with this function). Please note that the \ at the end of each line is purely for readability to put a separate parameter argument on each line.

note

If you are submitting this job to an cluster to run, you may have to bind the $TMPDIR directory used by your cluster in the singularity command. This will be slightly different depending on your cluster configuration.

singularity exec Demuxafy.sif TagReadWithGeneFunction \
          --ANNOTATIONS_FILE $GTF \
          --INPUT $BAM \
          --OUTPUT $DROPULATION_OUTDIR/possorted_genome_bam_dropulation_tag.bam

If the bam annotation is successful, you will have these new files in your $DROPULATION_OUTDIR:

/path/to/output/dropulation
└── possorted_genome_bam_dropulation_tag.bam

Dropulation Assignment

⏱️ Expected Resource Usage

~1.5h using a total of 5Gb memory when using 16 thread for the full Test Dataset which contains ~20,982 droplets of 13 multiplexed donors,

First, we will identify the most likely singlet donor for each droplet.

Note

Please change the cell barcode and molecular barcode tags as necessary. For 10x experiments processed with cellranger, this should be ‘CB’ for the CELL_BARCODE_TAG and ‘UB’ for the MOLECULAR_BARCODE_TAG

note

If you are submitting this job to an cluster to run, you may have to bind the $TMPDIR directory used by your cluster in the singularity command. This will be slightly different depending on your cluster configuration.

Please note that the \ at the end of each line is purely for readability to put a separate parameter argument on each line.

singularity exec Demuxafy.sif AssignCellsToSamples --CELL_BC_FILE $BARCODES \
          --INPUT_BAM $DROPULATION_OUTDIR/possorted_genome_bam_dropulation_tag.bam \
          --OUTPUT $DROPULATION_OUTDIR/assignments.tsv.gz \
          --VCF $VCF \
          --SAMPLE_FILE $INDS \
          --CELL_BARCODE_TAG 'CB' \
          --MOLECULAR_BARCODE_TAG 'UB' \
          --VCF_OUTPUT $DROPULATION_OUTDIR/assignment.vcf \
          --MAX_ERROR_RATE 0.05

If the bam annotation is successful, you will have these new files in your $DROPULATION_OUTDIR:

/path/to/output/dropulation
├── assignments.tsv.gz
├── out_vcf.vcf
├── out_vcf.vcf.idx
└── possorted_genome_bam_dropulation_tag.bam

Dropulation Doublet

⏱️ Expected Resource Usage

~1.5h using a total of 5Gb memory when using 16 thread for the full Test Dataset which contains ~20,982 droplets of 13 multiplexed donors,

Next, we will identify the likelihoods of each droplet being a doublet.

Note

Please change the cell barcode and molecular barcode tags as necessary. For 10x experiments processed with cellranger, this should be ‘CB’ for the CELL_BARCODE_TAG and ‘UB’ for the MOLECULAR_BARCODE_TAG

note

If you are submitting this job to an cluster to run, you may have to bind the $TMPDIR directory used by your cluster in the singularity command. This will be slightly different depending on your cluster configuration.

Please note that the \ at the end of each line is purely for readability to put a separate parameter argument on each line.

singularity exec Demuxafy.sif DetectDoublets --CELL_BC_FILE $BARCODES \
          --INPUT_BAM $DROPULATION_OUTDIR/possorted_genome_bam_dropulation_tag.bam \
          --OUTPUT $DROPULATION_OUTDIR/likelihoods.tsv.gz \
          --VCF $VCF \
          --CELL_BARCODE_TAG 'CB' \
          --MOLECULAR_BARCODE_TAG 'UB' \
          --SINGLE_DONOR_LIKELIHOOD_FILE $DROPULATION_OUTDIR/assignments.tsv.gz \
          --SAMPLE_FILE $INDS \
          --MAX_ERROR_RATE 0.05

If the bam annotation is successful, you will have these new files in your $DROPULATION_OUTDIR:

/path/to/output/dropulation
├── assignments.tsv.gz
├── likelihoods.tsv.gz
├── out_vcf.vcf
├── out_vcf.vcf.idx
└── possorted_genome_bam_dropulation_tag.bam

Dropulation Call

Finally, we will make final assignments for each droplet based on the doublet and assignment calls.

Please note that the \ at the end of each line is purely for readability to put a separate parameter argument on each line.

singularity exec Demuxafy.sif dropulation_call.R --assign $DROPULATION_OUTDIR/assignments.tsv.gz \
                           --doublet $DROPULATION_OUTDIR/likelihoods.tsv.gz \
                           --out $DROPULATION_OUTDIR/updated_assignments.tsv.gz

If the bam annotation is successful, you will have these new files in your $DROPULATION_OUTDIR:

/path/to/output/dropulation
├── assignments.tsv.gz
├── likelihoods.tsv.gz
├── out_vcf.vcf
├── out_vcf.vcf.idx
├── possorted_genome_bam_dropulation_tag.bam
└── updated_assignments.tsv.gz

Dropulation Summary

We have provided a script that will summarize the number of droplets classified as doublets, ambiguous and assigned to each donor by Dropulation and write it to the $DROPULATION_OUTDIR. You can run this to get a fast and easy summary of your results by providing the path to your result file:

singularity exec Demuxafy.sif bash Dropulation_summary.sh $DROPULATION_OUTDIR/updated_assignments.tsv.gz

which will return:

Classification

Assignment N

113_113

1327

349_350

1440

352_353

1562

39_39

1255

40_40

1082

41_41

1122

42_42

1365

43_43

1546

465_466

1084

596_597

1258

597_598

1515

632_633

815

633_634

892

660_661

1364

doublet

3355

or you can write it straight to a file:

singularity exec Demuxafy.sif bash Dropulation_summary.sh $DROPULATION_OUTDIR/updated_assignments.tsv.gz > $DROPULATION_OUTDIR/dropulation_summary.tsv

Note

To check if these numbers are consistent with the expected doublet rate in your dataset, you can use our Doublet Estimation Calculator.

Dropulation Results and Interpretation

After running the Dropulation steps and summarizing the results, you will have a number of files from some of the intermediary steps. These are the files that most users will find the most informative:

  • updated_assignments.tsv.gz

    • The predicted annotations for each droplet and metrics:

      Barcode

      dropulation_Likelihood

      dropulation_Assignment

      dropulation_DropletType

      dropulation_Nsnps

      dropulation_Numis

      CATATGGCAGCTCGCA-1

      -44.523

      596_597

      singlet

      193

      381

      ACATACGGTCGAATCT-1

      -93.431

      632_633

      singlet

      296

      675

      GCATGCGAGATCACGG-1

      -45.708

      41_41

      singlet

      241

      536

      CCTTACGGTAGCTCCG-1

      -21.723

      41_41

      singlet

      135

      217

      TTTACTGCAATGAATG-1

      -26.521

      352_353

      singlet

      120

      206

Merging Results with Other Software Results

We have provided a script that will help merge and summarize the results from multiple softwares together. See Combine Results.

Citation

If you used the Demuxafy platform for analysis, please reference our preprint as well as Dropulation.