DoubletDetection

DoubletDetection is a transcription-based doublet detection software. This was one of the better-performing doublet detecting softwares that we identified in our paper (CITE) and it is also relatively fast to run. We have provided a wrapper script that enables DoubletDetection to be easily run from the command line but we also provide example code so that users can run manually as well depending on their data.

Data

This is the data that you will need to have prepare to run DoubletDetection:

Required

  • A counts matrix ($COUNTS)

    • DoubletDetection expects counts to be in the cellranger output format either as

      • h5 file (filtered_feature_bc_matrix.h5)

        or

      • matrix directory (directory containing barcodes.tsv, genes.tsv and matrix.mtx or barcodes.tsv.gz, features.tsv.gz and matrix.mtx.gz)

      • If you don’t have your data in either of these formats, you can run DoubletDetection manually in python and load the data in using a method of your choosing.

Optional

  • Output directory ($DOUBLETDETECTION_OUTDIR)

    • If you don’t provide an $DOUBLETDETECTION_OUTDIR, the results will be written to the present working directory.

  • Filtered barcode file

    • A list of barcodes that are a subset of the barcodes in your h5 or matrix.mtx files. This is useful if you have run other QC softwares such as CellBender or DropletQC to remove empty droplets or droplets with damaged cells.

    • Expectation is that there is no header in this file

Run DoubletDetection

⏱️ Expected Resource Usage

~1h using a total of 15Gb memory when using 2 thread for the full Test Dataset which contains ~20,982 droplets of 13 multiplexed donors,

You can either run DoubletDetection with the wrapper script we have provided or you can run it manually if you would prefer to alter more parameters. In addition, we provide an example for filtering the single cell matrix to a subsetted list of barcodes

First, let’s assign the variables that will be used to execute each step.

Example Variable Settings

Below is an example of the variables that we can set up to be used in the command below. These are files provided as a test dataset available in the Data Preparation Documentation Please replace paths with the full path to data on your system.

DOUBLETDETECTION_OUTDIR=/path/to/output/DoubletDetection
COUNTS=/path/to/TestData4PipelineFull/test_dataset/outs/filtered_gene_bc_matrices/Homo_sapiens_GRCh38p10/
singularity exec Demuxafy.sif DoubletDetection.py -m $COUNTS -o $DOUBLETDETECTION_OUTDIR

To see all the parameters that this wrapper script will accept, run:

singularity exec Demuxafy.sif DoubletDetection.py -h

usage: DoubletDetection.py [-h] -m COUNTS_MATRIX [-b BARCODES]
                          [-f FILTERED_BARCODES] [-o OUTDIR] [-p BOOST_RATE]
                          [-c N_COMPONENTS] [-g N_TOP_VAR_GENES] [-r REPLACE]
                          [-a CLUSTERING_ALGORITHM] [-k CLUSTERING_KWARGS]
                          [-i N_ITERATIONS] [-e PSEUDOCOUNT] [-n NORMALIZER]
                          [-d RANDOM_STATE] [-s STANDARD_SCALING] [-j N_JOBS]
                          [-t P_THRESH] [-v VOTER_THRESH]

wrapper for DoubletDetection for doublet detection from transcriptomic data.

optional arguments:
  -h, --help            show this help message and exit
  -m COUNTS_MATRIX, --counts_matrix COUNTS_MATRIX
                        cell ranger counts matrix directory containing matrix
                        files or full path to matrix.mtx. Can also also
                        provide the 10x h5.
  -b BARCODES, --barcodes BARCODES
                        File containing droplet barcodes. Use barcodes from
                        provided 10x dir by default.
  -f FILTERED_BARCODES, --filtered_barcodes FILTERED_BARCODES
                        File containing a filtered list of droplet barcodes.
                        This may be used if you want to use a filtered list of
                        barcodes for doublet detection (ie need to remove
                        droplets that are empty or high in ambient RNA).
  -o OUTDIR, --outdir OUTDIR
                        The output directory; default is current working
                        directory
  -p BOOST_RATE, --boost_rate BOOST_RATE
                        Proportion of cells used to generate synthetic
                        doublets; default is 0.25.
  -c N_COMPONENTS, --n_components N_COMPONENTS
                        Number of principal components to use; default is 30.
  -g N_TOP_VAR_GENES, --n_top_var_genes N_TOP_VAR_GENES
                        Number of top variable genes to use; default is 1000.
  -r REPLACE, --replace REPLACE
                        Whether to replace cells when generating synthetic
                        doublets; default is False.
  -a CLUSTERING_ALGORITHM, --clustering_algorithm CLUSTERING_ALGORITHM
                        Which clustering algorithm to use; default is
                        'phenograph'
  -k CLUSTERING_KWARGS, --clustering_kwargs CLUSTERING_KWARGS
                        Keyword arguments to pass to clustering algorithm;
                        default is None.
  -i N_ITERATIONS, --n_iterations N_ITERATIONS
                        Number of iterations to use; default is 50
  -e PSEUDOCOUNT, --pseudocount PSEUDOCOUNT
                        Pseudocount used to normalize counts; default is 0.1.
  -n NORMALIZER, --normalizer NORMALIZER
                        Method for raw counts normalization; default is None.
  -d RANDOM_STATE, --random_state RANDOM_STATE
                        Number to use to seed random state for PCA; default is
                        0.
  -s STANDARD_SCALING, --standard_scaling STANDARD_SCALING
                        Whether to use standard scaling of normalized count
                        matrix prior to PCA (True) or not (False); default is
                        True
  -j N_JOBS, --n_jobs N_JOBS
                        Number of jobs to to use; default is 1
  -t P_THRESH, --p_thresh P_THRESH
                        P-value threshold for doublet calling; default is
                        1e-16
  -v VOTER_THRESH, --voter_thresh VOTER_THRESH
                        Voter threshold for doublet calling; default is 0.5

DoubletDetection Results and Interpretation

After running DoubletDetection, you will have multiple files in the $DOUBLETDETECTION_OUTDIR:

/path/to/output/DoubletDetection
├── convergence_test.pdf
├── DoubletDetection_doublets_singlets.tsv
├── DoubletDetection_summary.tsv
└── threshold_test.pdf

We have found these to be the most helpful:

  • DoubletDetection_summary.tsv

    • A summary of the number of singlets and doublets predicted by DoubletDetection.

    DoubletDetection_DropletType

    Droplet N

    doublet

    2594

    singlet

    18388

  • DoubletDetection_doublets_singlets.tsv

    • The per-barcode singlet and doublet classification from DoubletDetection.

      Barcode

      DoubletDetection_DropletType

      AAACCTGAGATAGCAT-1

      singlet

      AAACCTGAGCAGCGTA-1

      singlet

      AAACCTGAGCGATGAC-1

      singlet

      AAACCTGAGCGTAGTG-1

      singlet

      AAACCTGAGGAGTTTA-1

      singlet

      AAACCTGAGGCTCATT-1

      singlet

      AAACCTGAGGGCACTA-1

      singlet

  • convergence_test.pdf

    • The expectation is that after multiple rounds, the expected number of doublets will converge. If that is not the case, we suggest that you run DoubletDetection for more iterations (try 150, or even 250 if that isn’t convincing).

    • Here are two figures - one of a sample that came to convergence after 50 iterations (left) and one that did not (right)

      Good Converged

      Bad Convergence

      https://user-images.githubusercontent.com/44268007/104434976-ccf8fa80-55db-11eb-9f30-00f71e4592d4.png
      https://user-images.githubusercontent.com/44268007/95423527-f545dd00-098c-11eb-8a48-1ca6bb507151.png

Merging Results with Other Software Results

We have provided a script that will help merge and summarize the results from multiple softwares together. See Combine Results.

Citation

If you used the Demuxafy platform for analysis, please reference our preprint as well as DoubletDetection.