Freemuxlet

Freemuxlet is a genotype-free demultiplexing software that does not require you to have SNP genotypes the donors in your multiplexed capture. In fact, it can’t natively integrate SNP genotypes into the demultiplexing. We have provided some scripts that will help identify clusters from given donors if you do have SNP genotypes but use Freemuxlet. However, it might be better to use a software that is designed integrate SNP genotypes while assigning donor/cluster (Demuxlet, Souporcell or Vireo).

Data

This is the data that you will need to have prepare to run Freemuxlet:

Required

  • Common SNP genotypes vcf ($VCF)

    • While not exactly required, using common SNP genotype locations enhances accuracy

      • If you have reference SNP genotypes for individuals in your pool, you can use those

      • If you do not have reference SNP genotypes, they can be from any large population resource (i.e. 1000 Genomes or HRC)

    • Filter for common SNPs (> 5% minor allele frequency) and SNPs overlapping genes

  • Barcode file ($BARCODES)

  • Number of samples in pool ($N)

  • Bam file ($BAM)

    • Aligned single cell reads

  • Output directory ($FREEMUXLET_OUTDIR)

Optional

  • The SAM tag used in the Bam file to annotate the aligned single cell reads with their corresponding cell barcode ($CELL_TAG)

    • If not specified, _Freemuxlet defaults to using CB.

  • The SAM tag used in the Bam file to annotate the aligned single cell reads with their corresponding unique molecular identifier (UMI) ($UMI_TAG)

    • If not specified, _Freemuxlet defaults to using UB.

Run Freemuxlet

First, let’s assign the variables that will be used to execute each step.

Example Variable Settings

Below is an example of the variables that we can set up to be used in the command below. These are files provided as a test dataset available in the Data Preparation Documentation Please replace paths with the full path to data on your system.

VCF=/path/to/TestData4PipelineFull/test_dataset.vcf
BARCODES=/path/to/TestData4PipelineFull/test_dataset/outs/filtered_gene_bc_matrices/Homo_sapiens_GRCh38p10/barcodes.tsv
BAM=/path/to/test_dataset/possorted_genome_bam.bam
FREEMUXLET_OUTDIR=/path/to/output/freemuxlet
N=14

Popscle Pileup

⏱️ Expected Resource Usage

~3-4h using a total of 91Gb memory when using 5 threads for the full Test Dataset which contains ~20,982 droplets of 13 multiplexed donors,

First we will need to identify the number of reads from each allele at each of the common SNP location:

Please note that the \ at the end of each line is purely for readability to put a separate parameter argument on each line.

singularity exec Demuxafy.sif popscle_pileup.py \
--sam $BAM \
--vcf $VCF \
--group-list $BARCODES \
--tag-group $CELL_TAG \
--tag-UMI $UMI_TAG \
--out $FREEMUXLET_OUTDIR/pileup

If the pileup is successfull, you will have these files in your $FREEMUXLET_OUTDIR:

/path/to/output/freemuxlet
├── pileup.cel.gz
├── pileup.plp.gz
├── pileup.umi.gz
└── pileup.var.gz

Additional details about outputs are available below in the Freemuxlet Results and Interpretation.

Popscle Freemuxlet

⏱️ Expected Resource Usage

~9min using a total of 4Gb memory when using 1 thread for the full Test Dataset which contains ~20,982 droplets of 13 multiplexed donors,

First we will need to identify the number of reads from each allele at each SNP location.

Once you have run popscle pileup, you can demultiplex your samples with Freemuxlet:

Please note that the \ at the end of each line is purely for readability to put a separate parameter argument on each line.

singularity exec Demuxafy.sif popscle freemuxlet \
        --plp $FREEMUXLET_OUTDIR/pileup \
        --out $FREEMUXLET_OUTDIR/freemuxlet \
        --tag-group $CELL_TAG \
        --tag-UMI $UMI_TAG \
        --group-list $BARCODES \
        --nsample $N

If freemuxlet is successfull, you will have these new files in your $FREEMUXLET_OUTDIR:

/path/to/output/freemuxlet
├── freemuxlet.clust1.samples.gz
├── freemuxlet.clust1.vcf.gz
├── freemuxlet.lmix
├── pileup.cel.gz
├── pileup.plp.gz
├── pileup.umi.gz
└── pileup.var.gz

Additional details about outputs are available below in the Freemuxlet Results and Interpretation.

Freemuxlet Summary

We have provided a script that will summarize the number of droplets classified as doublets, ambiguous and assigned to each donor by Freemuxlet and write it to the $FREEMUXLET_OUTDIR. You can run this to get a fast and easy summary of your results by providing the result file of interest:

singularity exec Demuxafy.sif bash Freemuxlet_summary.sh $FREEMUXLET_OUTDIR/freemuxlet.clust1.samples.gz

which will return:

Classification

Assignment N

0

1575

1

1278

10

972

11

1477

12

1630

13

1446

2

1101

3

1150

4

1356

5

1540

6

1110

7

1313

8

1383

9

884

DBL

2767

or you can write it straight to a file:

singularity exec Demuxafy.sif bash Freemuxlet_summary.sh $FREEMUXLET_OUTDIR/freemuxlet.clust1.samples.gz > $FREEMUXLET_OUTDIR/freemuxlet_summary.tsv

Note

To check if these numbers are consistent with the expected doublet rate in your dataset, you can use our Doublet Estimation Calculator.

Correlating Cluster to Donor Reference SNP Genotypes (optional)

If you have reference SNP genotypes for some or all of the donors in your pool, you can identify which cluster is best correlated with each donor in your reference SNP genotypes. We have provided a script that will do this and provide a heatmap correlation figure and the predicted individual that should be assigned for each cluster. You can either run it with the script by providing the reference SNP genotypes ($VCF), the cluster SNP genotypes ($FREEMUXLET_OUTDIR/freemuxletOUT.clust1.vcf.gz) and the output directory ($FREEMUXLET_OUTDIR) You can run this script with:

Note

In order to do this, your $VCF must be reference SNP genotypes for the individuals in the pool and cannot be a general vcf with common SNP genotype locations from 1000 Genomes or HRC.

Please note that the \ at the end of each line is purely for readability to put a separate parameter argument on each line.

singularity exec Demuxafy.sif Assign_Indiv_by_Geno.R \
        -r $VCF \
        -c $FREEMUXLET_OUTDIR/freemuxlet.clust1.vcf.gz \
        -o $FREEMUXLET_OUTDIR

To see the parameter help menu, type:

singularity exec Demuxafy.sif Assign_Indiv_by_Geno.R -h

Which will print:

usage: Assign_Indiv_by_Geno.R [-h] -r REFERENCE_VCF -c CLUSTER_VCF -o OUTDIR

optional arguments:
-h, --help            show this help message and exit
-r REFERENCE_VCF, --reference_vcf REFERENCE_VCF
                                                The output directory where results will be saved
-c CLUSTER_VCF, --cluster_vcf CLUSTER_VCF
                                                A QC, normalized seurat object with
                                                classifications/clusters as Idents().
-o OUTDIR, --outdir OUTDIR
                                                Number of genes to use in
                                                'Improved_Seurat_Pre_Process' function.

After correlating the reference SNP genotypes with the cluster SNP genotypes using either the script or manually, you should have three new files in your $FREEMUXLET_OUTDIR:

/path/to/output/freemuxlet
├── freemuxlet.clust1.samples.gz
├── freemuxlet.clust1.vcf.gz
├── freemuxlet.lmix
├── freemuxlet_summary.tsv
├── Genotype_ID_key.txt
├── pileup.cel.gz
├── pileup.plp.gz
├── pileup.umi.gz
├── pileup.var.gz
├── ref_clust_pearson_correlation.png
└── ref_clust_pearson_correlations.tsv

Freemuxlet Results and Interpretation

After running the Freemuxlet steps and summarizing the results, you will have a number of files from some of the intermediary steps. Theses are the files that most users will find the most informative:

  • freemuxlet.clust1.samples.gz

    • Metrics for each droplet including the singlet, doublet or ambiguous assignment (DROPLET.TYPE), final assignment (BEST.GUESS), log likelihood of the final assignment (BEST.LLK) and other QC metrics.

      INT_ID

      BARCODE

      NUM.SNPS

      NUM.READS

      DROPLET.TYPE

      BEST.GUESS

      BEST.LLK

      NEXT.GUESS

      NEXT.LLK

      DIFF.LLK.BEST.NEXT

      BEST.POSTERIOR

      SNG.POSTERIOR

      SNG.BEST.GUESS

      SNG.BEST.LLK

      SNG.NEXT.GUESS

      SNG.NEXT.LLK

      SNG.ONLY.POSTERIOR

      DBL.BEST.GUESS

      DBL.BEST.LLK

      DIFF.LLK.SNG.DBL

      0

      GTGAAGGTCCGCGTTT-1

      600

      1050

      DBL

      12,1

      -1001.09

      12,4

      -1030.21

      29.13

      -0.00000

      6.7e-16

      12

      -1037.90

      1

      -1135.80

      1.00000

      12,1

      -1001.09

      -36.81

      1

      CGAGAAGTCCTCAACC-1

      354

      578

      SNG

      7,7

      -560.30

      13,7

      -583.64

      23.35

      -0.00000

      1

      7

      -560.30

      13

      -650.83

      1.00000

      13,7

      -583.64

      23.35

      2

      CGCTTCATCGGTGTCG-1

      1029

      2847

      DBL

      9,3

      -1651.22

      9,6

      -1777.52

      126.31

      0.00000

      1.5e-65

      9

      -1802.35

      3

      -1838.25

      1.00000

      9,3

      -1651.22

      -151.13

      3

      CAGCGACTCGTCGTTC-1

      167

      229

      SNG

      5,5

      -261.97

      6,5

      -272.51

      10.54

      -0.00001

      1

      5

      -261.97

      6

      -303.97

      1.00000

      6,5

      -272.51

      10.54

      4

      CGTAGGCAGGCCGAAT-1

      287

      465

      SNG

      1,1

      -451.79

      4,1

      -479.98

      28.18

      -0.00000

      1

      1

      -451.79

      10

      -562.57

      1.00000

      4,1

      -479.98

      28.18

If you ran the Assign_Indiv_by_Geno.R script, you will also have the following files:

  • Genotype_ID_key.txt

    • Key of the cluster and assignments for each individual and the pearson correlation coefficient.

      Genotype_ID

      Cluster_ID

      Correlation

      113_113

      CLUST4

      0.7939599

      349_350

      CLUST11

      0.7954687

      352_353

      CLUST12

      0.7962697

      39_39

      CLUST7

      0.7927807

      40_40

      CLUST6

      0.7833879

      41_41

      CLUST3

      0.7877763

      42_42

      CLUST13

      0.7915233

      43_43

      CLUST0

      0.8008066

      465_466

      CLUST2

      0.7849719

      596_597

      CLUST1

      0.7883125

      597_598

      CLUST5

      0.7996224

      632_633

      CLUST9

      0.7904012

      633_634

      CLUST10

      0.7834359

      660_661

      CLUST8

      0.7914850

  • ref_clust_pearson_correlation.png

    • Figure of the pearson correlation coefficients for each cluster-individual pair.

      _images/OneK1K_scRNA_Sample54_freemuxlet_pearson_correlation.png
  • ref_clust_pearson_correlations.tsv

    • All of the pearson correlation coefficients between the clusters and the individuals

      Cluster

      113_113

      349_350

      352_353

      39_39

      40_40

      0

      0.6710138155015287

      0.6670772417845169

      0.6662437546886375

      0.659705934873083

      0.661561196478371

      1

      0.6768324504112175

      0.6698041245221165

      0.6753365794834155

      0.6746102593436571

      0.670220232713515

      2

      0.680371000427

      0.6756606413629137

      0.6764869329887958

      0.6742600575280224

      0.6712474637813011

      3

      0.678245260602395

      0.6729013367875729

      0.6773636626488672

      0.6719793480269676

      0.6672767277830997

      4

      0.7939598604862043

      0.6714745697877756

      0.6713909926031749

      0.673064058187681

      0.6702690169292862

Merging Results with Other Software Results

We have provided a script that will help merge and summarize the results from multiple softwares together. See Combine Results.

Citation

If you used the Demuxafy platform for analysis, please reference our publication as well as Freemuxlet.