ScSplit

ScSplit is a reference-free demultiplexing software. If you have reference SNP genotypes, it would be better to use a demultiplexing software that can handle reference SNP genotypes (Demuxlet, Souporcell or Vireo)

Data

This is the data that you will need to have prepared to run ScSplit:

Required

  • Bam file ($BAM)

    • Aligned single cell reads

  • Genome reference fasta file ($FASTA)

  • Barcode file ($BARCODES)

  • Common SNP genotypes vcf ($VCF)

    • While not exactly required, using common SNP genotype locations enhances accuracy

      • If you have reference SNP genotypes for individuals in your pool, you can use those

      • If you do not have reference SNP genotypes, they can be from any large population resource (i.e. 1000 Genomes or HRC)

    • Filter for common SNPs (> 5% minor allele frequency) and SNPs overlapping genes

  • Number of samples in pool ($N)

  • Output directory ($SCSPLIT_OUTDIR)

Optional

  • The SAM tag used in the Bam file to annotate the aligned single cell reads with their corresponding cell barcode ($CELL_TAG)

    • If not specified, _ScSplit defaults to using CB.

Run ScSplit

First, let’s assign the variables that will be used to execute each step.

Example Variable Settings

Below is an example of the variables that we can set up to be used in the command below. These are files provided as a test dataset available in the Data Preparation Documentation Please replace paths with the full path to data on your system.

BARCODES=/path/to/TestData4PipelineFull/test_dataset/outs/filtered_gene_bc_matrices/Homo_sapiens_GRCh38p10/barcodes.tsv
BAM=/path/to/test_dataset/possorted_genome_bam.bam
SCSPLIT_OUTDIR=/path/to/output/scscplit
N=13
VCF=/path/to/TestData4PipelineFull/test_dataset.vcf ### optional

Prepare Bam file

⏱️ Expected Resource Usage

~7h using a total of 6.5Gb memory when using 8 threads for the full Test Dataset which contains ~20,982 droplets of 13 multiplexed donors,

First, let’s check to make sure that the bam file and vcf file are on the same reference for matching chromosome encoding (ie UCSC = hg38 = chr1, chr2, chr3… vs ENSEMBL/NCBI = GRCh38 = 1, 2, 3…)

singularity exec Demuxafy.sif compare_vcf_bam_genome.sh $BAM $VCF

If you receive an error, you will have to standardise the genome encoding so the files match before continuing.

you will need to prepare the bam file so that it only contains high quality, primarily mapped reads without any PCR duplicated reads.

singularity exec Demuxafy.sif samtools view -b -S -q 10 -F 3844 $BAM > $SCSPLIT_OUTDIR/filtered_bam.bam
singularity exec Demuxafy.sif samtools rmdup $SCSPLIT_OUTDIR/filtered_bam.bam $SCSPLIT_OUTDIR/filtered_bam_dedup.bam
singularity exec Demuxafy.sif samtools sort -o $SCSPLIT_OUTDIR/filtered_bam_dedup_sorted.bam $SCSPLIT_OUTDIR/filtered_bam_dedup.bam
singularity exec Demuxafy.sif samtools index $SCSPLIT_OUTDIR/filtered_bam_dedup_sorted.bam

After running these bam preparation steps, you will have the following files in your $SCSPLIT_OUTDIR:

/path/to/output/scscplit
├── filtered_bam.bam
├── filtered_bam_dedup.bam
├── filtered_bam_dedup_sorted.bam
└── filtered_bam_dedup_sorted.bam.bai

Call Sample SNVs

⏱️ Expected Resource Usage

~7h using a total of 6.5Gb memory when using 8 threads for the full Test Dataset which contains ~20,982 droplets of 13 multiplexed donors,

Next, you will need to identify SNV genotypes in the pooled bam.

singularity exec Demuxafy.sif freebayes -f $FASTA -iXu -C 2 -q 1 $SCSPLIT_OUTDIR/filtered_bam_dedup_sorted.bam > $SCSPLIT_OUTDIR/freebayes_var.vcf
singularity exec Demuxafy.sif vcftools --gzvcf $SCSPLIT_OUTDIR/freebayes_var.vcf --minQ 30 --recode --recode-INFO-all --out $SCSPLIT_OUTDIR/freebayes_var_qual30

After running these SNV calling steps, you will have the following new files in your $SCSPLIT_OUTDIR:

/path/to/output/scscplit
├── filtered_bam.bam
├── filtered_bam_dedup.bam
├── filtered_bam_dedup_sorted.bam
├── filtered_bam_dedup_sorted.bam.bai
├── freebayes_var_qual30.log
├── freebayes_var_qual30.recode.vcf
└── freebayes_var.vcf

Demultiplex with ScSplit

⏱️ Expected Resource Usage

~1h using a total of 32Gb memory when using 4 threads for the full Test Dataset which contains ~20,982 droplets of 13 multiplexed donors,

The prepared SNV genotypes and bam file can then be used to demultiplex and call genotypes in each cluster.

singularity exec Demuxafy.sif scSplit count -c $VCF -v $SCSPLIT_OUTDIR/freebayes_var_qual30.recode.vcf -i $SCSPLIT_OUTDIR/filtered_bam_dedup_sorted.bam -b $BARCODES ${CELL_TAG:+-t $CELL_TAG} -r $SCSPLIT_OUTDIR/ref_filtered.csv -a $SCSPLIT_OUTDIR/alt_filtered.csv -o $SCSPLIT_OUTDIR
singularity exec Demuxafy.sif scSplit run -r $SCSPLIT_OUTDIR/ref_filtered.csv -a $SCSPLIT_OUTDIR/alt_filtered.csv -n $N -o $SCSPLIT_OUTDIR
singularity exec Demuxafy.sif scSplit genotype -r $SCSPLIT_OUTDIR/ref_filtered.csv -a $SCSPLIT_OUTDIR/alt_filtered.csv -p $SCSPLIT_OUTDIR/scSplit_P_s_c.csv -o $SCSPLIT_OUTDIR

After running these demultiplexing steps, you will have the following new results:

/path/to/output/scscplit
├── alt_filtered.csv
├── filtered_bam.bam
├── filtered_bam_dedup.bam
├── filtered_bam_dedup_sorted.bam
├── filtered_bam_dedup_sorted.bam.bai
├── freebayes_var_qual30.log
├── freebayes_var_qual30.recode.vcf
├── freebayes_var.vcf
├── ref_filtered.csv
├── scSplit_dist_matrix.csv
├── scSplit_dist_variants.txt
├── scSplit.log
├── scSplit_PA_matrix.csv
├── scSplit_P_s_c.csv
├── scSplit_result.csv
└── scSplit.vcf

Additional details about outputs are available below in the Demuxlet Results and Interpretation.

ScSplit Summary

We have provided a script that will provide a summary of the number of droplets classified as doublets, ambiguous and assigned to each cluster by ScSplit. You can run this to get a fast and easy summary of your results. Just pass the ScSplit result file:

singularity exec Demuxafy.sif bash scSplit_summary.sh $SCSPLIT_OUTDIR/scSplit_result.csv

which will return the following summary:

Classification

Assignment N

DBL

1055

SNG-0

1116

SNG-10

1654

SNG-11

1207

SNG-12

1564

SNG-13

1428

SNG-14

1640

SNG-2

514

SNG-3

1314

SNG-4

1587

SNG-5

1774

SNG-6

1484

SNG-7

1662

SNG-8

1578

SNG-9

1282

You can save the summary to file pointing it to the desired output file:

singularity exec Demuxafy.sif bash scSplit_summary.sh $SCSPLIT_OUTDIR/scSplit_result.csv > $SCSPLIT_OUTDIR/scSplit_summary.tsv

Note

To check if these numbers are consistent with the expected doublet rate in your dataset, you can use our Doublet Estimation Calculator.

Correlating Cluster to Donor Reference SNP Genotypes (optional)

If you have reference SNP genotypes for some or all of the donors in your pool, you can identify which cluster is best correlated with each donor in your reference SNP genotypes. We have provided a script that will do this and provide a heatmap correlation figure and the predicted individual that should be assigned for each cluster. You can either run it with the script by providing the reference SNP genotypes ($VCF), the cluster SNP genotypes ($SCSPLIT_OUTDIR/scSplit.vcf) and the output directory ($SCSPLIT_OUTDIR) You can run this script with:

Note

In order to do this, your $VCF must be reference SNP genotypes for the individuals in the pool and cannot be a general vcf with common SNP genotype locations from 1000 Genomes or HRC.

singularity exec Demuxafy.sif Assign_Indiv_by_Geno.R -r $VCF -c $SCSPLIT_OUTDIR/scSplit.vcf -o $SCSPLIT_OUTDIR

To see the parameter help menu, type:

singularity exec Demuxafy.sif Assign_Indiv_by_Geno.R -h

Which will print:

usage: Assign_Indiv_by_Geno.R [-h] -r REFERENCE_VCF -c CLUSTER_VCF -o OUTDIR

optional arguments:
-h, --help            show this help message and exit
-r REFERENCE_VCF, --reference_vcf REFERENCE_VCF
                                                The output directory where results will be saved
-c CLUSTER_VCF, --cluster_vcf CLUSTER_VCF
                                                A QC, normalized seurat object with
                                                classifications/clusters as Idents().
-o OUTDIR, --outdir OUTDIR
                                                Number of genes to use in
                                                'Improved_Seurat_Pre_Process' function.

ScSplit Results and Interpretation

After running the ScSplit steps and summarizing the results, you will have a number of files from some of the intermediary steps. Theses are the files that most users will find the most informative:

  • scSplit_doublets_singlets.csv

    • The droplet assignment results. The first column is the droplet barcode and the second column is the droplet type and cluster assignment separated by a dash. For example SNG-9 would indicate that cluster 9 are singlets.

      Barcode

      Cluster

      AAACCTGTCCGAATGT-1

      SNG-0

      AAACGGGAGTTGAGAT-1

      SNG-0

      AAACGGGCATGTCTCC-1

      SNG-0

      AAACGGGTCCACGAAT-1

      SNG-0

      AAACGGGTCCAGTAGT-1

      SNG-0

      AAACGGGTCGGCTTGG-1

      SNG-0

      AAAGATGTCCGAACGC-1

      SNG-0

      AAAGATGTCCGTCAAA-1

      SNG-0

      AAAGTAGCATCACGTA-1

      SNG-0

If you ran the Assign_Indiv_by_Geno.R script, you will also have the following files:

  • Genotype_ID_key.txt

    • Key of the cluster and assignments for each individual and the Pearson correlation coefficient.

      Genotype_ID

      Cluster_ID

      Correlation

      113_113

      12

      0.6448151

      349_350

      14

      0.6663323

      352_353

      7

      0.6596409

      39_39

      6

      0.6398297

      40_40

      9

      0.6191905

      41_41

      3

      0.6324396

      42_42

      4

      0.6560180

      43_43

      5

      0.6672336

      465_466

      11

      0.6297396

      596_597

      13

      0.6273717

      597_598

      10

      0.6627428

      632_633

      1

      0.5899685

      633_634

      0

      0.6157936

      660_661

      8

      0.6423770

  • ref_clust_pearson_correlation.png

    • Figure of the Pearson correlation coefficients for each cluster-individual pair.

      _images/OneK1K_scRNA_Sample54_scSplit_pearson_correlation.png
  • ref_clust_pearson_correlations.tsv

    • All of the Pearson correlation coefficients between the clusters and the individuals

      Cluster

      113_113

      349_350

      352_353

      39_39

      40_40

      0

      0.18419103983986865

      0.18328230320693129

      0.19176272973032255

      0.15376916805897994

      0.19107524908934623

      1

      0.19853015287744033

      0.1981622074955004

      0.19245840283478327

      0.17855748333388533

      0.19455433395443292

      2

      0.17993959098414505

      0.15477058833898663

      0.26412833664924995

      0.17360648445022142

      0.16374615160876657

      3

      0.2128616996153357

      0.19325148148095284

      0.21728991668088174

      0.19346574998787222

      0.17921651379211084

      4

      0.17573820413419833

      0.17629504087312717

      0.16426156659465307

      0.17427996983606964

      0.18322785415879167

Merging Results with Other Software Results

We have provided a script that will help merge and summarize the results from multiple softwares together. See Combine Results.

Citation

If you used the Demuxafy platform for analysis, please reference our publication as well as ScSplit.