Data Preparation¶
There isn’t a lot of data preparation to be done before running the demultiplexing or doublet detecting softwares.
Data Required¶
The demultiplexing and transcriptome-based doublet detecting softwares have different data input requirements:
Software Group |
Single Cell Count Data Required |
SNP Genotype Data Required |
---|---|---|
Demultiplexing |
✔️ |
✔️ |
Doublet Detecting |
✔️ |
✖️ |
You won’t need to pre-process the single cell count data unless you are using DoubletFinder or DoubletDecon which need QC-filtered and normalized counts (for example with Seurat).
For the demultiplexing softwares, you should filter the SNP genotypes that you will use.
SNP Genotype Data¶
Note
The SNP genotype data can be for multiplexed donors in the pool OR it can be publicly available common SNP genotypes.
We provide instructions on how to access or prepare these data here.
Pulblicly Available SNP Genotype Data¶
We have provided common SNP vcf files that we have generated on both GRCh37 and GRCh38 and with a variety of filtering and ‘chr’ encoding. These can be downloaded using the following links:
Minor Allele Frequency |
Genome |
Region Filtering |
Chr Encoding |
vcf File |
md5sum File |
---|---|---|---|---|---|
1% |
GRCh37 |
Genes |
No ‘chr’ |
||
‘chr’ encoding |
|||||
Exons |
No ‘chr’ |
||||
‘chr’ encoding |
|||||
GRCh38 |
Genes |
No ‘chr’ |
|||
‘chr’ encoding |
|||||
Exons |
No ‘chr’ |
||||
‘chr’ encoding |
Of course, you can further filter these to a 5% minor allele frequency if you would prefer.
You can also download SNP genotype data and process it yourself from 1000G (hg19 and hg38) or HRC (hg19 only).
For 1000G, use the instructions at the above link to access the data per your preferences and you can find the required files at the following directories:
The hg19 data is available at
/ftp/release/
The hg38 data is available at
/ftp/release/20130502/supporting/GRCh38_positions/
Preparing your own SNP Genotype Data¶
It is best to filter the SNP genotypes for common SNPs (generally > 1% or > 5% minor allele frequency) that overlap exons. Here we provide an example of how to do this filtering. We built the required softwares into the singularity image so you can run these filtering steps with the image.
Note
We have found it best to impute reference SNP genotypes so there are more SNP locations available. If you are using reference SNP genotypes for the donors in your pool, please be sure to impute before filtering.
Filter for Common SNPs¶
First, filter the SNP genotypes for common SNPs - 5% minor allele frequency should work for most datasets but you can change this to another minor allele frequency if you would like.
singularity exec Demuxafy.sif bcftools filter --include 'MAF>=0.05' -Oz --output $OUTDIR/common_maf0.05.vcf.gz $VCF
Where $OUTDIR
is the output directory where you want to save the results and $VCF
is the path to the SNP genotype vcf file.
Filter for SNPs overlapping Exons¶
Next, filter for the SNPs that overlap exons.
Note
You can get an exon bed using the UCSC table browser (see instructions here) and we have also provided bed files for hg19
and hg38
singularity exec Demuxafy.sif vcftools \
--gzvcf $OUTDIR/common_maf0.05.vcf.gz \
--max-alleles 2 \
--remove-indels \
--bed $BED \
--recode \
--recode-INFO-all \
--out $OUTDIR/common_maf0.05_exon_filtered
Test Dataset¶
In addition, we have provided a test dataset that you can use. Find the data that can be downloaded below along with some information about the data. The data have been aligned to GRCh38 so for running. If you are running souporcell, you will needa a fasta file, which you can download from UCSC FTP
Information
The test dataset includes 20,982 droplets captured of PBMCs from 13 multiplexed individuals.
10x Directories + Other Necessary Files¶
We have provided this dataset as the complete dataset which is pretty large (~40Gb tar.gz directory). Therefore, we have also provided the same dataset where the data has been significantly reduced.
Warning
The reduced test dataset may not produce real-world results due to the small size - especially for doublet detecting softwares since the reads have been significantly downsampled to reduce the size.
You can download the dataset with one of the following commands:
First, download the dataset and the md5sum:
wget https://www.dropbox.com/s/3oujqq98y400rzz/TestData4PipelineFull.tar.gz
wget https://www.dropbox.com/s/5n7u723okkf5m3l/TestData4PipelineFull.tar.gz.md5
After downloading the tar.gz directory, it is best to make sure the md5sum of the TestData4PipelineFull.tar.gz
file matches the md5sum in the TestData4PipelineFull.tar.gz.md5
:
md5sum TestData4PipelineFull.tar.gz > downloaded_TestData4PipelineFull.tar.gz.md5
diff -s TestData4PipelineFull.tar.gz.md5 downloaded_TestData4PipelineFull.tar.gz.md5
That should return the following statement indicating that the two md5sums are identical:
Files TestData4PipelineFull.tar.gz.md5 and downloaded_TestData4PipelineFull.tar.gz.md5 are identical
Finally, you can access the data by unzipping the file:
tar -xvf TestData4PipelineFull.tar.gz
This should unzip the TestData4PipelineFull
directory where you will have the following file structure:
TestData4PipelineFull
├── donor_list.txt
├── samplesheet.txt
├── test_dataset
│ ├── outs
│ │ └── filtered_gene_bc_matrices
│ │ └── Homo_sapiens_GRCh38p10
│ │ ├── barcodes.tsv
│ │ ├── genes.tsv
│ │ └── matrix.mtx
│ ├── possorted_genome_bam.bam
│ └── possorted_genome_bam.bam.bai
└── test_dataset.vcf
First, download the reduced dataset and the md5sum:
wget https://www.dropbox.com/s/m8u61jn4i1mcktp/TestData4PipelineSmall.tar.gz
wget https://www.dropbox.com/s/ykjg86q3xw39wqr/TestData4PipelineSmall.tar.gz.md5
After downloading the tar.gz directory, it is best to make sure the md5sum of the TestData4PipelineSmall.tar.gz
file matches the md5sum in the TestData4PipelineSmall.tar.gz.md5
:
md5sum TestData4PipelineSmall.tar.gz > downloaded_TestData4PipelineSmall.tar.gz.md5
diff -s TestData4PipelineSmall.tar.gz.md5 downloaded_TestData4PipelineSmall.tar.gz.md5
That should return the following statement indicating that the two md5sums are identical:
Files TestData4PipelineSmall.tar.gz.md5 and downloaded_TestData4PipelineSmall.tar.gz.md5 are identical
Finally, you can access the data by unzipping the file:
tar -xvf TestData4PipelineFull.tar.gz
This should unzip the TestData4PipelineSmall
directory where you will have the following file structure:
TestData4PipelineSmall
├── donor_list.txt
├── individuals_list_dir
│ └── test_dataset.txt
├── samplesheet.txt
├── test_dataset
│ └── outs
│ ├── filtered_gene_bc_matrices
│ │ └── Homo_sapiens_GRCh38p10
│ │ ├── barcodes.tsv
│ │ ├── genes.tsv
│ │ └── matrix.mtx
│ ├── pooled.sorted.bam
│ └── pooled.sorted.bam.bai
└── test_dataset.vcf
Seurat Object¶
We have also provided a filtered, QC normalized Seurat object (needed for DoubletFinder and DoubletDecon)
Download the rds object and the md5sum:
wget https://www.dropbox.com/s/po4gy2j3eqohhjv/TestData_Seurat.rds
wget https://www.dropbox.com/s/rmix7tt9aw28n7i/TestData_Seurat.rds.md5
After downloading the rds.object, it is best to make sure the md5sum of the TestData_Seurat.rds
file matches the md5sum in the TestData_Seurat.rds.md5
:
md5sum TestData_Seurat.rds > downloaded_TestData_Seurat.rds.md5
diff -s TestData_Seurat.rds.md5 downloaded_TestData_Seurat.rds.md5
That should return the following statement indicating that the two md5sums are identical:
Files TestData_Seurat.rds.md5 and downloaded_TestData_Seurat.rds.md5 are identical
The TestData_Seurat.rds
can then be used directly as input for the DoubletFinder and DoubletDecon tutorials.
You can also load the TestData_Seurat.rds
into R
to see the seurat object by first opening R:
singularity exec Demuxafy.sif R
Then read it in with:
seurat <- readRDS(TestData_Seurat.rds)
Note
We have used this dataset for each of the tutorials. The example tables in the Results and Interpretation sections of each tutorial are the results from this dataset.