NB: Document under active development!
The ‘pbmc8k’ dataset contains around 8000 PBMCs (peripheral blood mononuclear cell) from a healthy individual. It is a publicly available dataset provided by 10X genomics: (https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc8k)
First, need the aligned bam files. The output of the 10X cellRanger pipeline will work with default parameters.
To obtain the data files for this analysis.
# Bam files and index. NB: large files.
wget http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_possorted_genome_bam.bam
wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_possorted_genome_bam_index.bam.bai
# Gzipped tar file with the precomputed tsne and clusters information
wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_analysis.tar.gz
tar -xzf pbmc8k_analysis.tar.gz
This command will produces the input required for polyApiper. It takes some time (hours) to run.
./polyApipe.py -i pbmc8k_possorted_genome_bam.bam -o pbmc8k -t 8 --depth_threshold 10
There is only one sample in this experiment, so specifing the bam file with -i. And -t means that multithreaded tools will use 8 threads.
Once finished, the following files are output:
pbmc8k_polyA_peaks.gff : gff file defining the polyA site regions, defined from peaks of polyA reads in the input. The sites are typically 250bp (–region_size) upstream of the polyA site (unless larger regions were merged). The misprime status (depending on –misprime_A_count and –misprime_in) flags sites near A-rich genomic regions. E.g.
1 polyAends polyAends 16442 16691 . - . peak="1_16442_r"; peakdepth="10"; misprime="False";
1 polyAends polyAends 184920 185169 . - . peak="1_184920_r"; peakdepth="179"; misprime="False";
1 polyAends polyAends 632454 632703 . + . peak="1_632703_f"; peakdepth="3468"; misprime="True";pbmc8k_counts.tab.gz : A gzipped tab-delimited text file of the number of reads aligned per cell in each polyA site peak. All reads are counted, whether or not they include a polyA. Format <peakname> <cell> <number of reads> .
10_44994717_f AACTGGTAGTACGCCC 1
10_44994717_f AACTGGTCAGTCCTTC 2
10_44994717_f AACTTTCCAGTCAGAG 1library(polyApiper)
# - Get appropriate ENSEMBL annotations and DNA sequence from AnnotationHub
# - Classify genomic regions into exon,intron,3'UTR,extension
polyApiper:::do_ensembl_organism(
out_path="human_ens94",
species="Homo sapiens",
version="94")
# - Load output from polyApipe.py
# - Produce HDF5Array SingleCellExperiment objects containing counts
# - Perform further analysis steps
polyApiper:::do_pipeline(
out_path ="pbmc8k_banquet",
counts_files ="pbmc8k_counts.tab.gz",
peak_info_file="pbmc8k_polyA_peaks.gff",
organism ="human_ens94",
)
# To [re]-load for analysis.
banq <- load_banquet("pbmc8k_banquet")
Now the pbmc8k_banquet directory has been setup - with APA information ready for downstream analysis.
If a given experiment has multiple samples in separate bam files, these can be put in a directory and processed together. Just proveide the directory to the -i paramter of polyApipe.py.
./polyApipe.py -i multi_bam_experiment_dir -o this_experiment_polyA -t 8
When multiple files are processed, the output will be a directory, with individual *_polyA.bam and *counts.tab.gz files for each sample. There will still be a single output_polyA_peaks.gff file, made considering all polyA reads from all samples pooled together. All samples are counted against this same set of regions.
These can then be loaded together into into R with polyApipeR.
When processing multiple files at the polyApipe.py step, individual samples are processed one at a time. The -t / –threads option is simply passed through to tools that support multithreading. This is probably sufficent for most experiments.
However, it might not be ideal for very large experiments (e.g. 100+ samples). In those cases, its possible to use snakemake to run parts of the pipeline in parallel on a cluster. Refer to snakemake docs for how to set this up, but there is a template to show the processing logic on github here. This will need to be adjusted to suit your local compute.
# Add snakemake and graphviz into your conda env.
conda activate polyApipe_snakemake
snakemake --jobs 30 --cores 8 --cluster-config cluster.json --cluster "sbatch -A {cluster.account} -c {cluster.c} -t {cluster.time}"
Running polyApipe.py --help shows many options - described below. Most are rarely needed.
Main options : Input and output. Essential. –output should not exist.
Main:
-i INPUT [INPUT ...], --input INPUT [INPUT ...]
A bam file, bam files, or single directory of bam
files. (default: None)
-o OUT_ROOT, --output OUT_ROOT
Name root to use for output files. (default: None)
PolyA peak options : For changing how the polyA sites are defined. Increase –depth_threshold for large datasets.
PolyA peak options:
Set peak-level configs
--depth_threshold DEPTH_THRESHOLD
Need at least this many reads in a peak to call an APA
site. (default: 10)
--region_size REGION_SIZE
Size upstream of APA site to consider polyA reads.
(default: 250)
PolyA read options : For changing what reads are considered polyA reads. Defaults should usually be sufficient.
PolyA read options:
Set read-level configs for defining polyA reads
--minMAPQ MINMAPQ Minimum MAPQ mapping quality to consider. (default:
10)
--minpolyA MINPOLYA Number of A to consider polyA. (default: 5)
--non_A_allowed NONA_ALLOWED
Number of non A bases permitted in polyA region (while
still having --minpolyA As) (default: 0)
--misprime_A_count MISPRIME_A_COUNT
Number of As seen in the last --misprime_in of aligned
reads to label potential mispriming (default: 8)
--misprime_in MISPRIME_IN
Look for --misprime_A_count As in this many
nucleotides at the end of a reaad alignment when
labelling potential misprime (default: 10)
Bam Tag options : Not needed for typical 10X pipeline output, but if something other than the CB and UR tags are used for describing barcodes, specify here.
Bam tags:
For specifiing umi, cell e.t.c
--cell_barcode_tag CORRECTED_CELL_BARCODE_TAG
Corrected (exact-match) cell barcode bam tag used in
bam input. (default: CB)
--umi_tag UMI_TAG Uncorrected UMI / molecular barcode bam tag used in
bam input. May contain mismatches. (default: UR)
Running options : The -t/–threads option should usually be used to tell multithreaded tasks how many threads they can use. The rest of these options are just for running via snakemake or debugging.
Running:
For changing how this script runs. Stop/start on polyA step e.t.c
-t THREADS, --threads THREADS
Num threads for multithreaded steps. (default: 1)
--no_peaks Stop after making polyA bams. Do not try to find peaks
in polyA files (implies --no_anno --no_count)
(default: False)
--no_count Stop after making merged polyA peaks gff file. Do not
try to count them, or annotate bams with them.
(default: False)
--polyA_bams Skip polyA filtering step, the bams specified with
'-i' are already filtered to polyA-containing reads
only. (default: False)
--peaks_gff PEAKS_GFF
If provided, use this gff file of peaks instead of
making one from polyA reads. Skips polyA filtering
steps, incompatable with --polyA_bams This is a gtf
format specifically as output by this script. See
example data. (default: None)
--keep_interim_files Don't delete the intermediate files (merged polyA,
annotated input bams).For debugging or piecemeal runs.
(default: False)
Dobin, Alexander, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras. 2013. “STAR: Ultrafast universal RNA-seq aligner.” Bioinformatics 29 (1): 15–21. https://doi.org/10.1093/bioinformatics/bts635.
Liao, Yang, Gordon K. Smyth, and Wei Shi. 2014. “FeatureCounts: An efficient general purpose program for assigning sequence reads to genomic features.” Bioinformatics 30 (7): 923–30. https://doi.org/10.1093/bioinformatics/btt656.
Smith, Tom, Andreas Heger, and Ian Sudbery. 2017. “UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy.” Genome Research 27 (3): 491–99. https://doi.org/10.1101/gr.209601.116.