PyPI Docs Build Status

Home

Top News

  • [29/05/2022] We have released v2.2 that fully supports counting droplet-based data for both Skipping Exon events and other types of splcing events. See the brie-count manual
  • [29/05/2022] We have include small-sized test data sets (15MB) for both smart-seq2 and 10x Genomics. See data in brie-tutorials/tests repo

About BRIE

Welcome to the new BRIE (>=2.0 or BRIE2), Bayesian Regression for Isoform Estimate, a scalable Bayesian method to accurately identify splicing phenotypes in single-cell RNA-seq experiments and quantify isoform proportions and their uncertainty.

BRIE2 supports the analysis of splicing processes at two molecular levels, either between alternative splicing isoforms or between unspliced and spliced RNAs. In either case, it returns cell-by-event or cell-by-gene matrices of PSI value and its 95% confidence interval (quantification) and the statistics for detecting DAS and DMG on each event or gene:

  1. Differential alternative splicing (DAS): This task is to quantify the proportions of alternative splicing isoforms and to detect DAS between groups of cells or along with a continuous covariate, e.g., pseudotime. BRIE2 is designed for two-isoform splicing events with a focus on exon skipping, but in principle also applicable for mutual exclusion, intron-retaining, alternative poly-A site, 3’ splice site and 5’ splice site.
  2. Differential momentum genes (DMG): This task is to quantify the proportions of unspliced and spliced RNAs in each gene and each cell. Similar to DAS, the DMG is a principled selection of genes that capture heterogeneity in transcriptional kinetics between cell groups, e.g., cell types, or continuous cell covariates, hence may enhance the RNA velocity analyses by focusing on dynamics informed genes.

Note

Though we highly recommend using BRIE v2 for a coherent way for splicing phenotype selection, BRIE1 CLI (MCMC based & gene feature only) is still available but the CLIs are changed to brie1 and brie1-diff.

Questions and Issues

If you find any technical issues in the codes, we will appreciate your report. Please write them in the GitHub issues: https://github.com/huangyh09/brie/issues

If you have other specific questions on using BRIE, feel free to get in touch with us: yuanhua <at> hku.hk

References

Installation

Environment setting (optional)

For using BRIE2, we recommend creating a separated conda environment for easier and cleaner management of dependent packages, particularly TensorFlow and TensorFlow-probability, which are under active development.

The following command line in terminal (Linux or MacOS) will create a conda environment with name TFProb, probably in the path ~/.conda/envs/TFProb:

conda create -n TFProb python=3.7

Alternatively, you can create the environment in another path by replacing -n TFProb with -p ANY_PATH/TFProb to specify the path for conda environment. Then you can check your environment by conda env list and activate the environment by conda activate TFProb or the full path, before start installing brie and other packages.

Easy install

BRIE2 is available on PYPI. To install, type the following command line, and add -U for upgrading:

pip install -U brie

Alternatively, you can install from this GitHub repository for the latest (often development) version by the following command line

pip install -U git+https://github.com/huangyh09/brie

In either case, if you don’t have write permission for your current Python environment, add --user, but check the previous section above for creating your own conda environment.

GPU usage

One of the key benefits of using TensorFlow backend is its direct support of GPU for substantial speedups (generally >10x). Here is one way to set up GPU configurations with NVIDIA GPU on Ubuntu:

pip install -U tensorflow-gpu
conda install -c anaconda cudatoolkit

Make sure that you have compatible versions between tensorflow and NVIDIA CUDA. You can check TF’s test here. For more information on GPU configuration, please refer to the Tensorflow documentation or anaconda GPU.

Once successfully configured, you will see log info like the following when using brie-quant:

$ brie-quant
  I tensorflow/stream_executor/platform/default/dso_loader.cc:53]
  Successfully opened dynamic library libcudart.so.11.0
  Welcome to brie-quant in BRIE v2.0.5!

Note

At the moment, TensorFlow calls all available GPUs, which is not necessary. You can specify the card (e.g., card 3) by adding the below variable before your command line CUDA_VISIBLE_DEVICES=3 brie-quant -i my_count.h5ad

Test

In order to test the installation, you could type brie-quant. If successful, you will see the following output.

Welcome to brie-quant in BRIE v2.0.2!

use -h or --help for help on argument.

If you install BRIE successfully, but can’t run it and see the error below, then check whether its directory is added to PATH environment.

brie-quant: command not found

Usually, the directory is ~/.local/bin if you don’t use Anaconda. You could add the path into PATH environment variable, by writing the following line into .profile or .bashrc file.

export PATH="~/.local/bin:$PATH"

If you have any issues, please report them to the issue on brie issues.

Quick start

BRIE2 estimates the splicing proportion for two-component events across many cells. It is designed for analysing two molecular levels of splicing processes:

  1. Alternative splicing isoforms, e.g., exon included or excluded. BRIE2 will identify differential alternative splicing (DAS) events. See demo1 for detecting DAS between disease and control.
  2. Spliced vs unspliced RNAs for RNA velocity analysis. BRIE2 will identify differential momentum genes (DMG). See demo2 for detecting DMG between cell types.

For getting started quickly, there are two steps to go: counting and quantifying. The brie-quant (for quantifying and testing) is developed in a unified manner. Here we provide a roadmap of using BRIE2 for both molecular levels in different purposes (quantification or feature detection).

BRIE2 roadmap

Pre-step: read counting

First, you need to count the isoform-specific reads in each splicing event in each cell. Similarly for read/UMI counts for spliced and unspliced RNAs. BRIE2 only provides a utility function for read counts in alternative splicing. For counting spliced/unspliced reads, we provide some tips in the unspliced RNA counting section.

For alternative splicing, e.g., exon-skipping event, you can download the splicing annotations generated by us or make your own, e.g., with briekit.

Then you can use the brie-count to fetch the read count tensor, which will be stored in hdf5 format as AnnData. See more details on brie-count CLI, and you can use this example command line.

# for smart-seq
brie-count -a AS_events/SE.gold.gtf -S sam_and_cellID.tsv -o out_dir -p 15

# for droplet, e.g. 10x Genomics
brie-count -a AS_events/SE.gold.gtf -s possorted.bam -b barcodes.tsv.gz -o out_dir -p 15

Besides the SE event, other types of alternative splicing, e.g., intron retaining is also applicable with BRIE. Some pre-processing utilities will be available soon.

Main Step: quantify and detect

brie-quant is used for this step. If GPU is available, we highly recommend using GPU for ~10x speedup comparing CPU server. Thanks to good support from Tensorflow, the environment setting is straightforward, see our guide on GPU Usage.

This step supports various settings for either quantification of splicing or detection of splicing phenotypes. You can directly run step 1.2 if you only want to perform phenotype detection.

PSI quantification

You can quantify the isoform with cell or gene features or both or none. Usually, we recommend using aggregated imputation even if you don’t have any feature, namely mode2-quant in brie-quant CLI as follows (please add --interceptMode gene for aggregating cells as prior for each gene),

brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_aggr.h5ad --interceptMode gene
Splicing Phenotype detection

If you have cell-level features, e.g., disease condition or cell type or continuous variable, you can use it in cell features to detect variable splicing events or differential momentum genes as phenotypes for further analysis. This is mode2-diff in brie-quant CLI, so requires -c and --LRTindex.

brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_cell.h5ad \
    -c $DATA_DIR/cell_info.tsv --interceptMode gene --LRTindex=All

Example

Please see the example in brie-quant CLI mode 3, and MS data.

Downstream Step: analysis

The BRIE output AnnData files are compatible with Scanpy, hence you can easily use it for dimension reduction, clustering, and other visualization. A few examples for both alternative splicing and RNA velocity are also available in this documentation (see the navigation bar on the left).

Others for unspliced RNA counting

BRIE2 doesn’t provide a utility function for counting the spliced and unspliced RNAs, but thanks to the community efforts, there are a few tools already available for this purpose:

  1. velocyto.py: the earliest software for this purpose. Generally not computationally efficient, possible due to written in Python. For unknown reasons, the proportion of unspliced RNA is unrealistically high for 5’ scRNA-seq data based on 10x Genomics.
  2. dropEst: as implemented in C/C++, it is much more efficient. It also returns more reasonable proportions of unspliced RNAs for 5’ 10x Genomics data
  3. STAR-solo: new extension for the popular STAR. Benefits: efficient and one step for reads alignment and counting of unspliced RNA (Recommended option)

The first two options take inputs as aligned bam file(s), and STAR-solo itself is a widely used aligner and provides the count matrices directly. All these options align reads to genome and define reads as unspliced and spliced by the gene annotations in GTF/GFF3 format.

Alternatively, there are other options by aligning reads to annotated transcriptomes directly e.g., kallisto bustools. However, the agreement of the above counting tools is still not perfect according to a recent benchmarking paper (Soneson et al, Plos Comp Bio, 2021)

brie-count CLI

BRIE (>=2.0.0) provides two CLI directly available in your Python path: brie-count, brie-quant.

From v2.1, the brie-count CLI supports any type of splicing events, with primarily focusing on exon-skipping events:

  • exon skipping (SE): we provide a curated annotation for human and mouse; the output is in h5ad format, seamlessly for downstream analysis with brie-quant
  • other types: in generally, brie-count returns reads (or UMIs) counts for that are compatible to isoform groups (regardless of the number of exons it contains). However, the downstream analysis (e.g., differentail splicing) with brie-quant mainly supports two-isoform based events. Also, users may need to curate the splicing annotation themselves (we may provide some curation in future).

From v2.2, the brie-count CLI supports counting on both well-based platform (e.g., smart-seq2) where each cell has a separate bam (sam/cram) file and droplet-based platform (e.g., 10x Genomics) where all cells are pooled in one big bam file through cell barcodes.

Splicing annotations

As input, it requires a list of annotated splicing events. We have generated annotations for human and mouse. If you are using it, please align RNA-seq reads to the according version (or close version) of reference genome. Alternatively, you can use briekit package to generate.

Read counting

The brie-count CLI calculates a count tensor for the number of reads that are aligned to each splicing event and each cell, and stratifies four them into four different categories (using SE as an example):

  1. key 1: Uniquely aligned to isoform1, e.g., in exon1-exon2 junction in SE event
  2. key 2: Uniquely aligned to isoform2, e.g., in exon1-exon3 junction in SE event
  3. key 3: Ambiguously aligned to isoform1 and isoform2, e.g., within exon1
  4. key 0: Partially aligned in the region but not compatible with any of the two isoforms. We suggest ignoring these reads.

Note

You can decode isoform compatibility from the key index by using the decimal-bo-binary

For example, the key 5 (decimal) can be decoded to 101 (binary), which means compatible with both isoform 1 (the last) and 3 (the third from right).

Then you fetch the counts on a list of bam files by the command line like this:

# for smart-seq
brie-count -a AS_events/SE.gold.gtf -S sam_and_cellID.tsv -o out_dir -p 15

# for droplet, e.g. 10x Genomics
brie-count -a AS_events/SE.gold.gtf -s possorted.bam -b barcodes.tsv.gz -o out_dir -p 15

By default, you will have four output files in the out_dir: brie_count.h5ad, read_count.mtx.gz, cell_note.tsv.gz, and gene_note.tsv.gz. The brie_count.h5ad contains all information for downstream analysis, e.g., for brie-quant.

Options

There are more parameters for setting (brie-count -h always give the version you are using):

Usage: brie-count [options]

Options:
  -h, --help            show this help message and exit
  -a GFF_FILE, --gffFile=GFF_FILE
                        GTF/GFF3 file for gene and transcript annotation
  -o OUT_DIR, --out_dir=OUT_DIR
                        Full path of output directory [default: $samFile/brieCOUNT]

  SmartSeq-based input:
    -S SAMLIST_FILE, --samList=SAMLIST_FILE
                        A no-header tsv file listing sorted and indexed bam/sam/cram
                        files. Columns: file path, cell id (optional)

  Droplet-based input:
    -s SAM_FILE, --samFile=SAM_FILE
                        One indexed bam/sam/cram file
    -b BARCODES_FILE, --barcodes=BARCODES_FILE
                        A file containing cell barcodes without header
    --cellTAG=CELL_TAG  Tag for cell barocdes [default: CB]
    --UMItag=UMI_TAG    Tag for UMI barocdes [default: UR]

  Optional arguments:
    --verbose           Print out detailed log info
    -p NPROC, --nproc=NPROC
                        Number of subprocesses [default: 4]
    -t EVENT_TYPE, --eventType=EVENT_TYPE
                        Type of splicing event for check. SE: skipping-exon; Any: no-
                        checking [default: SE]

brie-quant CLI

The brie-quant CLI (in brie>=2.0.0) uses the newly developed variational inference methods scalable to large data sets, which works both in CPU or GPU with the TensorFlow Backend. For using BRIE1 (<=0.2.4) with an MCMC sampler, please refer to BRIE1.

This command allows to quantify the splicing proportion Psi and detect variable splicing events/genes along with cell-level features, e.g., cell type, disease condition, and development time. It can handle both alternative splicing isoforms or unspliced vs spliced ratios in a unified framework.

As a Bayesian method, the key philosophy of BRIE is to combine likelihood (data-driven) and prior (uninformative or informative). In BRIE2, a variety of prior settings are supported, as follows.

Note

The mode2 (quant and diff) with cell features is the recommended option in BRIE2, for both quantifying PSI and identifying differential momentum genes for RNA velocity analysis and differential splicing events on alternative splicing for either categorical or continuous covariates.

For unspliced vs spliced in RNA velocity analysis, please add argument --layers spliced,unspliced, as by default is set for alternative splicing.

Mode2-quant: Aggregated imputation

This mode requires argument --interceptMode gene, which means fitting an offset for each gene (or splicing event). The offset denotes the mean of the prior distribution through aggregation, hence learns a prior shared by all cells on each gene. The benefit for this mode is that dimension reduction can be performed, e.g., PCA and UMAP on splicing PSI matrix. As there are many splicing events with low total reads, they will have a high variance in the estimation, and is often suggested filtered out, which will cause missing values. Based on the cell aggregated imputation, most dimension reduction methods can be used, even it doesn’t support missing values.

Example command line for mode 2:

brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_aggr.h5ad --interceptMode gene

Mode2-diff: Variable splicing detection

This mode requires argument -c for cell features and --LRTindex for the index (zero-based) of cell features to perform likelihood ratio test. Again we suggest keeping the cell aggregation on each gene by --interceptMode gene.

Then this mode will learn a prior from the given cell level features and perform the second round fitting with leaving one feature out each time to calculate the ELBO gain, which can be used as a surrogate for Bayes factor.

Example command line for mode 3:

brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_cell.h5ad \
    -c $DATA_DIR/cell_info.tsv --interceptMode gene --LRTindex=All

Example

As an example in the MS data, we have cells labelled with two factors 1) multiple sclerosis & control - isEAE, 2) two mouse strains - isCD1. We want to identify alternative splicing events associated with the first factor multiple sclerosis, but also want to consider the potential confounder in mouse strain, we could set the design matrix in cell_info.tsv file as follow, along with parameters --interceptMode gene --LRTindex 0.

samID   isEAE   isCD1
SRR7102862      0       1
SRR7103631      1       0
SRR7104046      1       1
SRR7105069      0       0

Note

Be very careful on collinearity (i.e., redundancy) of your design matrix in cell_info.tsv and the constant intercept (if use --interceptMode gene), which is similar to DEG in edgeR or DESeq2.

By default, BRIE2 uses the --testBase=full mode to detect differential splicing by comparing all features + constant intercept versus leaving the testing feature(s) out. In this setting, if your features can’t have collinearity with intercept, e.g., you may need to remove one cell type out if it covers all your cells.

Alternatively, you can change to another strategy --testBase=null by comparing the testing feature(s) + intercept versus intercept only, like the example data on Dentate Gyrus.

Mode 1: Imputation with gene features

This mode is introduced in BRIE 1, where genomic sequences are leverage to learn a prior distribution of PSI. Then the predicted distribution of PSI from sequence features are combined with the likelihood obtained from the observed read counts. This framework provides a coherent way to combine the quantification from read counts and imputation from genomic features.

Initially, the inference in this mode is achieved by MCMC sample per cell separated, which is generally slow for over hundreds of cells. BRIE2’s variational framework keeps supporting the gene features and performs the inference for all cells in one go, though they are independent.

Example command line for Mode 1. We suggest use --interceptMode cell to learn an offset for each cell:

brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_gene.h5ad \
    -g $DATA_DIR/gene_feature.tsv --interceptMode cell

Note

For the sake of convenience, we now recommend using Mode2-quant below to perform imputation, which leverages the average PSI values in a cell population to function as an informative prior.

Mode 0: None imputation

In this mode, the prior is an uninformative logit-normal distribution with mean=0, and learned variance. Therefore, if a splicing event in a gene doesn’t have any read, it will return a posterior with Psi’s mean=0.5 and 95% confidence interval around 0.95 (most cases >0.9).

This setting is used if you have high covered data and you only want to calculate cells with sufficient reads for each interesting gene, e.g., by filtering out all genes with Psi_95CI > 0.3.

Otherwise, the 0.5 imputed genes will be confounded by the expression level, instead of the isoform proportion.

Example command line for mode 1:

brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_pure.h5ad --interceptMode None

All parameters

There are more parameters for setting (brie-quant -h always give the version you are using):

Usage: brie-quant [options]

Options:
  -h, --help            show this help message and exit
  -i IN_FILE, --inFile=IN_FILE
                        Input read count matrices in AnnData h5ad or brie npz format.
  -c CELL_FILE, --cellFile=CELL_FILE
                        File for cell features in tsv[.gz] with cell and feature ids.
  -g GENE_FILE, --geneFile=GENE_FILE
                        File for gene features in tsv[.gz] with gene and feature ids.
  -o OUT_FILE, --out_file=OUT_FILE
                        Full path of output file for annData in h5ad [default:
                        $inFile/brie_quant.h5ad]
  --LRTindex=LRT_INDEX  Index (0-based) of cell features to test with LRT: All, None
                        or comma separated integers [default: None]
  --testBase=TEST_BASE  Features in testing base model: full, null  [default: full]
  --interceptMode=INTERCEPT_MODE
                        Intercept mode: gene, cell or None [default: None]
  --layers=LAYERS       Comma separated layers two or three for estimating Psi
                        [default: isoform1,isoform2,ambiguous]

  Gene filtering:
    --minCount=MIN_COUNT
                        Minimum total counts for fitltering genes [default: 50]
    --minUniqCount=MIN_UNIQ_COUNT
                        Minimum unique counts for fitltering genes [default: 10]
    --minCell=MIN_CELL  Minimum number of cells with unique count for fitltering genes
                        [default: 30]
    --minMIF=MIN_MIF    Minimum minor isoform frequency in unique count [default:
                        0.001]

  VI Optimization:
    --MCsize=MC_SIZE    Sample size for Monte Carlo Expectation [default: 3]
    --minIter=MIN_ITER  Minimum number of iterations [default: 5000]
    --maxIter=MAX_ITER  Maximum number of iterations [default: 20000]
    --batchSize=BATCH_SIZE
                        Element size per batch: n_gene * total cell [default: 500000]
    --pseudoCount=PSEUDO_COUNT
                        Pseudo count to add on unique count matrices [default: 0.01]

Release notes

Release in future

Release v2.2.2 (04/11/2022)

  • support limit the CPU threads used by tensorflow in brie-quant (default 6)

Release v2.2.1 (14/07/2022)

  • fix a bug for –eventType=Any in smart-seq
  • change to setting for save to h5ad file for all events with only two-isoforms

Release v2.2.0 (29/05/2022)

  • Big new feature brie-count now supports cell barcodes based platforms, including 10x Genomics
  • provide small test files in the tests folder
  • Update brie-count manual

Release v2.1.0 (22/03/2022)

  • Enable brie-count to support other types of splicing events, not only SE
  • Fix a minor bug in exteme scenario with missing a certain isoform for all genes

Release v2.0.6 (23/09/2021)

  • Fix in format bug in brie-count
  • Update brie.pl.volcano for using ELBO_gain as default y-axis
  • Update manual according to the revised paper
  • Add CLI brie

Release v2.0.5 (04/11/2020)

  • Support saving detection table to tsv file
  • Add the exon start and stop positions in brie-count

Release v2.0.4 (26/09/2020)

  • Tune the learning rate with multiple values
  • For test model, the fitted parameters will be used as initials
  • Support base model with full features or null feature
  • For gene feature only, switch sigma into per cell base
  • Add noise term in simulator
  • A few minor bug fix
  • Implement a Inv-Gamma prior distribution for sigma (not in use)

Release v2.0.3 (26/08/2020)

  • Support to use minimum minor isoform frequency to fileter genes (default=0.001)
  • Add pseudo count (default=0.01) for none-empty element in both unique counts for more robust results
  • Reduce the sample size for Monte Carlo Expectation (10 to 3) for computational efficiency
  • Restructure the arguments in brie-quant
  • Initialise the example notebook on multiple sclerosis data

Release v2.0.2 (24/08/2020)

  • Fix minor bugs in brei-count and brie-quant cli for compatibility

Release v2.0.0 (23/08/2020)

  • Change the whole BRIE model from MCMC sampler (v1) to variational inference (v2)
  • Change the usage of each read to a summarised read counts for speedup
  • Support splicing quantification without any feature or cell features or gene features or both types of features.
  • Support detection of variable splicing event associated with cell features
  • Support acceleration with Graphic card (Nvidia GPU)
  • Compatible with Scanpy analysis suite with a variety of plotting functions
  • Restructure the whole package
  • BRIE earlier version is still avaible with brie1

Release v0.2.4 (21/10/2019)

  • fix a bug that fragment length longer than transcript length

Release v0.2.2 (15/01/2019)

  • move __version__ into a separate file, so import brie is not required before installation.
  • support cram file as input aligned reads

Release v0.2.0 (03/06/2018)

  • split the pre-processing functions in BRIE to another separate package BRIE-kit, as some functions in the pre-processing require Python2 environment. So, BRIE will keep two functions brie and brie-diff, while BRIE-kit will have briekit-event, briekit-event-filter, and briekit-factor. See BRIE-KIT: https://github.com/huangyh09/briekit/wiki

Release v0.1.5 (03/06/2018)

  • support gff in gz format
  • add an example with 130 cells
  • brie-diff supporting ranking genes

Release v0.1.4 (02/06/2018)

  • turn off pylab
  • fix a bug for function rev_seq, reported in issue #10
  • update documentation

Release v0.1.3 (16/04/2017)

  • brie-diff takes multiple cells, which could handle pair-wise comparisons for 100 cells in ~10min with 30 CPUs; and 1000 cells within a few hours.
  • Simulation wraps on Spanki are provided for simulating splicing events at different coverages or drop-out probability and drop-out rate for single cells: https://github.com/huangyh09/brie/tree/master/simulator

Release v0.1.2 (13/01/2017)

  • support Python 3.x now
  • do not depend on h5py anymore for hdf5 data storage.
  • brie-factor returns xxx.csv.gz rather than xxx.h5
  • brie returns sample.csv.gz rather than sample.h5
  • brie-diff takes sample.csv.gz rather than sample.h5

Release v0.1.1 (02/01/2017)

  • change licence to Apache License v2
  • update brie-event-filter

Release v0.1.0 (29/12/2016)

  • Initial release of BRIE

Multiple Sclerosis

This data set was generated by Falcão et al, 2018 containing 2,208 mouse cells probed by SMART-seq2 with half experimental autoimmune encephalomyelitis (EAE) cells (in mimicing multiple scleorsis) and half control cells.

Here, we will illustrate how BRIE2 can be used to effectively detect differential splicing events between EAE and control cells, and using splicing ratio to predict cell type and disease conditions.

We have pre-processed the data with these BRIE2 scripts with brie-count and brie-quant. You can download the pre-computed data, e.g., by using the following command line and unzip it into the ./data folder for running this notebook.

wget http://ufpr.dl.sourceforge.net/project/brie-rna/examples/msEAE/brie2_msEAE.zip
unzip -j brie2_msEAE.zip -d ./data

Load Packages

[1]:
import umap
[2]:
import os
import brie
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt

print(brie.__version__)
2021-09-23 21:29:03.815258: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-23 21:29:03.815359: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2.0.6
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3) sc.logging.print_versions() sc.settings.set_figure_params(dpi=60)
[3]:
# define the path you store the example data
# dat_dir = "./data"
dat_dir = "/storage/yhhuang/research/brie2/releaseDat/msEAE/"

BRIE2 option 1: differential splicing events

Detecting differential splicing event by regressing on the EAE state with considering the strain covariate.

The command line is available at run_brie2.sh. Here is an example of design matrix besides the intercept and isEAE as the only testing variable.

samID   isEAE   isCD1
SRR7102862      0       1
SRR7103631      1       0
SRR7104046      1       1
SRR7105069      0       0

Besides the big .h5ad file, you can use the detected splicing events directely from brie_quant_cell.brie_ident.tsv. However, if you want get the quantify Psi for downstream analysis, e.g., option2 below, you need load the adata.layers['Psi'], and adata.layers['Psi_95CI'].

[4]:
adata = sc.read_h5ad(dat_dir + "/brie_quant_cell.h5ad")
adata
[4]:
AnnData object with n_obs × n_vars = 2208 × 3780
    obs: 'samID', 'samCOUNT'
    var: 'GeneID', 'GeneName', 'TranLens', 'TranIDs', 'n_counts', 'n_counts_uniq', 'loss_gene'
    uns: 'Xc_ids', 'brie_losses', 'brie_param', 'brie_version'
    obsm: 'Xc'
    varm: 'ELBO_gain', 'cell_coeff', 'effLen', 'fdr', 'intercept', 'p_ambiguous', 'pval', 'sigma'
    layers: 'Psi', 'Psi_95CI', 'Z_std', 'ambiguous', 'isoform1', 'isoform2', 'poorQual'
[5]:
adata.uns['brie_param']
[5]:
{'LRT_index': array([0]),
 'base_mode': 'full',
 'intercept_mode': 'gene',
 'layer_keys': array(['isoform1', 'isoform2', 'ambiguous'], dtype=object),
 'pseudo_count': 0.01}

Change gene index from Ensemebl id to gene name

[6]:
adata.var['old_index'] = adata.var.index
new_index = [adata.var.GeneName[i] + adata.var.GeneID[i][18:] for i in range(adata.shape[1])]
adata.var.index = new_index
[7]:
# if the index is bytes, change it to str

adata.var.index = [x.decode('utf-8') if type(x) == bytes else x for x in adata.var.index]
adata.obs.index = [x.decode('utf-8') if type(x) == bytes else x for x in adata.obs.index]

Add the cell annotations from input covariates

[8]:
adata.obs['MS'] = ['EAE' if x else 'Ctrl' for x in adata.obsm['Xc'][:, 0]]
adata.obs['isCD1'] = ['CD1_%d' %x for x in adata.obsm['Xc'][:, 1]]
Volcano plot

Here, BRIE2 detects the differential splicing events between EAE and control. Cell_coeff means the effect size on logit(Psi). Positive value means higher Psi in isEAE.

[9]:
## volcano plot for differential splicing events
brie.pl.volcano(adata, y='ELBO_gain', log_y=False, n_anno=16, score_red=7, adjust=True)
plt.xlabel('cell_coeff: effect size on logit(Psi)')
plt.title("MS differential splicing")
[9]:
Text(0.5, 1.0, 'MS differential splicing')
_images/brie2_msEAE_18_1.png
[10]:
DSEs = adata.var.index[adata.varm['ELBO_gain'][:, 0] >= 7]
len(DSEs), len(np.unique([x.split('.')[0] for x in DSEs])), adata.shape
[10]:
(129, 123, (2208, 3780))
[11]:
adata.uns['Xc_ids']
[11]:
array(['isEAE', 'isCD1'], dtype=object)
Visualize raw counts for DSEs
[12]:
rank_idx = np.argsort(adata.varm['ELBO_gain'][:, 0])[::-1]

fig = plt.figure(figsize=(15, 8))
brie.pl.counts(adata, genes=list(adata.var.index[rank_idx[:6]]),
               color='MS', add_val='ELBO_gain', ncol=3, alpha=0.7)
# fig.savefig(dat_dir + '../../figures/fig_s8_counts.png', dpi=300, bbox_inches='tight')
plt.show()
/home/yuanhua/MyGit/brie/brie/plot/base_plot.py:53: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later.
  plt.subplot(nrow, ncol, i + 1)
_images/brie2_msEAE_22_1.png
Changed ratio between ambiguous reads

Some differential splicing events having changed ratio on ambigous reads and isoform specific reads

[13]:
### Differential splicing shown by isoform1 and ambiguous reads

fig = plt.figure(figsize=(5, 4))
brie.pl.counts(adata, genes='Ddhd1',
               color='MS', add_val='ELBO_gain',
               layers=['isoform1', 'ambiguous'],
               nrow=1, alpha=0.7)
plt.show()
_images/brie2_msEAE_24_0.png
[ ]:

BRIE2 option 2: splicing quantification and usage

If you aim to use splicing isoform proportions as phenotypes for downstream analysis, e.g., to identify cell types, or study its disease contribution, we suggest not including the cell feature as covariate when quantifying Psi to avoid circular analysis.

Instead you could use gene level features, e.g., sequence features we introduced in BRIE1, or you can use an aggregated value from all cells as an informative prior. Similar as BRIE1, this prior is learned from the data and has a logit-normal distribution, so the information prior usually won’t be too strong.

In general, we recommend using the aggregation of cells as the prior thanks to its simplicity if it does not violate your biological hypothesis.

[14]:
adata_aggr = sc.read_h5ad(dat_dir + "/brie_quant_aggr.h5ad")

adata_aggr
[14]:
AnnData object with n_obs × n_vars = 2208 × 3780
    obs: 'samID', 'samCOUNT'
    var: 'GeneID', 'GeneName', 'TranLens', 'TranIDs', 'n_counts', 'n_counts_uniq'
    uns: 'brie_losses'
    obsm: 'Xc'
    varm: 'cell_coeff', 'effLen', 'intercept', 'p_ambiguous', 'sigma'
    layers: 'Psi', 'Psi_95CI', 'Z_std', 'ambiguous', 'isoform1', 'isoform2', 'poorQual'
[15]:
## change gene index
print(np.mean(adata.var['old_index'] == adata_aggr.var.index))

adata_aggr.var.index = adata.var.index
1.0
[16]:
# if the index is bytes, change it to str

adata_aggr.var.index = [x.decode('utf-8') if type(x) == bytes else x
                        for x in adata_aggr.var.index]
adata_aggr.obs.index = [x.decode('utf-8') if type(x) == bytes else x
                        for x in adata_aggr.obs.index]
Add meta data and gene-level annotation

Cell annotations and UMAP from gene-level expression. You can add any additional annotation.

[17]:
print(np.mean(adata.obs.index == adata_aggr.obs.index))
adata_aggr.obs['MS'] = adata.obs['MS'].copy()
adata_aggr.obs['isCD1'] = adata.obs['isCD1'].copy()
1.0
[18]:
dat_umap = np.genfromtxt(dat_dir + '/cell_X_umap.tsv', dtype='str', delimiter='\t')

mm = brie.match(adata_aggr.obs.index, dat_umap[:, 0])
idx = mm[mm != None].astype(int)

adata_aggr = adata_aggr[mm != None, :]
adata_aggr.obsm['X_GEX_UMAP'] = dat_umap[idx, 1:3].astype(float)
adata_aggr.obs['cluster'] = dat_umap[idx, 3]
adata_aggr.obs['combine'] = [adata_aggr.obs['cluster'][i] + '-' + adata_aggr.obs['MS'][i]
                            for i in range(adata_aggr.shape[0])]
[19]:
adata_aggr
[19]:
AnnData object with n_obs × n_vars = 1882 × 3780
    obs: 'samID', 'samCOUNT', 'MS', 'isCD1', 'cluster', 'combine'
    var: 'GeneID', 'GeneName', 'TranLens', 'TranIDs', 'n_counts', 'n_counts_uniq'
    uns: 'brie_losses'
    obsm: 'Xc', 'X_GEX_UMAP'
    varm: 'cell_coeff', 'effLen', 'intercept', 'p_ambiguous', 'sigma'
    layers: 'Psi', 'Psi_95CI', 'Z_std', 'ambiguous', 'isoform1', 'isoform2', 'poorQual'
Cell filtering
[20]:
plt.hist(np.log10(adata_aggr.X.sum(1)[:, 0] + 1), bins=30)
plt.xlabel("log10(total reads)")
plt.ylabel("Cell frequency")
plt.show()
_images/brie2_msEAE_37_0.png
[21]:
# min_reads = 15000
min_reads = 3000

adata_aggr = adata_aggr[adata_aggr.X.sum(axis=1) > min_reads, :]
adata = adata[adata_aggr.obs.index, :]

Visualizing splicing phenotypes in Gene expression UMAP

[22]:
sc.pl.scatter(adata_aggr, basis='GEX_UMAP', color=['cluster', 'MS'], size=30)
/home/yuanhua/.conda/envs/TFProb/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
/home/yuanhua/.conda/envs/TFProb/lib/python3.7/site-packages/anndata/_core/anndata.py:1229: ImplicitModificationWarning: Initializing view as actual.
  "Initializing view as actual.", ImplicitModificationWarning
Trying to set attribute `.obs` of view, copying.
... storing 'MS' as categorical
/home/yuanhua/.conda/envs/TFProb/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
Trying to set attribute `.obs` of view, copying.
... storing 'isCD1' as categorical
/home/yuanhua/.conda/envs/TFProb/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
Trying to set attribute `.obs` of view, copying.
... storing 'cluster' as categorical
/home/yuanhua/.conda/envs/TFProb/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  c.reorder_categories(natsorted(c.categories), inplace=True)
Trying to set attribute `.obs` of view, copying.
... storing 'combine' as categorical
_images/brie2_msEAE_40_1.png
[23]:
sc.pl.scatter(adata_aggr, basis='GEX_UMAP', layers='Psi',
              color=['App.AS2', 'Mbp.AS3', 'Emc10'],
              size=50, color_map='terrain_r') #'terrain_r'
_images/brie2_msEAE_41_0.png
[24]:
import seaborn as sns

sc.pl.violin(adata_aggr, ['App.AS2', 'Mbp.AS3', 'Emc10'],
             layer='Psi', groupby='combine', rotation=90,
             inner='quartile', palette=sns.color_palette("Paired"))
_images/brie2_msEAE_42_0.png

Dimension reduction on Psi

For downstream analysis, we only using splicing events that have been detected as differential splicing events. As mentioned earlier, the Psi is re-quantified by not using the cell-level information.

[25]:
# idx = list(adata.varm['fdr'][:, 0] < 0.05)
# adata_psi = adata_aggr[:, idx]

adata_psi = adata_aggr.copy()

adata_psi.layers['X'] = adata_psi.X.astype(np.float64)
adata_psi.X = adata_psi.layers['Psi']
adata_psi.shape
[25]:
(1845, 3780)
[26]:
sc.tl.pca(adata_psi, svd_solver='arpack')
[27]:
adata_psi.obs['Psi_PC1'] = adata_psi.obsm['X_pca'][:, 0]
adata_psi.obs['Psi_PC2'] = adata_psi.obsm['X_pca'][:, 1]
adata_psi.obs['Psi_PC3'] = adata_psi.obsm['X_pca'][:, 2]
[28]:
fig = plt.figure(figsize=(5, 3.7), dpi=80)
ax = plt.subplot(1, 1, 1)
sc.pl.pca(adata_psi, color='combine', size=30, show=False, ax=ax,
          palette=sns.color_palette("Paired"))
plt.xlabel("PC1: %.1f%%" %(adata_psi.uns['pca']['variance_ratio'][0] * 100))
plt.ylabel("PC2: %.1f%%" %(adata_psi.uns['pca']['variance_ratio'][1] * 100))
# plt.legend(loc=1)
plt.show()
_images/brie2_msEAE_48_0.png
[29]:
sc.pl.pca_variance_ratio(adata_psi)
_images/brie2_msEAE_49_0.png
[30]:
sc.pp.neighbors(adata_psi, n_neighbors=10, n_pcs=20, method='umap')
sc.tl.umap(adata_psi)
/home/yuanhua/.conda/envs/TFProb/lib/python3.7/site-packages/numba/np/ufunc/parallel.py:363: NumbaWarning: The TBB threading layer requires TBB version 2019.5 or later i.e., TBB_INTERFACE_VERSION >= 11005. Found TBB_INTERFACE_VERSION = 6103. The TBB threading layer is disabled.
  warnings.warn(problem)
OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[31]:
sc.pl.umap(adata_psi, color=['MS', 'combine'], size=30)
_images/brie2_msEAE_51_0.png
Visualise the Psi value on Psi-based UMAP
[32]:
sc.pl.umap(adata_psi, color=['App.AS2', 'Mbp.AS3', 'Emc10'], size=50, cmap='terrain_r')
_images/brie2_msEAE_53_0.png
Only for the cells with confident Psi estimate
[33]:
adata_tmp = adata_psi[adata_psi[:, 'App.AS2'].layers['Psi_95CI'] < 0.3, :]
adata_tmp.X = adata_tmp.layers['Psi']
sc.pl.umap(adata_tmp, color=['App.AS2', 'Mbp.AS3', 'Emc10'], size=50, cmap='terrain_r')
_images/brie2_msEAE_55_0.png
[34]:
sc.pl.scatter(adata_psi, basis='GEX_UMAP', color=['Psi_PC1', 'Psi_PC2', 'Psi_PC3'], size=50)
_images/brie2_msEAE_56_0.png
[ ]:

Prediction of cell type

One potential usage of the quantified Psi is to identify cell types. Instead of using them to cluster cells, here we show that by using the Psi (and their principle compoments), the annotated cell types from gene expression can be accurately predicted.

[35]:
np.random.seed = 1
[36]:
_val, _idx = np.unique(pd.factorize(adata_psi.obs['cluster'])[0], return_index=True)
print(_val, _idx)
print(adata_psi.obs['cluster'][_idx])
[0 1 2 3 4] [  0   8  54  92 890]
SRR7102862     OPC
SRR7102870     MOL
SRR7102925     COP
SRR7102967    VLMC
SRR7103970    MiGl
Name: cluster, dtype: category
Categories (5, object): ['COP', 'MOL', 'MiGl', 'OPC', 'VLMC']
[37]:
import scipy.stats as st
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from hilearn import ROC_plot, CrossValidation

LogisticRegression = linear_model.LogisticRegression()
RF_class = RandomForestClassifier(n_estimators=100, n_jobs=-1)

X1 = adata_psi.obsm['X_pca'][:, :20]
Y1 = pd.factorize(adata_psi.obs['cluster'])[0]

CV = CrossValidation(X1, Y1)
# Y1_pre, Y1_score = CV.cv_classification(model=RF_class, folds=10)
Y2_pre, Y2_score = CV.cv_classification(model=LogisticRegression, folds=10)
[38]:
fig = plt.figure(figsize=(6, 5), dpi=60)
# ROC_plot(Y1, Y1_score[:,1], label="Random Forest", base_line=False)
# ROC_plot(Y1, Y2_score[:,1], label="Logistic Regress")

for i in range(5):
    ROC_plot(Y1 == i, Y2_score[:,i],
             legend_label=adata_psi.obs['cluster'][_idx[i]],
             base_line=False)
ROC_plot(np.concatenate([Y1 == i for i in range(5)]),
         Y2_score.T.reshape(-1), legend_label='Overall')

plt.title("Psi based ROC curve: cell type classification")

# fig.savefig(dat_dir + '../../figures/fig_s7_classification.pdf', dpi=300, bbox_inches='tight')
plt.show()
_images/brie2_msEAE_63_0.png
Prediction of MS state

Besides using Psi to predict the cell types, a more interesting task is to predict if a cell is in disease state (i.e., EAE state). Though by only using the Psi, it achieves moderate performance but clearly improves the prediction together with the gene expression.

[39]:
dat_GEX_pc = np.genfromtxt(dat_dir + 'top_20PC.tsv', dtype='str', delimiter='\t')
[40]:
idx = brie.match(adata_psi.obs.index, dat_GEX_pc[:, 0])
GEX_pc = dat_GEX_pc[idx, 1:].astype(float)
[41]:
adata_psi
[41]:
AnnData object with n_obs × n_vars = 1845 × 3780
    obs: 'samID', 'samCOUNT', 'MS', 'isCD1', 'cluster', 'combine', 'Psi_PC1', 'Psi_PC2', 'Psi_PC3'
    var: 'GeneID', 'GeneName', 'TranLens', 'TranIDs', 'n_counts', 'n_counts_uniq'
    uns: 'brie_losses', 'cluster_colors', 'MS_colors', 'pca', 'combine_colors', 'neighbors', 'umap'
    obsm: 'Xc', 'X_GEX_UMAP', 'X_pca', 'X_umap'
    varm: 'cell_coeff', 'effLen', 'intercept', 'p_ambiguous', 'sigma', 'PCs'
    layers: 'Psi', 'Psi_95CI', 'Z_std', 'ambiguous', 'isoform1', 'isoform2', 'poorQual', 'X'
    obsp: 'distances', 'connectivities'
[42]:
np.random.RandomState(0)
[42]:
RandomState(MT19937) at 0x7F09201C87C0
[43]:
import scipy.stats as st
from sklearn import linear_model
from hilearn import ROC_plot, CrossValidation

# np.random.seed(0)

X1 = adata_psi.obsm['X_pca'][:, :20]
X2 = GEX_pc
X3 = np.append(GEX_pc, adata_psi.obsm['X_pca'][:, :20], axis=1)
Y1 = pd.factorize(adata_psi.obs['MS'])[0]

X1 = X1[adata_psi.obs['cluster'] == 'MOL']
X2 = X2[adata_psi.obs['cluster'] == 'MOL']
X3 = X3[adata_psi.obs['cluster'] == 'MOL']
Y1 = Y1[adata_psi.obs['cluster'] == 'MOL']

LogisticRegression = linear_model.LogisticRegression(max_iter=5000)
CV = CrossValidation(X1, Y1)
Y1_pre, Y1_score = CV.cv_classification(model=LogisticRegression, folds=10)

LogisticRegression = linear_model.LogisticRegression(max_iter=5000)
CV = CrossValidation(X2, Y1)
Y2_pre, Y2_score = CV.cv_classification(model=LogisticRegression, folds=10)

LogisticRegression = linear_model.LogisticRegression(max_iter=5000)
CV = CrossValidation(X3, Y1)
Y3_pre, Y3_score = CV.cv_classification(model=LogisticRegression, folds=10)
[44]:
fig = plt.figure(figsize=(6, 5), dpi=60)
ROC_plot(Y1, Y1_score[:,1], label="Psi 20 PCs", base_line=False)
ROC_plot(Y1, Y2_score[:,1], label="GEX 20 PCs", base_line=False)
ROC_plot(Y1, Y3_score[:,1], label="Psi & GEX")

plt.title("ROC curve: EAE states classification on MOL")

# fig.savefig(dat_dir + '../../figures/fig_s9_EAE_predict.pdf', dpi=300, bbox_inches='tight')
plt.show()
Warning: label argument is replaced by legend_label and will be moved in future
Warning: label argument is replaced by legend_label and will be moved in future
Warning: label argument is replaced by legend_label and will be moved in future
_images/brie2_msEAE_71_1.png
[ ]:

Techinical diagnosis
[45]:
## There are multiple batches in running jobs, so you will steps;
## the second part shows the last bit of the final batch

brie.pl.loss(adata.uns['brie_losses'])
_images/brie2_msEAE_74_0.png
[ ]:

scNT-seq

scNT-seq (Qiu, Hu, et al 2020) is a recently proposed technique to metabolically label newly synthesied RNAs in single cells. By using the labelling information, the inferred cellular transitions by Dynamo are hihgly consistent with the stimulation time (see our reproduced analysis with scripts from the authors).

This relative-longe-period transition is generally difficult to be obtained from RNA velocity by only using total RNAs. Here, we will illustrate that the differential momentum genes could help correct the projected trajectory, thanks to using the stimulation time as a testing (i.e., supervised) covariate.

As BRIE2 takes ~30 minutes with GPU to detect the DMGs, we provide the pre-computed data with these BRIE2 scripts. You can run this notebook by downloading the data, e.g., using the following command line and unzip it into the ./data folder:

wget http://ufpr.dl.sourceforge.net/project/brie-rna/examples/scNTseq/brie2_scNTseq.zip
unzip -j brie2_scNTseq.zip -d ./data

Load packages

[1]:
import brie
import numpy as np
import scanpy as sc
import scvelo as scv
import matplotlib.pyplot as plt
scv.logging.print_version()
2021-09-23 21:25:04.656553: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-23 21:25:04.656584: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Running scvelo 0.2.4 (python 3.7.6) on 2021-09-23 21:25.
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
ERROR: XMLRPC request failed [code: -32500]
RuntimeError: PyPI's XMLRPC API is currently disabled due to unmanageable load and will be deprecated in the near future. See https://status.python.org/ for more information.
[2]:
scv.settings.verbosity = 3  # show errors(0), warnings(1), info(2), hints(3)
scv.settings.presenter_view = True  # set max width size for presenter view
scv.set_figure_params('scvelo')  # for beautified visualization
[3]:
# define the path you store the example data
# dat_dir = "./data"
dat_dir = '/storage/yhhuang/research/brie2/releaseDat/scNTseq/'

scVelo dynamical model with total RNAs

This may take a few minutes to run, so we provide the pre-computed the data as in the above zip file for the setting with top 2,000 genes. The commented Python scripts are below. You can get the neuron_splicing_totalRNA.h5ad from here generated by the adapted script. If you want to change to more hihgly variable genes, you can change n_top_genes to other value, e.g., 8000.

adata = scv.read(dat_dir + "/neuron_splicing_totalRNA.h5ad")

scv.pp.filter_and_normalize(adata, min_shared_counts=30, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

scv.tl.recover_dynamics(adata, var_names='all')
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
scv.tl.latent_time(adata)
adata.write(dat_dir + "/scvelo_neuron_totalRNA_dynamical_2K.h5ad")
Cellular transitions on default selected velocity genes
[4]:
adata = scv.read(dat_dir + "/scvelo_neuron_totalRNA_dynamical_2K.h5ad")

print(adata.shape, np.sum(adata.var['velocity_genes']))
adata
(3066, 1999) 132
[4]:
AnnData object with n_obs × n_vars = 3066 × 1999
    obs: 'cellname', 'time', 'early', 'late', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', 'n_counts', 'velocity_self_transition', 'root_cells', 'end_points', 'velocity_pseudotime', 'latent_time'
    var: 'gene_short_name', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'fit_alpha', 'fit_beta', 'fit_gamma', 'fit_t_', 'fit_scaling', 'fit_std_u', 'fit_std_s', 'fit_likelihood', 'fit_u0', 'fit_s0', 'fit_pval_steady', 'fit_steady_u', 'fit_steady_s', 'fit_variance', 'fit_alignment_scaling', 'fit_r2', 'velocity_genes'
    uns: 'neighbors', 'pca', 'recover_dynamics', 'velocity_graph', 'velocity_graph_neg', 'velocity_params'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs', 'loss'
    layers: 'Ms', 'Mu', 'fit_t', 'fit_tau', 'fit_tau_', 'spliced', 'unspliced', 'velocity', 'velocity_u'
    obsp: 'connectivities', 'distances'
[5]:
scv.pl.velocity_embedding_stream(adata, basis='umap', color=['time'],
                                 ax=None, show=True, legend_fontsize=10, dpi=80,
                                 title='scVelo dynamical model: 2K variable genes')
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
_images/brie2_scNTseq_12_1.png
For 8K top variable genes

You can skip this sub section if not using the 8K top genes

adata2 = scv.read(dat_dir + "/scvelo_neuron_totalRNA_dynamical_8K.h5ad")
print(adata2.shape, np.sum(adata2.var['velocity_genes']))

scv.pl.velocity_embedding_stream(adata2, basis='umap', color=['time'],
                                 ax=None, show=True, legend_fontsize=10,
                                 title='scVelo dynamical model: 8K variable genes')

BRIE2 for differential momentum genes (DMGs)

Besides the large h5ad file, BRIE2 also saves the DMGs in a .tsv file brie_neuron_splicing_time.brie_ident.tsv for quick access.

[6]:
adata_brie = scv.read(dat_dir + "/brie_neuron_splicing_time.h5ad")

adata_brie
[6]:
AnnData object with n_obs × n_vars = 3066 × 7849
    obs: 'cellname', 'time', 'early', 'late'
    var: 'gene_short_name', 'n_counts', 'n_counts_uniq', 'loss_gene'
    uns: 'Xc_ids', 'brie_losses', 'brie_param', 'brie_version'
    obsm: 'X_umap', 'Xc'
    varm: 'ELBO_gain', 'cell_coeff', 'fdr', 'intercept', 'pval', 'sigma'
    layers: 'Psi', 'Psi_95CI', 'Z_std', 'spliced', 'unspliced'
[7]:
fig = plt.figure(figsize=(6, 4), dpi=60)
plt.figure(figsize=(6, 4), dpi=100)
brie.pl.volcano(adata_brie, y='ELBO_gain', log_y=False, n_anno=16,
        score_red=10, adjust=True)
plt.title('Differential momentum genes with time')
plt.xlabel('Coefficient on time')
# plt.savefig(dat_dir + '../../figures/scNT_volcano_elbo.png', dpi=200)
plt.show()
<Figure size 360x240 with 0 Axes>
_images/brie2_scNTseq_17_1.png
RNA velocity on DMGs
[8]:
idx = (np.min(adata_brie.varm['ELBO_gain'], axis=1) > 5)
gene_use = adata_brie.var.index[idx]
print(sum(idx), sum(brie.match(gene_use, adata.var.index) != None))
n_genes = sum(brie.match(gene_use, adata.var.index) != None)

scv.tl.velocity_graph(adata, gene_subset=gene_use)
scv.pl.velocity_embedding_stream(adata, basis='umap', color=['time'],
                                 legend_fontsize=10,
                                 ax=None, show=True, dpi=80,
                                 title='scVelo with %d DMGs' %(n_genes))
421 201
computing velocity graph (using 1/80 cores)

    finished (0:00:02) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
_images/brie2_scNTseq_19_3.png
Change cutoffs
[9]:
fig = plt.figure(figsize=(11, 4), dpi=60)

ax1 = plt.subplot(1, 2, 1)
idx1 = (np.min(adata_brie.varm['ELBO_gain'], axis=1) > 7)
gene_use1 = adata_brie.var.index[idx1]
print(sum(idx1), sum(brie.match(gene_use1, adata.var.index) != None))

scv.tl.velocity_graph(adata, gene_subset=gene_use1)
scv.pl.velocity_embedding_stream(adata, basis='umap', color=['time'],
                                 ax=ax1, show=False, legend_fontsize=10,
                                 title='BRIE2: %d DMGs at ELBO_gain>7'
                                 %(sum(brie.match(gene_use1, adata.var.index) != None)))
ax1.text(-0.15, 0.95, 'a', transform=ax1.transAxes, size=20, weight='bold')


ax2 = plt.subplot(1, 2, 2)
idx2 = (np.min(adata_brie.varm['ELBO_gain'], axis=1) > 3)
gene_use2 = adata_brie.var.index[idx2]
print(sum(idx2), sum(brie.match(gene_use2, adata.var.index) != None))

scv.tl.velocity_graph(adata, gene_subset=gene_use2)
scv.pl.velocity_embedding_stream(adata, basis='umap', color=['time'],
                                 ax=ax2, show=False, legend_fontsize=10,
                                 title='BRIE2: %d DMGs at ELBO_gain>3'
                                 %(sum(brie.match(gene_use2, adata.var.index) != None)))
ax2.text(-0.15, 0.95, 'b', transform=ax2.transAxes, size=20, weight='bold')

# plt.tight_layout()
# plt.savefig(dat_dir + '../../figures/scNT_scVelo_brie_ELBO.png', dpi=300)
# plt.show()
239 122
computing velocity graph (using 1/80 cores)

    finished (0:00:02) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
815 360
computing velocity graph (using 1/80 cores)

    finished (0:00:03) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
[9]:
Text(-0.15, 0.95, 'b')
_images/brie2_scNTseq_22_6.png
Visualize gene count
[10]:
idx = (np.min(adata_brie.varm['ELBO_gain'], axis=1) > 5)
gene_use = adata_brie.var.index[idx]

mm = brie.match(gene_use, adata.var.index) != None
gene_use[mm]
[10]:
Index(['Ncor1', 'App', 'Meis2', 'Cdk2ap1', 'Usp24', 'Ahi1', 'Tcf4', 'Nup98',
       'Zswim6', 'Fam155a',
       ...
       'Epha3', 'Pcca', 'Ptk2', 'Zbtb20', 'Nell2', 'Zbtb11', 'Pccb', 'Frem2',
       'Celf4', 'Zfp704'],
      dtype='object', length=201)
[11]:
## sorted by ELBO gain
idx_sort = np.argsort(adata_brie[:, gene_use[mm]].varm['ELBO_gain'][:,0])[::-1]
gene_use[mm][idx_sort]
[11]:
Index(['Ext1', 'Pccb', 'Ube2ql1', 'Ntrk2', 'Mbnl1', 'Fosb', 'Pfkp', 'Orc5',
       'Srd5a1', 'Chka',
       ...
       'Osgep', 'Nrn1', 'Ddx3y', 'Napb', 'Zdhhc17', 'Snap25', 'Ccnl1', 'Cobl',
       'Etv5', 'Lats2'],
      dtype='object', length=201)
[12]:
## Only negative coefficient
gene_sorted = gene_use[mm][idx_sort]
gene_sorted[adata_brie[:, gene_sorted].varm['cell_coeff'][:, 0] < 0]
[12]:
Index(['Ext1', 'Pccb', 'Ube2ql1', 'Ntrk2', 'Mbnl1', 'Orc5', 'Srd5a1', 'Akap9',
       'Ank2', 'Dlgap1',
       ...
       'Ptpre', 'Klhl4', 'Kcnn2', 'Zfp292', 'Osgep', 'Napb', 'Zdhhc17',
       'Snap25', 'Cobl', 'Etv5'],
      dtype='object', length=129)
[13]:
## Only positive coefficient
gene_sorted = gene_use[mm][idx_sort]
gene_sorted[adata_brie[:, gene_sorted].varm['cell_coeff'][:, 0] > 0]
[13]:
Index(['Fosb', 'Pfkp', 'Chka', 'Homer1', 'Erf', 'Nup98', 'Ifrd1', 'Ncapg2',
       'Rasgef1b', 'Rsrp1', 'Tiparp', 'Crem', 'Zdbf2', 'Arl5b', 'Usp37',
       'Fam107b', 'Elmsan1', 'Cwc25', 'Nedd9', 'Cystm1', 'Atp9a', 'Fosl2',
       'Stil', 'Arih1', 'Ldlr', 'Ak4', 'Zfp248', 'Elovl5', 'Pde4d', 'Nr4a2',
       'Nufip2', 'Mpped1', 'Srsf5', 'Fgfr2', 'Gtf2i', 'Gpr19', 'Cpeb3',
       'Fbxo33', 'Srrm2', 'Dcun1d3', 'Foxo3', 'Zswim6', 'Ccdc138', 'Zbtb11',
       'Cramp1l', 'Ndel1', 'Aff4', 'Arf4', 'Slc2a3', 'Stk40', 'Baiap2',
       'Slc20a1', 'Cabp1', 'Gla', 'Gltscr2', 'Ptprk', 'Rasgrf1', 'Cpeb2',
       'Snapc1', 'Pkia', 'Sik1', 'Nr4a1', 'Ppm1d', 'Cdk2ap1', 'Impact', 'Plaa',
       'Hpf1', 'Taf1', 'Nrn1', 'Ddx3y', 'Ccnl1', 'Lats2'],
      dtype='object')
[14]:
adata_brie.obs['time_cat'] = adata_brie.obs['time'].astype('category')

gene_top_neg = ['Ext1', 'Pccb', 'Ube2ql1']
gene_top_pos = ['Fosb', 'Pfkp', 'Chka']

fig = plt.figure(figsize=(16, 8), dpi=40)
brie.pl.counts(adata_brie, genes=gene_top_pos + gene_top_neg,
               layers=['spliced', 'unspliced'],
               color='time_cat', add_val='ELBO_gain',
               ncol=3, alpha=0.7, legend='brief', noise_scale=0.1)

# plt.savefig(dat_dir + '../../figures/scNT_DMG_counts.png', dpi=150)
# plt.show()
_images/brie2_scNTseq_28_0.png
[15]:
adata_brie[:, gene_top_pos + gene_top_neg].varm['cell_coeff']
[15]:
ArrayView([[ 0.0196298 ],
           [ 0.01087456],
           [ 0.01594826],
           [-0.03045544],
           [-0.0186206 ],
           [-0.01236648]], dtype=float32)
[16]:
scv.pl.velocity(adata, var_names=['Ext1', 'Pfkp'], colorbar=True, ncols=1)
_images/brie2_scNTseq_30_0.png
[ ]:

Dentate Gyrus

This processed Dentate Gyrus data set is downloaded from scVelo package, which is a very nice tool for RNA velocity quantification. Here, we will illustrate that the differential momentum genes could reveal biological informed trajectory, thanks to its supervised manner on informative gene detection.

You can run this notebook with our pre-computed data with these BRIE2 command lines, by downloading it e.g., using the following command line and unzip it into the ./data folder:

wget http://ufpr.dl.sourceforge.net/project/brie-rna/examples/dentateGyrus/brie2_dentateGyrus.zip
unzip -j brie2_dentateGyrus.zip -d ./data

Load packages

[1]:
import brie
import numpy as np
import scanpy as sc
import scvelo as scv
import matplotlib.pyplot as plt

scv.logging.print_version()
print('brie version: %s' %brie.__version__)
2021-09-23 21:13:31.376208: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-23 21:13:31.376238: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Running scvelo 0.2.4 (python 3.7.6) on 2021-09-23 21:13.
brie version: 2.0.6
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
ERROR: XMLRPC request failed [code: -32500]
RuntimeError: PyPI's XMLRPC API is currently disabled due to unmanageable load and will be deprecated in the near future. See https://status.python.org/ for more information.
[2]:
scv.settings.verbosity = 3  # show errors(0), warnings(1), info(2), hints(3)
scv.settings.presenter_view = True  # set max width size for presenter view
scv.set_figure_params('scvelo')  # for beautified visualization
[3]:
# define the path you store the example data
# dat_dir = "./data"
dat_dir = "/home/yuanhua/research/brie2/releaseDat/dentateGyrus/"

scVelo’s default results

Fitting scVelo

Scvelo’s stochastic and dynamical models may take a few minutes to run, so we pre-run it by the following codes, and the fitted data can be found in the downloaded zip file.

Stochastic model

adata = scv.datasets.dentategyrus()

scv.pp.filter_and_normalize(adata, min_shared_counts=30, n_top_genes=3000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

scv.tl.velocity(adata)
scv.tl.velocity_graph(adata)
adata.write(dat_dir + "/dentategyrus_scvelo_stoc_3K.h5ad")
[4]:
adata = scv.read(dat_dir + "/dentategyrus_scvelo_stoc_3K.h5ad")
[5]:
scv.utils.show_proportions(adata)
adata
Abundance of ['spliced', 'unspliced']: [0.9 0.1]
[5]:
AnnData object with n_obs × n_vars = 2930 × 2894
    obs: 'clusters', 'age(days)', 'clusters_enlarged', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts', 'velocity_self_transition'
    var: 'velocity_gamma', 'velocity_r2', 'velocity_genes'
    uns: 'clusters_colors', 'neighbors', 'pca', 'velocity_graph', 'velocity_graph_neg', 'velocity_params'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'Ms', 'Mu', 'ambiguous', 'spliced', 'unspliced', 'variance_velocity', 'velocity'
    obsp: 'connectivities', 'distances'
# if the index is bytes, change it to str adata.var.index = [x.decode('utf-8') if type(x) == bytes else x for x in adata.var.index] adata.obs.index = [x.decode('utf-8') if type(x) == bytes else x for x in adata.obs.index]
[6]:
print(np.sum(adata.var['velocity_genes']))
634
[7]:
scv.tl.velocity_graph(adata, gene_subset=None)
scv.pl.velocity_embedding_stream(adata, basis='umap',
                                 color=['clusters', 'age(days)'],
                                 legend_fontsize=5, dpi=60)
computing velocity graph (using 1/80 cores)

    finished (0:00:06) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
_images/brie2_dentateGyrus_13_3.png
[8]:
## Plotting with higher figure resolution

# scv.pl.velocity_embedding_stream(adata, basis='umap', color=['clusters'],
#                                  legend_fontsize=5, dpi=150, title='')

# scv.pl.velocity_embedding(adata, basis='umap', arrow_length=2, arrow_size=2, dpi=300, title='')
[ ]:

BRIE2’s differential momentum genes (DMGs)

Here, we aim to use BRIE2 to detect the differential momentum genes between cell types and illustrate its performance on projecting RNA velocity to cellular transitions. The command line and design matrix are: run_brie2.sh and dentategyrus_cluster_OL.tsv.

One main differentiation is between OPC and OL. So, we took out these 103 cells and run BRIE2 to detect the DMGs by testing the covariate is_OL.

[9]:
adata_brie_OL = scv.read(dat_dir + "/brie_dentategyrus_cluster_subOL.h5ad")

adata_OL = adata[adata_brie_OL.obs.index, :]

Vocalno plot for DMGs between OPC and OL. cell_ceoff means effect size on logit(Psi), Psi means proportion of spliced RNAs. Positive effect size means OL has higher proportion of spliced RNAs.

[10]:
plt.figure(figsize=(6, 4.5), dpi=60)

brie.pl.volcano(adata_brie_OL, y='ELBO_gain', log_y=False, n_anno=16,
                score_red=5, adjust=True)
plt.title('DMGs between OPC and OL')

plt.xlabel('cell_coeff: effect size on logit(Psi)')
plt.title('DMGs between OPC and OL')
# plt.savefig(dat_dir + '../../brie2/figures/fig_s11_vocalno_DMG_OPCvsOL.pdf', dpi=150)
[10]:
Text(0.5, 1.0, 'DMGs between OPC and OL')
_images/brie2_dentateGyrus_22_1.png
[11]:
top_genes_OL = adata_brie_OL.var.index[np.argsort(list(np.max(adata_brie_OL.varm['ELBO_gain'], axis=1)))[::-1][:100]]
top_genes_OL[:4]
[11]:
Index(['Elavl3', 'Vmp1', 'Sema6d', 'Tra2a'], dtype='object', name='index')
[12]:
print(adata.var['velocity_r2'][top_genes_OL[:4]])

scv.pl.velocity(adata, var_names=top_genes_OL[:2], colorbar=True, ncols=1)
index
Elavl3    0.310708
Vmp1     -0.736003
Sema6d   -0.358877
Tra2a    -1.474734
Name: velocity_r2, dtype: float64
_images/brie2_dentateGyrus_24_1.png
[13]:
idx = adata_brie_OL.varm['ELBO_gain'][:, 0] > 2
gene_use = adata_brie_OL.var.index[idx]

print(len(gene_use), sum(brie.match(gene_use, adata_OL.var.index) != None))

scv.tl.velocity_graph(adata_OL, gene_subset=gene_use)
scv.pl.velocity_embedding_stream(adata_OL, basis='umap',
                                 color=['clusters', 'age(days)'],
                                 legend_fontsize=10, dpi=50, size=1000)
48 48
computing velocity graph (using 1/80 cores)
Trying to set attribute `.uns` of view, copying.

    finished (0:00:00) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
_images/brie2_dentateGyrus_25_4.png
[14]:
fig = plt.figure(figsize=(11, 8), dpi=60)
## cutoff 0.001
ax1 = plt.subplot(2, 2, 1)

print(np.sum(adata_OL.var['velocity_genes']))
idx1 = adata_OL.var['velocity_genes']
scv.tl.velocity_graph(adata_OL, gene_subset=None)
scv.pl.velocity_embedding_stream(adata_OL, basis='umap', color=['clusters'],
                                 ax=ax1, show=False, legend_fontsize=10, size=1000,
                                 title='scVelo default %d genes' %(sum(idx1)))
ax1.text(-0.15, 0.95, 'a', transform=ax1.transAxes, size=20, weight='bold')

## cutoff 0.05
ax1 = plt.subplot(2, 2, 2)
idx1 = adata_brie_OL.varm['ELBO_gain'][:, 0] > 3
gene_use1 = adata_brie_OL.var.index[idx1]
print(sum(idx1), sum(brie.match(gene_use1, adata_OL.var.index) != None))

scv.tl.velocity_graph(adata_OL, gene_subset=gene_use1)
scv.pl.velocity_embedding_stream(adata_OL, basis='umap', color=['clusters'],
                                 ax=ax1, show=False, legend_fontsize=10, size=1000,
                                 title='%d DMGs at ELBO_gain>3' %(sum(idx1)))
ax1.text(-0.15, 0.95, 'b', transform=ax1.transAxes, size=20, weight='bold')

## cutoff 0.05
ax1 = plt.subplot(2, 2, 3)

idx1 = adata_brie_OL.varm['ELBO_gain'][:, 0] > 2
gene_use1 = adata_brie_OL.var.index[idx1]
print(sum(idx1), sum(brie.match(gene_use1, adata_OL.var.index) != None))

scv.tl.velocity_graph(adata_OL, gene_subset=gene_use1)
scv.pl.velocity_embedding_stream(adata_OL, basis='umap', color=['clusters'],
                                 ax=ax1, show=False, legend_fontsize=10, size=1000,
                                 title='%d DMGs at ELBO_gain>2' %(sum(idx1)))
ax1.text(-0.15, 0.95, 'c', transform=ax1.transAxes, size=20, weight='bold')

## cutoff 0.05
ax1 = plt.subplot(2, 2, 4)
idx1 = adata_brie_OL.varm['ELBO_gain'][:, 0] > 1.5
gene_use1 = adata_brie_OL.var.index[idx1]
print(sum(idx1), sum(brie.match(gene_use1, adata_OL.var.index) != None))

scv.tl.velocity_graph(adata_OL, gene_subset=gene_use1)
scv.pl.velocity_embedding_stream(adata_OL, basis='umap', color=['clusters'],
                                 ax=ax1, show=False, legend_fontsize=10, size=1000,
                                 title='%d DMGs at ELBO_gain>1.5' %(sum(idx1)))
ax1.text(-0.15, 0.95, 'd', transform=ax1.transAxes, size=20, weight='bold')


plt.tight_layout()
# plt.savefig(dat_dir + '../../brie2/figures/fig_s12_OPLvsOL_direction.png', dpi=300)
plt.show()
634
computing velocity graph (using 1/80 cores)

    finished (0:00:00) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
20 20
computing velocity graph (using 1/80 cores)

    finished (0:00:00) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
48 48
computing velocity graph (using 1/80 cores)

    finished (0:00:00) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
83 83
computing velocity graph (using 1/80 cores)

    finished (0:00:00) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
_images/brie2_dentateGyrus_26_9.png
[ ]:

We then detect the DMGs for each cell type vs the rest and use them to explore their impact on cellular transitions.

[15]:
adata_brie = scv.read(dat_dir + "/brie_dentategyrus_cluster.h5ad")

cdr = np.array((adata_brie.X > 0).mean(0))[0, :]

adata_brie
[15]:
AnnData object with n_obs × n_vars = 2930 × 2879
    obs: 'clusters', 'age(days)', 'clusters_enlarged', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size'
    var: 'n_counts', 'n_counts_uniq', 'loss_gene'
    uns: 'Xc_ids', 'brie_losses', 'brie_param', 'brie_version', 'clusters_colors'
    obsm: 'X_umap', 'Xc'
    varm: 'ELBO_gain', 'cell_coeff', 'fdr', 'intercept', 'pval', 'sigma'
    layers: 'Psi', 'Psi_95CI', 'Z_std', 'ambiguous', 'spliced', 'unspliced'
[16]:
top_genes_brie = adata_brie.var.index[np.argsort(list(np.max(adata_brie.varm['ELBO_gain'], axis=1)))[::-1][:100]]
top_genes_brie[:4]
[16]:
Index(['Celf2', 'Vmp1', 'Myl6', 'Snap25'], dtype='object', name='index')
[17]:
print(adata.var['velocity_r2'][top_genes_brie[:4]])
scv.pl.velocity(adata, var_names=top_genes_brie[:2], colorbar=True, ncols=1)
index
Celf2     0.318048
Vmp1     -0.736003
Myl6      0.338808
Snap25   -0.970810
Name: velocity_r2, dtype: float64
_images/brie2_dentateGyrus_31_1.png
[18]:
import seaborn as sns

scVelo_genes = ['Tmsb10', 'Camk2a', 'Ppp3ca', 'Igfbpl1']
adata_brie.varm['max_EGain'] = np.round(np.max(adata_brie.varm['ELBO_gain'], axis=1, keepdims=True), 2)

fig = plt.figure(figsize=(20, 8), dpi=40)
brie.pl.counts(adata_brie, genes=np.append(top_genes_brie[:4], scVelo_genes),
               layers=['spliced', 'unspliced'],
               color='clusters', add_val='max_EGain',
               nrow=2, alpha=0.7, legend=False,
               palette=sns.color_palette(adata_brie.uns['clusters_colors']),
               hue_order=np.unique(adata_brie.obs['clusters']), noise_scale=0.1)

# plt.savefig(dat_dir + '../../brie2/figures/fig_s10_DMG_counts.png', dpi=200)
plt.show()
_images/brie2_dentateGyrus_32_0.png
[19]:
idx = (np.max(adata_brie.varm['ELBO_gain'], axis=1) > 5) * (cdr > 0.15)
gene_use = adata_brie.var.index[idx]

print(len(gene_use), sum(brie.match(gene_use, adata.var.index) != None))

scv.tl.velocity_graph(adata, gene_subset=gene_use)
scv.pl.velocity_embedding_stream(adata, basis='umap',
                                 color=['clusters', 'age(days)'],
                                 legend_fontsize=5, dpi=60)
297 297
computing velocity graph (using 1/80 cores)

    finished (0:00:03) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
_images/brie2_dentateGyrus_33_3.png
[20]:
## Plotting with higher figure resolution

# scv.pl.velocity_embedding_stream(adata, basis='umap', color=['clusters'],
#                                  legend_fontsize=5, dpi=150, title='')

# scv.pl.velocity_embedding(adata, basis='umap', arrow_length=2, arrow_size=2, dpi=300, title='')
[ ]:

scVelo’s dynamical model

Dynamical model

adata = scv.datasets.dentategyrus()

scv.pp.filter_and_normalize(adata, min_shared_counts=30, n_top_genes=3000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

scv.tl.recover_dynamics(adata, var_names='all')
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
scv.tl.latent_time(adata)
adata.write(dat_dir + "/dentategyrus_scvelo_dyna_3K.h5ad")
[21]:
adata_dyna = scv.read(dat_dir + "/dentategyrus_scvelo_dyna_3K.h5ad")
[22]:
print(np.sum(adata_dyna.var['velocity_genes']))
1066
[23]:
# scv.tl.velocity_graph(adata_dyna, gene_subset=None)
scv.pl.velocity_embedding_stream(adata_dyna, basis='umap',
                                 color=['clusters', 'age(days)'],
                                 legend_fontsize=5, dpi=60)
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
_images/brie2_dentateGyrus_40_1.png
[ ]:

Prior distribtuion

[3]:
import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf
import tensorflow_probability as tfp
from tensorflow_probability import distributions as tfd
[21]:
import matplotlib.pyplot as plt

Logit-Normal

See more on logit-normal distribution Wikipedia page.

In practicce, we use \(\mu=0, \sigma=3.0\) as prior

[30]:
def _logit(x):
    return tf.math.log(x / (1 - x))

_logit(np.array([0.6]))
[30]:
<tf.Tensor: id=7912, shape=(1,), dtype=float64, numpy=array([0.40546511])>
[40]:
_means = np.array([0.05, 0.3, 0.5, 0.7, 0.95], dtype=np.float32)
_vars = np.array([0.1, 0.5, 1.0, 1.5, 2.0, 3.0], dtype=np.float32)

xx = np.arange(0.001, 0.999, 0.001).astype(np.float32)

fig = plt.figure(figsize=(10, 6))
for j in range(len(_vars)):
    plt.subplot(2, 3, j + 1)
    for i in range(len(_means)):
        _model = tfd.Normal(_logit(_means[i:i+1]), _vars[j:j+1])
        _pdf = _model.prob(_logit(xx)) * 1 / (xx * (1 - xx))
        plt.plot(xx, _pdf, label="mean=%.2f" %(_means[i]))
    plt.title("std=%.2f" %(_vars[j]))
    if j == 5:
        plt.legend(loc="best")

plt.tight_layout()
plt.show()
_images/Prior_distribution_BRIE2_5_0.png
[73]:
_model = tfd.Normal([0], [3])
plt.hist(np.array(tf.sigmoid(_model.sample(1000))).reshape(-1), bins=100)
plt.show()
_images/Prior_distribution_BRIE2_6_0.png

Gamma distribution

See more on Gamma distribution Wikipedia page In practice, we use \(\alpha=10, \beta=3\) as prior

[65]:
_alpha = np.array([2, 3, 5, 10, 15], dtype=np.float32)
_beta = np.array([0.5, 1, 2, 3, 5, 8], dtype=np.float32)

xx = np.arange(0.01, 10, 0.1).astype(np.float32)

fig = plt.figure(figsize=(10, 6))
for j in range(len(_vars)):
    plt.subplot(2, 3, j + 1)
    for i in range(len(_means)):
        _model = tfd.Gamma(_alpha[i:i+1], _beta[j:j+1])
        _pdf = _model.prob(xx)
        plt.plot(xx, _pdf, label="alpha=%.2f" %(_alpha[i]))
    plt.title("beta=%.2f" %(_beta[j]))
    if j == 5:
        plt.legend(loc="best")

plt.tight_layout()
plt.show()
_images/Prior_distribution_BRIE2_9_0.png
[70]:
_model = tfd.Gamma([10], [3])
plt.hist(_model.sample(1000).numpy().reshape(-1), bins=100)
plt.show()
_images/Prior_distribution_BRIE2_10_0.png
[ ]: