Home¶
Top News¶
- [29/05/2022] We have released v2.2 that fully supports counting droplet-based data for both Skipping Exon events and other types of splcing events. See the brie-count manual
- [29/05/2022] We have include small-sized test data sets (15MB) for both smart-seq2 and 10x Genomics. See data in brie-tutorials/tests repo
About BRIE¶
Welcome to the new BRIE (>=2.0 or BRIE2), Bayesian Regression for Isoform Estimate, a scalable Bayesian method to accurately identify splicing phenotypes in single-cell RNA-seq experiments and quantify isoform proportions and their uncertainty.
BRIE2 supports the analysis of splicing processes at two molecular levels, either between alternative splicing isoforms or between unspliced and spliced RNAs. In either case, it returns cell-by-event or cell-by-gene matrices of PSI value and its 95% confidence interval (quantification) and the statistics for detecting DAS and DMG on each event or gene:
- Differential alternative splicing (DAS): This task is to quantify the proportions of alternative splicing isoforms and to detect DAS between groups of cells or along with a continuous covariate, e.g., pseudotime. BRIE2 is designed for two-isoform splicing events with a focus on exon skipping, but in principle also applicable for mutual exclusion, intron-retaining, alternative poly-A site, 3’ splice site and 5’ splice site.
- Differential momentum genes (DMG): This task is to quantify the proportions of unspliced and spliced RNAs in each gene and each cell. Similar to DAS, the DMG is a principled selection of genes that capture heterogeneity in transcriptional kinetics between cell groups, e.g., cell types, or continuous cell covariates, hence may enhance the RNA velocity analyses by focusing on dynamics informed genes.
Note
Though we highly recommend using BRIE v2 for a coherent way for splicing phenotype selection, BRIE1 CLI (MCMC based & gene feature only) is still available but the CLIs are changed to brie1 and brie1-diff.
Questions and Issues¶
If you find any technical issues in the codes, we will appreciate your report. Please write them in the GitHub issues: https://github.com/huangyh09/brie/issues
If you have other specific questions on using BRIE, feel free to get in touch with us: yuanhua <at> hku.hk
Quick Resources¶
- Code on GitHub: https://github.com/huangyh09/brie
- Preprocessed splicing events annotations (GFT/GFF3): http://sourceforge.net/projects/brie-rna/files/annotation/
- Example datasets: https://github.com/huangyh09/brie-tutorials
- All releases: https://pypi.org/project/brie/#history
- Issue reports: https://github.com/huangyh09/brie/issues
References¶
- Yuanhua Huang and Guido Sanguinetti. BRIE2: computational identification of splicing phenotypes from single-cell transcriptomic experiments. Genome Biology, 2021; 22(1):251.
- Yuanhua Huang and Guido Sanguinetti. BRIE: transcriptome-wide splicing quantification in single cells. Genome Biology, 2017; 18(1):123.
Installation¶
Environment setting (optional)¶
For using BRIE2, we recommend creating a separated conda environment for easier and cleaner management of dependent packages, particularly TensorFlow and TensorFlow-probability, which are under active development.
The following command line in terminal (Linux or MacOS) will create a conda
environment with name TFProb
, probably in the path ~/.conda/envs/TFProb
:
conda create -n TFProb python=3.7
Alternatively, you can create the environment in another path by replacing
-n TFProb
with -p ANY_PATH/TFProb
to specify the path for conda
environment. Then you can check your environment by conda env list
and
activate the environment by conda activate TFProb
or the full path, before
start installing brie and other packages.
Easy install¶
BRIE2 is available on PYPI. To install, type the following command
line, and add -U
for upgrading:
pip install -U brie
Alternatively, you can install from this GitHub repository for the latest (often development) version by the following command line
pip install -U git+https://github.com/huangyh09/brie
In either case, if you don’t have write permission for your current Python
environment, add --user
, but check the previous section above for creating
your own conda environment.
GPU usage¶
One of the key benefits of using TensorFlow backend is its direct support of GPU for substantial speedups (generally >10x). Here is one way to set up GPU configurations with NVIDIA GPU on Ubuntu:
pip install -U tensorflow-gpu
conda install -c anaconda cudatoolkit
Make sure that you have compatible versions between tensorflow and NVIDIA CUDA. You can check TF’s test here. For more information on GPU configuration, please refer to the Tensorflow documentation or anaconda GPU.
Once successfully configured, you will see log info like the following when
using brie-quant
:
$ brie-quant
I tensorflow/stream_executor/platform/default/dso_loader.cc:53]
Successfully opened dynamic library libcudart.so.11.0
Welcome to brie-quant in BRIE v2.0.5!
Note
At the moment, TensorFlow calls all available GPUs, which is not necessary.
You can specify the card (e.g., card 3) by adding the below variable before
your command line CUDA_VISIBLE_DEVICES=3 brie-quant -i my_count.h5ad
Test¶
In order to test the installation, you could type brie-quant
. If successful,
you will see the following output.
Welcome to brie-quant in BRIE v2.0.2!
use -h or --help for help on argument.
If you install BRIE successfully, but can’t run it and see the error below,
then check whether its directory is added to PATH
environment.
brie-quant: command not found
Usually, the directory is ~/.local/bin
if you don’t use Anaconda. You could add
the path into PATH
environment variable, by writing the following line into
.profile
or .bashrc
file.
export PATH="~/.local/bin:$PATH"
If you have any issues, please report them to the issue on brie issues.
Quick start¶
BRIE2 estimates the splicing proportion for two-component events across many cells. It is designed for analysing two molecular levels of splicing processes:
- Alternative splicing isoforms, e.g., exon included or excluded. BRIE2 will identify differential alternative splicing (DAS) events. See demo1 for detecting DAS between disease and control.
- Spliced vs unspliced RNAs for RNA velocity analysis. BRIE2 will identify differential momentum genes (DMG). See demo2 for detecting DMG between cell types.
For getting started quickly, there are two steps to go: counting and
quantifying.
The brie-quant
(for quantifying and testing) is developed in a unified
manner. Here we provide a roadmap of using BRIE2 for both molecular levels
in different purposes (quantification or feature detection).

Pre-step: read counting¶
First, you need to count the isoform-specific reads in each splicing event in each cell. Similarly for read/UMI counts for spliced and unspliced RNAs. BRIE2 only provides a utility function for read counts in alternative splicing. For counting spliced/unspliced reads, we provide some tips in the unspliced RNA counting section.
For alternative splicing, e.g., exon-skipping event, you can download the splicing annotations generated by us or make your own, e.g., with briekit.
Then you can use the brie-count
to fetch the read count tensor, which will
be stored in hdf5 format as AnnData. See more details on brie-count
CLI, and you can use this example command line.
# for smart-seq
brie-count -a AS_events/SE.gold.gtf -S sam_and_cellID.tsv -o out_dir -p 15
# for droplet, e.g. 10x Genomics
brie-count -a AS_events/SE.gold.gtf -s possorted.bam -b barcodes.tsv.gz -o out_dir -p 15
Besides the SE event, other types of alternative splicing, e.g., intron retaining is also applicable with BRIE. Some pre-processing utilities will be available soon.
Main Step: quantify and detect¶
brie-quant
is used for this step. If GPU is available, we highly
recommend using GPU for ~10x speedup comparing CPU server. Thanks to good
support from Tensorflow, the environment setting is straightforward, see our
guide on GPU Usage.
This step supports various settings for either quantification of splicing or detection of splicing phenotypes. You can directly run step 1.2 if you only want to perform phenotype detection.
PSI quantification¶
You can quantify the isoform with cell or gene features or both or none. Usually,
we recommend using aggregated imputation even if you don’t have any feature,
namely mode2-quant
in brie-quant CLI as follows (please add
--interceptMode gene
for aggregating cells as prior for each gene),
brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_aggr.h5ad --interceptMode gene
Splicing Phenotype detection¶
If you have cell-level features, e.g., disease condition or cell type or
continuous variable, you can use it in cell features to detect variable splicing
events or differential momentum genes as phenotypes for further analysis. This
is mode2-diff
in brie-quant CLI, so requires -c
and
--LRTindex
.
brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_cell.h5ad \
-c $DATA_DIR/cell_info.tsv --interceptMode gene --LRTindex=All
Example
Please see the example in brie-quant CLI mode 3, and MS data.
Downstream Step: analysis¶
The BRIE output AnnData files are compatible with Scanpy, hence you can easily use it for dimension reduction, clustering, and other visualization. A few examples for both alternative splicing and RNA velocity are also available in this documentation (see the navigation bar on the left).
Others for unspliced RNA counting¶
BRIE2 doesn’t provide a utility function for counting the spliced and unspliced RNAs, but thanks to the community efforts, there are a few tools already available for this purpose:
- velocyto.py: the earliest software for this purpose. Generally not computationally efficient, possible due to written in Python. For unknown reasons, the proportion of unspliced RNA is unrealistically high for 5’ scRNA-seq data based on 10x Genomics.
- dropEst: as implemented in C/C++, it is much more efficient. It also returns more reasonable proportions of unspliced RNAs for 5’ 10x Genomics data
- STAR-solo: new extension for the popular STAR. Benefits: efficient and one step for reads alignment and counting of unspliced RNA (Recommended option)
The first two options take inputs as aligned bam file(s), and STAR-solo itself is a widely used aligner and provides the count matrices directly. All these options align reads to genome and define reads as unspliced and spliced by the gene annotations in GTF/GFF3 format.
Alternatively, there are other options by aligning reads to annotated transcriptomes directly e.g., kallisto bustools. However, the agreement of the above counting tools is still not perfect according to a recent benchmarking paper (Soneson et al, Plos Comp Bio, 2021)
brie-count CLI¶
BRIE (>=2.0.0) provides two CLI directly available in your Python path:
brie-count
, brie-quant
.
From v2.1, the brie-count
CLI supports any type of splicing events,
with primarily focusing on exon-skipping events:
- exon skipping (SE): we provide a curated annotation for human and mouse; the
output is in h5ad format, seamlessly for downstream analysis with
brie-quant
- other types: in generally,
brie-count
returns reads (or UMIs) counts for that are compatible to isoform groups (regardless of the number of exons it contains). However, the downstream analysis (e.g., differentail splicing) withbrie-quant
mainly supports two-isoform based events. Also, users may need to curate the splicing annotation themselves (we may provide some curation in future).
From v2.2, the brie-count
CLI supports counting on both well-based
platform (e.g., smart-seq2) where each cell has a separate bam (sam/cram) file
and droplet-based platform (e.g., 10x Genomics) where all cells are pooled in
one big bam file through cell barcodes.
Splicing annotations¶
As input, it requires a list of annotated splicing events. We have generated annotations for human and mouse. If you are using it, please align RNA-seq reads to the according version (or close version) of reference genome. Alternatively, you can use briekit package to generate.
Read counting¶
The brie-count
CLI calculates a count tensor for the number of reads that
are aligned to each splicing event and each cell, and stratifies four them into
four different categories (using SE as an example):
- key 1: Uniquely aligned to isoform1, e.g., in exon1-exon2 junction in SE event
- key 2: Uniquely aligned to isoform2, e.g., in exon1-exon3 junction in SE event
- key 3: Ambiguously aligned to isoform1 and isoform2, e.g., within exon1
- key 0: Partially aligned in the region but not compatible with any of the two isoforms. We suggest ignoring these reads.
Note
You can decode isoform compatibility from the key index by using the decimal-bo-binary
For example, the key 5 (decimal) can be decoded to 101 (binary), which means compatible with both isoform 1 (the last) and 3 (the third from right).
Then you fetch the counts on a list of bam files by the command line like this:
# for smart-seq
brie-count -a AS_events/SE.gold.gtf -S sam_and_cellID.tsv -o out_dir -p 15
# for droplet, e.g. 10x Genomics
brie-count -a AS_events/SE.gold.gtf -s possorted.bam -b barcodes.tsv.gz -o out_dir -p 15
By default, you will have four output files in the out_dir: brie_count.h5ad
,
read_count.mtx.gz
, cell_note.tsv.gz
, and gene_note.tsv.gz
. The
brie_count.h5ad
contains all information for downstream analysis, e.g., for
brie-quant.
Options¶
There are more parameters for setting (brie-count -h
always give the version
you are using):
Usage: brie-count [options]
Options:
-h, --help show this help message and exit
-a GFF_FILE, --gffFile=GFF_FILE
GTF/GFF3 file for gene and transcript annotation
-o OUT_DIR, --out_dir=OUT_DIR
Full path of output directory [default: $samFile/brieCOUNT]
SmartSeq-based input:
-S SAMLIST_FILE, --samList=SAMLIST_FILE
A no-header tsv file listing sorted and indexed bam/sam/cram
files. Columns: file path, cell id (optional)
Droplet-based input:
-s SAM_FILE, --samFile=SAM_FILE
One indexed bam/sam/cram file
-b BARCODES_FILE, --barcodes=BARCODES_FILE
A file containing cell barcodes without header
--cellTAG=CELL_TAG Tag for cell barocdes [default: CB]
--UMItag=UMI_TAG Tag for UMI barocdes [default: UR]
Optional arguments:
--verbose Print out detailed log info
-p NPROC, --nproc=NPROC
Number of subprocesses [default: 4]
-t EVENT_TYPE, --eventType=EVENT_TYPE
Type of splicing event for check. SE: skipping-exon; Any: no-
checking [default: SE]
brie-quant CLI¶
The brie-quant
CLI (in brie>=2.0.0) uses the newly developed variational
inference methods scalable to large data sets, which works both in CPU or
GPU with the TensorFlow Backend.
For using BRIE1 (<=0.2.4) with an MCMC sampler,
please refer to BRIE1.
This command allows to quantify the splicing proportion Psi and detect variable splicing events/genes along with cell-level features, e.g., cell type, disease condition, and development time. It can handle both alternative splicing isoforms or unspliced vs spliced ratios in a unified framework.
As a Bayesian method, the key philosophy of BRIE is to combine likelihood (data-driven) and prior (uninformative or informative). In BRIE2, a variety of prior settings are supported, as follows.
Note
The mode2 (quant and diff) with cell features is the recommended option in BRIE2, for both quantifying PSI and identifying differential momentum genes for RNA velocity analysis and differential splicing events on alternative splicing for either categorical or continuous covariates.
For unspliced vs spliced in RNA velocity analysis, please add argument
--layers spliced,unspliced
, as by default is set for alternative splicing.
Mode2-quant: Aggregated imputation¶
This mode requires argument --interceptMode gene
, which means fitting an
offset for each gene (or splicing event). The offset denotes the mean of the
prior distribution through aggregation, hence learns a prior
shared by all cells on each gene. The benefit for this mode is that dimension
reduction can be performed, e.g., PCA and UMAP on splicing PSI matrix.
As there are many
splicing events with low total reads, they will have a high variance in the
estimation, and is often suggested filtered out, which will cause missing values.
Based on the cell aggregated imputation, most dimension reduction methods can be
used, even it doesn’t support missing values.
Example command line for mode 2:
brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_aggr.h5ad --interceptMode gene
Mode2-diff: Variable splicing detection¶
This mode requires argument -c
for cell features and --LRTindex
for the
index (zero-based) of cell features to perform likelihood ratio test. Again we
suggest keeping the cell aggregation on each gene by --interceptMode gene
.
Then this mode will learn a prior from the given cell level features and perform the second round fitting with leaving one feature out each time to calculate the ELBO gain, which can be used as a surrogate for Bayes factor.
Example command line for mode 3:
brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_cell.h5ad \
-c $DATA_DIR/cell_info.tsv --interceptMode gene --LRTindex=All
Example
As an example in the
MS data,
we have cells labelled with
two factors 1) multiple sclerosis & control - isEAE, 2) two mouse strains
- isCD1. We want to identify alternative splicing events associated with the
first factor multiple sclerosis, but also want to consider the potential
confounder in mouse strain, we could set the design matrix in cell_info.tsv
file as follow, along with parameters --interceptMode gene --LRTindex 0
.
samID isEAE isCD1
SRR7102862 0 1
SRR7103631 1 0
SRR7104046 1 1
SRR7105069 0 0
Note
Be very careful on collinearity (i.e., redundancy) of your design matrix in
cell_info.tsv
and the constant intercept (if use
--interceptMode gene
), which is similar to DEG in edgeR or DESeq2.
By default, BRIE2 uses the --testBase=full
mode to detect differential
splicing by comparing all features + constant intercept versus leaving the
testing feature(s) out. In this setting, if your features can’t have
collinearity with intercept, e.g., you may need to remove one cell type out
if it covers all your cells.
Alternatively, you can change to another strategy --testBase=null
by
comparing the testing feature(s) + intercept versus intercept only,
like the example data on
Dentate Gyrus.
Mode 1: Imputation with gene features¶
This mode is introduced in BRIE 1, where genomic sequences are leverage to learn a prior distribution of PSI. Then the predicted distribution of PSI from sequence features are combined with the likelihood obtained from the observed read counts. This framework provides a coherent way to combine the quantification from read counts and imputation from genomic features.
Initially, the inference in this mode is achieved by MCMC sample per cell separated, which is generally slow for over hundreds of cells. BRIE2’s variational framework keeps supporting the gene features and performs the inference for all cells in one go, though they are independent.
Example command line for Mode 1. We suggest use --interceptMode cell
to
learn an offset for each cell:
brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_gene.h5ad \
-g $DATA_DIR/gene_feature.tsv --interceptMode cell
Note
For the sake of convenience, we now recommend using Mode2-quant below to perform imputation, which leverages the average PSI values in a cell population to function as an informative prior.
Mode 0: None imputation¶
In this mode, the prior is an uninformative logit-normal distribution with mean=0, and learned variance. Therefore, if a splicing event in a gene doesn’t have any read, it will return a posterior with Psi’s mean=0.5 and 95% confidence interval around 0.95 (most cases >0.9).
This setting is used if you have high covered data and you only want to calculate cells with sufficient reads for each interesting gene, e.g., by filtering out all genes with Psi_95CI > 0.3.
Otherwise, the 0.5 imputed genes will be confounded by the expression level, instead of the isoform proportion.
Example command line for mode 1:
brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_pure.h5ad --interceptMode None
All parameters¶
There are more parameters for setting (brie-quant -h
always give the version
you are using):
Usage: brie-quant [options]
Options:
-h, --help show this help message and exit
-i IN_FILE, --inFile=IN_FILE
Input read count matrices in AnnData h5ad or brie npz format.
-c CELL_FILE, --cellFile=CELL_FILE
File for cell features in tsv[.gz] with cell and feature ids.
-g GENE_FILE, --geneFile=GENE_FILE
File for gene features in tsv[.gz] with gene and feature ids.
-o OUT_FILE, --out_file=OUT_FILE
Full path of output file for annData in h5ad [default:
$inFile/brie_quant.h5ad]
--LRTindex=LRT_INDEX Index (0-based) of cell features to test with LRT: All, None
or comma separated integers [default: None]
--testBase=TEST_BASE Features in testing base model: full, null [default: full]
--interceptMode=INTERCEPT_MODE
Intercept mode: gene, cell or None [default: None]
--layers=LAYERS Comma separated layers two or three for estimating Psi
[default: isoform1,isoform2,ambiguous]
Gene filtering:
--minCount=MIN_COUNT
Minimum total counts for fitltering genes [default: 50]
--minUniqCount=MIN_UNIQ_COUNT
Minimum unique counts for fitltering genes [default: 10]
--minCell=MIN_CELL Minimum number of cells with unique count for fitltering genes
[default: 30]
--minMIF=MIN_MIF Minimum minor isoform frequency in unique count [default:
0.001]
VI Optimization:
--MCsize=MC_SIZE Sample size for Monte Carlo Expectation [default: 3]
--minIter=MIN_ITER Minimum number of iterations [default: 5000]
--maxIter=MAX_ITER Maximum number of iterations [default: 20000]
--batchSize=BATCH_SIZE
Element size per batch: n_gene * total cell [default: 500000]
--pseudoCount=PSEUDO_COUNT
Pseudo count to add on unique count matrices [default: 0.01]
Release notes¶
Release in future¶
Release v2.2.2 (04/11/2022)¶
- support limit the CPU threads used by tensorflow in brie-quant (default 6)
Release v2.2.1 (14/07/2022)¶
- fix a bug for –eventType=Any in smart-seq
- change to setting for save to h5ad file for all events with only two-isoforms
Release v2.2.0 (29/05/2022)¶
- Big new feature brie-count now supports cell barcodes based platforms, including 10x Genomics
- provide small test files in the tests folder
- Update brie-count manual
Release v2.1.0 (22/03/2022)¶
- Enable brie-count to support other types of splicing events, not only SE
- Fix a minor bug in exteme scenario with missing a certain isoform for all genes
Release v2.0.6 (23/09/2021)¶
- Fix in format bug in brie-count
- Update brie.pl.volcano for using ELBO_gain as default y-axis
- Update manual according to the revised paper
- Add CLI
brie
Release v2.0.5 (04/11/2020)¶
- Support saving detection table to tsv file
- Add the exon start and stop positions in brie-count
Release v2.0.4 (26/09/2020)¶
- Tune the learning rate with multiple values
- For test model, the fitted parameters will be used as initials
- Support base model with full features or null feature
- For gene feature only, switch sigma into per cell base
- Add noise term in simulator
- A few minor bug fix
- Implement a Inv-Gamma prior distribution for sigma (not in use)
Release v2.0.3 (26/08/2020)¶
- Support to use minimum minor isoform frequency to fileter genes (default=0.001)
- Add pseudo count (default=0.01) for none-empty element in both unique counts for more robust results
- Reduce the sample size for Monte Carlo Expectation (10 to 3) for computational efficiency
- Restructure the arguments in brie-quant
- Initialise the example notebook on multiple sclerosis data
Release v2.0.2 (24/08/2020)¶
- Fix minor bugs in brei-count and brie-quant cli for compatibility
Release v2.0.0 (23/08/2020)¶
- Change the whole BRIE model from MCMC sampler (v1) to variational inference (v2)
- Change the usage of each read to a summarised read counts for speedup
- Support splicing quantification without any feature or cell features or gene features or both types of features.
- Support detection of variable splicing event associated with cell features
- Support acceleration with Graphic card (Nvidia GPU)
- Compatible with Scanpy analysis suite with a variety of plotting functions
- Restructure the whole package
- BRIE earlier version is still avaible with brie1
Release v0.2.4 (21/10/2019)¶
- fix a bug that fragment length longer than transcript length
Release v0.2.2 (15/01/2019)¶
- move __version__ into a separate file, so import brie is not required before installation.
- support cram file as input aligned reads
Release v0.2.0 (03/06/2018)¶
- split the pre-processing functions in BRIE to another separate package BRIE-kit, as some functions in the pre-processing require Python2 environment. So, BRIE will keep two functions brie and brie-diff, while BRIE-kit will have briekit-event, briekit-event-filter, and briekit-factor. See BRIE-KIT: https://github.com/huangyh09/briekit/wiki
Release v0.1.5 (03/06/2018)¶
- support gff in gz format
- add an example with 130 cells
- brie-diff supporting ranking genes
Release v0.1.4 (02/06/2018)¶
- turn off pylab
- fix a bug for function rev_seq, reported in issue #10
- update documentation
Release v0.1.3 (16/04/2017)¶
brie-diff
takes multiple cells, which could handle pair-wise comparisons for 100 cells in ~10min with 30 CPUs; and 1000 cells within a few hours.- Simulation wraps on Spanki are provided for simulating splicing events at different coverages or drop-out probability and drop-out rate for single cells: https://github.com/huangyh09/brie/tree/master/simulator
Release v0.1.2 (13/01/2017)¶
- support Python 3.x now
- do not depend on h5py anymore for hdf5 data storage.
brie-factor
returns xxx.csv.gz rather than xxx.h5brie
returns sample.csv.gz rather than sample.h5brie-diff
takes sample.csv.gz rather than sample.h5
Release v0.1.1 (02/01/2017)¶
- change licence to Apache License v2
- update
brie-event-filter
Release v0.1.0 (29/12/2016)¶
- Initial release of BRIE
Multiple Sclerosis¶
This data set was generated by Falcão et al, 2018 containing 2,208 mouse cells probed by SMART-seq2 with half experimental autoimmune encephalomyelitis (EAE) cells (in mimicing multiple scleorsis) and half control cells.
Here, we will illustrate how BRIE2 can be used to effectively detect differential splicing events between EAE and control cells, and using splicing ratio to predict cell type and disease conditions.
We have pre-processed the data with these BRIE2 scripts with brie-count
and brie-quant
. You can download the pre-computed data, e.g., by using the following command line and unzip it into the ./data
folder for running this notebook.
wget http://ufpr.dl.sourceforge.net/project/brie-rna/examples/msEAE/brie2_msEAE.zip
unzip -j brie2_msEAE.zip -d ./data
Load Packages¶
[1]:
import umap
[2]:
import os
import brie
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
print(brie.__version__)
2021-09-23 21:29:03.815258: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-23 21:29:03.815359: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2.0.6
[3]:
# define the path you store the example data
# dat_dir = "./data"
dat_dir = "/storage/yhhuang/research/brie2/releaseDat/msEAE/"
BRIE2 option 1: differential splicing events¶
Detecting differential splicing event by regressing on the EAE state with considering the strain covariate.
The command line is available at run_brie2.sh. Here is an example of design matrix besides the intercept and isEAE
as the only testing variable.
samID isEAE isCD1
SRR7102862 0 1
SRR7103631 1 0
SRR7104046 1 1
SRR7105069 0 0
Besides the big .h5ad file, you can use the detected splicing events directely from brie_quant_cell.brie_ident.tsv. However, if you want get the quantify Psi
for downstream analysis, e.g., option2 below, you need load the adata.layers['Psi']
, and adata.layers['Psi_95CI']
.
[4]:
adata = sc.read_h5ad(dat_dir + "/brie_quant_cell.h5ad")
adata
[4]:
AnnData object with n_obs × n_vars = 2208 × 3780
obs: 'samID', 'samCOUNT'
var: 'GeneID', 'GeneName', 'TranLens', 'TranIDs', 'n_counts', 'n_counts_uniq', 'loss_gene'
uns: 'Xc_ids', 'brie_losses', 'brie_param', 'brie_version'
obsm: 'Xc'
varm: 'ELBO_gain', 'cell_coeff', 'effLen', 'fdr', 'intercept', 'p_ambiguous', 'pval', 'sigma'
layers: 'Psi', 'Psi_95CI', 'Z_std', 'ambiguous', 'isoform1', 'isoform2', 'poorQual'
[5]:
adata.uns['brie_param']
[5]:
{'LRT_index': array([0]),
'base_mode': 'full',
'intercept_mode': 'gene',
'layer_keys': array(['isoform1', 'isoform2', 'ambiguous'], dtype=object),
'pseudo_count': 0.01}
Change gene index from Ensemebl id to gene name
[6]:
adata.var['old_index'] = adata.var.index
new_index = [adata.var.GeneName[i] + adata.var.GeneID[i][18:] for i in range(adata.shape[1])]
adata.var.index = new_index
[7]:
# if the index is bytes, change it to str
adata.var.index = [x.decode('utf-8') if type(x) == bytes else x for x in adata.var.index]
adata.obs.index = [x.decode('utf-8') if type(x) == bytes else x for x in adata.obs.index]
Add the cell annotations from input covariates
[8]:
adata.obs['MS'] = ['EAE' if x else 'Ctrl' for x in adata.obsm['Xc'][:, 0]]
adata.obs['isCD1'] = ['CD1_%d' %x for x in adata.obsm['Xc'][:, 1]]
Volcano plot¶
Here, BRIE2 detects the differential splicing events between EAE and control. Cell_coeff means the effect size on logit(Psi). Positive value means higher Psi in isEAE.
[9]:
## volcano plot for differential splicing events
brie.pl.volcano(adata, y='ELBO_gain', log_y=False, n_anno=16, score_red=7, adjust=True)
plt.xlabel('cell_coeff: effect size on logit(Psi)')
plt.title("MS differential splicing")
[9]:
Text(0.5, 1.0, 'MS differential splicing')

[10]:
DSEs = adata.var.index[adata.varm['ELBO_gain'][:, 0] >= 7]
len(DSEs), len(np.unique([x.split('.')[0] for x in DSEs])), adata.shape
[10]:
(129, 123, (2208, 3780))
[11]:
adata.uns['Xc_ids']
[11]:
array(['isEAE', 'isCD1'], dtype=object)
Visualize raw counts for DSEs¶
[12]:
rank_idx = np.argsort(adata.varm['ELBO_gain'][:, 0])[::-1]
fig = plt.figure(figsize=(15, 8))
brie.pl.counts(adata, genes=list(adata.var.index[rank_idx[:6]]),
color='MS', add_val='ELBO_gain', ncol=3, alpha=0.7)
# fig.savefig(dat_dir + '../../figures/fig_s8_counts.png', dpi=300, bbox_inches='tight')
plt.show()
/home/yuanhua/MyGit/brie/brie/plot/base_plot.py:53: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later.
plt.subplot(nrow, ncol, i + 1)

Changed ratio between ambiguous reads¶
Some differential splicing events having changed ratio on ambigous reads and isoform specific reads
[13]:
### Differential splicing shown by isoform1 and ambiguous reads
fig = plt.figure(figsize=(5, 4))
brie.pl.counts(adata, genes='Ddhd1',
color='MS', add_val='ELBO_gain',
layers=['isoform1', 'ambiguous'],
nrow=1, alpha=0.7)
plt.show()

[ ]:
BRIE2 option 2: splicing quantification and usage¶
If you aim to use splicing isoform proportions as phenotypes for downstream analysis, e.g., to identify cell types, or study its disease contribution, we suggest not including the cell feature as covariate when quantifying Psi
to avoid circular analysis.
Instead you could use gene level features, e.g., sequence features we introduced in BRIE1, or you can use an aggregated value from all cells as an informative prior. Similar as BRIE1, this prior is learned from the data and has a logit-normal distribution, so the information prior usually won’t be too strong.
In general, we recommend using the aggregation of cells as the prior thanks to its simplicity if it does not violate your biological hypothesis.
[14]:
adata_aggr = sc.read_h5ad(dat_dir + "/brie_quant_aggr.h5ad")
adata_aggr
[14]:
AnnData object with n_obs × n_vars = 2208 × 3780
obs: 'samID', 'samCOUNT'
var: 'GeneID', 'GeneName', 'TranLens', 'TranIDs', 'n_counts', 'n_counts_uniq'
uns: 'brie_losses'
obsm: 'Xc'
varm: 'cell_coeff', 'effLen', 'intercept', 'p_ambiguous', 'sigma'
layers: 'Psi', 'Psi_95CI', 'Z_std', 'ambiguous', 'isoform1', 'isoform2', 'poorQual'
[15]:
## change gene index
print(np.mean(adata.var['old_index'] == adata_aggr.var.index))
adata_aggr.var.index = adata.var.index
1.0
[16]:
# if the index is bytes, change it to str
adata_aggr.var.index = [x.decode('utf-8') if type(x) == bytes else x
for x in adata_aggr.var.index]
adata_aggr.obs.index = [x.decode('utf-8') if type(x) == bytes else x
for x in adata_aggr.obs.index]
Add meta data and gene-level annotation¶
Cell annotations and UMAP from gene-level expression. You can add any additional annotation.
[17]:
print(np.mean(adata.obs.index == adata_aggr.obs.index))
adata_aggr.obs['MS'] = adata.obs['MS'].copy()
adata_aggr.obs['isCD1'] = adata.obs['isCD1'].copy()
1.0
[18]:
dat_umap = np.genfromtxt(dat_dir + '/cell_X_umap.tsv', dtype='str', delimiter='\t')
mm = brie.match(adata_aggr.obs.index, dat_umap[:, 0])
idx = mm[mm != None].astype(int)
adata_aggr = adata_aggr[mm != None, :]
adata_aggr.obsm['X_GEX_UMAP'] = dat_umap[idx, 1:3].astype(float)
adata_aggr.obs['cluster'] = dat_umap[idx, 3]
adata_aggr.obs['combine'] = [adata_aggr.obs['cluster'][i] + '-' + adata_aggr.obs['MS'][i]
for i in range(adata_aggr.shape[0])]
[19]:
adata_aggr
[19]:
AnnData object with n_obs × n_vars = 1882 × 3780
obs: 'samID', 'samCOUNT', 'MS', 'isCD1', 'cluster', 'combine'
var: 'GeneID', 'GeneName', 'TranLens', 'TranIDs', 'n_counts', 'n_counts_uniq'
uns: 'brie_losses'
obsm: 'Xc', 'X_GEX_UMAP'
varm: 'cell_coeff', 'effLen', 'intercept', 'p_ambiguous', 'sigma'
layers: 'Psi', 'Psi_95CI', 'Z_std', 'ambiguous', 'isoform1', 'isoform2', 'poorQual'
Cell filtering¶
[20]:
plt.hist(np.log10(adata_aggr.X.sum(1)[:, 0] + 1), bins=30)
plt.xlabel("log10(total reads)")
plt.ylabel("Cell frequency")
plt.show()

[21]:
# min_reads = 15000
min_reads = 3000
adata_aggr = adata_aggr[adata_aggr.X.sum(axis=1) > min_reads, :]
adata = adata[adata_aggr.obs.index, :]
Visualizing splicing phenotypes in Gene expression UMAP¶
[22]:
sc.pl.scatter(adata_aggr, basis='GEX_UMAP', color=['cluster', 'MS'], size=30)
/home/yuanhua/.conda/envs/TFProb/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
c.reorder_categories(natsorted(c.categories), inplace=True)
/home/yuanhua/.conda/envs/TFProb/lib/python3.7/site-packages/anndata/_core/anndata.py:1229: ImplicitModificationWarning: Initializing view as actual.
"Initializing view as actual.", ImplicitModificationWarning
Trying to set attribute `.obs` of view, copying.
... storing 'MS' as categorical
/home/yuanhua/.conda/envs/TFProb/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
c.reorder_categories(natsorted(c.categories), inplace=True)
Trying to set attribute `.obs` of view, copying.
... storing 'isCD1' as categorical
/home/yuanhua/.conda/envs/TFProb/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
c.reorder_categories(natsorted(c.categories), inplace=True)
Trying to set attribute `.obs` of view, copying.
... storing 'cluster' as categorical
/home/yuanhua/.conda/envs/TFProb/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
c.reorder_categories(natsorted(c.categories), inplace=True)
Trying to set attribute `.obs` of view, copying.
... storing 'combine' as categorical

[23]:
sc.pl.scatter(adata_aggr, basis='GEX_UMAP', layers='Psi',
color=['App.AS2', 'Mbp.AS3', 'Emc10'],
size=50, color_map='terrain_r') #'terrain_r'

[24]:
import seaborn as sns
sc.pl.violin(adata_aggr, ['App.AS2', 'Mbp.AS3', 'Emc10'],
layer='Psi', groupby='combine', rotation=90,
inner='quartile', palette=sns.color_palette("Paired"))

Dimension reduction on Psi¶
For downstream analysis, we only using splicing events that have been detected as differential splicing events. As mentioned earlier, the Psi
is re-quantified by not using the cell-level information.
[25]:
# idx = list(adata.varm['fdr'][:, 0] < 0.05)
# adata_psi = adata_aggr[:, idx]
adata_psi = adata_aggr.copy()
adata_psi.layers['X'] = adata_psi.X.astype(np.float64)
adata_psi.X = adata_psi.layers['Psi']
adata_psi.shape
[25]:
(1845, 3780)
[26]:
sc.tl.pca(adata_psi, svd_solver='arpack')
[27]:
adata_psi.obs['Psi_PC1'] = adata_psi.obsm['X_pca'][:, 0]
adata_psi.obs['Psi_PC2'] = adata_psi.obsm['X_pca'][:, 1]
adata_psi.obs['Psi_PC3'] = adata_psi.obsm['X_pca'][:, 2]
[28]:
fig = plt.figure(figsize=(5, 3.7), dpi=80)
ax = plt.subplot(1, 1, 1)
sc.pl.pca(adata_psi, color='combine', size=30, show=False, ax=ax,
palette=sns.color_palette("Paired"))
plt.xlabel("PC1: %.1f%%" %(adata_psi.uns['pca']['variance_ratio'][0] * 100))
plt.ylabel("PC2: %.1f%%" %(adata_psi.uns['pca']['variance_ratio'][1] * 100))
# plt.legend(loc=1)
plt.show()

[29]:
sc.pl.pca_variance_ratio(adata_psi)

[30]:
sc.pp.neighbors(adata_psi, n_neighbors=10, n_pcs=20, method='umap')
sc.tl.umap(adata_psi)
/home/yuanhua/.conda/envs/TFProb/lib/python3.7/site-packages/numba/np/ufunc/parallel.py:363: NumbaWarning: The TBB threading layer requires TBB version 2019.5 or later i.e., TBB_INTERFACE_VERSION >= 11005. Found TBB_INTERFACE_VERSION = 6103. The TBB threading layer is disabled.
warnings.warn(problem)
OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[31]:
sc.pl.umap(adata_psi, color=['MS', 'combine'], size=30)

Visualise the Psi value on Psi-based UMAP¶
[32]:
sc.pl.umap(adata_psi, color=['App.AS2', 'Mbp.AS3', 'Emc10'], size=50, cmap='terrain_r')

Only for the cells with confident Psi estimate¶
[33]:
adata_tmp = adata_psi[adata_psi[:, 'App.AS2'].layers['Psi_95CI'] < 0.3, :]
adata_tmp.X = adata_tmp.layers['Psi']
sc.pl.umap(adata_tmp, color=['App.AS2', 'Mbp.AS3', 'Emc10'], size=50, cmap='terrain_r')

[34]:
sc.pl.scatter(adata_psi, basis='GEX_UMAP', color=['Psi_PC1', 'Psi_PC2', 'Psi_PC3'], size=50)

[ ]:
One potential usage of the quantified Psi
is to identify cell types. Instead of using them to cluster cells, here we show that by using the Psi
(and their principle compoments), the annotated cell types from gene expression can be accurately predicted.
[35]:
np.random.seed = 1
[36]:
_val, _idx = np.unique(pd.factorize(adata_psi.obs['cluster'])[0], return_index=True)
print(_val, _idx)
print(adata_psi.obs['cluster'][_idx])
[0 1 2 3 4] [ 0 8 54 92 890]
SRR7102862 OPC
SRR7102870 MOL
SRR7102925 COP
SRR7102967 VLMC
SRR7103970 MiGl
Name: cluster, dtype: category
Categories (5, object): ['COP', 'MOL', 'MiGl', 'OPC', 'VLMC']
[37]:
import scipy.stats as st
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from hilearn import ROC_plot, CrossValidation
LogisticRegression = linear_model.LogisticRegression()
RF_class = RandomForestClassifier(n_estimators=100, n_jobs=-1)
X1 = adata_psi.obsm['X_pca'][:, :20]
Y1 = pd.factorize(adata_psi.obs['cluster'])[0]
CV = CrossValidation(X1, Y1)
# Y1_pre, Y1_score = CV.cv_classification(model=RF_class, folds=10)
Y2_pre, Y2_score = CV.cv_classification(model=LogisticRegression, folds=10)
[38]:
fig = plt.figure(figsize=(6, 5), dpi=60)
# ROC_plot(Y1, Y1_score[:,1], label="Random Forest", base_line=False)
# ROC_plot(Y1, Y2_score[:,1], label="Logistic Regress")
for i in range(5):
ROC_plot(Y1 == i, Y2_score[:,i],
legend_label=adata_psi.obs['cluster'][_idx[i]],
base_line=False)
ROC_plot(np.concatenate([Y1 == i for i in range(5)]),
Y2_score.T.reshape(-1), legend_label='Overall')
plt.title("Psi based ROC curve: cell type classification")
# fig.savefig(dat_dir + '../../figures/fig_s7_classification.pdf', dpi=300, bbox_inches='tight')
plt.show()

Besides using Psi to predict the cell types, a more interesting task is to predict if a cell is in disease state (i.e., EAE state). Though by only using the Psi
, it achieves moderate performance but clearly improves the prediction together with the gene expression.
[39]:
dat_GEX_pc = np.genfromtxt(dat_dir + 'top_20PC.tsv', dtype='str', delimiter='\t')
[40]:
idx = brie.match(adata_psi.obs.index, dat_GEX_pc[:, 0])
GEX_pc = dat_GEX_pc[idx, 1:].astype(float)
[41]:
adata_psi
[41]:
AnnData object with n_obs × n_vars = 1845 × 3780
obs: 'samID', 'samCOUNT', 'MS', 'isCD1', 'cluster', 'combine', 'Psi_PC1', 'Psi_PC2', 'Psi_PC3'
var: 'GeneID', 'GeneName', 'TranLens', 'TranIDs', 'n_counts', 'n_counts_uniq'
uns: 'brie_losses', 'cluster_colors', 'MS_colors', 'pca', 'combine_colors', 'neighbors', 'umap'
obsm: 'Xc', 'X_GEX_UMAP', 'X_pca', 'X_umap'
varm: 'cell_coeff', 'effLen', 'intercept', 'p_ambiguous', 'sigma', 'PCs'
layers: 'Psi', 'Psi_95CI', 'Z_std', 'ambiguous', 'isoform1', 'isoform2', 'poorQual', 'X'
obsp: 'distances', 'connectivities'
[42]:
np.random.RandomState(0)
[42]:
RandomState(MT19937) at 0x7F09201C87C0
[43]:
import scipy.stats as st
from sklearn import linear_model
from hilearn import ROC_plot, CrossValidation
# np.random.seed(0)
X1 = adata_psi.obsm['X_pca'][:, :20]
X2 = GEX_pc
X3 = np.append(GEX_pc, adata_psi.obsm['X_pca'][:, :20], axis=1)
Y1 = pd.factorize(adata_psi.obs['MS'])[0]
X1 = X1[adata_psi.obs['cluster'] == 'MOL']
X2 = X2[adata_psi.obs['cluster'] == 'MOL']
X3 = X3[adata_psi.obs['cluster'] == 'MOL']
Y1 = Y1[adata_psi.obs['cluster'] == 'MOL']
LogisticRegression = linear_model.LogisticRegression(max_iter=5000)
CV = CrossValidation(X1, Y1)
Y1_pre, Y1_score = CV.cv_classification(model=LogisticRegression, folds=10)
LogisticRegression = linear_model.LogisticRegression(max_iter=5000)
CV = CrossValidation(X2, Y1)
Y2_pre, Y2_score = CV.cv_classification(model=LogisticRegression, folds=10)
LogisticRegression = linear_model.LogisticRegression(max_iter=5000)
CV = CrossValidation(X3, Y1)
Y3_pre, Y3_score = CV.cv_classification(model=LogisticRegression, folds=10)
[44]:
fig = plt.figure(figsize=(6, 5), dpi=60)
ROC_plot(Y1, Y1_score[:,1], label="Psi 20 PCs", base_line=False)
ROC_plot(Y1, Y2_score[:,1], label="GEX 20 PCs", base_line=False)
ROC_plot(Y1, Y3_score[:,1], label="Psi & GEX")
plt.title("ROC curve: EAE states classification on MOL")
# fig.savefig(dat_dir + '../../figures/fig_s9_EAE_predict.pdf', dpi=300, bbox_inches='tight')
plt.show()
Warning: label argument is replaced by legend_label and will be moved in future
Warning: label argument is replaced by legend_label and will be moved in future
Warning: label argument is replaced by legend_label and will be moved in future

[ ]:
[45]:
## There are multiple batches in running jobs, so you will steps;
## the second part shows the last bit of the final batch
brie.pl.loss(adata.uns['brie_losses'])

[ ]:
scNT-seq¶
scNT-seq (Qiu, Hu, et al 2020) is a recently proposed technique to metabolically label newly synthesied RNAs in single cells. By using the labelling information, the inferred cellular transitions by Dynamo are hihgly consistent with the stimulation time (see our reproduced analysis with scripts from the authors).
This relative-longe-period transition is generally difficult to be obtained from RNA velocity by only using total RNAs. Here, we will illustrate that the differential momentum genes could help correct the projected trajectory, thanks to using the stimulation time as a testing (i.e., supervised) covariate.
As BRIE2 takes ~30 minutes with GPU to detect the DMGs, we provide the pre-computed data with these BRIE2 scripts. You can run this notebook by downloading the data, e.g., using the following command line and unzip it into the ./data
folder:
wget http://ufpr.dl.sourceforge.net/project/brie-rna/examples/scNTseq/brie2_scNTseq.zip
unzip -j brie2_scNTseq.zip -d ./data
Load packages¶
[1]:
import brie
import numpy as np
import scanpy as sc
import scvelo as scv
import matplotlib.pyplot as plt
scv.logging.print_version()
2021-09-23 21:25:04.656553: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-23 21:25:04.656584: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Running scvelo 0.2.4 (python 3.7.6) on 2021-09-23 21:25.
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
ERROR: XMLRPC request failed [code: -32500]
RuntimeError: PyPI's XMLRPC API is currently disabled due to unmanageable load and will be deprecated in the near future. See https://status.python.org/ for more information.
[2]:
scv.settings.verbosity = 3 # show errors(0), warnings(1), info(2), hints(3)
scv.settings.presenter_view = True # set max width size for presenter view
scv.set_figure_params('scvelo') # for beautified visualization
[3]:
# define the path you store the example data
# dat_dir = "./data"
dat_dir = '/storage/yhhuang/research/brie2/releaseDat/scNTseq/'
scVelo dynamical model with total RNAs¶
This may take a few minutes to run, so we provide the pre-computed the data as in the above zip file for the setting with top 2,000 genes. The commented Python scripts are below. You can get the neuron_splicing_totalRNA.h5ad
from here generated by the adapted script. If you want to change to more hihgly variable genes, you can change n_top_genes
to other value, e.g., 8000.
adata = scv.read(dat_dir + "/neuron_splicing_totalRNA.h5ad")
scv.pp.filter_and_normalize(adata, min_shared_counts=30, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
scv.tl.recover_dynamics(adata, var_names='all')
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
scv.tl.latent_time(adata)
adata.write(dat_dir + "/scvelo_neuron_totalRNA_dynamical_2K.h5ad")
Cellular transitions on default selected velocity genes¶
[4]:
adata = scv.read(dat_dir + "/scvelo_neuron_totalRNA_dynamical_2K.h5ad")
print(adata.shape, np.sum(adata.var['velocity_genes']))
adata
(3066, 1999) 132
[4]:
AnnData object with n_obs × n_vars = 3066 × 1999
obs: 'cellname', 'time', 'early', 'late', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', 'n_counts', 'velocity_self_transition', 'root_cells', 'end_points', 'velocity_pseudotime', 'latent_time'
var: 'gene_short_name', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'fit_alpha', 'fit_beta', 'fit_gamma', 'fit_t_', 'fit_scaling', 'fit_std_u', 'fit_std_s', 'fit_likelihood', 'fit_u0', 'fit_s0', 'fit_pval_steady', 'fit_steady_u', 'fit_steady_s', 'fit_variance', 'fit_alignment_scaling', 'fit_r2', 'velocity_genes'
uns: 'neighbors', 'pca', 'recover_dynamics', 'velocity_graph', 'velocity_graph_neg', 'velocity_params'
obsm: 'X_pca', 'X_umap'
varm: 'PCs', 'loss'
layers: 'Ms', 'Mu', 'fit_t', 'fit_tau', 'fit_tau_', 'spliced', 'unspliced', 'velocity', 'velocity_u'
obsp: 'connectivities', 'distances'
[5]:
scv.pl.velocity_embedding_stream(adata, basis='umap', color=['time'],
ax=None, show=True, legend_fontsize=10, dpi=80,
title='scVelo dynamical model: 2K variable genes')
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
For 8K top variable genes¶
You can skip this sub section if not using the 8K top genes
adata2 = scv.read(dat_dir + "/scvelo_neuron_totalRNA_dynamical_8K.h5ad")
print(adata2.shape, np.sum(adata2.var['velocity_genes']))
scv.pl.velocity_embedding_stream(adata2, basis='umap', color=['time'],
ax=None, show=True, legend_fontsize=10,
title='scVelo dynamical model: 8K variable genes')
BRIE2 for differential momentum genes (DMGs)¶
Besides the large h5ad file, BRIE2 also saves the DMGs in a .tsv file brie_neuron_splicing_time.brie_ident.tsv for quick access.
[6]:
adata_brie = scv.read(dat_dir + "/brie_neuron_splicing_time.h5ad")
adata_brie
[6]:
AnnData object with n_obs × n_vars = 3066 × 7849
obs: 'cellname', 'time', 'early', 'late'
var: 'gene_short_name', 'n_counts', 'n_counts_uniq', 'loss_gene'
uns: 'Xc_ids', 'brie_losses', 'brie_param', 'brie_version'
obsm: 'X_umap', 'Xc'
varm: 'ELBO_gain', 'cell_coeff', 'fdr', 'intercept', 'pval', 'sigma'
layers: 'Psi', 'Psi_95CI', 'Z_std', 'spliced', 'unspliced'
[7]:
fig = plt.figure(figsize=(6, 4), dpi=60)
plt.figure(figsize=(6, 4), dpi=100)
brie.pl.volcano(adata_brie, y='ELBO_gain', log_y=False, n_anno=16,
score_red=10, adjust=True)
plt.title('Differential momentum genes with time')
plt.xlabel('Coefficient on time')
# plt.savefig(dat_dir + '../../figures/scNT_volcano_elbo.png', dpi=200)
plt.show()
<Figure size 360x240 with 0 Axes>
RNA velocity on DMGs¶
[8]:
idx = (np.min(adata_brie.varm['ELBO_gain'], axis=1) > 5)
gene_use = adata_brie.var.index[idx]
print(sum(idx), sum(brie.match(gene_use, adata.var.index) != None))
n_genes = sum(brie.match(gene_use, adata.var.index) != None)
scv.tl.velocity_graph(adata, gene_subset=gene_use)
scv.pl.velocity_embedding_stream(adata, basis='umap', color=['time'],
legend_fontsize=10,
ax=None, show=True, dpi=80,
title='scVelo with %d DMGs' %(n_genes))
421 201
computing velocity graph (using 1/80 cores)
finished (0:00:02) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
Change cutoffs¶
[9]:
fig = plt.figure(figsize=(11, 4), dpi=60)
ax1 = plt.subplot(1, 2, 1)
idx1 = (np.min(adata_brie.varm['ELBO_gain'], axis=1) > 7)
gene_use1 = adata_brie.var.index[idx1]
print(sum(idx1), sum(brie.match(gene_use1, adata.var.index) != None))
scv.tl.velocity_graph(adata, gene_subset=gene_use1)
scv.pl.velocity_embedding_stream(adata, basis='umap', color=['time'],
ax=ax1, show=False, legend_fontsize=10,
title='BRIE2: %d DMGs at ELBO_gain>7'
%(sum(brie.match(gene_use1, adata.var.index) != None)))
ax1.text(-0.15, 0.95, 'a', transform=ax1.transAxes, size=20, weight='bold')
ax2 = plt.subplot(1, 2, 2)
idx2 = (np.min(adata_brie.varm['ELBO_gain'], axis=1) > 3)
gene_use2 = adata_brie.var.index[idx2]
print(sum(idx2), sum(brie.match(gene_use2, adata.var.index) != None))
scv.tl.velocity_graph(adata, gene_subset=gene_use2)
scv.pl.velocity_embedding_stream(adata, basis='umap', color=['time'],
ax=ax2, show=False, legend_fontsize=10,
title='BRIE2: %d DMGs at ELBO_gain>3'
%(sum(brie.match(gene_use2, adata.var.index) != None)))
ax2.text(-0.15, 0.95, 'b', transform=ax2.transAxes, size=20, weight='bold')
# plt.tight_layout()
# plt.savefig(dat_dir + '../../figures/scNT_scVelo_brie_ELBO.png', dpi=300)
# plt.show()
239 122
computing velocity graph (using 1/80 cores)
finished (0:00:02) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
815 360
computing velocity graph (using 1/80 cores)
finished (0:00:03) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
[9]:
Text(-0.15, 0.95, 'b')
Visualize gene count¶
[10]:
idx = (np.min(adata_brie.varm['ELBO_gain'], axis=1) > 5)
gene_use = adata_brie.var.index[idx]
mm = brie.match(gene_use, adata.var.index) != None
gene_use[mm]
[10]:
Index(['Ncor1', 'App', 'Meis2', 'Cdk2ap1', 'Usp24', 'Ahi1', 'Tcf4', 'Nup98',
'Zswim6', 'Fam155a',
...
'Epha3', 'Pcca', 'Ptk2', 'Zbtb20', 'Nell2', 'Zbtb11', 'Pccb', 'Frem2',
'Celf4', 'Zfp704'],
dtype='object', length=201)
[11]:
## sorted by ELBO gain
idx_sort = np.argsort(adata_brie[:, gene_use[mm]].varm['ELBO_gain'][:,0])[::-1]
gene_use[mm][idx_sort]
[11]:
Index(['Ext1', 'Pccb', 'Ube2ql1', 'Ntrk2', 'Mbnl1', 'Fosb', 'Pfkp', 'Orc5',
'Srd5a1', 'Chka',
...
'Osgep', 'Nrn1', 'Ddx3y', 'Napb', 'Zdhhc17', 'Snap25', 'Ccnl1', 'Cobl',
'Etv5', 'Lats2'],
dtype='object', length=201)
[12]:
## Only negative coefficient
gene_sorted = gene_use[mm][idx_sort]
gene_sorted[adata_brie[:, gene_sorted].varm['cell_coeff'][:, 0] < 0]
[12]:
Index(['Ext1', 'Pccb', 'Ube2ql1', 'Ntrk2', 'Mbnl1', 'Orc5', 'Srd5a1', 'Akap9',
'Ank2', 'Dlgap1',
...
'Ptpre', 'Klhl4', 'Kcnn2', 'Zfp292', 'Osgep', 'Napb', 'Zdhhc17',
'Snap25', 'Cobl', 'Etv5'],
dtype='object', length=129)
[13]:
## Only positive coefficient
gene_sorted = gene_use[mm][idx_sort]
gene_sorted[adata_brie[:, gene_sorted].varm['cell_coeff'][:, 0] > 0]
[13]:
Index(['Fosb', 'Pfkp', 'Chka', 'Homer1', 'Erf', 'Nup98', 'Ifrd1', 'Ncapg2',
'Rasgef1b', 'Rsrp1', 'Tiparp', 'Crem', 'Zdbf2', 'Arl5b', 'Usp37',
'Fam107b', 'Elmsan1', 'Cwc25', 'Nedd9', 'Cystm1', 'Atp9a', 'Fosl2',
'Stil', 'Arih1', 'Ldlr', 'Ak4', 'Zfp248', 'Elovl5', 'Pde4d', 'Nr4a2',
'Nufip2', 'Mpped1', 'Srsf5', 'Fgfr2', 'Gtf2i', 'Gpr19', 'Cpeb3',
'Fbxo33', 'Srrm2', 'Dcun1d3', 'Foxo3', 'Zswim6', 'Ccdc138', 'Zbtb11',
'Cramp1l', 'Ndel1', 'Aff4', 'Arf4', 'Slc2a3', 'Stk40', 'Baiap2',
'Slc20a1', 'Cabp1', 'Gla', 'Gltscr2', 'Ptprk', 'Rasgrf1', 'Cpeb2',
'Snapc1', 'Pkia', 'Sik1', 'Nr4a1', 'Ppm1d', 'Cdk2ap1', 'Impact', 'Plaa',
'Hpf1', 'Taf1', 'Nrn1', 'Ddx3y', 'Ccnl1', 'Lats2'],
dtype='object')
[14]:
adata_brie.obs['time_cat'] = adata_brie.obs['time'].astype('category')
gene_top_neg = ['Ext1', 'Pccb', 'Ube2ql1']
gene_top_pos = ['Fosb', 'Pfkp', 'Chka']
fig = plt.figure(figsize=(16, 8), dpi=40)
brie.pl.counts(adata_brie, genes=gene_top_pos + gene_top_neg,
layers=['spliced', 'unspliced'],
color='time_cat', add_val='ELBO_gain',
ncol=3, alpha=0.7, legend='brief', noise_scale=0.1)
# plt.savefig(dat_dir + '../../figures/scNT_DMG_counts.png', dpi=150)
# plt.show()
[15]:
adata_brie[:, gene_top_pos + gene_top_neg].varm['cell_coeff']
[15]:
ArrayView([[ 0.0196298 ],
[ 0.01087456],
[ 0.01594826],
[-0.03045544],
[-0.0186206 ],
[-0.01236648]], dtype=float32)
[16]:
scv.pl.velocity(adata, var_names=['Ext1', 'Pfkp'], colorbar=True, ncols=1)
[ ]:
Dentate Gyrus¶
This processed Dentate Gyrus data set is downloaded from scVelo package, which is a very nice tool for RNA velocity quantification. Here, we will illustrate that the differential momentum genes could reveal biological informed trajectory, thanks to its supervised manner on informative gene detection.
You can run this notebook with our pre-computed data with these BRIE2 command lines, by downloading it e.g., using the following command line and unzip it into the ./data
folder:
wget http://ufpr.dl.sourceforge.net/project/brie-rna/examples/dentateGyrus/brie2_dentateGyrus.zip
unzip -j brie2_dentateGyrus.zip -d ./data
Load packages¶
[1]:
import brie
import numpy as np
import scanpy as sc
import scvelo as scv
import matplotlib.pyplot as plt
scv.logging.print_version()
print('brie version: %s' %brie.__version__)
2021-09-23 21:13:31.376208: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-23 21:13:31.376238: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Running scvelo 0.2.4 (python 3.7.6) on 2021-09-23 21:13.
brie version: 2.0.6
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
ERROR: XMLRPC request failed [code: -32500]
RuntimeError: PyPI's XMLRPC API is currently disabled due to unmanageable load and will be deprecated in the near future. See https://status.python.org/ for more information.
[2]:
scv.settings.verbosity = 3 # show errors(0), warnings(1), info(2), hints(3)
scv.settings.presenter_view = True # set max width size for presenter view
scv.set_figure_params('scvelo') # for beautified visualization
[3]:
# define the path you store the example data
# dat_dir = "./data"
dat_dir = "/home/yuanhua/research/brie2/releaseDat/dentateGyrus/"
scVelo’s default results¶
Fitting scVelo¶
Scvelo’s stochastic and dynamical models may take a few minutes to run, so we pre-run it by the following codes, and the fitted data can be found in the downloaded zip file.
Stochastic model
adata = scv.datasets.dentategyrus()
scv.pp.filter_and_normalize(adata, min_shared_counts=30, n_top_genes=3000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
scv.tl.velocity(adata)
scv.tl.velocity_graph(adata)
adata.write(dat_dir + "/dentategyrus_scvelo_stoc_3K.h5ad")
[4]:
adata = scv.read(dat_dir + "/dentategyrus_scvelo_stoc_3K.h5ad")
[5]:
scv.utils.show_proportions(adata)
adata
Abundance of ['spliced', 'unspliced']: [0.9 0.1]
[5]:
AnnData object with n_obs × n_vars = 2930 × 2894
obs: 'clusters', 'age(days)', 'clusters_enlarged', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts', 'velocity_self_transition'
var: 'velocity_gamma', 'velocity_r2', 'velocity_genes'
uns: 'clusters_colors', 'neighbors', 'pca', 'velocity_graph', 'velocity_graph_neg', 'velocity_params'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'Ms', 'Mu', 'ambiguous', 'spliced', 'unspliced', 'variance_velocity', 'velocity'
obsp: 'connectivities', 'distances'
[6]:
print(np.sum(adata.var['velocity_genes']))
634
[7]:
scv.tl.velocity_graph(adata, gene_subset=None)
scv.pl.velocity_embedding_stream(adata, basis='umap',
color=['clusters', 'age(days)'],
legend_fontsize=5, dpi=60)
computing velocity graph (using 1/80 cores)
finished (0:00:06) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
[8]:
## Plotting with higher figure resolution
# scv.pl.velocity_embedding_stream(adata, basis='umap', color=['clusters'],
# legend_fontsize=5, dpi=150, title='')
# scv.pl.velocity_embedding(adata, basis='umap', arrow_length=2, arrow_size=2, dpi=300, title='')
[ ]:
BRIE2’s differential momentum genes (DMGs)¶
Here, we aim to use BRIE2 to detect the differential momentum genes between cell types and illustrate its performance on projecting RNA velocity to cellular transitions. The command line and design matrix are: run_brie2.sh and dentategyrus_cluster_OL.tsv.
One main differentiation is between OPC and OL. So, we took out these 103 cells and run BRIE2 to detect the DMGs by testing the covariate is_OL
.
[9]:
adata_brie_OL = scv.read(dat_dir + "/brie_dentategyrus_cluster_subOL.h5ad")
adata_OL = adata[adata_brie_OL.obs.index, :]
Vocalno plot for DMGs between OPC and OL. cell_ceoff means effect size on logit(Psi), Psi means proportion of spliced RNAs. Positive effect size means OL has higher proportion of spliced RNAs.
[10]:
plt.figure(figsize=(6, 4.5), dpi=60)
brie.pl.volcano(adata_brie_OL, y='ELBO_gain', log_y=False, n_anno=16,
score_red=5, adjust=True)
plt.title('DMGs between OPC and OL')
plt.xlabel('cell_coeff: effect size on logit(Psi)')
plt.title('DMGs between OPC and OL')
# plt.savefig(dat_dir + '../../brie2/figures/fig_s11_vocalno_DMG_OPCvsOL.pdf', dpi=150)
[10]:
Text(0.5, 1.0, 'DMGs between OPC and OL')
[11]:
top_genes_OL = adata_brie_OL.var.index[np.argsort(list(np.max(adata_brie_OL.varm['ELBO_gain'], axis=1)))[::-1][:100]]
top_genes_OL[:4]
[11]:
Index(['Elavl3', 'Vmp1', 'Sema6d', 'Tra2a'], dtype='object', name='index')
[12]:
print(adata.var['velocity_r2'][top_genes_OL[:4]])
scv.pl.velocity(adata, var_names=top_genes_OL[:2], colorbar=True, ncols=1)
index
Elavl3 0.310708
Vmp1 -0.736003
Sema6d -0.358877
Tra2a -1.474734
Name: velocity_r2, dtype: float64
[13]:
idx = adata_brie_OL.varm['ELBO_gain'][:, 0] > 2
gene_use = adata_brie_OL.var.index[idx]
print(len(gene_use), sum(brie.match(gene_use, adata_OL.var.index) != None))
scv.tl.velocity_graph(adata_OL, gene_subset=gene_use)
scv.pl.velocity_embedding_stream(adata_OL, basis='umap',
color=['clusters', 'age(days)'],
legend_fontsize=10, dpi=50, size=1000)
48 48
computing velocity graph (using 1/80 cores)
Trying to set attribute `.uns` of view, copying.
finished (0:00:00) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
[14]:
fig = plt.figure(figsize=(11, 8), dpi=60)
## cutoff 0.001
ax1 = plt.subplot(2, 2, 1)
print(np.sum(adata_OL.var['velocity_genes']))
idx1 = adata_OL.var['velocity_genes']
scv.tl.velocity_graph(adata_OL, gene_subset=None)
scv.pl.velocity_embedding_stream(adata_OL, basis='umap', color=['clusters'],
ax=ax1, show=False, legend_fontsize=10, size=1000,
title='scVelo default %d genes' %(sum(idx1)))
ax1.text(-0.15, 0.95, 'a', transform=ax1.transAxes, size=20, weight='bold')
## cutoff 0.05
ax1 = plt.subplot(2, 2, 2)
idx1 = adata_brie_OL.varm['ELBO_gain'][:, 0] > 3
gene_use1 = adata_brie_OL.var.index[idx1]
print(sum(idx1), sum(brie.match(gene_use1, adata_OL.var.index) != None))
scv.tl.velocity_graph(adata_OL, gene_subset=gene_use1)
scv.pl.velocity_embedding_stream(adata_OL, basis='umap', color=['clusters'],
ax=ax1, show=False, legend_fontsize=10, size=1000,
title='%d DMGs at ELBO_gain>3' %(sum(idx1)))
ax1.text(-0.15, 0.95, 'b', transform=ax1.transAxes, size=20, weight='bold')
## cutoff 0.05
ax1 = plt.subplot(2, 2, 3)
idx1 = adata_brie_OL.varm['ELBO_gain'][:, 0] > 2
gene_use1 = adata_brie_OL.var.index[idx1]
print(sum(idx1), sum(brie.match(gene_use1, adata_OL.var.index) != None))
scv.tl.velocity_graph(adata_OL, gene_subset=gene_use1)
scv.pl.velocity_embedding_stream(adata_OL, basis='umap', color=['clusters'],
ax=ax1, show=False, legend_fontsize=10, size=1000,
title='%d DMGs at ELBO_gain>2' %(sum(idx1)))
ax1.text(-0.15, 0.95, 'c', transform=ax1.transAxes, size=20, weight='bold')
## cutoff 0.05
ax1 = plt.subplot(2, 2, 4)
idx1 = adata_brie_OL.varm['ELBO_gain'][:, 0] > 1.5
gene_use1 = adata_brie_OL.var.index[idx1]
print(sum(idx1), sum(brie.match(gene_use1, adata_OL.var.index) != None))
scv.tl.velocity_graph(adata_OL, gene_subset=gene_use1)
scv.pl.velocity_embedding_stream(adata_OL, basis='umap', color=['clusters'],
ax=ax1, show=False, legend_fontsize=10, size=1000,
title='%d DMGs at ELBO_gain>1.5' %(sum(idx1)))
ax1.text(-0.15, 0.95, 'd', transform=ax1.transAxes, size=20, weight='bold')
plt.tight_layout()
# plt.savefig(dat_dir + '../../brie2/figures/fig_s12_OPLvsOL_direction.png', dpi=300)
plt.show()
634
computing velocity graph (using 1/80 cores)
finished (0:00:00) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
20 20
computing velocity graph (using 1/80 cores)
finished (0:00:00) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
48 48
computing velocity graph (using 1/80 cores)
finished (0:00:00) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
83 83
computing velocity graph (using 1/80 cores)
finished (0:00:00) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
[ ]:
We then detect the DMGs for each cell type vs the rest and use them to explore their impact on cellular transitions.
[15]:
adata_brie = scv.read(dat_dir + "/brie_dentategyrus_cluster.h5ad")
cdr = np.array((adata_brie.X > 0).mean(0))[0, :]
adata_brie
[15]:
AnnData object with n_obs × n_vars = 2930 × 2879
obs: 'clusters', 'age(days)', 'clusters_enlarged', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size'
var: 'n_counts', 'n_counts_uniq', 'loss_gene'
uns: 'Xc_ids', 'brie_losses', 'brie_param', 'brie_version', 'clusters_colors'
obsm: 'X_umap', 'Xc'
varm: 'ELBO_gain', 'cell_coeff', 'fdr', 'intercept', 'pval', 'sigma'
layers: 'Psi', 'Psi_95CI', 'Z_std', 'ambiguous', 'spliced', 'unspliced'
[16]:
top_genes_brie = adata_brie.var.index[np.argsort(list(np.max(adata_brie.varm['ELBO_gain'], axis=1)))[::-1][:100]]
top_genes_brie[:4]
[16]:
Index(['Celf2', 'Vmp1', 'Myl6', 'Snap25'], dtype='object', name='index')
[17]:
print(adata.var['velocity_r2'][top_genes_brie[:4]])
scv.pl.velocity(adata, var_names=top_genes_brie[:2], colorbar=True, ncols=1)
index
Celf2 0.318048
Vmp1 -0.736003
Myl6 0.338808
Snap25 -0.970810
Name: velocity_r2, dtype: float64
[18]:
import seaborn as sns
scVelo_genes = ['Tmsb10', 'Camk2a', 'Ppp3ca', 'Igfbpl1']
adata_brie.varm['max_EGain'] = np.round(np.max(adata_brie.varm['ELBO_gain'], axis=1, keepdims=True), 2)
fig = plt.figure(figsize=(20, 8), dpi=40)
brie.pl.counts(adata_brie, genes=np.append(top_genes_brie[:4], scVelo_genes),
layers=['spliced', 'unspliced'],
color='clusters', add_val='max_EGain',
nrow=2, alpha=0.7, legend=False,
palette=sns.color_palette(adata_brie.uns['clusters_colors']),
hue_order=np.unique(adata_brie.obs['clusters']), noise_scale=0.1)
# plt.savefig(dat_dir + '../../brie2/figures/fig_s10_DMG_counts.png', dpi=200)
plt.show()
[19]:
idx = (np.max(adata_brie.varm['ELBO_gain'], axis=1) > 5) * (cdr > 0.15)
gene_use = adata_brie.var.index[idx]
print(len(gene_use), sum(brie.match(gene_use, adata.var.index) != None))
scv.tl.velocity_graph(adata, gene_subset=gene_use)
scv.pl.velocity_embedding_stream(adata, basis='umap',
color=['clusters', 'age(days)'],
legend_fontsize=5, dpi=60)
297 297
computing velocity graph (using 1/80 cores)
finished (0:00:03) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
[20]:
## Plotting with higher figure resolution
# scv.pl.velocity_embedding_stream(adata, basis='umap', color=['clusters'],
# legend_fontsize=5, dpi=150, title='')
# scv.pl.velocity_embedding(adata, basis='umap', arrow_length=2, arrow_size=2, dpi=300, title='')
[ ]:
scVelo’s dynamical model¶
Dynamical model
adata = scv.datasets.dentategyrus()
scv.pp.filter_and_normalize(adata, min_shared_counts=30, n_top_genes=3000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
scv.tl.recover_dynamics(adata, var_names='all')
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
scv.tl.latent_time(adata)
adata.write(dat_dir + "/dentategyrus_scvelo_dyna_3K.h5ad")
[21]:
adata_dyna = scv.read(dat_dir + "/dentategyrus_scvelo_dyna_3K.h5ad")
[22]:
print(np.sum(adata_dyna.var['velocity_genes']))
1066
[23]:
# scv.tl.velocity_graph(adata_dyna, gene_subset=None)
scv.pl.velocity_embedding_stream(adata_dyna, basis='umap',
color=['clusters', 'age(days)'],
legend_fontsize=5, dpi=60)
computing velocity embedding
finished (0:00:00) --> added
'velocity_umap', embedded velocity vectors (adata.obsm)
[ ]:
Prior distribtuion¶
[3]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_probability as tfp
from tensorflow_probability import distributions as tfd
[21]:
import matplotlib.pyplot as plt
Logit-Normal¶
See more on logit-normal distribution Wikipedia page.
In practicce, we use \(\mu=0, \sigma=3.0\) as prior
[30]:
def _logit(x):
return tf.math.log(x / (1 - x))
_logit(np.array([0.6]))
[30]:
<tf.Tensor: id=7912, shape=(1,), dtype=float64, numpy=array([0.40546511])>
[40]:
_means = np.array([0.05, 0.3, 0.5, 0.7, 0.95], dtype=np.float32)
_vars = np.array([0.1, 0.5, 1.0, 1.5, 2.0, 3.0], dtype=np.float32)
xx = np.arange(0.001, 0.999, 0.001).astype(np.float32)
fig = plt.figure(figsize=(10, 6))
for j in range(len(_vars)):
plt.subplot(2, 3, j + 1)
for i in range(len(_means)):
_model = tfd.Normal(_logit(_means[i:i+1]), _vars[j:j+1])
_pdf = _model.prob(_logit(xx)) * 1 / (xx * (1 - xx))
plt.plot(xx, _pdf, label="mean=%.2f" %(_means[i]))
plt.title("std=%.2f" %(_vars[j]))
if j == 5:
plt.legend(loc="best")
plt.tight_layout()
plt.show()

[73]:
_model = tfd.Normal([0], [3])
plt.hist(np.array(tf.sigmoid(_model.sample(1000))).reshape(-1), bins=100)
plt.show()

Gamma distribution¶
See more on Gamma distribution Wikipedia page In practice, we use \(\alpha=10, \beta=3\) as prior
[65]:
_alpha = np.array([2, 3, 5, 10, 15], dtype=np.float32)
_beta = np.array([0.5, 1, 2, 3, 5, 8], dtype=np.float32)
xx = np.arange(0.01, 10, 0.1).astype(np.float32)
fig = plt.figure(figsize=(10, 6))
for j in range(len(_vars)):
plt.subplot(2, 3, j + 1)
for i in range(len(_means)):
_model = tfd.Gamma(_alpha[i:i+1], _beta[j:j+1])
_pdf = _model.prob(xx)
plt.plot(xx, _pdf, label="alpha=%.2f" %(_alpha[i]))
plt.title("beta=%.2f" %(_beta[j]))
if j == 5:
plt.legend(loc="best")
plt.tight_layout()
plt.show()

[70]:
_model = tfd.Gamma([10], [3])
plt.hist(_model.sample(1000).numpy().reshape(-1), bins=100)
plt.show()

[ ]: