brie-quant CLI

The brie-quant CLI (in brie>=2.0.0) uses the newly developed variational inference methods scalable to large data sets, which works both in CPU or GPU with the TensorFlow Backend. For using BRIE1 (<=0.2.4) with an MCMC sampler, please refer to BRIE1.

This command allows to quantify the splicing proportion Psi and detect variable splicing events/genes along with cell-level features, e.g., cell type, disease condition, and development time. It can handle both alternative splicing isoforms or unspliced vs spliced ratios in a unified framework.

As a Bayesian method, the key philosophy of BRIE is to combine likelihood (data-driven) and prior (uninformative or informative). In BRIE2, a variety of prior settings are supported, as follows.

Note

The mode2 (quant and diff) with cell features is the recommended option in BRIE2, for both quantifying PSI and identifying differential momentum genes for RNA velocity analysis and differential splicing events on alternative splicing for either categorical or continuous covariates.

For unspliced vs spliced in RNA velocity analysis, please add argument --layers spliced,unspliced, as by default is set for alternative splicing.

Mode2-quant: Aggregated imputation

This mode requires argument --interceptMode gene, which means fitting an offset for each gene (or splicing event). The offset denotes the mean of the prior distribution through aggregation, hence learns a prior shared by all cells on each gene. The benefit for this mode is that dimension reduction can be performed, e.g., PCA and UMAP on splicing PSI matrix. As there are many splicing events with low total reads, they will have a high variance in the estimation, and is often suggested filtered out, which will cause missing values. Based on the cell aggregated imputation, most dimension reduction methods can be used, even it doesn’t support missing values.

Example command line for mode 2:

brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_aggr.h5ad --interceptMode gene

Mode2-diff: Variable splicing detection

This mode requires argument -c for cell features and --LRTindex for the index (zero-based) of cell features to perform likelihood ratio test. Again we suggest keeping the cell aggregation on each gene by --interceptMode gene.

Then this mode will learn a prior from the given cell level features and perform the second round fitting with leaving one feature out each time to calculate the ELBO gain, which can be used as a surrogate for Bayes factor.

Example command line for mode 3:

brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_cell.h5ad \
    -c $DATA_DIR/cell_info.tsv --interceptMode gene --LRTindex=All

Example

As an example in the MS data, we have cells labelled with two factors 1) multiple sclerosis & control - isEAE, 2) two mouse strains - isCD1. We want to identify alternative splicing events associated with the first factor multiple sclerosis, but also want to consider the potential confounder in mouse strain, we could set the design matrix in cell_info.tsv file as follow, along with parameters --interceptMode gene --LRTindex 0.

samID   isEAE   isCD1
SRR7102862      0       1
SRR7103631      1       0
SRR7104046      1       1
SRR7105069      0       0

Note

Be very careful on collinearity (i.e., redundancy) of your design matrix in cell_info.tsv and the constant intercept (if use --interceptMode gene), which is similar to DEG in edgeR or DESeq2.

By default, BRIE2 uses the --testBase=full mode to detect differential splicing by comparing all features + constant intercept versus leaving the testing feature(s) out. In this setting, if your features can’t have collinearity with intercept, e.g., you may need to remove one cell type out if it covers all your cells.

Alternatively, you can change to another strategy --testBase=null by comparing the testing feature(s) + intercept versus intercept only, like the example data on Dentate Gyrus.

Mode 1: Imputation with gene features

This mode is introduced in BRIE 1, where genomic sequences are leverage to learn a prior distribution of PSI. Then the predicted distribution of PSI from sequence features are combined with the likelihood obtained from the observed read counts. This framework provides a coherent way to combine the quantification from read counts and imputation from genomic features.

Initially, the inference in this mode is achieved by MCMC sample per cell separated, which is generally slow for over hundreds of cells. BRIE2’s variational framework keeps supporting the gene features and performs the inference for all cells in one go, though they are independent.

Example command line for Mode 1. We suggest use --interceptMode cell to learn an offset for each cell:

brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_gene.h5ad \
    -g $DATA_DIR/gene_feature.tsv --interceptMode cell

Note

For the sake of convenience, we now recommend using Mode2-quant below to perform imputation, which leverages the average PSI values in a cell population to function as an informative prior.

Mode 0: None imputation

In this mode, the prior is an uninformative logit-normal distribution with mean=0, and learned variance. Therefore, if a splicing event in a gene doesn’t have any read, it will return a posterior with Psi’s mean=0.5 and 95% confidence interval around 0.95 (most cases >0.9).

This setting is used if you have high covered data and you only want to calculate cells with sufficient reads for each interesting gene, e.g., by filtering out all genes with Psi_95CI > 0.3.

Otherwise, the 0.5 imputed genes will be confounded by the expression level, instead of the isoform proportion.

Example command line for mode 1:

brie-quant -i out_dir/brie_count.h5ad -o out_dir/brie_quant_pure.h5ad --interceptMode None

All parameters

There are more parameters for setting (brie-quant -h always give the version you are using):

Usage: brie-quant [options]

Options:
  -h, --help            show this help message and exit
  -i IN_FILE, --inFile=IN_FILE
                        Input read count matrices in AnnData h5ad or brie npz format.
  -c CELL_FILE, --cellFile=CELL_FILE
                        File for cell features in tsv[.gz] with cell and feature ids.
  -g GENE_FILE, --geneFile=GENE_FILE
                        File for gene features in tsv[.gz] with gene and feature ids.
  -o OUT_FILE, --out_file=OUT_FILE
                        Full path of output file for annData in h5ad [default:
                        $inFile/brie_quant.h5ad]
  --LRTindex=LRT_INDEX  Index (0-based) of cell features to test with LRT: All, None
                        or comma separated integers [default: None]
  --testBase=TEST_BASE  Features in testing base model: full, null  [default: full]
  --interceptMode=INTERCEPT_MODE
                        Intercept mode: gene, cell or None [default: None]
  --layers=LAYERS       Comma separated layers two or three for estimating Psi
                        [default: isoform1,isoform2,ambiguous]

  Gene filtering:
    --minCount=MIN_COUNT
                        Minimum total counts for fitltering genes [default: 50]
    --minUniqCount=MIN_UNIQ_COUNT
                        Minimum unique counts for fitltering genes [default: 10]
    --minCell=MIN_CELL  Minimum number of cells with unique count for fitltering genes
                        [default: 30]
    --minMIF=MIN_MIF    Minimum minor isoform frequency in unique count [default:
                        0.001]

  VI Optimization:
    --MCsize=MC_SIZE    Sample size for Monte Carlo Expectation [default: 3]
    --minIter=MIN_ITER  Minimum number of iterations [default: 5000]
    --maxIter=MAX_ITER  Maximum number of iterations [default: 20000]
    --batchSize=BATCH_SIZE
                        Element size per batch: n_gene * total cell [default: 500000]
    --pseudoCount=PSEUDO_COUNT
                        Pseudo count to add on unique count matrices [default: 0.01]