-
Notifications
You must be signed in to change notification settings - Fork 39
Tutorial
Peer can be either used as standalone tool or using one of it's interfaces (currently R and Python). Below you can find a brief introduction to either one of these ways to get started with PEER.
As a minimum, PEER requires an expression matrix in csv
(comma-separated) format, specified with the -f
option. The matrix is
assumed to have N rows and G columns, where N is the number of
samples, and G is the number of genes. The basic command to apply peer
to such a matrix given in expression.csv, learning K=10 hidden
confounders is
> peertool -f expression.csv -n 10
PEER can read comma-separated (.csv) or tab separated (.tab) files, and
assumes single space separation if the extension is not .csv or .tab. If
the expression data file has a header row, you have to give
--has_header
switch on the command line. There is a sample expression file
in the directory examples/data.
The output is written to directory peer_out by default, creating csv files for the residuals after accounting for the factors (residuals.csv, NxG matrix), the inferred factors (X.csv, NxK), the weights of each factor for every gene (W.csv, GxK), and the inverse variance of the weights (Alpha.csv, Kx1).
You can change the output directory with the -o
option, e.g.
> peertool -f expression.csv -n 5 -o peer_n-5
If you are not interested in the posterior estimates of all the variables, you can suppress their output with switches. For example, to only output the residuals, you can use
> peertool -f expression.csv --no_a_out --no_w_out --no_x_out
The --no_res_out
switch suppresses the output of residuals.
If there are measured experimental variables that may contribute to
variability in the data, they can be included in the inference, and
specified with the -c
flag.
> peertool -f expression.csv -c covs.csv
The covariates file should be in csv or tab format, and have N rows and C columns, where N matches the number of samples in the expression file.
If desired, PEER can include a covariate for the mean (a vector of ones).
Addition of the mean covariate can be controlled via the
--add_mean
flag.
As default, PEER iterates through updates of every variable 100 times. To set it to say, 1000, use
> peertool -f expression.csv -i 100
PEER finishes if the increase in lower bound on the model evidence ceases to change, or the variance of the residuals has stabilised. The limiting values (tolerances) can be specified as
> peertool -f expression.csv --bound_tol 0.1 --var_tol 0.00000001
In general you can keep the bound tolerance fairly high, but should keep the variation tolerance quite low compared to the variance of the expression matrix. If unsure, use the default values.
Finally, the prior parameters on the noise and weight precision distributions can also be changed. As these are both gamma distributed, you can specify the a and b parameters of both:
> peertool -f expression.csv --e_pa 1.0 --e_pb 0.01 --a_pa 10.0 --a_pb 100
Applying PEER in R amounts to creating the model and setting its parameters, followed by inference. The model object will then hold the posterior distributions of the parameters.
First, load the peer library:
> library(peer)
If there are any errors, please see the installation page or FAQ. You will need data to analyse - we will use the data from the examples directory:
> expr = read.csv('examples/data/expression.csv', header=FALSE)
> dim(expr)
[1] 200 100
The data matrix is assumed to have N rows and G columns, where N is the number of samples, and G is the number of genes.
Now we can create the model object,
> model = PEER()
set the observed data,
> PEER_setPhenoMean(model,as.matrix(expr))
NULL
> dim(PEER_getPhenoMean(model))
[1] 200 100
(NULL response means no error here), say we want to infer K=10 hidden confounders,
> PEER_setNk(model,10)
NULL
> PEER_getNk(model)
[1] 10
and perform the inference.
> PEER_update(model)
iteration 0/1000
iteration 1/1000
iteration 2/1000
iteration 3/1000
iteration 4/1000
iteration 5/1000
iteration 6/1000
iteration 7/1000
iteration 8/1000
Converged (var(residuals)) after 8 iterations
NULL
The result is the model object with posterior distributions of the variables.
You can get the posterior mean of the inferred confounders (NxK matrix), their weights (GxK matrix), precision (inverse variance) of the weights (Kx1 matrix), and the residual dataset (NxG matrix):
> factors = PEER_getX(model)
> dim(factors)
[1] 200 10
> weights = PEER_getW(model)
> dim(weights)
[1] 100 10
> precision = PEER_getAlpha(model)
> dim(precision)
[1] 10 1
> residuals = PEER_getResiduals(model)
> dim(residuals)
[1] 200 100
> plot(precision)
PEER can automatically include an additional factor (covariate) to account for the mean expression. For most use cases, including the mean effect is likely to be a good choice. To active mean factors, use
> PEER_setAdd_mean(model, TRUE)
NULL
If running PEER using this setting, there will be 11 instead of 10 factors.
If there are measured experimental variables that may contribute to variability in the data, they can be included in the inference. The C observed covariates are assumed to be in a NxC matrix:
> covs = read.csv('examples/data/covs.csv',header=FALSE)
> PEER_setCovariates(model, as.matrix(covs))
NULL
This sets the first C factors to be fixed to the observed covariates, and extends the hidden factor matrix X to have additional C columns. Remember to cast the covariates as a matrix!
As default, PEER iterates through updates of every variable 1000 times. To set it to say, 100, use
> PEER_setNmax_iterations(model, 100)
NULL
PEER finishes if the increase in lower bound on the model evidence ceases to change, or the variance of the residuals has stabilised. The limiting values (tolerances) can be specified as
> PEER_setTolerance(model, 1)
NULL
> PEER_setVarTolerance(model, 0.1)
NULL
In general you can keep the bound tolerance fairly high, but should keep the variation tolerance quite low compared to the variance of the expression matrix. If unsure, use the default values (bound=0.001, variance=0.00001).
Finally, the prior parameters on the noise and weight precision distributions can also be changed. As these are both gamma distributed, you can specify the a and b parameters of both:
> PEER_setPriorAlpha(model,0.001,0.1)
NULL
> PEER_setPriorEps(model,0.1,10.)
NULL
PEER uses uninformative priors on weight precision and noise precision by default (Alpha a = 0.001, Alpha b = 0.1, Eps a = 0.1, Eps b = 10)
Applying PEER in Python amounts to creating the model and setting its parameters, followed by inference. The model object will then hold the posterior distributions of the parameters.
First, load the peer library:
>>> import peer
If there are any errors, please see the installation page or FAQ. You will need data to analyse - we will use the data from the examples directory:
>>> import scipy as SP
>>> expr = SP.loadtxt('examples/data/expression.csv', delimiter=',')
>>> expr.shape
(200,100)
The data matrix is assumed to have N rows and G columns, where N is the number of samples, and G is the number of genes.
Now we can create the model object,
>>> model = peer.PEER()
set the observed data,
>>> model.setPhenoMean(expr)
>>> model.getPhenoMean().shape
(200, 100)
say we want to infer K=10 hidden confounders,
>>> model.setNk(10)
>>> model.getNk()
10
and perform the inference.
>>> model.update()
iteration 0/1000
iteration 1/1000
iteration 2/1000
iteration 3/1000
iteration 4/1000
iteration 5/1000
iteration 6/1000
iteration 7/1000
iteration 8/1000
Converged (var(residuals)) after 8 iterations
The result is the model object with posterior distributions of the variables.
You can get the posterior mean of the inferred confounders (NxK matrix), their weights (GxK matrix), precision (inverse variance) of the weights (Kx1 matrix), and the residual dataset (NxG matrix):
>>> factors = model.getX()
>>> factors.shape
(200, 10)
>>> weights = model.getW()
>>> weights.shape
(100, 10)
>>> precision = model.getAlpha()
>>> precision.shape
(10, 1)
>>> residuals = model.getResiduals()
>>> residuals.shape
(200, 100)
If you have pylab installed, you can for example
>>> import pylab as PL
>>> PL.plot(precision)
>>> PL.show()
PEER can automatically include an additional factor (covariate) to account for the mean expression. For most use cases, including the mean effect is likely to be a good choice. To active mean factors, use
>>> model.setAdd_mean(True)
If running PEER using this setting, there will be 11 instead of 10 factors.
If there are measured experimental variables that may contribute to variability in the data, they can be included in the inference. The C observed covariates are assumed to be in a NxC matrix:
>>> model.setCovariates(covs)
This sets the first C factors to be fixed to the observed covariates, and extends the hidden factor matrix X to have additional C columns.
As default, PEER iterates through updates of every variable 1000 times. To set it to say, 100, use
>>> model.setNmax_iterations(100)
PEER finishes if the increase in lower bound on the model evidence ceases to change, or the variance of the residuals has stabilised. The limiting values (tolerances) can be specified as
>>> model.setTolerance(1)
>>> model.setVarTolerance(0.1)
In general you can keep the bound tolerance fairly high, but should keep the variation tolerance quite low compared to the variance of the expression matrix. If unsure, use the default values (bound=0.001, variance=0.00001).
Finally, the prior parameters on the noise and weight precision distributions can also be changed. As these are both gamma distributed, you can specify the a and b parameters of both:
>>> model.setPriorAlpha(0.001,0.1)
>>> model.setPriorEps(0.1,10.)
PEER uses uninformative priors on weight precision and noise precision by default (Alpha a = 0.001, Alpha b = 0.1, Eps a = 0.1, Eps b = 10)