Time zero normalization with the Multivariate Gaussian distribution
08 May 2022Timecourses are a powerful experimental design for evaluating the impact of a perturbation. These perturbations are usually chemicals because chemicals, such as a drug, can be introduced quickly and with high temporal precision. Although, with some technologies, such as the estradioldriven promoters that I used in the induction dynamics expression atlas (IDEA), it is possible to rapidly perturb a single gene further increasing specificity and broadening applicability. By rapidly perturbing individuals, they can be synchronized based on the time when dosing began. We often call this point when dosing begins “time zero” while all subsequent measurements correspond to the time post perturbation. (Since time zero corresponds to a point when a perturbation is applied, but will not yet impact the system, this measurement is usually taken before adding the perturbation.)
One of the benefits of collecting a time zero measurement is it allows us to remove, or account for, effects that are shared among all time points. In many cases this may just amount to analyzing foldchanges of postperturbation measurement with respect to their time zero observation, rather than the original measurements themselves. This can be useful if there is considerable variation among timecourses irrespective of the perturbation, such as if we were studying humans or mice. Similarly, undesirable variation due to daytoday variation in instruments, sample stability, or any of the many other factors which could produce batch effects, can sometimes by addressed by measuring each timecourse together and working with foldchanges. In either case, correcting for individual effects using preperturbation measurements will increase our power to detect perturbations’ effects.
Aside from correcting for unwanted variation, the kinetics of timecourses are a rich source of information which can be either a blessing or a curse. With temporal information, ephemeral responses can be observed. We can see both which features are changing and when they are changing. And, the ordering of events can point us towards causality. In practice, each of these goals can be difficult, or impossible to achieve, leaving us with a nagging feeling that we’re leaving information on the table. There are many competing options for identifying differences in timecourses, few ways of summarizing dynamics in an intuitive way, and causal inference is often out of reach. In this post, and others to follow, I’ll pick apart a few of these limitations, discussing developments that were applied to the IDEA, but will likely be useful for others thinking about biological timeseries analysis (or other timeseries if you are so inclined!). Here, I evaluate a few established methods for identifying features which vary across time and then introduce an alternative approach based on the Multivariate Gaussian distribution and Mahalanobis distance which increases power and does not require any assumptions about responses’ kinetics.
Our timecourse experiment
To evaluate methods for detecting temporal dynamics its helpful to use a dataset where there a clearcut examples of timecourses with and without signal. With such a dataset in hand, we can easily detect signals that we are missing (false negatives), noise that we think is real (false positives) and evaluate overall recall (what fraction of signals are we detecting). We rarely have such positive and negative examples in real timecourses, so instead we can simulate timecourses with and without signal. Going forward I will also use genes as shorthand for whatever features that we might be working with since I’ve primarily worked with these methods in the context of gene expression data.
Environment Setup
First, I’m going to setup the R environment by loading some breadandbutter packages and setting the global options for future and ggplot2.
Simulate timecourses containing signal
First, we can generate the subset of our timecourses which contain signal. These timecourses should follow a broad range of biologicallyfeasible patterns.
To construct such timecourses, we can use the phenomonological timecourse model of Chechik & Koller which represents timecourses as a pair of sigmoidal responses, called an impulse, We’ll also use a simpler single sigmoidal version of the C & K model.
To simulate data from these models, we can use the simulate_timecourses() function from the impulse R package, available on GitHub.
This function will draw a set of parameters for sigmoidal and impulse from appropriate distributions to define a simulated timecourse. We’ll then add independent normally distributed noise to each observation. (For most genomic data types, measurements are lognormal so we could think of these abundance units as already having been logtransformed).
tc_id  time  sim_fit  abundance  signal 

2  0  0.0976223  0.0957832  contains signal 
2  5  0.2018300  0.5742951  contains signal 
2  10  0.3967604  0.2166940  contains signal 
2  20  1.1307509  0.9666770  contains signal 
2  30  1.8594480  0.9893428  contains signal 
2  40  2.1534230  1.8150251  contains signal 
2  60  2.2445179  2.7889178  contains signal 
2  90  2.2489452  2.5141336  contains signal 
Simulate timecourses which are just noise
Timecourses which are just noise are easy to generate, we can just generate these using independent draws from a normal distribution (with the same standard deviation that we used to add noise to the signals).
With timecourses with and without signals in hand, we can combine the two sets together while tracking their origin.
Additionally since we are interested in timedependent changes with respect to time zero, we can transform abundances into fold changes which subtract the initial value of a timecourse from every measurement. We’ll work with both the native abundance scale and timezero normalized fold changes going forward.
Example timecourses
Models to try
At this point, we want to fit a few flavors of time series models to each gene in order to determine how reliably each model can discriminate signalcontaining and nosignal timecourses.
To make it easy to iterate over features, I like to using the nest function from tidyr to store all the data for a feature in a single row. Here, expression data will be stored as a list of genelevel tables.
Having nested onegene per row, we can apply multiple regression models to each gene, and we’ll also do this treating both the foldchange and original expression level as responses to evaluate the effect of time zero normalization. The regression models that we’ll try are:
 linear effect of time on expression. From our sigmoid and impulse generate process, we shouldn’t expect a linear relationship to work great, but it can serve as a nice baseline.
 cubic relationship between time and expression. This fill fit a linear term, a quadratic ($t^2$) and a cubic ($t^3$) term to allow for more complicated dynamics such as genes that go up and then down again. One feature of cubic regression, or other polynomial regression models such as quadratic regression, is that they are zerocentric. What I mean by this is that additional terms in a polyomial regression models, such as moving from a cubic or a quardic model, allows for extra flexibily around zero. This can be helpful, but if changes are occurring at late timepoints, we may need a high degree polynomial to capture this change, and the cost will be a prediction which overfits to the noise in earlier timepoints.
 predicting expression with a spline over timepoints using generalized additive models (GAMs). These models are similar to the cubic models but provide support evenly across time. This will allow them to detect late changes without requiring many degrees of freedom. GAMs are powerful models but they can run into problem when changes occur rapidly especially if we have relatively few timepoints.
With these models our main goal is determine whether each timecourses contains dynamic signal rather than testing for the significance of individual parameters. Approaching the problem this way will also allow us to compare these different approaches even though they fit a different number of parameters. To summarize each model’s prediction of the role of time, ANOVA can be used to determine how much variation is explained by time relative to the noise left over. Because cubic regression and GAMs fit more parameters they must do a better job of explaining the temporal dynamics to justify their extra degrees of freedom.
There are many other approaches that we could try, a few of these are worth mentioning:
 mgcv is an alternative implementation of GAMs to the gam package used below. It is able to decide how flexible of a spline should be fit to each gene using crossvalidation. For our synthetic dataset, mgcv actually fails for a number of features owing to the complexity of our dynamics relative to the relatively small number of timepoints.
 If we had replicates of timepoints then we could fit a model which treats each timepoint as a categorical variable. This would allow us detect dynamics without assuming that we expect certain patterns. The downside is that this would require us to collect twice as much data, or perhaps cut back on time points to provide repeated measures of the timepoints that we do have. In general, I think its better to have more unique timepoints represented even if we don’t have repeated measures since this provides more even representation of measuremetns over the time period we care about.
 Since time points are not evenly spaced we could have tried transforming time when fitting the above models. While the timepoints are “exponentially” sampled, taking log(time) would send time zero to Inf so a better transformation would be using the square root of time as the independent variable.
A couple of notes:
 Since the time zero fold change must be zero by definition, I applied this as a constraint (this is the “+ 0” in the formulas below).
 future was used to parallelize over genes; its settings were setup in the “environment setup” section.
Each model x gene can be summarized by a single pvalue. We expect the signalcontaining timecourses to have relatively low pvalues, while the nosignal timecourses’ pvalues should be uniformly distributed between 0 and 1.
To correct for multiple tests we can use the Storey qvalue approach to control the false discovery rate (FDR). To do this we will estimate qvalues separately for each model and select a qvalue cutoff of 0.1 as a cutoff for significance. At this level we expect that 1/10 of genes with a qvalue of less than 0.1 will be from the nosignal group.
Visually, there are a large number of nosignal timecourses with small pvalues in the foldchange data. This suggests that something pathological is going on.
We can also summarize models based on the FDR that was actually realized given that we were shooting for 0.1, and based on the total recall of signalcontaining timecourses at the cutoff of 0.1
model  response  false negative  false positive  true negative  true positive  fdr  recall 

cubic  abundance  1912  10  7990  88  0.1020408  0.0440 
gam  abundance  1118  128  7872  882  0.1267327  0.4410 
linear  abundance  1532  55  7945  468  0.1051625  0.2340 
cubic  foldchange  280  2928  5072  1720  0.6299484  0.8600 
gam  foldchange  276  4151  3849  1724  0.7065532  0.8620 
linear  foldchange  377  3463  4537  1623  0.6808887  0.8115 
In this summary we can see that working with abundances does accurately control the FDR, but recall is low for linear and cubic regression and moderate for GAMs. Working with fold changes in contrast fails to control the FDR. While we intended for 1/10 of our discoveries to be null, around 60% actually are! While recall is pretty good, most of the kinetic responses will be garbage.
To figure out what is going wrong, we can plot examples of false positives (model thinks there is signal when there isn’t) and false negatives (model doesn’t think there is signal when there really is).
From this we can say a few things:

When working with abundances we need to include an intercept term so the average value of a feature can be separated from its change over time. Doing this however can remove some early responses since the intercept becomes the point of reference rather than the value at time zero.

Changes which are showing up primarily in one or two timepoints might be missed since the polynomial and gam models used above can’t contort themselves to fit these dynamics. This might be appropriate if these points were just noise but in many cases these are large changes beyond what we would expect as noise in our generative process.

Working with fold change enforces the value at time zero as the appropriate reference. This makes conceptual sense for a perturbation timecourse since at time zero (and before) the system is in a reference state, and all subsequent timepoints capture the dynamics of interest. However, working directly with foldchanges creates a problem. We are no longer controlling the FDR! In this simulation if we wanted to find discoveries at a 10% FDR than we would in fact be realizing a 60% FDR. This is a big problem, especially since outside of working in a simulation, we don’t know which timecourses contain real signal and which are spurious (if that was the case then why would we do the experiment…), so we would would think there were strong signals in our dataset when it may in fact entirely be noise.
If we wanted to get around these issues, then we could still probably make a regression model work, but it would require adding more samples and cost to the analysis. The two main paths we could take are:

replicates of each timecourse  if we had multiple biological replicates at each timepoint, than rather than treating time as a numerical variable, we could treat it as a categorical variable. In this case we could fit an ANOVA model which would assess whether the variation between timepoints is greater than the variation within timepoints. This would be a powerful way of detecting differences across time which is agnostic to types of changes occurring.

denser sampling  if we had more measurements near timepoints where rapid dynamics were occurring then it would be easier to distinguish smooth rapid responses from single outlier observations. This would still require us to fit a model which can appropriately capture such dynamics, but with more observations we could either fit a more flexible model (with more degrees of freedom) or use the same simple models but with more power to detect significant changes.
In most cases, I think adopting one of these options is probably the smart way to go, however there are reasons why collecting more samples in a given experiment is not feasible such as if MANY similar experiments are being performed, like in IDEA, or if there are constraints on how frequently samples can be collected.
In such cases, I think we can find a path forward by stepping away from regression and thinking about likelihoodbased methods which capture the nature of fold changes.
Timecourse fold change likelihood
Using likelihood methods, we start with a statistical model for how our observations were generated. We can then sample or optimize the parameters of this model in order to find the most likely value (i.e., the frequentist MLE) or generate a distribution of parameters incorporating both likelihood and parameter plausibility (i.e., the Bayesian approach).
Before we posit an appropriate likelihood for fold change, lets figure out why the regression approaches using fold change were so anticonservative. The big problem here was that many timecourses which were just noise looked like they actually contained signal. So, lets work with just the “no signal” timecourses.
To do this, we can look at how the value at time zero influences foldchange estimates of later timepoints.
From this plot, a large positive value at time zero (due to noise) results in later timepoints appearing as consistent negative fold changes. Conversely, if the time zero value is negative then later timepoints appear consistently positive.
While we simulated abundances as independent Normal draws which would possess a spherical covariance structure, normalizing to time zero has induced a correlation between observations and this dependence will need to be accounted for in order to have a useful null hypothesis.
Based on the value of time zero, which all subsequent time points are normalized with respect to, these later timepoints are biased such that they are all higher or lower than otherwise expected. In order to test for timecourselevel signal based on the aggregate signal of all observations, we need to account for the dependence of these observations. Luckily, the form of this dependence is quite straightforward.
We can see this dependence using the sample covariance matrix of our null timecourses.
5  10  20  30  40  60  90  

5  0.5141863  0.2514794  0.2583391  0.2533660  0.2557825  0.2539230  0.2559496 
10  0.2514794  0.5076230  0.2605876  0.2536349  0.2504740  0.2512202  0.2540698 
20  0.2583391  0.2605876  0.5131240  0.2505837  0.2562055  0.2559705  0.2547911 
30  0.2533660  0.2536349  0.2505837  0.4973020  0.2534935  0.2517957  0.2526950 
40  0.2557825  0.2504740  0.2562055  0.2534935  0.4982398  0.2498487  0.2531005 
60  0.2539230  0.2512202  0.2559705  0.2517957  0.2498487  0.4986771  0.2508796 
90  0.2559496  0.2540698  0.2547911  0.2526950  0.2531005  0.2508796  0.5066272 
Observations variances are approximately $2\text{Var}(x_t)$ (2 * 0.5) because:
\[\mathcal{N}(\mu_{A}, \sigma^{2}_{A})  \mathcal{N}(\mu_{B}, \sigma^{2}_{B}) = \mathcal{N}(\mu_{A}  \mu_{B}, \sigma^{2}_{A} + \sigma^{2}_{B})\]Observation covariances are $\text{Var}(\log_2x_t)$ because the shared normalization to time zero adds the variance of time zero as a covariance to to the later time points.
Normalization of Normal (or logNormal) observations to a common reference produces a Multivariate Gaussian distribution.
\[\mathbf{f}_{i} \sim \mathcal{MN}\left(\mu = \mathbf{0}, \Sigma = \begin{bmatrix} 2\sigma_{\epsilon}^2 & \sigma_{\epsilon}^2 & \dots & \sigma_{\epsilon}^2 \\ \sigma_{\epsilon}^2 & 2\sigma_{\epsilon}^2 & \dots & \sigma_{\epsilon}^2 \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{\epsilon}^2 & \sigma_{\epsilon}^2 & \dots & 2\sigma_{\epsilon}^2 \end{bmatrix}\right)\]If we expect null foldchange measurements to follow this distribution, then we can sample from a multivariate normal distribution to explore whether this works as a foldchange generative process. We can then assess whether these draws are likely to have come from this distribution as a null hypothesis. For this purpose, we’ll use Mahalanobis distance, which is a multivariate generalization of the Wald test that assesses how many standard deviations an observation is from the mean of the distribution. This requires an estimate of the covariance matrix, an assumption that will be discussed below. We expect these statistics to be $\chi^{2}$ distributed with degrees of freedom equal to the number of timepoints that we have.
Having taken draws from the Multivariate Gaussian distribution and then used the Mahalanobis distance to calculate pvalues, the $\text{Unif}(0,1)$ distribution of these pvalues confirms that the Mahalanobis distance is appropriate.
We can now test whether foldchanges are really Multivariate Gaussian distributed by inspecting the distribution of Mahalanobis distance pvalues from the nosignal timecourses.
The pvalues for the nosignal fold change timecourses are indeed $\text{Unif}(0,1)$ distributed as we hoped.
Now, we can calculate the Mahalanobis distances and their corresponding pvalues for the signalcontaining timecourses as well. Signal in these timecourses will both increase the overall variance in expression for a feature, and deviations of nearby timepoints may be similar. These factors will make it harder for the Multivariate Gaussian noise model to explain the signalcontaining expression vector, resulting in a high Mahalanobis distance and a small pvalue.
Based on the pvalue distributions, most of the signalcontaining timecourses have small pvalues suggesting increased recall. We can verify this as before by summarizing results based on the realized FDR and the overall recall of signalcontaining timecourses at this FDR cutoff.
false negative  false positive  true negative  true positive  fdr  recall 

201  194  7806  1799  0.0973407  0.8995 
The Multivariate Gaussian test did a great job of appropriately identifying signalcontaining timecourses. The high power of this test arises from using estimates of noise level of observations, and our resulting ability to step away from assumptions regrading the types of timedependent signals we expect.
The test uses not just a pattern of expression but also information about the magnitude of noise associated with each observation. This information is available in many contexts in genomics. A couple examples where observationlevel estimates of noise are available are (1) via the meanvariance relationships of RNAseq data or (2) via consistency of peptides in proteomics data. Having these estimates can identify large foldchanges which are unlikely to occur by chance, even if all post time zero timepoints are similar. Similarly, complex rapid dynamics will look like a poorly fit regression model with high residual error. But, if we know the magnitude of residual error that we expect, then signals can be identified by just looking for an excess of variation in the timecourse (accounting for the bias introduced by time zero normalization). Using noise estimates may seem like cheating since this information was not directly used by the other tests. If we were to use an estimate of the noise level it would be to carryout a weighted least squared regression. But, since all observations have the same level of noise added, here, weighted regressions would be equivalent to the unweighted regressions used above.
Using Mahalanobis distance, we can look for any departures from the null noise model to define signal. This lets us step away from models which look for particular types of signals, such as the linear, cubic, or smooth relationships sought in the regression models we applied. Some of these models can be quite flexible, but when the underlying data does not follow these relationships these models will often fail. In this case, we simulated signals as biologically feasible sigmoidal or impulse responses, so none of the regression models applied could capture every instance of a simulated signals containing timecourse. In this case, if we were to use a regression model then we would be best off fitting a nonlinear least squares model following the sigmoidal or impulse form. Doing this is actually nontrivial if we want to avoid pathological fits, but these issues have been addressed in the impulse R package. I’ll discuss this problem in a future post.