Sean Hackett

Flattening the Gompertz Distribution

2025-02-02T00:00:00+00:00

In this post I’ll explore the Gompertz law of mortality which describes individuals’ accelerating risk of death with age.

The Gompertz equation describes per-year hazard (i.e., the likelihood of surviving from time $t$ to $t+1$) as the product of age-independent parameter $\alpha$ and an age-dependent component which increases exponentially with time scaled by another parameter $beta$ ($e^{\beta \cdot t}$).

The equation is thus:

\[\large h(t) = \alpha \cdot e^{\beta \cdot t}\]

The Gompertz equation is often studied by taking its natural log resulting in a linear relationship between log(hazard) and increasing risk with age.

\[\large \ln(h(t)) = \ln(\alpha) + \beta \cdot t\]

Formulating and estimating the parameters of demographic hazard models like the Gompertz’s equation is an active area of research, and there is a lot of information out there catering to both the academic and lay audiences. Still, when reviewing this literature, I did not see a clear summary of how decreases in $\beta$ (the chief aim of longevity research) would lead to lifespan extension.

For me, this is one of the most interesting properties of the Gompertz model, and this potential has been touted by my colleagues Graham Ruby and Eugene Melamud at Calico for some time.

Here, I’ll focus on a quantitative treatment of two related questions related to the potential of lifespan extension through slowing aging:

How was the lifespan extension seen across the $20^{th}$ borne out in changes in baseline mortality ($\alpha$)? The 1.9-fold lifespan extension across the century was borne-out through a 26-fold decrease in baseline mortality ($\alpha$).
What would life expectancy be if there were a comparable decrease in $\beta$? A 26-fold decrease in $\beta$ would extend human life expectancy to around 1,000 years.

I am NOT an expert on demographic models, so through ignorance (but also conveniently for brevity!) I’ll be leaving out lots of relevant information.

Additionally, while I am a current Calico employee, I want to emphasize that this and any other posts on my personal blog are my personal thoughts and do not necessarily reflect the opinions of my employer.

Background

Life Tables

Demographic models of mortality and life expectancy are built from “life tables.” Life tables contain both the number of individual of a given age that are living in the selected year, and the number of deaths that occurred in these ages during the selected year. This is sufficient to calculate the probability of an individual of a given age dying that year (i.e., the hazard).

Life tables can either be “period” life tables which describe the mortality of a population within a narrow window of time, while “cohort” life tables describe the mortality of a cohort of individual born within a narrow window of time. In this post I’ll be focusing on period life tables, which can be obtained from many sources, for example here is an example from the social security administration.

The Origin of Life Tables

As mentioned above, creating life tables requires (1) the number of individuals at each age, and (2) the number of deaths at each age. The easiest way to accomplish (1) is through a census, while (2) requires solid record keeping. These conditions were rarely in place at the same time so early life tables were derived from birth and death records rather than direct summaries of demographics.

In 1693 the British astronomer Edmon Halley studied the birth and death records of the city of Breslau. Breslau happened to be in a steady-state where the number of births approximately equaled the number of deaths. He used this information to create the first life table describing the distribution of residents’ ages. Using this information, Halley demonstrated how to price annuities leading to the birth of Actuarial Science.

Around 1750, the Swiss mathematician Leonard Euler re-derived the work of Halley while exploring the exponential growth of populations. His formulation of exponential growth allowed life tables to be generated for non-stationary populations.

With proper censuses it is now easier to just plug in measured values into a life table, but the lack of this information produced elegant math for describing population dynamics amid indirect measurements. These continue to be fundamental for studying both evolutionary and ecological dynamics.

Gompertz’s Law

Nearly 200 years ago, Benjamin Gompertz, a British actuary, described the mortality of populations in a life table (i.e., the survival curve) as a symmetrical sigmoidal function. Because the function is sigmoidal it indicates that the absolute mortality peaks at the average lifespan; past this point, while relative mortality increases the lower number of individuals remaining results in fewer total deaths.

Later, the modern Gompertz law of mortality was derived from the more general Gompertz equation. Unlike the sigmoidal formulation, the Gompertz law focuses on relative mortality hence risk continues to increase exponentially despite the winnowing of the aging population.

\[\large h(t) = \alpha \cdot e^{\beta \cdot t}\]

While I will primarily focus on this formulation of the Gompertz equation, its form has been broadly amended and challenged to better predict mortality in the very young and very old.

The Gompertz equation implies a vanishingly small risk for the very young which even for modern humans is untrue due to early childhood mortality (Gompertz-Makeham Law Wiki); and more generally will be challenged by high rates of extrinsic mortality (e.g., from disease or predation). To better account for baseline risk, the Gompertz equation is often discussed as the Gompertz–Makeham law of mortality. This formulation adds the Makeham term ($\lambda$) though it is generally appreciated that the Gompertz terms outweight the Makeham term when extrinsic mortality is low.

\[\large h(t) = \alpha \cdot e^{\beta \cdot t} + \lambda\]

For the very old, the Gompertz equation would predict an eventual hazard of one (mathematically it would continue to go past this point which is a bit of a red flag) implying guaranteed death by a defined age. This point has been hotly contested because it bears on whether there is a fundamental upper limit on human lifespan. Coming up with alternative models to describe the hazard of the very old is surprisingly hard because there are so few individuals to fit. This work generally focuses on “extreme value distributions” which describe the distribution of maximum/minimum values from a set of observations.

Historical Changes in Life Expectancy

To explore how changes in life expectancy map onto parameters of the Gompertz, we’ll start by obtain some demographic data on how life expectancy changes across the $20^{th}$ century.

To do this, I’ll use a table of life expectancy in the USA ranging from 1900-1998 which I stumbled into on Andrew Noymer’s website (Associate Professor, Public Health at UC–Irvine): USA Life Expectancy 20th century site

I could have copied this data to a local file but instead I decided to knock some of the rust off of my webscraping skills and directly read the table using rvest. If you are interested in another example of rvest webscraping I talked about how I scraped >200K MMA webpages a while back in this post: FightPrior.

My approach to reading the table didn’t work great. CSS selectors which define the portion html to extract can be a little finicky and in this case the fields I selected pulled in some white space above and below the table. Because of this rather than extracting a table with a command like “rvest::html_table” I had to serialize the table as a long character vector. After doing this I reformat it to a matrix and then applied a few more clunky operations to set the first rows as the variable names and to convert the matrix to a nice tibble.

# load packages and create a default plotting theme
suppressPackageStartupMessages(library(dplyr))
library(ggplot2)
library(patchwork) # for combining plots

theme_bw_mod <- theme_bw(base_size = 15)

pretty_kable <- function(tab) {
  tab %>%
    knitr::kable() %>%
    kableExtra::kable_styling("striped", position = "left", full_width = FALSE)
}

# read html
US_LIFE_EXPECTANCY_URL <- "https://u.demog.berkeley.edu/~andrew/1918/figure2.html"
us_life_expectancy_html <- rvest::read_html(US_LIFE_EXPECTANCY_URL)

us_life_expectancy_matrix <- us_life_expectancy_html %>%
  # setup with Selector Gadget as a CSS selector
  rvest::html_nodes("tr~ tr+ tr p") %>% 
  rvest::html_text2() %>%
  matrix(
    ncol = 3, byrow = TRUE
  ) %>%
  {.[-c(1, nrow(.)),]}

# turn first row into column names
colnames(us_life_expectancy_matrix) <- us_life_expectancy_matrix[1,]
us_life_expectancy_matrix <- us_life_expectancy_matrix[-1,]

us_life_expectancy_tbl <- tibble::as_tibble(us_life_expectancy_matrix) %>%
  mutate_all(as.numeric)

pretty_kable(head(us_life_expectancy_tbl))

Year	M	F
1900	46.3	48.3
1901	47.6	50.6
1902	49.8	53.4
1903	49.1	52.0
1904	46.2	49.1
1905	47.3	50.2

With summaries of life expectancy each year we can now flag some years of interest which will be useful for plotting. I identified the end of each decade as well as 1918 when the Spanish Flu killed 50 million people worldwide.

select_lifespans <- us_life_expectancy_tbl %>%
  mutate(avg_lifespan = (M + `F`)/2) %>%
  filter(
    case_when(
      Year %in% c(1900, 1998) ~ TRUE,
      avg_lifespan == min(avg_lifespan) ~ TRUE,
      Year %% 10 == 0 ~ TRUE,
      TRUE ~ FALSE
      )
  )

lifespan_extension_20th <- round(max(select_lifespans$avg_lifespan) / min(select_lifespans$avg_lifespan), 1)

#lifespan_extension_20th
# 1.9

pretty_kable(select_lifespans)

Year	M	F	avg_lifespan
1900	46.3	48.3	47.30
1910	48.4	51.8	50.10
1918	36.6	42.2	39.40
1920	53.6	54.6	54.10
1930	58.1	61.6	59.85
1940	60.8	65.2	63.00
1950	65.6	71.1	68.35
1960	66.6	73.1	69.85
1970	67.1	74.7	70.90
1980	70.0	77.4	73.70
1990	71.8	78.8	75.30
1998	73.8	79.5	76.65

Finally, we can visualize the changes in life expectancy across the $20^{th}$ century. From a low point in 1918, life expectancy rose 1.9-fold.

us_life_expectancy_tbl %>%
  tidyr::gather(sex, life_expectancy, -Year) %>%
  mutate(sex = factor(sex, levels = c("M", "F"))) %>%
  ggplot(aes(x = Year, y = life_expectancy, color = sex)) +
  geom_path(linewidth = 2) +
  scale_y_continuous("Life expectancy") +
  scale_x_continuous("Year") +
  scale_color_manual(
    values = c("F" = "pink", "M" = "dodgerblue")
  ) +
  labs(title = "From 1918, US life expectancy rose by **90%**"
  ) +
  theme_bw_mod +
  theme(
    legend.position = "none",
    plot.title = ggtext::element_markdown(size = 17, lineheight = 1.2)
  )

Unpacking the Gompertz Equation

As mentioned above, the form of the Gompertz equation is:

\[\large \ln(h(t)) = \ln(\alpha) + \beta \cdot t \\ \large h(t) = \alpha \cdot e^{\beta \cdot t}\]

For a given, value of $\alpha$ and $\beta$ we could use the Gompertz equation to predict an individual’s hazard at an age $t$.

Surviving to an age $t$ entails surviving every preceding year as well. Thus, the survival function can be described as the probability of surviving until a point $t$:

\[\large S(t) = \prod_{x=1}^{t}\left(1-h(x)\right)\]

For the survival function, this is analytically equivalent to:

\[\large S(t) = e^{\frac{\alpha}{\beta}(1-e^{\beta t})}\]

Also, since life expectancy is defined “assuming that the age-specific death rates for the year in question will apply throughout the lifetime of individuals born in that year” it should simply be the integral under the survival curve (or the sum since we are discretizing to year-by-year changes).

Based on this formulation I wrote equations which map $\alpha$, $\beta$ and $t$ onto hazard and survival and created plots which summarize a Gompertz model approximately fitting to the modern US population.

gompertz_hazard <- function(alpha, beta, times) {
  alpha*exp(beta*times)
}

gompertz_survival <- function(alpha, beta, times) {
  exp((alpha/beta)*(1-exp(beta*times)))
}

# doubling of hazard every 8 years
BETA_CURRENT = 0.0861
ALPHA_CURRENT = 0.000064
AGES <- seq(1, 105)

# 
gompertz_df <- tibble::tibble(
  age = AGES,
  hazard = gompertz_hazard(ALPHA_CURRENT, BETA_CURRENT, AGES)
) %>%
  mutate(
    log_hazard = log(hazard),
    survival = gompertz_survival(ALPHA_CURRENT, BETA_CURRENT, AGES)
    )

# the integral of survival is life expectancy
life_expectancy <- sum(gompertz_df$survival)

hazard_grob <- ggplot(gompertz_df, aes(x = age, y = hazard)) +
  geom_point() +
  theme_bw_mod

survival_grob <- ggplot(gompertz_df, aes(x = age, y = survival)) +
  geom_polygon(
    data = dplyr::bind_rows(
      gompertz_df,
      tibble::tibble(age = 0, survival = 0),
      tibble::tibble(age = 0, survival = 1)
      ),
    fill = "blue", alpha = 0.2) +
  geom_point() +
  geom_vline(xintercept = life_expectancy, color = "blue", width = 2) +
  annotate("text", x = life_expectancy - 2, y = 0.25, label = "life expectancy", color = "blue", hjust = 1) +
  theme_bw_mod

## Warning in geom_vline(xintercept = life_expectancy, color = "blue", width = 2):
## Ignoring unknown parameters: `width`

log_hazard_grob <- ggplot(gompertz_df, aes(x = age, y = log_hazard)) +
  geom_point() +
  scale_y_continuous("log(hazard)") +
  theme_bw_mod
  
log_hazard_grob + hazard_grob + survival_grob

To do this, I fixed $\beta$ at the value of 0.0861 which imputes a doubling of risk every 8 years. Since it is widely believed that historical lifespan extension has primarily been through modifying $\alpha$ rather than $\beta$ we can fit Gompertz models to achieve a desired life expectancy. For a life expectancy of around 76.6 (the life expectancy in 1998 which is also startling close to the current life expectancy in the USA of 77.2), $\alpha$ would be $\sim0.000064$.

To explore how $\alpha$ has changed across the $20^{th}$ century we can explore how different levels of baseline risk map onto historical changes in life expectancy.

20th century lifespan extension was through a 26x drop in baseline hazard

To map values of $\alpha$ onto the “select_lifespans” defined above, I estimated the Gompertz survival function for a range of $\alpha$ values starting at a modern value of $\sim0.000064$ and going down to a 26-fold decrease in $\alpha$ (initially I did a wider range then I dialed it in). Tidyr’s crossing is really helpful for doing the all-by-all comparisons of the set of $\alpha$ parameters and ages being evaluated. After calculating the survival functions I grouped by $\alpha$ parameters and integrated over ages to infer the life expectancy.

# define fold-change of max/min alpha or beta
ALPHA_BETA_FC = 26

# create a uniform sequence in log-space (the log transform is just so ages will be evenly space when plotting lifespan ~ log(param)
alpha_possibilities = exp(seq(log(ALPHA_CURRENT), log(ALPHA_CURRENT*ALPHA_BETA_FC), length.out = 100))

gompertz_curves <- tidyr::crossing(alpha = alpha_possibilities, age = AGES) %>%
  mutate(survival = gompertz_survival(alpha, BETA_CURRENT, age))

life_expectancy_by_alpha <- gompertz_curves %>%
  summarize(life_expectancy = sum(survival), .by = alpha)

# what is the alpha corresponding to the selected years life expectancies
select_lifespans_w_alpha <- select_lifespans %>%
  tidyr::crossing(life_expectancy_by_alpha) %>%
  mutate(year_lifespan_vs_alpha_prediction = abs(avg_lifespan - life_expectancy)) %>%
  group_by(Year) %>%
  filter(year_lifespan_vs_alpha_prediction == min(year_lifespan_vs_alpha_prediction))

stopifnot(all(select_lifespans_w_alpha$year_lifespan_vs_alpha_prediction < 1))

life_expectancy_by_alpha %>%
  ggplot(aes(x =  log10(alpha), y = life_expectancy)) +
  geom_point() +
  ggrepel::geom_text_repel(data = select_lifespans_w_alpha, aes(x = log10(alpha), y = life_expectancy, label = Year), force = 50, hjust = 0, nudge_x = 0.06, nudge_y = 1.5) +
  scale_x_continuous(expression(log[10] ~ alpha)) +
  scale_y_continuous("Life expectancy") +
  labs(title = "20^th century lifespan extension was primarily 
through a **26x** drop in baseline hazard") +
  theme_bw_mod +
  theme(
    plot.title = ggtext::element_markdown(size = 13, lineheight = 1.2)
  )

The nearly doubling of life expectancy across the $20^{th}$ century was chiefly driven by the massive medical advancements of antibiotics and vaccines, as well improvements in nutrition and hygiene. The deceleration of lifespan extension over the last thirty years reflects the challenge of removing risks one-by-one like in a game of wack-a-mole. Going forward, little is projected to change as judged by the social security administration, where lifespan extension in 2,100 is only projected to reach $\sim85$ years

These are and will-be, hard fought for gains, but radically extending lifespan should tackle the underlying risk factors of diseases of aging … aging itself.

A 26x drop in age-dependent risk would increase human life expectancy by 12.5x

The 26-fold drop in age-independent hazard across the 20th century is MASSIVE and I was curious what a comparable drop in $\beta$ going forward would look like. To explore this, I fixed $\alpha$ at a modern value (0.000064) and explored a range of $\beta$ values ranging from the current value (0.0861; where risk would double every 8 years) down to 0.0033 where risk would double every 208 years.

AGES_EXTENDED <- seq(0, 100000)
beta_possibilites = exp(seq(log(BETA_CURRENT), log(BETA_CURRENT/ALPHA_BETA_FC), length.out = 100))

gompertz_curves <- tidyr::crossing(beta = beta_possibilites, age = AGES_EXTENDED) %>%
  mutate(
    hazard = gompertz_hazard(ALPHA_CURRENT, beta, age),
    survival = gompertz_survival(ALPHA_CURRENT, beta, age)
  )

life_expectancy_by_beta <- gompertz_curves %>%
  summarize(life_expectancy = sum(survival), .by = beta)

life_expectancy_range_ratio <- life_expectancy_by_beta$life_expectancy[1] / life_expectancy_by_beta$life_expectancy[nrow(life_expectancy_by_beta)]

life_expectancy_by_beta %>%
  ggplot(aes(x = log10(beta), y = life_expectancy)) +
  geom_point() +
  scale_x_continuous(expression(log[10] ~ beta), breaks = seq(-5, 1, by = 0.5), expand = c(0.02,0.02)) +
  scale_y_continuous("Life expectancy") +
  labs(title = "A **26x** drop in age-dependent risk would
increase human lifespan by **12.5x**") +
  theme_bw_mod +
  theme(
    plot.title = ggtext::element_markdown(size = 14, lineheight = 1.2)
  )

I don’t think it will be as easy to modify $\beta$ as $\alpha$ even a two-fold drop in $\beta$ would increase life expectancy to 138 years. Precipitously dropping $\beta$ and achieving a life expectancy of 1,000 is starting to sound like science fiction but it is an interesting thought experiment. But why stop there! We can push this further and explore what a world with a $\beta$ of zero would look like. This concept is reflected in an interesting game by Polstats where you can simulate 100 individual’s lifespans in a world where you only die of unnatural causes. The average lifespan in this cohort was ~10,000 years old and the longest lived individual died at the ripe old age of 57,912 in a car accident.

Playing this game a few times I was struck by how much the average lifespan of the cohort can shift based on a few long-lived stragglers who continue to dodge bullets (and cars).

In this world, how old would we expect the oldest person to be?

If hazard is constant over time then lifespans would follow a Geometric distribution, where the mean equals (1/$\alpha$ = 15,625). If we thought of lifespans as continuous in nature, which would be appropriate, then we could similarly think of lifespans as following an exponential distribution (equivalent to exponential decay).

tibble(age = seq(0, 100000, by = 50)) %>%
  mutate(survival = dgeom(age, ALPHA_CURRENT)) %>%
  ggplot(aes(x = age, y = survival)) +
  geom_path() +
  theme_bw_mod

The maximum lifespan under this model would be equivalent to the maximum of $N$ geometric draws. This distribution is described in StackExchange and its quite involved (there’s no cgeom function), so I’ll cop out and just obtain the maximum value of a set of random draws.

set.seed(1234)
current_earth_pop_w_geometric_lifespan <- rgeom(n = 7.9e9, ALPHA_CURRENT)
max_lifespan <- max(current_earth_pop_w_geometric_lifespan)

From this simulation the maximum age of the oldest person on earth would be 345,353 years old.

Conclusion

In this post, we’ve explored the Gompertz equation and how it can be used to model historical lifespan extention (decreasing $\alpha$) and the potential that slowing aging has for radical lifespan extension (decreasing $\beta$).

I’ve found that this framing is particularly helpful when discussing my professional work with non-scientists. I tried this out first in a presentation at a local meetup (Future Tech Immersive: AI x Synthetic Biology Meetup) and simplified the narrative last November when I spoke at the 2024 D One Growing Together Conference in Zürich. Both talks went quite well - aging is deeply personal to all of us. Its something we see every day in ourselves, our family, and our friends. We accept it as a given, but it may not be and that “what if?” continues to inspire me.

False Discovery Rate (FDR) Overview and lFDR-Based Shrinkage

2022-06-11T00:00:00+00:00

Coming from a quantitative genetics background, correcting for multiple comparisons meant controlling the family-wise error rate (FWER) using a procedure like Bonferroni correction. This all changed when I took John Storey’s “Advanced Statistics for Biology” class in grad school. John is an expert in statistical interpretation of high-dimensional data and literally wrote the book, well paper, on false-discovery rate (FDR) as an author of Storey & Tibshirani 2006. His description of the FDR has grounded my interpretation of hundreds of genomic datasets and I’ve continued to pay this knowledge forward with dozens of white-board style descriptions of the FDR for colleagues. As an interviewer and paper reviewer I still regularly see accomplished individuals and groups where “FDR control” is a clear blind spot. In this post I’ll layout how I whiteboard the FDR problem, and then highlight a specialized application of the FDR for “denoising” genomic datasets.

Multiple Hypothesis Testing

Statistical tests are designed so that if the null hypothesis is true, observed statistics will follow a defined null distribution, hence an observed statistic can be compared to the quantiles of the null distribution to calculate the p-value. In quantitative biology we are frequently testing hundreds to millions of hypotheses in parallel. The p-value for a single test can be interpreted roughly as: p < 0.05, yay! [only slightly sarcastically]. But, when we have many tests, some will possess small p-values by chance (10,000 tests would have 500 p < 0.05 findings by chance). Controlling for multiple hypotheses acknowledges this challenge with FDR and FWER providing alternative perspectives for winnowing spurious associations to a set of high-confidence “discoveries”.

The FWER is the probability of making one or more false discoveries. FWER is common in genetic association studies providing an interpretation that of all loci there is an $\alpha$ chance that are one or more are spurious. Bonferroni correction controls the FWER by accepting tests whose p-values are less than $\frac{\alpha}{N}$ where $\alpha$ is the type I error rate and $N$ is the number of hypotheses.
The FDR involves selecting a set of observation (the positive tests) which constrains the expected number of false positives ($\mathbb{E}$FP) selected relative to true positives to a desired proportion: i.e. $\mathbb{E}\left[\frac{\text{FP}}{\text{FP} + \text{TP}}\right] \leq \alpha$. The FDR is less conservative than the FWER and is useful whenever we want to interpret the trends in a dataset even if individual findings may be spurious. This thought process fits nicely into genomics where differential expression analysis is frequently coupled to reductionist approaches like Gene Set Enrichment Analysis (GSEA).

The FWER and FDR control different properties so its not entirely fair to compare them yet I often see people controlling the FWER and then interpreting results as if they were controlling the FDR (and vice versa) so its important to note their practical differences. Whenever we have multiple hypotheses, the FWER will be more conservative (underestimating changes) than the FDR and this difference in power between the FDR and FWER widens with the number of hypotheses being tested. If we carried out more tests the p-value threshold for detecting discoveries would drop (perhaps resulting a drop in discoveries with more features!) while this threshold should be constant if we are controlling the FDR resulting in a proportional increase in the number of significant hits as we detect more features. This distinction is frequently misunderstood - folks will say things like “I’m not sure these changes would survive the FDR.” The shadow of the FWER results in a perception that collecting more information will prevent us from detecting “high confidence” hits. In reality, a well-selected FDR procedure can help to squeeze the most power out of a dataset.

There are multiple ways to control the FDR and I think the Storey & Tibshirani “q-value” framework is particularly appealing because of its Bayesian elegance and statistical power. When the assumptions underlying the q-value approach breakdown (basically when my p-values don’t look nice like the ones below), I fall back to the Benjamini-Hochberg approach for controlling FDR. BH preceded Storey’s q-value but is a special case of it (where $\hat{\pi}_{0}$ set at one).

Controlling the False Discovery Rate (FDR) with Q-Values

Using the approach of Storey & Tibshirani, we can think about a p-value histogram as a mixture of two distributions:

Negatives - features with no signal that follow our null distribution and whose p-values will in turn be distributed as $\sim\text{Unif}(0,1)$
Positives - features containing some signal which will consequently have elevated test statistics and tend towards having small p-values.

To see this visually, we can generate a mini-simulation containing a mixture of negatives and positives.

library(dplyr)
library(ggplot2)
library(tidyr)

# ggplot default theme
theme_set(theme_bw() + theme(legend.position = "bottom"))

# define simulation parameters
set.seed(1234)
n_sims <- 100000
pi0 <- 0.5
beta <- 1.5

simple_pvalue_mixture <- tibble(truth = c(rep("Positive", n_sims * (1-pi0)), rep("Negative", n_sims * pi0))) %>%
  # positives are centered around beta; negatives around 0
  mutate(truth = factor(truth, levels = c("Positive", "Negative")),
         mu = ifelse(truth == "Positive", beta, 0),
         # observations sampled from a normal distribution centered on 0
         # or beta with an SD of 1 (the default)
         x = rnorm(n(), mean = mu),
         # carryout a 1-tailed wald test about 0 
         p = pnorm(x, lower.tail = FALSE))

observation_grob <- ggplot(simple_pvalue_mixture, aes(x = x, fill = truth)) +
  geom_density(alpha = 0.5) +
  ggtitle("Observations with and without signal")

pvalues_grob <- ggplot(simple_pvalue_mixture, aes(x = p, fill = truth)) +
  geom_histogram(bins = 100, breaks = seq(0, 1, by=0.01)) +
  ggtitle("P-values of observations with and without signal")

gridExtra::grid.arrange(observation_grob, pvalues_grob, ncol = 2)

While there is a mixture of positive and negative observations, their values cannot be clearly separated (that would be too easy!) rather noise works against some positives, and some negative observations take on extreme values by chance. This is paralleled by the p-values of positive and negative observations. True positive p-values tend to be small, but may also be large; while true negative p-values are uniformly distributed from 0 to 1 and are as likely to be small as large.

To control the FDR at a level $\alpha$, the Storey procedure first estimates the fraction of null hypothesis (0.5 in our simulation): $\hat{\pi}_{0}$.

This is done by looking at large p-values (near 1). Because large p-values will rarely be signal-containing positives there will be fewer large p-values than would be expected from the number of tests. For example, there are 5106 p-values > 0.9 in our example, which is close to 5000, the value we would expect from $N\pi_{0}*0.1$ (10⁵ * $\pi_{0}$ * 0.1). (I’ll use the true value of $\pi_{0}$ (0.5) as a stand-in for the estimate of $\hat{\pi}_{0}$ so the numbers are a little clearer.)

Just as we expected 5000 null p-values on the interval from [0.9,1], we would expect 5000 null p-values on the interval [0,0.1]. But, there are actually 34415 with p-values < 0.1 because positives tend have small p-values. If we chose 0.1 as a possible cutoff, then we would expect 5000 false positives while the observed number of p-values < 0.1 equals the denominator of the FDR ($\text{FP} + \text{TP}$). The ratio of these two values, 0.145, would be the expected FDR at a p-value cutoff of 0.1. Now, we usually don’t want to choose a cutoff and then live with the FDR we would get, but rather control the FDR at a level $\alpha$ by tuning the cutoff as a parameter $\lambda$.

To apply q-value based FDR control we can use the q-value package:

# install q-value from bioconductor if needed
# remotes::install_bioc("qvalue")

library(qvalue)
qvalue_estimates <- qvalue(simple_pvalue_mixture$p)

The q-value object contains an estimate of $\pi_{0}$ of 0.496 which is close to the true value of 0.5 It also contains a vector of q-values, lFDR, and other goodies.

The q-values are the quantity that we’re usually interested in; if we take all of the q-values less than a target cutoff of say 0.05, then that should give us a set of “discoveries” realizing a 5% FDR.

simple_qvalues <- simple_pvalue_mixture %>%
  mutate(q = qvalue_estimates$qvalues)

fdr_pvalue_cutoff <- max(simple_qvalues$p[simple_qvalues$q < 0.05])

simple_qvalues <- simple_qvalues %>%
  mutate(hypothesis_type = case_when(p <= fdr_pvalue_cutoff & truth == "Positive" ~ "TP",
                                     p <= fdr_pvalue_cutoff & truth == "Negative" ~ "FP",
                                     p > fdr_pvalue_cutoff & truth == "Positive" ~ "FN",
                                     p > fdr_pvalue_cutoff & truth == "Negative" ~ "TN"))

hypothesis_type_counts <- simple_qvalues %>%
  count(hypothesis_type)

TP <- hypothesis_type_counts$n[hypothesis_type_counts$hypothesis_type == "TP"]
FP <- hypothesis_type_counts$n[hypothesis_type_counts$hypothesis_type == "FP"]
FDR <- FP / (TP + FP)

knitr::kable(hypothesis_type_counts) %>%
  kableExtra::kable_styling(full_width = FALSE)

hypothesis_type	n
FN	38910
FP	616
TN	49384
TP	11090

In this case, due to our simulation, we know whether individual discoveries are true or false positives. As a result we can determine that the realized FDR is 0.053, close to our target of 0.05.

In most cases we would take our discoveries and work with them further, confident that as a population, only ~5% of them are bogus. But, in some cases we care about how likely an individual observation is to be a false positive. In this case we can look at the local density of p-values near an observation of interest to estimate a local version of the FDR, the local FDR (lFDR).

We took advantage of this property during my collaboration with Google Brain aimed to improve the accuracy of peptide matches to proteomics’ spectra using labels from traditional informatics (arXiv). In this study we weighted peptides’ labels with their lFDR using a cross-entropy loss to more strongly penalize failed prediction with high-confidence labels.

lFDR-based shrinkage

Because the lFDR reflects the relative odds that an observation is null it is a useful measure for shrinkage or thresholding aiming to remove noise and better approximate true value. To do this we can weight an observation by 1-lFDR. One interpretation of this is that we are using the lFDR to hedge our bets between the positive and negative mixture components, weighting by our null hypothesis that $\mu$ = 0 with confidence lFDR, and by the alternative $\mu \neq 0$ value of x with confidence 1-lFDR:

\[x_{\text{shrinkage}} = \text{lFDR}\cdot0 + (1-\text{lFDR})\cdot x\]

true_values <- tribble(~ truth, ~ mu,
                       "Positive", beta,
                       "Negative", 0)

shrinkage_estimates <- simple_qvalues %>%
  mutate(lfdr = qvalue_estimates$lfdr,
         xs = x*(1-lfdr)) %>%
  select(truth, x, xs) %>%
  gather(processing, value, -truth) %>%
  mutate(processing = case_when(processing == "x" ~ "original value",
                                processing == "xs" ~ "shrinkage estimate"))

ggplot(shrinkage_estimates, aes(x = value, fill = processing)) +
  geom_density(alpha = 0.5) +
  geom_vline(data = true_values, aes(xintercept = mu), color = "chartreuse", size = 1) +
  facet_grid(truth ~ ., scale = "free_y") +
  scale_fill_brewer(palette = "Set1") +
  ggtitle("lFDR-based shrinkage improves agreement between observations and the true mean")

Using lFDR-based shrinkage, values which are just noise were aggressively shrunk toward their true mean of 0 such that there is very little remaining variation. Positives were shrunk using the same methodology retaining extreme values near their measured value. We can verify that there is an overall decrease in uncertainty about the true mean reflecting the removal of noise.

shrinkage_estimates %>%
  inner_join(true_values, by = "truth") %>%
  mutate(resid = value - mu) %>%
  group_by(processing) %>%
  summarize(RMSE = sqrt(mean(resid^2))) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(full_width = FALSE)

processing	RMSE
original value	0.9994577
shrinkage estimate	0.8103141

Future Work

In a future post I’ll describe how lFDR-based shrinkage is particularly useful for signal processing of time-resolved peturbation data. In this case, early direct changes are rare, while late indirect changes are quite common. This intuition can be folded into how we estimate the lFDR by estimating a $\hat{\pi}_{0}$ which decreases monotonically with time using the functional FDR.

Romic: Data Structures and EDA for Genomics

2022-05-08T00:00:00+00:00

Romic is an R package, which I developed, that is now is now available on CRAN. There is already a nice README for romic on GitHub and a pkgdown site, so here, I will add some context regarding the problems this package addresses.

The first problem we’ll consider is that genomics data analysis involves a lot of shuffling between various forms of wide and tall data and incrementally tacking on attributes as needed. Romic aims to simplify this process, by providing a set of flexible data structures that accommodate a range of measurements and metadata and can be readily inter-converted based on the needs of an analysis.

The second challenge we’ll contend with is decreasing the time it takes to generate a plot so that mechanics of plotting rarely interrupt the thought process of data interpretation. Building upon romic’s data structure, the meaning of variables (feature-, sample-, measurement-level) are encoded in a schema, so they can be appropriately surfaced to filter or reorder a dataset, and add ggplot2 aesthetics. Interactivity is facilitated using Shiny apps composed from romic-centric Shiny modules.

Both of these solutions increase the speed, clarity, and succinctness of analysis. I’ve developed and will continue to refine this package to save myself (and hopefully others!) time.

While, romic is discussed in the parlance of genomics, romic’s data structures are useful for any moderately sized feature-level data, and its interactive visualizations can be used for any data with dense continuous measurements. Because of its generality, romic serves as a useful underlying data structure that can be combined with application-specific schemas and methods to create powerful, succinct workflows. One such application that I’ll discuss in a future post, is the claman R package which builds upon romic to create an opinionated workflow for mass spectrometry data analysis.

Data Structures for Genomics

Conventional Formatting

Datasets in genomics are often generated and shared in wide formats (one row per gene, one column per sample), often with extra rows and columns added for feature and sample metadata. At first blush this is a good format, because it supports both folks who want to work with a matrix-level dataset as well as individuals who are interested in specific genes.

That said, to manipulate and visualize such data requires integrating metadata with measurements. For example, when correcting for batch effects we often want to incorporate sample-level information, such as the date samples were collected. Combining numeric measurements with categorical and numeric meta-data is awkward in matrices. One could do this with attributes, but generally we would just maintain separate tables for samples, and features, since each variable in a table can have its own class. A benefit of this approach is that working with matrices can be very fast, while the major downsides are having to maintain multiple similar versions of a dataset, and needing to be careful about maintaining the alignment of measurements, features, and samples.

Romic’s Tabular Representations

An alternative to manipulating matrices is to work fully with tabular data. This mode of operation is very similar to working with SQL, allowing us to maintain a complex, yet organized dataset. Using tabular “tidy” data also allow us to tap into the expansive suite of tools in the tidyverse. Working with features, samples, and measurements tables allow us to separately modify each table, while the three tables can be combined (using primary key - foreign key relationships) if we need to add sample- or feature-level context to measurements.

Romic provides two data structures, a triple_omic and a tidy_omic class for representing these two scenarios. These formats can be used interchangeable in romic’s functions by treating them as a T*omics (tomic) meta-class. Most exported functions from romic, take a tomic object which means they can convert to whatever format makes most sense for a function under the hood and then return a triple_omic or tidy_omic object depending on the input type.

Using a schema, tables can be combined and then broken apart again without constant guidance, and validators quickly flag data manipulation errors (such as non-unique primary keys, or measurements of the same sample with different sample attributes).

By taking care of many of the joins and reshaping operations that we may have to do, romic helps to simplify analyses while avoid common data manipulation errors. It directly supports dplyr and some ggplot operations, while data can also be easily pulled out of the romic format (and then added back if desired) based on users’ needs.

Exploratory Data Analysis for Genomics

To demonstrate how easily romic can be used for formatting and exploratory data analysis we can reanalyze and existing dataset.

Following a tradition set by Dave Robinson of teaching statistical analysis of genomics data using yeast microarrays (link), I generally teach statistical genomics with the Brauer et al. 2008 dataset and this study formed the basis of romic’s vignette and examples. To expand this theme, here we can look at another old-school yeast expression dataset. This one has over 5,500 citations!

In Gasch et al. 2000 the authors explored how yeast expression depends on a range of stressors. Gasch2K revealed that regardless of the nature of a stressor, yeast tend to respond to the threat with a relatively stereotypical gene expression response termed the “environmental stress response” (the ESR).

David Botstein (the senior author of both of both the Brauer and Gasch papers) describes the logic behind the ESR with a Star Trek-themed analogy. The idea is that when the Starship Enterprise is cruising along, most power goes to the engine. But, when the Enterprise is under attack (whether from Klingons, Romulans or asteroids) power needs to be redirected to the shields to combat the threat. Cells follow this “shields up and shields down” growing fast when conditions are good and hunkering down when they are not. An interesting corollary of this behavior is that when facing one stress (such as desiccation), cells will simultaneously become more resistant to other stressors (such as heat shock).

While humans have more complicated stress sensing pathways than yeast, the mammalian equivalent of the ESR, termed the integrated stress response (ISR), still serves an important role in sensing and responding to diverse stresses. Modulating this pathway is being actively explored as an anti-aging/disease therapy by Calico, Denali and Altos.

Data Loading

In what can only be described as par for the course in bioinformatics, while writing this post the Stanford site that was hosting Gasch2K was down requiring me to obtain the dataset using Wayback Machine. Having moved the dataset to my site (hosted on GitHub pages) we can read it directly from a url.

# environment setup
library(dplyr)
suppressPackageStartupMessages(library(ggplot2))
# install from CRAN with install.packages("romic)
# right now its probably better to install the dev version from GitHub
# with remotes::install_github("romic)
library(romic)

gasch_2000 <- readr::read_tsv(
  file = "https://www.shackett.org/files/gasch2000.txt",
  col_types = readr::cols() # to accept default column types
  )
gasch_matrix <- gasch_2000 %>%
  select(-UID, -NAME, -GWEIGHT) %>%
  as.matrix()
rownames(gasch_matrix) <- gasch_2000$UID

# output
dim(gasch_matrix) %>% {c("rows" = .[1], "columns" = .[2])} %>% t() %>%
  knitr::kable() %>% kableExtra::kable_styling(full_width = FALSE)

rows	columns
6152	173

Process metadata

To interpret any of the patterns in this dataset, we’ll need some metadata describing both the measured genes and samples.

Genes

Genes are frequently summarized using Gene Ontology (GO) terms that capture their sub-cellular localization (CC), molecular function (MF) or biological process (BP). These are typically one-to-many relationships where a given gene will belong to multiple GO terms in each of three ontologies. The GO slim ontologies used here are a curated subset of GO terms which map each gene to a single BP, MF and CC term. These ontologies are convenient for the kind of quick data slicing and inspection but we would be better off with the full ontologies for systematic approaches like Gene Set Enrichment Analysis (GSEA).

goslim_mappings <- readr::read_tsv(
    "https://downloads.yeastgenome.org/curation/literature/go_slim_mapping.tab",
    col_names = c("ORF", "common", "SGD", "category", "geneset", "GO", "class"),
    col_types = readr::cols()
  ) %>%
  select(-GO) %>%
  group_by(ORF, category) %>%
  slice(1) %>%
  tidyr::spread(category, geneset) %>%
  select(
    ORF, common, SGD, class,
    cellular_compartment = C,
    molecular_function = F,
    biological_process = P
  ) %>%
  ungroup()

feature_metadata <- gasch_2000 %>%
  select(UID) %>%
  left_join(goslim_mappings, by = c("UID" = "ORF"))

knitr::kable(feature_metadata %>% dplyr::slice(1:5))

UID	common	SGD	class	cellular_compartment	molecular_function	biological_process
YAL001C	TFC3	S000000001	ORF\|Verified	mitochondrion	DNA binding	biosynthetic process
YAL002W	VPS8	S000000002	ORF\|Verified	CORVET complex	enzyme binding	endosomal transport
YAL003W	EFB1	S000000003	ORF\|Verified	eukaryotic translation elongation factor 1 complex	enzyme regulator activity	biosynthetic process
YAL004W	NA	S000002136	ORF\|Dubious	cellular component	molecular function	biological process
YAL005C	SSA1	S000000004	ORF\|Verified	cell wall	ATP hydrolysis activity	biosynthetic process

Samples

Working with a fresh dataset invariably involves some data munging to format data and metadata in a usable format. In the case of the Gasch2K dataset, organizing samples was the most painful part of this process. Gasch2Ks samples are identified with short irregularly formatted names so it requires a bit of work to organize them. We could address this problem with a manually curated spreadsheet (I generally use tibble::tribble() for small tables and Google Sheets for larger ones). Luckily, the samples here are still organized enough that we can programmatically summarize them. Samples are defined in two ways: first, by the type of stressor (e.g., heat, starvation, …) and second, by the severity of the stressor. Within each stressor, samples are typically arranged in order of increasing stress. With this setup, we can capture each stressor using regulator expressions (since there are inconsistencies in the data, such as “diauxic” and “Diauxic”).

library(stringr)

experiment_labels <- tibble::tibble(sample = colnames(gasch_matrix)) %>%
  mutate(experiment = case_when(
    str_detect(sample, "hs\\-1") ~ "Heat Shock (A) (duration)",
    str_detect(sample, "hs\\-2") ~ "Heat Shock (B) (duration)",
    str_detect(sample, "^37C to 25C") ~ "Cold Shock (duration)",
    str_detect(sample, "^heat shock") ~ "Heat Shock (severity)",
    str_detect(sample, "^29C to 33C") ~ "29C to 33C (duration)",
    str_detect(sample, "^29C \\+1M sorbitol to 33C \\+ 1M sorbitol") ~ "29C + Sorbitol to 33C + Sorbitol (duration)",
    str_detect(sample, "^29C \\+1M sorbitol to 33C \\+ \\*NO sorbitol") ~ "29C + Sorbitol to 33C (duration)",
    str_detect(sample, "^constant 0.32 mM H2O2") ~ "Hydrogen peroxide (duration)",
    str_detect(sample, "^1 ?mM Menadione") ~ "Menadione (duration)",
    str_detect(sample, "^2.5mM DTT") ~ "DTT (A) (duration)",
    str_detect(sample, "^dtt") ~ "DTT (B) (duration)",
    str_detect(sample, "diamide") ~ "Diamide (duration)",
    str_detect(sample, "^1M sorbitol") ~ "Sorbitol (duration)",
    str_detect(sample, "^Hypo-osmotic shock") ~ "Hypo-Osmotic Shock (duration)",
    str_detect(sample, "^aa starv") ~ "Amino Acid Starvation (duration)",
    str_detect(sample, "^Nitrogen Depletion") ~ "Nitrogen Depletion (duration)",
    str_detect(sample, "^[Dd]iauxic [Ss]hift") ~ "Diauxic Shift (duration)",
    str_detect(sample, "ypd-2") ~ "YPD (duration)",
    str_detect(sample, "ypd-1") ~ "YPD stationary phase (duration)",
    str_detect(sample, "overexpression") ~ "TF Overexpression",
    str_detect(sample, "car-1") ~ "Carbon Sources (A)",
    str_detect(sample, "car-2") ~ "Carbon Sources (B)",
    str_detect(sample, "ct-1") ~ "Temperature Gradient",
    str_detect(sample, "ct-2") ~ "Temperature Gradient, Steady State"
    )) %>%
  group_by(experiment) %>%
  mutate(experiment_order = 1:n()) %>%
  ungroup() %>%
  mutate(
    experiment_order = ifelse(is.na(experiment), NA, experiment_order),
    experiment = ifelse(is.na(experiment), "Other", experiment)
  )

experiment_labels %>%
  dplyr::sample_n(5) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(full_width = FALSE)

sample	experiment	experiment_order
37C to 25C shock - 90 min	Cold Shock (duration)	5
29C +1M sorbitol to 33C + *NO sorbitol - 30 minutes	29C + Sorbitol to 33C (duration)	3
heat shock 25 to 37, 20 minutes	Heat Shock (severity)	3
YPD stationary phase 5 d ypd-1	YPD stationary phase (duration)	8
constant 0.32 mM H2O2 (80 min) redo	Hydrogen peroxide (duration)	7

Formatting for romic

Romic organizes genomic datasets as sets of measurement-, sample-, and feature-level variables. We’ve essentially created three tables capturing each of these aspects of our dataset already. Romic can bundle these together using a feature primary key shared between the features and measurements table (here, “UID”), and a sample primary key shared between the samples and measurements table (here, “sample”).

# tidy gasch measurements
tall_gasch <- gasch_2000 %>%
  select(-NAME, -GWEIGHT) %>%
  tidyr::gather("sample", "expression", -UID) %>%
  dplyr::filter(!is.na(expression))
  
triple_omic <- create_triple_omic(
  measurement_df = tall_gasch,
  feature_df = feature_metadata,
  sample_df = experiment_labels,
  feature_pk = "UID",
  sample_pk = "sample"
)

Plotting At the Tips of Your Fingers

When its inefficient to explore a dataset, analyses will either be cursory or take longer than it should. While creating bespoke plots that explore specific aspects of a dataset are difficult to automate, the early stages of exploratory data analysis (EDA) should be. During EDA we hope to identify the major sources of variation in a dataset. Ideally this variation will reflect planned factors in our experimental design, but it is also frequently the case that unexpected sources of variability should be identified so they can be accounted for during modeling.

To support this early exploration, romic provides several specialized and general purpose interactive Shiny apps built form composable Shiny modules. We’ll use two of these apps to demonstrate a general workflow where we’ll

Interactively visualize our dataset in Shiny
Share the Shiny app using shinyapp.io (or Rstudio Connect)
Create a static visualization summarizing our findings.

Principal Components Analysis

To explore the major factors driving variation in a dataset it is a good idea to look at a low dimensional representation of samples. Principal components analysis can address this problem by sequentially capturing and then removing the most prominent one-dimensional pattern in the data. As an example, the Brauer 2008 experiment explored gene expression as yeast grew at different rates in different environments. When applying Singular Value Decomposition (SVD) (PCA is a special case of SVD), the most prominent pattern in samples (one principal component (PC) occactionally called an eigengene; a vector over samples) closely mirrored the growth rate, while the corresponding pattern across genes reflected how their expression changes with growth rate (one loading; a vector over genes) see Brauer 2008 - Figure 3. Having captured this pattern it could be removed from the data allowing for the estimation of the second prominent pattern, which could then be removed to estimate the third pattern, and so forth. In PCA/SVD, each pattern is constructed to maximize the amount of variation in the dataset that is explained and this fraction of variation explained is often important for interpreting PCs. Romic currently does not expose this information (though it probably should).

If we have a simple design comparing a gene knockout to a wild type (functioning gene) we should hope that the mutants will look more similar to one another and the wild type individuals will look more similar to one another. The differences between the mutant and wild type would manifest as a set of correlated expression changes that would largely be captured by the leading principal components. To visualize how sets of samples are separated in principal component space, it is common to create a scatter plot of PC1 x PC2 and then labeling each sample by the elements of the experimental design that are driving their separation in expression space.

One of the features of PCA is that subtle patterns in the data (later principal components) would be totally ignored if we were looking at the leading principal components. Alternatives to PCA, such as t-snee and UMAP, are increasingly popular for visualizing sample similarity because they can simultaneously capture all expression patterns driving sample similarity as a two-dimensional summary. As a result, samples may group together even if they have similar values of PC1 and PC2. The downside of this more holistic view of sample similarity is that distances are difficult to interpret. Samples that are very close to each other are likely quite similar while samples at a moderate distance could either be similar or totally different - see The Specious Art of Single-Cell Genomics.

Shiny app

SVD and PCA are fundamentally linear algebra techniques and therefore do not work if our dataset has missing values. (Optimization-based variants do exist but they are not implemented in romic.) If we filtered all genes which are missing measurements in at least one sample from the Gasch2K dataset we would be dropping more than 80% of our features. To avoid this outcome it is common to perform some form of missing value imputation on genomics datasets. Imputation methods should be avoided if possible and otherwise thoughtfully applied using a technique that is appropriate for the data modality you are working with. For microarrays, the standard approach for missing value imputation is K-nearest neighbors imputation. In KNN imputation, the K most similar neighbors of a sample with missing values are found using non-missing measurements and the missing value is imputed using their average expression.

Romic makes some decision about how to proceed when a dataset would not otherwise be able to perform an operation but imputation must be performed explicitly. Otherwise romic would toss out all the genes with missing values when estimating the PCs.

imputed_triple <- triple_omic %>%
  # overwrite existing expression so that we don't have the
  # raw expression changes which contains lots of missing values
  impute_missing_values(impute_var_name = "expression")

With our imputed dataset we can now easily create a local Shiny app where we can overlay different sample attributes on PCs 1-5.

app_pcs(imputed_triple)

Running Shiny apps requires a live R session working under the hood so its often quite challenging for other users (particularly non-technical ones) to setup the dependencies required to run an app. Luckily RStudio has created a couple of nice frameworks where Shiny apps can be deployed to a remote server running R. This allows users to just navigate to a url to access results. My employer, Calico, uses the enterprise product RStudio Connect to host internal apps on our own Google Cloud Platform server. Here, I’ll demonstrate deployment to a similar service hosted by Rstudio shinyapps.io. Here is the end product: romic PCs. (This app isn’t behaving very well on the free tier of shinyapps.io but it works fine locally and on Connect ¯\_(ツ)_/¯).

When deploying content to Connect or shinyapps.io, R has to understand how to run your app on the remote server. To do this it will either attempt to automatically identify package versions and where to obtain them (CRAN, Bioconductor, GitHub, Rstudio package manager) or read these versions out of a file. I generally use renv for non-trivial deployments since it can manage python environments as well. Beyond this, its nice to have all of the files we would want to deploy to the server in a single directory. In most cases I store data on Google Cloud Storage or Cloud SQL to make it easy access results on a remote server.

To deploy this app, I put the following code in an “app.R” file in a directory containing a .Rds file of “imputed_triple”. Then I ran the Shiny app with shiny::runApp() and hit the publish button in the top right of the Rstudio pop-up. shinyapps.io is one of the options and the deployment proceeded without any hiccups.

library(shiny)
library(romic)

tidy_omics <- readRDS("gasch2K.Rds")
app_pcs(tidy_omics)

Static PC Plot

Having interactively explored the relationships between the PCs and our experimental design we may want to summarize our results using a static figure. Since romic’s apps call ggplot2-based plotting functions it is easy to recreate dynamically-generated plots. Of course, we could also just save our plot from the Shiny app’s interface.

samples_with_PCs <- imputed_triple %>%
  add_pca_loadings(npcs = 2) %>%
  # if you aren't used to the {} syntax, it doesn't use the object you
  # piped in as the first argument. The object is still accessible with "."
  {.$samples}

plot_bivariate(
  tomic_table = samples_with_PCs,
  x = "PC1",
  y = "PC2",
  color_var = "experiment"
) +
  ggtitle(
    label = "Most stressors modulate a common set of genes",  
    subtitle = "Gasch2K expression principal components labelled by experiment"
    ) +
  guides(colour = guide_legend(ncol = 3)) +
  theme(legend.position = "bottom")

Based on this analysis we can see that most experiments cluster together aside from the “YPD” timecourses. These represent starvation conditions where the yeast clearly react with added measures beyond the ESR. Overlaying “experiment_order” and lassoing points to see them in a table using the interactive app, we can also see that samples are roughly ordered form less severe to more severe within the non-YPD conditions.

Heatmaps

While PCA allows us to summarize latent features of our dataset it is also helpful to view observation-level results in some format. This often involves plotting individual features but for genomics data it is common to also visualize the complete dataset using a heatmap. Heatmaps are essentially a visualization of a matrix of expression values (such as expression mean-centered by gene) with genes rearranged such that covarying genes are nearby one another. Samples may also be organized by similarity but frequently are organized by the experimental design. To order features and/or samples, hierarchical clustering is applied to create a tree linking all genes through successive merges of similarly behaving clusters of genes. The main parameters used when hierarchical clustering are a distance measure which defines how dissimilar all pairs of genes are, and the hierarchical clustering method which can affect the degree to which many small clusters or few large clusters are created. Generally it is important to choose a distance measure appropriate for your problem (here, Euclidean distance), while I generally don’t focus on the hierarchical clustering method (Ward.D2 is the default in romic). Both options are exposed whenever hierarchical clustering is performed in romic.

Shiny app

Heatmaps can be quite slow to render so to demo this function we can filter the Gasch2K dataset to a subset of conditions. To do this we’ll filter the samples table to a subset of experiments exploring the relationship between heat and gene expression.

# filter to a few experiments for the demo
heatshock_triple_omic <- imputed_triple %>%
  filter_tomic(
    filter_type = "category",
    filter_table = "samples",
    filter_variable = "experiment",
    filter_value = c(
      "Heat Shock (A) (duration)",
      "Heat Shock (B) (duration)",
      "Heat Shock (severity)",
      "Temperature Gradient"
      ))

Following the same deployment approach used above we can easily create a minimal Shiny app for helping us browse and explore heatmaps based on this dataset: Shiny romic heatmap. Since the app is ggplot2-based it is quite easy to add facets to organize samples.

Static Heatmaps

Once we find a nice view of our heatmap we can reproduce the results with a static visualization.

plot_heatmap(
  tomic = heatshock_triple_omic,
  cluster_dim = "rows",
  change_threshold = 3
) +
  facet_grid(~ experiment, scales = "free_x")

From this plot we can see when the yeast are most stressed out by heat and it is apparent that they respond to heat by turning up or down genes in a graded fashion to respond to both progressive and severe heat. Interestingly, the heat shock experiment stresses out the yeast transiently but by 80 minutes the stress has passed and the yeast have learned to leave with the elevated temperature. Yeast are tough.

Wrapping Up

Romic is built around a core data structure (the T*Omic) that efficiently tracks data and metadata as a dataset is filtered, mutated and reorganized during analysis. To enable this modulation, romic distinguishes feature-, sample- and measurement-level variables using a schema. An added benefit of this approach is that variables can be automatically mapped to feasible aesthetics during plotting. It wouldn’t make much sense to color by a measurement in a sample-level plot, nor to color a heatmap by a categorical variable. This property can be exploited with dynamic visualizations which map variables to feasible ggplot2 aesthetics.

Time zero normalization with the Multivariate Gaussian distribution

2022-05-08T00:00:00+00:00

Timecourses are a powerful experimental design for evaluating the impact of a perturbation. These perturbations are usually chemicals because chemicals, such as a drug, can be introduced quickly and with high temporal precision. Although, with some technologies, such as the estradiol-driven promoters that I used in the induction dynamics expression atlas (IDEA), it is possible to rapidly perturb a single gene further increasing specificity and broadening applicability. By rapidly perturbing individuals, they can be synchronized based on the time when dosing began. We often call this point when dosing begins “time zero” while all subsequent measurements correspond to the time post perturbation. (Since time zero corresponds to a point when a perturbation is applied, but will not yet impact the system, this measurement is usually taken before adding the perturbation.)

One of the benefits of collecting a time zero measurement is it allows us to remove, or account for, effects that are shared among all time points. In many cases this may just amount to analyzing fold-changes of post-perturbation measurement with respect to their time zero observation, rather than the original measurements themselves. This can be useful if there is considerable variation among timecourses irrespective of the perturbation, such as if we were studying humans or mice. Similarly, undesirable variation due to day-to-day variation in instruments, sample stability, or any of the many other factors which could produce batch effects, can sometimes by addressed by measuring each timecourse together and working with fold-changes. In either case, correcting for individual effects using pre-perturbation measurements will increase our power to detect perturbations’ effects.

Aside from correcting for unwanted variation, the kinetics of timecourses are a rich source of information which can be either a blessing or a curse. With temporal information, ephemeral responses can be observed. We can see both which features are changing and when they are changing. And, the ordering of events can point us towards causality. In practice, each of these goals can be difficult, or impossible to achieve, leaving us with a nagging feeling that we’re leaving information on the table. There are many competing options for identifying differences in timecourses, few ways of summarizing dynamics in an intuitive way, and causal inference is often out of reach. In this post, and others to follow, I’ll pick apart a few of these limitations, discussing developments that were applied to the IDEA, but will likely be useful for others thinking about biological timeseries analysis (or other timeseries if you are so inclined!). Here, I evaluate a few established methods for identifying features which vary across time and then introduce an alternative approach based on the Multivariate Gaussian distribution and Mahalanobis distance which increases power and does not require any assumptions about responses’ kinetics.

Our timecourse experiment

To evaluate methods for detecting temporal dynamics its helpful to use a dataset where there a clear-cut examples of timecourses with and without signal. With such a dataset in hand, we can easily detect signals that we are missing (false negatives), noise that we think is real (false positives) and evaluate overall recall (what fraction of signals are we detecting). We rarely have such positive and negative examples in real timecourses, so instead we can simulate timecourses with and without signal. Going forward I will also use genes as short-hand for whatever features that we might be working with since I’ve primarily worked with these methods in the context of gene expression data.

Environment Setup

First, I’m going to setup the R environment by loading some bread-and-butter packages and setting the global options for future and ggplot2.

# general use packages
suppressPackageStartupMessages(library(dplyr))
library(future)
library(ggplot2)
library(tidyr)

# R package for simulating dynamics
library(impulse)

# global options
# setup parallelization
plan("multisession", workers = 4)

# ggplot default theme
theme_set(theme_bw())

Simulate timecourses containing signal

First, we can generate the subset of our timecourses which contain signal. These timecourses should follow a broad range of biologically-feasible patterns.

To construct such timecourses, we can use the phenomonological timecourse model of Chechik & Koller which represents timecourses as a pair of sigmoidal responses, called an impulse, We’ll also use a simpler single sigmoidal version of the C & K model.

To simulate data from these models, we can use the simulate_timecourses() function from the impulse R package, available on GitHub.

This function will draw a set of parameters for sigmoidal and impulse from appropriate distributions to define a simulated timecourse. We’ll then add independent normally distributed noise to each observation. (For most genomic data types, measurements are log-normal so we could think of these abundance units as already having been log-transformed).

timepts <- c(0, 5, 10, 20, 30, 40, 60, 90) # time points measured
measurement_sd <- 0.5 # standard deviation of Gaussian noise added to each observation
total_measurements <- 10000 # total number of genes
signal_frac <- 0.2 # what fraction of genes contain real signal

set.seed(1234)

# simulate timecourses containing signal 

alt_timecourses <- impulse::simulate_timecourses(n = total_measurements * signal_frac * 2,
                                                 timepts = timepts,
                                                 prior_pars = c(v_sd = 0.8,
                                                                rate_shape = 2,
                                                                rate_scale = 0.25,
                                                                time_shape = 1,
                                                                time_scale = 30),
                                                 measurement_sd = measurement_sd) %>%
  unnest_legacy(measurements) %>%
  select(-true_model) %>%
  mutate(signal = "contains signal") %>%
  # drop timecourses where no true value's magnitude is greater than 1 (these
  # aren't really signal containing
  # and timecourses where the initial value isn't ~zero
  group_by(tc_id) %>%
  filter(any(abs(sim_fit) > 1),
         abs(sim_fit[time == 0]) < 0.1) %>%
  ungroup()

# only retain the target number of signal containing timecourses
alt_timecourses <- alt_timecourses %>%
  semi_join(
    alt_timecourses %>%
      distinct(tc_id) %>%
      sample_n(min(n(), total_measurements * signal_frac)),
    by = "tc_id")

knitr::kable(alt_timecourses %>% slice(1:length(timepts)))

tc_id	time	sim_fit	abundance	signal
2	0	-0.0976223	0.0957832	contains signal
2	5	-0.2018300	-0.5742951	contains signal
2	10	-0.3967604	-0.2166940	contains signal
2	20	-1.1307509	-0.9666770	contains signal
2	30	-1.8594480	-0.9893428	contains signal
2	40	-2.1534230	-1.8150251	contains signal
2	60	-2.2445179	-2.7889178	contains signal
2	90	-2.2489452	-2.5141336	contains signal

Simulate timecourses which are just noise

Timecourses which are just noise are easy to generate, we can just generate these using independent draws from a normal distribution (with the same standard deviation that we used to add noise to the signals).

With timecourses with and without signals in hand, we can combine the two sets together while tracking their origin.

Additionally since we are interested in time-dependent changes with respect to time zero, we can transform abundances into fold changes which subtract the initial value of a timecourse from every measurement. We’ll work with both the native abundance scale and time-zero normalized fold changes going forward.

null_timecourses <- crossing(tc_id = seq(max(alt_timecourses$tc_id) + 1,
                                         max(alt_timecourses$tc_id) + total_measurements * (1-signal_frac)),
                             time = timepts) %>%
  mutate(signal = "no signal",
         sim_fit = 0,
         abundance = rnorm(n(), 0, measurement_sd))

simulated_timecourses <- bind_rows(alt_timecourses, null_timecourses) %>%
  mutate(signal = factor(signal, levels = c("contains signal", "no signal"))) %>%
  group_by(tc_id) %>%
  mutate(fold_change = abundance - abundance[time == 0]) %>%
  ungroup()

Example timecourses

example_tcs <- simulated_timecourses %>%
  distinct(signal, tc_id) %>%
  group_by(signal) %>%
  sample_n(5) %>%
  mutate(label = as.character(1:n()))

simulated_timecourses %>%
  inner_join(example_tcs, by = c("signal", "tc_id")) %>%
  ggplot(aes(x = time, color = label)) +
  geom_path(aes(y = sim_fit)) +
  geom_point(aes(y = abundance)) +
  facet_wrap(~ signal, ncol = 1, scale = "free_y") +
  scale_y_continuous("Abundance") +
  scale_color_brewer("Example Timecourse", palette = "Set2") +
  ggtitle("Simulated timecourses with and without signal", "line: true values, points: observed values") +
  theme(legend.position = "bottom")

Models to try

At this point, we want to fit a few flavors of time series models to each gene in order to determine how reliably each model can discriminate signal-containing and no-signal timecourses.

To make it easy to iterate over features, I like to using the nest function from tidyr to store all the data for a feature in a single row. Here, expression data will be stored as a list of gene-level tables.

nested_timecourses <- simulated_timecourses %>%
  nest(timecourse_data = -c(signal, tc_id)) 

nested_timecourses

## # A tibble: 10,000 × 3
##    tc_id signal          timecourse_data 
##                          
##  1     2 contains signal 
##  2     9 contains signal 
##  3    10 contains signal 
##  4    12 contains signal 
##  5    13 contains signal 
##  6    14 contains signal 
##  7    15 contains signal 
##  8    16 contains signal 
##  9    22 contains signal 
## 10    23 contains signal 
## # … with 9,990 more rows

Having nested one-gene per row, we can apply multiple regression models to each gene, and we’ll also do this treating both the fold-change and original expression level as responses to evaluate the effect of time zero normalization. The regression models that we’ll try are:

linear effect of time on expression. From our sigmoid and impulse generate process, we shouldn’t expect a linear relationship to work great, but it can serve as a nice baseline.
cubic relationship between time and expression. This fill fit a linear term, a quadratic ($t^2$) and a cubic ($t^3$) term to allow for more complicated dynamics such as genes that go up and then down again. One feature of cubic regression, or other polynomial regression models such as quadratic regression, is that they are zero-centric. What I mean by this is that additional terms in a polyomial regression models, such as moving from a cubic or a quardic model, allows for extra flexibily around zero. This can be helpful, but if changes are occurring at late timepoints, we may need a high degree polynomial to capture this change, and the cost will be a prediction which overfits to the noise in earlier timepoints.
predicting expression with a spline over timepoints using generalized additive models (GAMs). These models are similar to the cubic models but provide support evenly across time. This will allow them to detect late changes without requiring many degrees of freedom. GAMs are powerful models but they can run into problem when changes occur rapidly especially if we have relatively few timepoints.

With these models our main goal is determine whether each timecourses contains dynamic signal rather than testing for the significance of individual parameters. Approaching the problem this way will also allow us to compare these different approaches even though they fit a different number of parameters. To summarize each model’s prediction of the role of time, ANOVA can be used to determine how much variation is explained by time relative to the noise left over. Because cubic regression and GAMs fit more parameters they must do a better job of explaining the temporal dynamics to justify their extra degrees of freedom.

There are many other approaches that we could try, a few of these are worth mentioning:

mgcv is an alternative implementation of GAMs to the gam package used below. It is able to decide how flexible of a spline should be fit to each gene using cross-validation. For our synthetic dataset, mgcv actually fails for a number of features owing to the complexity of our dynamics relative to the relatively small number of timepoints.
If we had replicates of timepoints then we could fit a model which treats each timepoint as a categorical variable. This would allow us detect dynamics without assuming that we expect certain patterns. The downside is that this would require us to collect twice as much data, or perhaps cut back on time points to provide repeated measures of the timepoints that we do have. In general, I think its better to have more unique timepoints represented even if we don’t have repeated measures since this provides more even representation of measuremetns over the time period we care about.
Since time points are not evenly spaced we could have tried transforming time when fitting the above models. While the timepoints are “exponentially” sampled, taking log(time) would send time zero to -Inf so a better transformation would be using the square root of time as the independent variable.

A couple of notes:

Since the time zero fold change must be zero by definition, I applied this as a constraint (this is the “+ 0” in the formulas below).
future was used to parallelize over genes; its settings were setup in the “environment setup” section.

library(broom)
library(furrr)
suppressPackageStartupMessages(library(gam))

fit_regression <- function (one_tc, model_fxn = "lm", model_formula, null_formula = NULL) {

  if (all.vars(model_formula)[1] == "fold_change") {
    one_tc <- one_tc %>%
      filter(time != 0)
  }
  
  alt_fit <- do.call(model_fxn, list(data = one_tc, formula = model_formula))
  
  if (model_fxn == "lm") {
    null_fit <- do.call(model_fxn, list(data = one_tc, formula = null_formula))
    model_anova <- anova(null_fit, alt_fit)
  } else {
    model_anova <- alt_fit
  }
  
  model_anova %>%
    broom::tidy() %>%
    filter(!is.na(statistic))
}

standard_models <- nested_timecourses %>%
  mutate(linear_abundance = future_map(timecourse_data, fit_regression, model_fxn = "lm",
                                       model_formula = as.formula(abundance ~ time),
                                       null_formula = as.formula(abundance ~ 1)),
         linear_foldchange = future_map(timecourse_data, fit_regression, model_fxn = "lm",
                                        model_formula = as.formula(fold_change ~ time + 0),
                                        null_formula = as.formula(fold_change ~ 0)),
         cubic_abundance = future_map(timecourse_data, fit_regression, model_fxn = "lm",
                                      model_formula = as.formula(abundance ~ poly(time, degree = 3, raw = TRUE)),
                                      null_formula = as.formula(abundance ~ 1)),
         cubic_foldchange = future_map(timecourse_data, fit_regression, model_fxn = "lm",
                                       model_formula = as.formula(fold_change ~ poly(time, degree = 3, raw = TRUE) + 0),
                                       null_formula = as.formula(fold_change ~ 0)),
         gam_abundance = future_map(timecourse_data, fit_regression, model_fxn = "gam",
                                    model_formula = as.formula(abundance ~ s(time))),
         gam_foldchange = future_map(timecourse_data, fit_regression, model_fxn = "gam",
                                     model_formula = as.formula(fold_change ~ s(time) + 0)))

Each model x gene can be summarized by a single p-value. We expect the signal-containing timecourses to have relatively low p-values, while the no-signal timecourses’ pvalues should be uniformly distributed between 0 and 1.

To correct for multiple tests we can use the Storey q-value approach to control the false discovery rate (FDR). To do this we will estimate q-values separately for each model and select a q-value cutoff of 0.1 as a cutoff for significance. At this level we expect that 1/10 of genes with a q-value of less than 0.1 will be from the no-signal group.

fdr_control <- function(pvalues) {
  qvalue::qvalue(pvalues)$qvalues 
}

all_model_fits <- standard_models %>%
  select(-timecourse_data) %>%
  gather(model_type, model_data, -tc_id, -signal) %>%
  unnest(model_data) %>%
  group_by(model_type) %>%
  mutate(qvalue = fdr_control(p.value),
         discovery = ifelse(qvalue < 0.1, "positive", "negative")) %>%
  separate(model_type, into = c("model", "response"))

ggplot(all_model_fits, aes(x = p.value, fill = signal)) +
  facet_grid(model ~ response) +
  geom_histogram(bins = 25) +
  scale_fill_brewer(palette = "Set1")

Visually, there are a large number of no-signal timecourses with small p-values in the fold-change data. This suggests that something pathological is going on.

We can also summarize models based on the FDR that was actually realized given that we were shooting for 0.1, and based on the total recall of signal-containing timecourses at the cutoff of 0.1

all_model_fits %>%
  count(signal, model, response, discovery) %>%
  mutate(correct = case_when(signal == "no signal" & discovery == "negative" ~ "true negative",
                             signal == "no signal" & discovery == "positive" ~ "false positive",
                             signal == "contains signal" & discovery == "negative" ~ "false negative",
                             signal == "contains signal" & discovery == "positive" ~ "true positive")) %>%
  select(model, response, correct, n) %>%
  spread(correct, n) %>%
  arrange(response) %>%
  mutate(fdr = `false positive` / (`false positive` + `true positive`),
         recall = `true positive` / (`false negative` + `true positive`)) %>%
  knitr::kable()

model	response	false negative	false positive	true negative	true positive	fdr	recall
cubic	abundance	1912	10	7990	88	0.1020408	0.0440
gam	abundance	1118	128	7872	882	0.1267327	0.4410
linear	abundance	1532	55	7945	468	0.1051625	0.2340
cubic	foldchange	280	2928	5072	1720	0.6299484	0.8600
gam	foldchange	276	4151	3849	1724	0.7065532	0.8620
linear	foldchange	377	3463	4537	1623	0.6808887	0.8115

In this summary we can see that working with abundances does accurately control the FDR, but recall is low for linear and cubic regression and moderate for GAMs. Working with fold changes in contrast fails to control the FDR. While we intended for 1/10 of our discoveries to be null, around 60% actually are! While recall is pretty good, most of the kinetic responses will be garbage.

To figure out what is going wrong, we can plot examples of false positives (model thinks there is signal when there isn’t) and false negatives (model doesn’t think there is signal when there really is).

extreme_false_negatives <- all_model_fits %>%
  filter(signal == "contains signal" & response == "abundance" & model %in% c("cubic", "gam")) %>%
  group_by(model, response) %>%
  arrange(desc(qvalue)) %>%
  slice(1:5) %>%
  mutate(label = as.character(1:n())) %>%
  select(tc_id, model, response, label) %>%
  ungroup() %>%
   mutate(facet_label = glue::glue("{response} {model} model false negatives"))

extreme_false_positives <- all_model_fits %>%
  filter(signal == "no signal" & response == "foldchange" & model %in% c("cubic", "gam")) %>%
  group_by(model, response) %>%
  sample_n(5) %>%
  mutate(label = as.character(1:n())) %>%
  select(tc_id, model, response, label) %>%
  ungroup() %>%
  mutate(facet_label = glue::glue("{response} {model} model false positives"))

select_misclassifications <- bind_rows(extreme_false_negatives, 
                                       extreme_false_positives)

simulated_timecourses %>%
  inner_join(select_misclassifications, by = "tc_id") %>%
  ggplot(aes(x = time, color = label)) +
  geom_path(aes(y = sim_fit)) +
  geom_point(aes(y = abundance)) +
  facet_wrap( ~ facet_label, scale = "free_y") +
  scale_y_continuous("Abundance") +
  scale_color_brewer("Example Timecourse", palette = "Set2") +
  ggtitle("Timecourses missed by GAM", "line: true values, points: observed values") +
  theme(legend.position = "bottom")

From this we can say a few things:

When working with abundances we need to include an intercept term so the average value of a feature can be separated from its change over time. Doing this however can remove some early responses since the intercept becomes the point of reference rather than the value at time zero.
Changes which are showing up primarily in one or two timepoints might be missed since the polynomial and gam models used above can’t contort themselves to fit these dynamics. This might be appropriate if these points were just noise but in many cases these are large changes beyond what we would expect as noise in our generative process.
Working with fold change enforces the value at time zero as the appropriate reference. This makes conceptual sense for a perturbation timecourse since at time zero (and before) the system is in a reference state, and all subsequent timepoints capture the dynamics of interest. However, working directly with fold-changes creates a problem. We are no longer controlling the FDR! In this simulation if we wanted to find discoveries at a 10% FDR than we would in fact be realizing a 60% FDR. This is a big problem, especially since outside of working in a simulation, we don’t know which timecourses contain real signal and which are spurious (if that was the case then why would we do the experiment…), so we would would think there were strong signals in our dataset when it may in fact entirely be noise.

If we wanted to get around these issues, then we could still probably make a regression model work, but it would require adding more samples and cost to the analysis. The two main paths we could take are:

replicates of each timecourse - if we had multiple biological replicates at each timepoint, than rather than treating time as a numerical variable, we could treat it as a categorical variable. In this case we could fit an ANOVA model which would assess whether the variation between timepoints is greater than the variation within timepoints. This would be a powerful way of detecting differences across time which is agnostic to types of changes occurring.
denser sampling - if we had more measurements near timepoints where rapid dynamics were occurring then it would be easier to distinguish smooth rapid responses from single outlier observations. This would still require us to fit a model which can appropriately capture such dynamics, but with more observations we could either fit a more flexible model (with more degrees of freedom) or use the same simple models but with more power to detect significant changes.

In most cases, I think adopting one of these options is probably the smart way to go, however there are reasons why collecting more samples in a given experiment is not feasible such as if MANY similar experiments are being performed, like in IDEA, or if there are constraints on how frequently samples can be collected.

In such cases, I think we can find a path forward by stepping away from regression and thinking about likelihood-based methods which capture the nature of fold changes.

Timecourse fold change likelihood

Using likelihood methods, we start with a statistical model for how our observations were generated. We can then sample or optimize the parameters of this model in order to find the most likely value (i.e., the frequentist MLE) or generate a distribution of parameters incorporating both likelihood and parameter plausibility (i.e., the Bayesian approach).

Before we posit an appropriate likelihood for fold change, lets figure out why the regression approaches using fold change were so anticonservative. The big problem here was that many timecourses which were just noise looked like they actually contained signal. So, lets work with just the “no signal” timecourses.

To do this, we can look at how the value at time zero influences fold-change estimates of later timepoints.

timecourse_spread <- simulated_timecourses %>%
  filter(signal == "no signal") %>%
  group_by(tc_id) %>%
  mutate(tzero_value = abundance[time == 0]) %>%
  ungroup() %>%
  filter(fold_change != 0) %>%
  select(tc_id, time, tzero_value, fold_change) %>%
  spread(time, fold_change)

ggplot(timecourse_spread, aes(x = `5`, y = `10`, color = tzero_value)) + geom_point() +
  scale_color_gradient2('Time zero value', low = "GREEN", high = "RED", mid = "BLACK", midpoint = 0, breaks = -2:2, limits = c(-2,2)) +
  scale_x_continuous("Fold change at 5 minutes") +
  scale_y_continuous("Fold change at 10 minutes") +
  coord_cartesian(xlim = c(-2,2), ylim = c(-2,2)) +
  theme_minimal()

From this plot, a large positive value at time zero (due to noise) results in later timepoints appearing as consistent negative fold changes. Conversely, if the time zero value is negative then later timepoints appear consistently positive.

While we simulated abundances as independent Normal draws which would possess a spherical covariance structure, normalizing to time zero has induced a correlation between observations and this dependence will need to be accounted for in order to have a useful null hypothesis.

Based on the value of time zero, which all subsequent time points are normalized with respect to, these later timepoints are biased such that they are all higher or lower than otherwise expected. In order to test for timecourse-level signal based on the aggregate signal of all observations, we need to account for the dependence of these observations. Luckily, the form of this dependence is quite straight-forward.

We can see this dependence using the sample covariance matrix of our null timecourses.

cov(timecourse_spread[,3:ncol(timecourse_spread)]) %>%
  knitr::kable()

	5	10	20	30	40	60	90
5	0.5141863	0.2514794	0.2583391	0.2533660	0.2557825	0.2539230	0.2559496
10	0.2514794	0.5076230	0.2605876	0.2536349	0.2504740	0.2512202	0.2540698
20	0.2583391	0.2605876	0.5131240	0.2505837	0.2562055	0.2559705	0.2547911
30	0.2533660	0.2536349	0.2505837	0.4973020	0.2534935	0.2517957	0.2526950
40	0.2557825	0.2504740	0.2562055	0.2534935	0.4982398	0.2498487	0.2531005
60	0.2539230	0.2512202	0.2559705	0.2517957	0.2498487	0.4986771	0.2508796
90	0.2559496	0.2540698	0.2547911	0.2526950	0.2531005	0.2508796	0.5066272

Observations variances are approximately $2\text{Var}(x_t)$ (2 * 0.5) because:

\[\mathcal{N}(\mu_{A}, \sigma^{2}_{A}) - \mathcal{N}(\mu_{B}, \sigma^{2}_{B}) = \mathcal{N}(\mu_{A} - \mu_{B}, \sigma^{2}_{A} + \sigma^{2}_{B})\]

Observation covariances are $\text{Var}(\log_2x_t)$ because the shared normalization to time zero adds the variance of time zero as a covariance to to the later time points.

Normalization of Normal (or log-Normal) observations to a common reference produces a Multivariate Gaussian distribution.

\[\mathbf{f}_{i} \sim \mathcal{MN}\left(\mu = \mathbf{0}, \Sigma = \begin{bmatrix} 2\sigma_{\epsilon}^2 & \sigma_{\epsilon}^2 & \dots & \sigma_{\epsilon}^2 \\ \sigma_{\epsilon}^2 & 2\sigma_{\epsilon}^2 & \dots & \sigma_{\epsilon}^2 \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{\epsilon}^2 & \sigma_{\epsilon}^2 & \dots & 2\sigma_{\epsilon}^2 \end{bmatrix}\right)\]

If we expect null fold-change measurements to follow this distribution, then we can sample from a multivariate normal distribution to explore whether this works as a fold-change generative process. We can then assess whether these draws are likely to have come from this distribution as a null hypothesis. For this purpose, we’ll use Mahalanobis distance, which is a multivariate generalization of the Wald test that assesses how many standard deviations an observation is from the mean of the distribution. This requires an estimate of the covariance matrix, an assumption that will be discussed below. We expect these statistics to be $\chi^{2}$ distributed with degrees of freedom equal to the number of timepoints that we have.

timecourse_covariance <- matrix(measurement_sd^2, nrow = length(timepts)-1, ncol = length(timepts)-1)
diag(timecourse_covariance) <- 2*measurement_sd^2

n_fold_changes = ncol(timecourse_covariance)

# simulate draws from multivariate normal
library(mvtnorm)
r_multivariate_normal <- rmvt(10000, sigma = timecourse_covariance, df = 0)

r_multivariate_mahalanobis_dist <- mahalanobis(r_multivariate_normal,
                                               center = rep(0, times = n_fold_changes),
                                               cov = timecourse_covariance,
                                               inverted = FALSE)

# test multivariate normality
hist(pchisq(r_multivariate_mahalanobis_dist, df = n_fold_changes, lower.tail = FALSE), breaks = 50, main = "p-values for MN fold-change generative process")

Having taken draws from the Multivariate Gaussian distribution and then used the Mahalanobis distance to calculate p-values, the $\text{Unif}(0,1)$ distribution of these p-values confirms that the Mahalanobis distance is appropriate.

We can now test whether fold-changes are really Multivariate Gaussian distributed by inspecting the distribution of Mahalanobis distance p-values from the no-signal timecourses.

# test timecourse samples for multivariate normality
time_course_mahalanobis_dist <- mahalanobis(timecourse_spread[,-c(1:2)], center = rep(0, n_fold_changes), cov = timecourse_covariance)
hist(pchisq(time_course_mahalanobis_dist, df = n_fold_changes, lower.tail = FALSE), breaks = 50, main = "p-values for no-signal timecourses using MN fold-change model")

The p-values for the no-signal fold change timecourses are indeed $\text{Unif}(0,1)$ distributed as we hoped.

Now, we can calculate the Mahalanobis distances and their corresponding p-values for the signal-containing timecourses as well. Signal in these timecourses will both increase the overall variance in expression for a feature, and deviations of nearby timepoints may be similar. These factors will make it harder for the Multivariate Gaussian noise model to explain the signal-containing expression vector, resulting in a high Mahalanobis distance and a small p-value.

timecourse_mvn <- simulated_timecourses %>%
  group_by(tc_id) %>%
  mutate(tzero_value = abundance[time == 0]) %>%
  ungroup() %>%
  filter(fold_change != 0) %>%
  select(tc_id, time, tzero_value, fold_change) %>%
  spread(time, fold_change) %>%
  mutate(mahalanobis_dist = mahalanobis(.[,-c(1:2)],
                                        center = rep(0, n_fold_changes),
                                        cov = timecourse_covariance),
         pvalue = pchisq(mahalanobis_dist, df = n_fold_changes, lower.tail = FALSE),
         qvalue = fdr_control(pvalue),
         discovery = ifelse(qvalue < 0.1, "positive", "negative")) %>%
  left_join(simulated_timecourses %>%
              distinct(tc_id, signal),
            by = "tc_id")

timecourse_mvn %>%
  ggplot(aes(x = pvalue, fill = signal)) +
  geom_histogram(bins = 100) +
  scale_fill_brewer(palette = "Set1")

Based on the p-value distributions, most of the signal-containing timecourses have small p-values suggesting increased recall. We can verify this as before by summarizing results based on the realized FDR and the overall recall of signal-containing timecourses at this FDR cutoff.

timecourse_mvn %>%
  count(signal, discovery) %>%
  mutate(correct = case_when(signal == "no signal" & discovery == "negative" ~ "true negative",
                             signal == "no signal" & discovery == "positive" ~ "false positive",
                             signal == "contains signal" & discovery == "negative" ~ "false negative",
                             signal == "contains signal" & discovery == "positive" ~ "true positive")) %>%
  select(correct, n) %>%
  spread(correct, n) %>%
  mutate(fdr = `false positive` / (`false positive` + `true positive`),
         recall = `true positive` / (`false negative` + `true positive`)) %>%
  knitr::kable()

false negative	false positive	true negative	true positive	fdr	recall
201	194	7806	1799	0.0973407	0.8995

The Multivariate Gaussian test did a great job of appropriately identifying signal-containing timecourses. The high power of this test arises from using estimates of noise level of observations, and our resulting ability to step away from assumptions regrading the types of time-dependent signals we expect.

The test uses not just a pattern of expression but also information about the magnitude of noise associated with each observation. This information is available in many contexts in genomics. A couple examples where observation-level estimates of noise are available are (1) via the mean-variance relationships of RNAseq data or (2) via consistency of peptides in proteomics data. Having these estimates can identify large fold-changes which are unlikely to occur by chance, even if all post time zero timepoints are similar. Similarly, complex rapid dynamics will look like a poorly fit regression model with high residual error. But, if we know the magnitude of residual error that we expect, then signals can be identified by just looking for an excess of variation in the timecourse (accounting for the bias introduced by time zero normalization). Using noise estimates may seem like cheating since this information was not directly used by the other tests. If we were to use an estimate of the noise level it would be to carry-out a weighted least squared regression. But, since all observations have the same level of noise added, here, weighted regressions would be equivalent to the un-weighted regressions used above.

Using Mahalanobis distance, we can look for any departures from the null noise model to define signal. This lets us step away from models which look for particular types of signals, such as the linear, cubic, or smooth relationships sought in the regression models we applied. Some of these models can be quite flexible, but when the underlying data does not follow these relationships these models will often fail. In this case, we simulated signals as biologically feasible sigmoidal or impulse responses, so none of the regression models applied could capture every instance of a simulated signals containing timecourse. In this case, if we were to use a regression model then we would be best off fitting a non-linear least squares model following the sigmoidal or impulse form. Doing this is actually non-trivial if we want to avoid pathological fits, but these issues have been addressed in the impulse R package. I’ll discuss this problem in a future post.