<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://www.shackett.org/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.shackett.org/" rel="alternate" type="text/html" /><updated>2026-01-31T03:43:29+00:00</updated><id>https://www.shackett.org/feed.xml</id><title type="html">Sean Hackett</title><subtitle>As a Director of Data Science, I lay the groundwork for organizational change by connecting data collection, organization, and synthesis. But, I still love getting into the weeds on technical problems where I can continue developing my analytics and programming skills.</subtitle><author><name>Sean Hackett</name></author><entry><title type="html">Distinguishing Activation from Inhibition with Relation-Aware Graph Neural Networks</title><link href="https://www.shackett.org/relation_prediction/" rel="alternate" type="text/html" title="Distinguishing Activation from Inhibition with Relation-Aware Graph Neural Networks" /><published>2026-01-06T00:00:00+00:00</published><updated>2026-01-06T00:00:00+00:00</updated><id>https://www.shackett.org/relation_prediction</id><content type="html" xml:base="https://www.shackett.org/relation_prediction/"><![CDATA[<p>In my <a href="https://www.shackett.org/napistu_torch">last post</a>, I discussed
self-supervised edge prediction as a way of embedding genes using a
gene-regulatory network.</p>

<p>This approach allows genes, metabolites, drugs and other vertices to be
connected based on shared network topology. However, to date I’ve only
discussed edge prediction using a dot-product head, where a
vertex-pair’s edge support is a direct readout of their similarity in
embedding space (𝐚 · 𝐛). While surprisingly powerful, this head has
limitations when vertices are heterogeneous or interact in qualitatively
different ways — particularly when we want to distinguish between
activation and inhibition.</p>

<p>Here, I explore more expressive approaches for learning mappings between
A → B by evaluating both general edge prediction heads (like MLPs) and
“relation-aware” heads that can learn distinct mappings for different
edge types. The post will cover:</p>

<ul>
  <li>Data model and training changes enabling relation-specific
predictions</li>
  <li>Geometric analysis revealing how relation-aware heads encode
regulatory semantics</li>
  <li>PerturbSeq validation demonstrating successful prediction of signed
regulatory interactions</li>
  <li>Pre-trained models available on HuggingFace</li>
</ul>

<!--more-->

<p>Edge prediction is a powerful approach for predicting regulatory
relationships between molecular species, but not all regulatory
relations are equivalent. They vary both in how molecules interact
(physically, functionally, mechanistically) and in the consequences of
these interactions (activation, inhibition, ambiguous effects, or no
effect). While the edge encoder partially captures this information to
weight message passing, the ultimate prediction is a single continuous
score representing edge likelihood — without distinguishing the type
of interaction.</p>

<p>Ideally, I want models that can predict not just whether an interaction
occurs, but how it occurs. This led me to relation-aware approaches.
Relations are commonly discussed in the context of knowledge graphs,
where qualitatively different vertex types are connected by different
relationship types. For example, embedding the Open Targets knowledge
graph organizes genes and phenotypes in a common manifold while also
connecting drugs, chemical probes, and other entity types. Learning
relation-aware edges defines specific transformations that map between
distinct regions of the embedding space.</p>

<p>However, I will be focusing on learning relation-types within a largely
homogeneous vertex set — primarily genes and metabolites that can
serve multiple regulatory roles. This presents a greater challenge for
relation-aware methods, as vertices cannot be cleanly separated by type,
and the same molecule may act as both activator and inhibitor in
different contexts.</p>

<h2 id="workflow-updates-supporting-relation-prediction">Workflow updates supporting relation prediction</h2>

<p>Building robust relation-aware models required improvements to both the
Napistu-Torch training framework and the underlying data model.</p>

<p><strong>General framework improvements:</strong></p>

<ul>
  <li><strong>Hugging Face integration</strong> for reproducible datasets and sharing
pre-trained models</li>
  <li><strong>Training enhancements</strong> including Weights &amp; Biases sweep support
and resumable training</li>
  <li><strong>Transfer learning capabilities</strong> for loading pre-trained encoders
and fine-tuning models</li>
</ul>

<p><strong>Relation-specific data model changes:</strong></p>

<ul>
  <li><strong>Reaction vertex removal</strong> to enable direct edge prediction between
molecular species</li>
  <li><strong>Relation-type labels</strong> derived from source and target Systems
Biology Ontology (SBO) role annotations</li>
</ul>

<h3 id="restructuring-the-napistugraph---no-more-reaction-vertices">Restructuring the <code class="language-plaintext highlighter-rouge">NapistuGraph</code> - no more reaction vertices</h3>

<p>Earlier versions of Napistu included both molecular species (proteins,
metabolites, drugs, etc.) and reaction vertices. For complex regulatory
mechanisms like enzymatic reactions, this provided a clear functional
description anchoring pairwise molecular interactions. This was also
useful for network visualization, since reactions from common sources
— particularly the many narrowly-scoped Reactome pathways — provide
ideal labels for network neighborhoods.</p>

<p>However, including reactions has a major downside — individual edges
lose their meaning. For example, an enzyme transforming A → B would be
encoded as two separate edges (A → R, and R → B). For many purposes this
is fine, but for edge prediction it adds more noise than signal. For the
model to learn what an A → R → B reaction represents, it would need to
encode B’s embedding within the R embedding. This is both
computationally difficult and conceptually unnecessary, so for GNNs I’m
moving to direct A → B connections.</p>

<p>Predicting direct connections introduces a wrinkle; I had previously
enforced that no more than one edge could connect an A-B pair. Moving to
a reaction-less graph, I relaxed this constraint so multiple edges can
now connect the same vertex pair. This allows activating, inhibitory,
and interaction edges to simultaneously exist — relationships that
would have previously been distinguished by their intermediate reaction
vertices.</p>

<h3 id="adding-relation_type-to-napistudata">Adding <code class="language-plaintext highlighter-rouge">relation_type</code> to <code class="language-plaintext highlighter-rouge">NapistuData</code></h3>

<p>When constructing the network graph, I encode each mechanism as a series
of pairwise interactions, with each participant assigned a role from the
<em>SBO</em> controlled vocabulary. <em>SBO</em> terms — like interactor,
stimulator, inhibitor, modifier, and modified — capture the distinct
ways molecules participate in regulatory mechanisms. To create
relation_type labels for edges, I constructed composite labels by
combining each edge’s source and target SBO terms, such as “catalyst →
reactant,” “interactor → interactor,” and “stimulator → modified.”</p>

<p><code class="language-plaintext highlighter-rouge">NapistuData</code> (a subclass of PyG’s <code class="language-plaintext highlighter-rouge">Data</code>), supports relations through
two optional attributes: <code class="language-plaintext highlighter-rouge">relation_type</code> and its associated
<code class="language-plaintext highlighter-rouge">relation_manager</code> (for tracking label metadata). Creating these
relation types is elegantly handled as an extension of the existing
“edge strata” functionality, which organizes edges based on vertex
and/or species type to create hard negative samples.</p>

<h2 id="fitting-relation-unaware-models">Fitting relation-(un)aware models</h2>

<p>During standard edge prediction, we score a possible edge based on the
source and target vertices’ embeddings. For relation prediction, we
additionally provide heads with a <code class="language-plaintext highlighter-rouge">relation_type</code> integer to distinguish
different types of relations. To evaluate different head architectures,
I trained a range of relation-aware heads alongside simpler
relation-unaware baselines.</p>

<p>To enable fair and efficient comparison across heads, I first trained a
128-dimensional <code class="language-plaintext highlighter-rouge">GraphConv</code> message passing encoder with a
32-dimensional edge encoder and a simple dot-product head. I deployed
this pre-trained model to <a href="https://huggingface.co/seanhacks/edge_prediction_dotprod_128e">Hugging
Face</a>,
then initialized each head of interest with the pre-trained encoder
weights.</p>

<p>To leverage this pretraining, I made three key changes to the training
regime:</p>

<ul>
  <li>Lowered the learning rate substantially from 0.003 (original
dot-product head) to 0.0005 (transfer learning experiments)</li>
  <li>Used the one-cycle scheduler to gradually ramp up the learning rate</li>
  <li>Initialized expressive heads with <code class="language-plaintext highlighter-rouge">init_as_identity</code> settings (when
appropriate) so they started from a similar state as the pre-trained
dot-product head</li>
</ul>

<p>To address the imbalanced distribution of relation types in the training
data, I applied relation-weighting to each head’s loss function (binary
cross-entropy for most heads and a margin-based loss for TransE and
RotatE). Each relation type’s loss contribution is weighted by
1/√(relation-type count), down-weighting abundant relation types (like
“interactor → interactor”) while emphasizing rare but biologically
important ones (like “inhibitor → modified”). This ensures that the
models learn to predict all relation types effectively, rather than
primarily optimizing for the most common edges.</p>

<p>I fitted all models using model-specific configs and the Napistu-Torch
CLI.</p>

<h3 id="reproducing-this-analysis">Reproducing this analysis</h3>

<p>This analysis is fully reproducible — all code, data, and model
configurations are provided so you can run the complete workflow on your
own machine.</p>

<p><strong>Environment setup:</strong></p>

<ol>
  <li>
    <p>Install <a href="https://docs.astral.sh/uv/#highlights">uv</a> (or use <code class="language-plaintext highlighter-rouge">pip</code> if
preferred).</p>
  </li>
  <li>
    <p>Set up a Python environment:</p>
  </li>
</ol>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>uv venv <span class="nt">--python</span> 3.11
<span class="nb">source</span> .venv/bin/activate
<span class="c"># Core dependencies</span>
uv pip <span class="nb">install </span><span class="nv">torch</span><span class="o">==</span>2.8.0
uv pip <span class="nb">install </span>torch-scatter torch-sparse <span class="nt">-f</span> https://data.pyg.org/whl/torch-2.8.0+cpu.html
uv pip <span class="nb">install </span><span class="nv">napistu</span><span class="o">==</span>0.8.5
<span class="c"># pin wandb to 0.22.x for compatibility</span>
uv pip <span class="nb">install </span><span class="nv">wandb</span><span class="o">==</span>0.22.3 
uv pip <span class="nb">install</span> <span class="s2">"napistu-torch[pyg,lightning,analysis]==0.3.6"</span>
<span class="c"># For rendering the notebook</span>
uv pip <span class="nb">install </span>ipykernel nbformat nbclient
python <span class="nt">-m</span> ipykernel <span class="nb">install</span> <span class="nt">--user</span> <span class="nt">--name</span><span class="o">=</span>blog-staging
</code></pre></div></div>

<ol>
  <li>
    <p>Download the
<a href="https://github.com/shackett/shackett/blob/master/posts/posted/relation_prediction.qmd"><code class="language-plaintext highlighter-rouge">relation_prediction.qmd</code></a>
notebook (or copy relevant code blocks).</p>
  </li>
  <li>
    <p>Choose your path:</p>

    <ul>
      <li><strong>Using pre-trained models</strong> (recommended): The notebook will
load the models from Hugging Face on-the-fly.</li>
      <li><strong>Training from scratch</strong>: Download the <a href="https://github.com/shackett/shackett/blob/main/assets/data/relation_prediction_configs.zip">model configs and
training shell
script</a>
to train models yourself.</li>
    </ul>
  </li>
  <li>
    <p>Configure <code class="language-plaintext highlighter-rouge">WORKING_DIR</code> in the following code block to point to your
working directory.</p>
  </li>
</ol>

<h3 id="configuration-and-imports">Configuration and imports</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># standard library imports
</span><span class="kn">from</span> <span class="nn">itertools</span> <span class="kn">import</span> <span class="n">combinations</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">import</span> <span class="nn">textwrap</span>

<span class="c1"># 3rd party imports
</span><span class="kn">from</span> <span class="nn">torch</span> <span class="kn">import</span> <span class="nb">abs</span>
<span class="kn">import</span> <span class="nn">torch.nn.functional</span> <span class="k">as</span> <span class="n">F</span>
<span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">from</span> <span class="nn">napistu.ingestion.perturbseq</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="n">assign_predicted_direction</span><span class="p">,</span>
    <span class="n">load_harmonizome_perturbseq_datasets</span><span class="p">,</span>
    <span class="n">_get_distinct_harmonizome_perturbseq_interactions</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">napistu.ingestion.constants</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="n">SIGNED_PERTURBATION_TYPES</span><span class="p">,</span>
    <span class="n">STRONG_ORDERED_SIGNED_PERTURBSEQ_DIRECTIONS</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">import</span> <span class="nn">napistu.utils</span> <span class="k">as</span> <span class="n">napistu_utils</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>

<span class="c1"># import a couple of functions used by just the posted version of the blog
# pip install git+https://github.com/shackett/shackett-utils.git
</span><span class="kn">from</span> <span class="nn">shackett_utils.utils</span> <span class="kn">import</span> <span class="n">pd_utils</span>
<span class="kn">from</span> <span class="nn">shackett_utils.blog.html_utils</span> <span class="kn">import</span> <span class="n">display_tabulator</span>

<span class="c1"># napistu-torch imports
</span><span class="kn">from</span> <span class="nn">napistu_torch.evaluation.manager</span> <span class="kn">import</span> <span class="n">RemoteEvaluationManager</span>
<span class="kn">from</span> <span class="nn">napistu_torch.visualization.basic_metrics</span> <span class="kn">import</span> <span class="n">plot_auc_only</span><span class="p">,</span> <span class="n">_extract_metric</span>
<span class="kn">from</span> <span class="nn">napistu_torch.visualization.advanced_metrics</span> <span class="kn">import</span> <span class="n">plot_combined_grouped_barplot</span>
<span class="kn">from</span> <span class="nn">napistu_torch.evaluation.relation_prediction</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="n">calculate_relation_type_confusion_and_correlation</span><span class="p">,</span>
    <span class="n">compare_relation_type_predictions_to_perturbseq_truth</span><span class="p">,</span>
    <span class="n">get_perturbseq_edgelist_tensor</span><span class="p">,</span>
    <span class="n">summarize_relation_type_aucs</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">napistu_torch.models.constants</span> <span class="kn">import</span> <span class="n">HEAD_DESCRIPTIONS</span>
<span class="kn">from</span> <span class="nn">napistu_torch.utils.tensor_utils</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="n">compute_correlation_matrix</span><span class="p">,</span>
    <span class="n">compute_effective_dimensionality</span><span class="p">,</span>
    <span class="n">compute_spearman_correlation_torch</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">napistu_torch.visualization.heatmaps</span> <span class="kn">import</span> <span class="n">plot_heatmap</span>

<span class="n">WORKING_DIR</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">expanduser</span><span class="p">(</span><span class="s">"~/Desktop/relation_prediction_experiments"</span><span class="p">))</span>
<span class="n">PATH_TO_NAPISTU_STORE</span> <span class="o">=</span> <span class="n">WORKING_DIR</span> <span class="o">/</span> <span class="s">".store"</span>

<span class="n">MODEL_DISPLAY_ORDER</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s">"dot_product"</span><span class="p">,</span>
    <span class="s">"mlp"</span><span class="p">,</span>
    <span class="s">"attention"</span><span class="p">,</span>
    <span class="s">"distmult"</span><span class="p">,</span>
    <span class="s">"rotate"</span><span class="p">,</span>
    <span class="s">"transe"</span><span class="p">,</span>
    <span class="s">"relation_attention"</span><span class="p">,</span>
    <span class="s">"relation_gated_mlp"</span><span class="p">,</span>
    <span class="s">"relation_attention_mlp"</span><span class="p">,</span>
<span class="p">]</span>

<span class="n">MODEL_HF_REPOSITORIES</span> <span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">tuple</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]]</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"dot_product"</span> <span class="p">:</span> <span class="p">(</span><span class="s">"seanhacks/relation_prediction_dotprod_128e"</span><span class="p">,</span> <span class="s">"20251229"</span><span class="p">),</span>
    <span class="s">"mlp"</span> <span class="p">:</span> <span class="p">(</span><span class="s">"seanhacks/relation_prediction_mlp_128e"</span><span class="p">,</span> <span class="s">"20251229"</span><span class="p">),</span>
    <span class="s">"attention"</span> <span class="p">:</span> <span class="p">(</span><span class="s">"seanhacks/relation_prediction_attention_128e"</span><span class="p">,</span> <span class="s">"20251229"</span><span class="p">),</span>
    <span class="s">"distmult"</span> <span class="p">:</span> <span class="p">(</span><span class="s">"seanhacks/relation_prediction_distmult_128e"</span><span class="p">,</span> <span class="s">"20251229"</span><span class="p">),</span>
    <span class="s">"rotate"</span> <span class="p">:</span> <span class="p">(</span><span class="s">"seanhacks/relation_prediction_rotate_128e"</span><span class="p">,</span> <span class="s">"20251229"</span><span class="p">),</span>
    <span class="s">"transe"</span> <span class="p">:</span> <span class="p">(</span><span class="s">"seanhacks/relation_prediction_transe_128e"</span><span class="p">,</span> <span class="s">"20251229"</span><span class="p">),</span>
    <span class="s">"relation_attention"</span> <span class="p">:</span> <span class="p">(</span><span class="s">"seanhacks/relation_prediction_relationattention_128e"</span><span class="p">,</span> <span class="s">"20251229-2"</span><span class="p">),</span>
    <span class="s">"relation_gated_mlp"</span> <span class="p">:</span> <span class="p">(</span><span class="s">"seanhacks/relation_prediction_relationgatedmlp_128e"</span><span class="p">,</span> <span class="s">"20251229"</span><span class="p">),</span>
    <span class="s">"relation_attention_mlp"</span> <span class="p">:</span> <span class="p">(</span><span class="s">"seanhacks/relation_prediction_relationattnmlp_128e"</span><span class="p">,</span> <span class="s">"20251229"</span><span class="p">),</span>
<span class="p">}</span>

<span class="n">RELATION_AWARE_FOCUSED_HEADS</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s">"distmult"</span><span class="p">,</span>
    <span class="s">"transe"</span><span class="p">,</span>
    <span class="s">"relation_gated_mlp"</span><span class="p">,</span>
    <span class="s">"relation_attention_mlp"</span>
<span class="p">]</span>

<span class="n">PERTURBSEQ_RELATION_TYPES</span> <span class="o">=</span> <span class="p">[</span><span class="s">"inhibitor -&gt; modified"</span><span class="p">,</span> <span class="s">"stimulator -&gt; modified"</span><span class="p">]</span>

<span class="c1"># local caches
</span><span class="n">LOCAL_HARMONIZOME_DATA_DIR</span> <span class="o">=</span> <span class="s">"/tmp/harmonizome_data"</span>
<span class="n">CROSS_RELATION_PREDICTION_CACHE</span> <span class="o">=</span> <span class="s">"/tmp/cross_relation_prediction_matrices.pkl"</span>
</code></pre></div></div>

<h2 id="comparing-relation-unaware-models">Comparing relation-(un)aware models</h2>

<p>To compare the trained models, I will load their checkpoints and
evaluation metrics using Napistu-Torch’s <code class="language-plaintext highlighter-rouge">RemoteEvaluationManager</code>,
which provides a unified interface for accessing model weights, training
configs, and Weights &amp; Biases summaries directly from Hugging Face. (If
you are working with local models, you can instead use the similar
<code class="language-plaintext highlighter-rouge">LocalEvaluationManager</code>, which directly interacts with Weights &amp; Biases
and local models and data.)</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">eval_managers</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="k">for</span> <span class="n">model_name</span><span class="p">,</span> <span class="n">model_info</span> <span class="ow">in</span> <span class="n">MODEL_HF_REPOSITORIES</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
    <span class="n">model_repo</span><span class="p">,</span> <span class="n">model_version</span> <span class="o">=</span> <span class="n">model_info</span>
    <span class="n">eval_managers</span><span class="p">[</span><span class="n">model_name</span><span class="p">]</span> <span class="o">=</span> <span class="n">RemoteEvaluationManager</span><span class="p">.</span><span class="n">from_huggingface</span><span class="p">(</span>
        <span class="n">model_repo</span><span class="p">,</span>
        <span class="n">data_store_dir</span> <span class="o">=</span> <span class="n">PATH_TO_NAPISTU_STORE</span><span class="p">,</span>
        <span class="n">revision</span> <span class="o">=</span> <span class="n">model_version</span><span class="p">,</span>
    <span class="p">)</span>

<span class="c1"># for local evaluation, instead do this:
# from napistu_torch.evaluation.manager import LocalEvaluationManager
# EXPERIMENT
# eval_managers = dict()
# for experiment in MODEL_DISPLAY_ORDER:
#     experiment_path = &lt;&lt;PATH_TO_EXPERIMENT_DIR&gt;&gt;
#     eval_managers[experiment] = LocalEvaluationManager(experiment_path)
</span>
<span class="c1"># Load pre-calculated WandB summaries directly from HuggingFace
</span><span class="n">run_summaries</span> <span class="o">=</span> <span class="p">{</span><span class="n">exp</span><span class="p">:</span> <span class="n">manager</span><span class="p">.</span><span class="n">get_run_summary</span><span class="p">()</span> <span class="k">for</span> <span class="n">exp</span><span class="p">,</span> <span class="n">manager</span> <span class="ow">in</span> <span class="n">eval_managers</span><span class="p">.</span><span class="n">items</span><span class="p">()}</span>

<span class="c1"># Load all of the trained models
</span><span class="n">models</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span> <span class="p">:</span> <span class="n">v</span><span class="p">.</span><span class="n">load_model_from_checkpoint</span><span class="p">()</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">eval_managers</span><span class="p">.</span><span class="n">items</span><span class="p">()}</span>

<span class="c1"># Count trainable parameters in each head
</span><span class="n">n_head_params</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span> <span class="p">:</span> <span class="nb">sum</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">numel</span><span class="p">()</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">v</span><span class="p">.</span><span class="n">task</span><span class="p">.</span><span class="n">head</span><span class="p">.</span><span class="n">head</span><span class="p">.</span><span class="n">parameters</span><span class="p">())</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">models</span><span class="p">.</span><span class="n">items</span><span class="p">()}</span>

<span class="c1"># connect to the NapistuDataStore and load the NapistuData instance which all models were trained on
# all of the experiments have the same value so we just need to pick an arbitrary one
</span>
<span class="n">napistu_data_store</span> <span class="o">=</span> <span class="n">eval_managers</span><span class="p">[</span><span class="s">"distmult"</span><span class="p">].</span><span class="n">napistu_data_store</span>
<span class="n">napistu_data</span> <span class="o">=</span> <span class="n">napistu_data_store</span><span class="p">.</span><span class="n">load_napistu_data</span><span class="p">(</span><span class="s">"relation_prediction"</span><span class="p">)</span>
<span class="n">relation_types</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">napistu_data</span><span class="p">.</span><span class="n">relation_manager</span><span class="p">.</span><span class="n">label_names</span><span class="p">.</span><span class="n">values</span><span class="p">())</span>
<span class="n">species_identifiers</span> <span class="o">=</span> <span class="n">napistu_data_store</span><span class="p">.</span><span class="n">load_pandas_df</span><span class="p">(</span><span class="s">"species_identifiers"</span><span class="p">)</span>
<span class="n">name_to_sid_map</span> <span class="o">=</span> <span class="n">napistu_data_store</span><span class="p">.</span><span class="n">load_pandas_df</span><span class="p">(</span><span class="s">"name_to_sid_map"</span><span class="p">).</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">name_to_sid_map</span><span class="p">[</span><span class="s">"integer_id"</span><span class="p">]</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">name_to_sid_map</span><span class="p">))</span>

<span class="k">if</span> <span class="ow">not</span> <span class="nb">all</span><span class="p">(</span><span class="n">name_to_sid_map</span><span class="p">[</span><span class="s">"name"</span><span class="p">]</span> <span class="o">==</span> <span class="n">napistu_data</span><span class="p">.</span><span class="n">ng_vertex_names</span><span class="p">):</span>
    <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="s">"name_to_sid_map does not match napistu_data.ng_vertex_names"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="model-architecture-overview">Model architecture overview</h3>

<p>I evaluated seven different head architectures spanning three
categories:</p>

<ul>
  <li><strong>Edge prediction (relation-unaware)</strong>: Simple heads that predict
edge existence without distinguishing relation types. These serve as
baselines to assess whether relation-aware methods provide
meaningful improvements.</li>
  <li><strong>Knowledge graph embedding</strong>: Methods originally developed for
heterogeneous knowledge graphs (like TransE and DistMult) that learn
relation-specific transformations.</li>
  <li><strong>Relation prediction (expressive heads)</strong>: Custom architectures
that combine relation-aware gating or attention mechanisms with MLPs
to learn flexible, relation-specific transformations.</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create a summary table
</span><span class="n">model_summaries</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">([</span><span class="n">HEAD_DESCRIPTIONS</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">MODEL_DISPLAY_ORDER</span><span class="p">])</span>
<span class="n">model_summaries</span><span class="p">[</span><span class="s">"category"</span><span class="p">]</span> <span class="o">=</span> <span class="n">model_summaries</span><span class="p">[</span><span class="s">"category"</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"_"</span><span class="p">,</span> <span class="s">" "</span><span class="p">).</span><span class="nb">str</span><span class="p">.</span><span class="n">capitalize</span><span class="p">()</span>
<span class="n">model_summaries</span><span class="p">[</span><span class="s">"N parameters"</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="n">n_head_params</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">MODEL_DISPLAY_ORDER</span><span class="p">]</span>

<span class="n">display_tabulator</span><span class="p">(</span>
    <span class="n">model_summaries</span><span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="s">"N parameters"</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">True</span><span class="p">),</span>
    <span class="n">caption</span><span class="o">=</span><span class="s">"Summary of all tested prediction heads"</span><span class="p">,</span>
    <span class="n">wrap_columns</span><span class="o">=</span><span class="p">[</span><span class="s">"label"</span><span class="p">,</span> <span class="s">"category"</span><span class="p">,</span> <span class="s">"description"</span><span class="p">],</span>
    <span class="n">column_widths</span><span class="o">=</span><span class="p">{</span><span class="s">"description"</span> <span class="p">:</span> <span class="s">"50%"</span><span class="p">,</span> <span class="s">"N parameters"</span> <span class="p">:</span> <span class="s">"15%"</span><span class="p">},</span>
    <span class="n">include_index</span> <span class="o">=</span> <span class="bp">False</span>
<span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Summary of all tested prediction heads
</figcaption>

<div class="data-table" style="" data-table="[{&quot;label&quot;: &quot;Dot product&quot;, &quot;category&quot;: &quot;Edge prediction&quot;, &quot;description&quot;: &quot;Simple dot product of source and target embeddings&quot;, &quot;N parameters&quot;: 0}, {&quot;label&quot;: &quot;RotatE&quot;, &quot;category&quot;: &quot;Knowledge graph embedding&quot;, &quot;description&quot;: &quot;RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space&quot;, &quot;N parameters&quot;: 704}, {&quot;label&quot;: &quot;DistMult&quot;, &quot;category&quot;: &quot;Knowledge graph embedding&quot;, &quot;description&quot;: &quot;DistMult: Embedding Entities and Relations for Learning and Inference in Knowledge Bases&quot;, &quot;N parameters&quot;: 1408}, {&quot;label&quot;: &quot;TransE&quot;, &quot;category&quot;: &quot;Knowledge graph embedding&quot;, &quot;description&quot;: &quot;TransE: Translating Embeddings for Modeling Multi-relational Data&quot;, &quot;N parameters&quot;: 1408}, {&quot;label&quot;: &quot;Attention&quot;, &quot;category&quot;: &quot;Edge prediction&quot;, &quot;description&quot;: &quot;Attention head that projects nodes to query/key spaces and computes scaled dot-product attention. Learns separate transformations for source (query) and target (key) embeddings.&quot;, &quot;N parameters&quot;: 32768}, {&quot;label&quot;: &quot;MLP&quot;, &quot;category&quot;: &quot;Edge prediction&quot;, &quot;description&quot;: &quot;Multi-layer perceptron head that concatenates source and target embeddings, then applies 2-layer MLP with ReLU and dropout.&quot;, &quot;N parameters&quot;: 33025}, {&quot;label&quot;: &quot;Relation-gated MLP head&quot;, &quot;category&quot;: &quot;Relation prediction&quot;, &quot;description&quot;: &quot;Edge features are processed through MLP, then modulated by relation-specific gates (element-wise multiplication), and passed through output MLP.&quot;, &quot;N parameters&quot;: 58561}, {&quot;label&quot;: &quot;Relation-aware attention head&quot;, &quot;category&quot;: &quot;Relation prediction&quot;, &quot;description&quot;: &quot;Relation-aware multi-head attention head. Uses relation embeddings as queries to attend to edge features (concatenated source/target).&quot;, &quot;N parameters&quot;: 75073}, {&quot;label&quot;: &quot;Relation-aware attention head with MLP&quot;, &quot;category&quot;: &quot;Relation prediction&quot;, &quot;description&quot;: &quot;Processes edge features through MLP, then uses relation embeddings as queries in multi-head attention to select relevant features. Includes residual connection and output MLP for final prediction.&quot;, &quot;N parameters&quot;: 91585}]" data-columns="[{&quot;title&quot;: &quot;label&quot;, &quot;field&quot;: &quot;label&quot;, &quot;formatter&quot;: &quot;textarea&quot;, &quot;variableHeight&quot;: true}, {&quot;title&quot;: &quot;category&quot;, &quot;field&quot;: &quot;category&quot;, &quot;formatter&quot;: &quot;textarea&quot;, &quot;variableHeight&quot;: true}, {&quot;title&quot;: &quot;description&quot;, &quot;field&quot;: &quot;description&quot;, &quot;formatter&quot;: &quot;textarea&quot;, &quot;variableHeight&quot;: true, &quot;width&quot;: &quot;50%&quot;}, {&quot;title&quot;: &quot;N parameters&quot;, &quot;field&quot;: &quot;N parameters&quot;, &quot;width&quot;: &quot;15%&quot;}]" data-options="{&quot;layout&quot;: &quot;fitColumns&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<h3 id="comparing-models-with-standard-and-relation-weighted-auc">Comparing models with standard and relation-weighted AUC</h3>

<p>To evaluate model performance, I use two complementary metrics:</p>

<p><strong>Standard AUC</strong> treats all edges equally, measuring how well models
discriminate real edges from random negatives across the entire graph.</p>

<p><strong>Relation-weighted AUC</strong> accounts for the imbalanced distribution of
relation types by:</p>

<ol>
  <li>Calculating AUC separately for each relation type (comparing real
edges to negative samples of the same relation type)</li>
  <li>Taking a weighted average of these per-relation AUCs, where weights
are proportional to √N (the square root of each relation type’s
frequency)</li>
</ol>

<p>This relation-weighted metric is particularly important given the highly
imbalanced distribution of relation types. Some (like “interactor →
interactor”) are far more common than others (like “inhibitor →
modified”). The √N weighting balances standard AUC (which would weight
by N) and equal weighting (which would weight by 1), ensuring rare
relation types influence the metric without dominating it. I used
relation-weighted AUC for early stopping during training, prioritizing
models that perform well across all relation types rather than just the
most abundant ones.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># extract and reorder based on test relation-weighted AUC
</span><span class="n">test_aucs</span> <span class="o">=</span> <span class="n">_extract_metric</span><span class="p">(</span><span class="n">run_summaries</span><span class="p">,</span> <span class="s">"test_relation_weighted_auc"</span><span class="p">)</span>
<span class="n">performance_order</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">test_aucs</span><span class="p">,</span> <span class="n">run_summaries</span><span class="p">.</span><span class="n">keys</span><span class="p">()))]</span>
<span class="n">performance_ordered_summaries</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span> <span class="n">run_summaries</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">performance_order</span><span class="p">}</span>
<span class="n">ordered_labels</span> <span class="o">=</span> <span class="p">[</span>
    <span class="n">textwrap</span><span class="p">.</span><span class="n">fill</span><span class="p">(</span><span class="n">HEAD_DESCRIPTIONS</span><span class="p">[</span><span class="n">x</span><span class="p">][</span><span class="s">"label"</span><span class="p">],</span> <span class="n">width</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>  <span class="c1"># Adjust width as needed
</span>    <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">performance_order</span>
<span class="p">]</span>

<span class="c1"># Create figure with two subplots
</span><span class="n">fig</span><span class="p">,</span> <span class="p">(</span><span class="n">ax1</span><span class="p">,</span> <span class="n">ax2</span><span class="p">)</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>

<span class="c1"># Plot regular AUC on first axis
</span><span class="n">plot_auc_only</span><span class="p">(</span><span class="n">performance_ordered_summaries</span><span class="p">,</span> <span class="n">ordered_labels</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax1</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s">"Standard AUC"</span><span class="p">)</span>

<span class="c1"># Plot relation-weighted AUC on second axis
</span><span class="n">plot_auc_only</span><span class="p">(</span>
    <span class="n">performance_ordered_summaries</span><span class="p">,</span>
    <span class="n">ordered_labels</span><span class="p">,</span>
    <span class="n">test_auc_attribute</span><span class="o">=</span><span class="s">"test_relation_weighted_auc"</span><span class="p">,</span>
    <span class="n">val_auc_attribute</span><span class="o">=</span><span class="s">"val_relation_weighted_auc"</span><span class="p">,</span>
    <span class="n">title</span><span class="o">=</span><span class="s">"Relation-Weighted AUC"</span><span class="p">,</span>
    <span class="n">ax</span><span class="o">=</span><span class="n">ax2</span>
<span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/figure/source/2026-01-06-relation_prediction/plot_auc_comparison-output-1.png" alt="" /></p>

<h3 id="performance-across-relation-types">Performance across relation types</h3>

<p>While the relation-weighted AUC provides an overall performance metric,
examining AUC for individual relation types reveals how well each model
handles specific regulatory mechanisms. Counterintuitively, “interactor
→ interactor” edges are among the hardest to predict, possibly because
the high density of interaction edges creates competing demands on
vertex positioning in the embedding space. In contrast, directed
regulatory relation types like “stimulator → reactant” and “catalyst →
reactant” achieve higher AUCs across most models.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">relation_type_aucs</span> <span class="o">=</span> <span class="n">summarize_relation_type_aucs</span><span class="p">(</span><span class="n">run_summaries</span><span class="p">,</span> <span class="n">relation_types</span><span class="p">)</span>
<span class="n">ordered_experiments</span> <span class="o">=</span> <span class="n">relation_type_aucs</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">"experiment"</span><span class="p">).</span><span class="nb">sum</span><span class="p">().</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="s">"test_auc"</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">True</span><span class="p">).</span><span class="n">index</span><span class="p">.</span><span class="n">tolist</span><span class="p">()</span>
<span class="n">ordered_relation_types</span> <span class="o">=</span> <span class="n">relation_type_aucs</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">"relation_type"</span><span class="p">).</span><span class="nb">sum</span><span class="p">().</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="s">"test_auc"</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">True</span><span class="p">).</span><span class="n">index</span><span class="p">.</span><span class="n">tolist</span><span class="p">()</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plot_combined_grouped_barplot</span><span class="p">(</span>
    <span class="n">relation_type_aucs</span><span class="p">,</span>
    <span class="n">category_order</span><span class="o">=</span><span class="n">ordered_relation_types</span><span class="p">,</span>
    <span class="n">attribute_order</span><span class="o">=</span><span class="n">ordered_experiments</span><span class="p">,</span>
    <span class="n">value_vars</span> <span class="o">=</span> <span class="p">[</span><span class="s">"test_auc"</span><span class="p">],</span>
    <span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="mi">8</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/figure/source/2026-01-06-relation_prediction/plot_relation_type_level_aucs-output-1.png" alt="" /></p>

<h3 id="performance-takeaways">Performance takeaways</h3>

<p>Comparing models’ relation-weighted AUC and relation-level AUCs reveals
several patterns:</p>

<ul>
  <li><strong>Expressive relation-aware MLPs achieve top performance.</strong> The
relation-gated MLP and relation-attention MLP heads achieve nearly
equivalent performance (~0.87 relation-weighted AUC), representing
the ceiling for this architecture and training regime. Both combine
relation-specific modulation with multi-layer MLPs to learn
flexible, relation-specific transformations.</li>
  <li><strong>DistMult achieves remarkable parameter efficiency.</strong> Despite using
only ~1,400 parameters (1/500th of the top MLP heads), DistMult
achieves 0.865 relation-weighted AUC, trailing the top models by
less than 0.01 AUC points. DistMult learns relation-specific scalar
weights for each embedding dimension—the scoring function
$\text{score}(h, r, t) = \sum_i h_i \cdot r_i \cdot t_i$ means each
relation type re-weights the embedding space to emphasize dimensions
where related vertices show correlated (positive weights) or
anti-correlated (negative weights) patterns.</li>
  <li><strong>MLPs enable effective vertex attention.</strong> Both lightweight
attention heads (attention and relation-attention) substantially
underperform (0.833-0.844 AUC), barely exceeding dot-product
performance. The key architectural difference in top-performing
attention-based heads is the MLP that processes concatenated
source-target embeddings before attention, creating learned edge
feature representations that attention can then modulate. Raw
attention over node embeddings alone lacks sufficient expressivity.</li>
  <li><strong>Knowledge graph embedding methods show variable performance.</strong>
RotatE underperforms relative to even the simple dot-product
baseline, likely because treating the 128-dimensional embedding as
64 complex dimensions reduces the vertex representation’s
expressivity. TransE performs moderately better but still lags
behind custom relation-aware heads. The margin loss used by both
methods may be poorly suited for edge prediction—it enforces
pairwise rankings between individual positive-negative pairs rather
than learning distributional differences between positive and
negative edge populations (as BCE loss does). In contrast, DistMult
uses BCE loss, and this likely contributes to its strong
performance.</li>
  <li><strong>Relation-unaware models show relation-type variation.</strong> Even the
simple dot-product and MLP heads show varying performance across
relation types. This likely reflects competing demands on vertex
positioning—vertices involved in many dense interactions (like
“interactor → interactor” edges) face more constraints in the
embedding space than vertices in sparser relation types. Retraining
the dot-product head with relation-weighted BCE (versus standard
BCE) substantially shifts the learned embeddings, demonstrating that
relation-type frequency affects vertex positioning even when
relation type isn’t used as a model input.</li>
</ul>

<h2 id="evaluating-relation-aware-models">Evaluating relation-aware models</h2>

<p>To explore whether relation-aware models are making meaningful signed
regulatory predictions, I will evaluate them using three analyses:</p>

<ol>
  <li>Using interpretable relation-aware knowledge graph embedding heads,
I will examine the geometric representation of activation and
inhibition in the learned transformations.</li>
  <li>Exploring top-performing heads, I will examine the strength of
regulatory predictions — are scores similar regardless of the
putative relation type, or do models confidently predict
relation-type-specific interactions?</li>
  <li>Leveraging PerturbSeq data, I will assess whether models’
top-scoring relation-type predictions (activation vs. inhibition)
align with experimentally observed transcriptional responses to
genetic perturbations.</li>
</ol>

<h3 id="what-is-the-geometry-of-activation-and-inhibition">What is the geometry of activation and inhibition?</h3>

<p>To understand how relation-aware heads encode regulatory semantics, I
will examine the learned relation embeddings from three knowledge graph
embedding methods: RotatE (rotation angles), TransE (translation
vectors), and DistMult (dimensional scaling weights). Each method learns
these transformations as model weights, allowing direct inspection of
relation types’ geometric encoding.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">RotatE_head</span> <span class="o">=</span> <span class="n">models</span><span class="p">[</span><span class="s">"rotate"</span><span class="p">].</span><span class="n">task</span><span class="p">.</span><span class="n">head</span><span class="p">.</span><span class="n">head</span>
<span class="n">RotatE_phases</span> <span class="o">=</span> <span class="n">RotatE_head</span><span class="p">.</span><span class="n">relation_emb</span><span class="p">.</span><span class="n">weight</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">cpu</span><span class="p">()</span>

<span class="n">TransE_head</span> <span class="o">=</span> <span class="n">models</span><span class="p">[</span><span class="s">"transe"</span><span class="p">].</span><span class="n">task</span><span class="p">.</span><span class="n">head</span><span class="p">.</span><span class="n">head</span>
<span class="n">TransE_vectors</span> <span class="o">=</span> <span class="n">TransE_head</span><span class="p">.</span><span class="n">relation_emb</span><span class="p">.</span><span class="n">weight</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">cpu</span><span class="p">()</span>

<span class="n">DistMult_head</span> <span class="o">=</span> <span class="n">models</span><span class="p">[</span><span class="s">"distmult"</span><span class="p">].</span><span class="n">task</span><span class="p">.</span><span class="n">head</span><span class="p">.</span><span class="n">head</span>
<span class="n">DistMult_scalars</span> <span class="o">=</span> <span class="n">DistMult_head</span><span class="p">.</span><span class="n">relation_emb</span><span class="p">.</span><span class="n">weight</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">cpu</span><span class="p">()</span>
<span class="n">DistMult_deviations</span> <span class="o">=</span> <span class="n">DistMult_scalars</span> <span class="o">-</span> <span class="mf">1.0</span>  <span class="c1"># Deviation from identity
</span>
<span class="c1"># Get relation indices
</span><span class="n">regulatory_relations</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"activation"</span><span class="p">:</span> <span class="n">relation_types</span><span class="p">.</span><span class="n">index</span><span class="p">(</span><span class="s">"stimulator -&gt; modified"</span><span class="p">),</span>
    <span class="s">"inhibition"</span><span class="p">:</span> <span class="n">relation_types</span><span class="p">.</span><span class="n">index</span><span class="p">(</span><span class="s">"inhibitor -&gt; modified"</span><span class="p">),</span>
    <span class="s">"interaction"</span><span class="p">:</span> <span class="n">relation_types</span><span class="p">.</span><span class="n">index</span><span class="p">(</span><span class="s">"interactor -&gt; interactor"</span><span class="p">),</span>
<span class="p">}</span>

<span class="c1"># Compute relation-type similarity matrices for each method
</span><span class="n">similarity_matrices</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">method_name</span><span class="p">,</span> <span class="n">embeddings</span> <span class="ow">in</span> <span class="p">[</span>
    <span class="p">(</span><span class="s">"RotatE"</span><span class="p">,</span> <span class="n">RotatE_phases</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"TransE"</span><span class="p">,</span> <span class="n">TransE_vectors</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"DistMult"</span><span class="p">,</span> <span class="n">DistMult_deviations</span><span class="p">)</span>
<span class="p">]:</span>
    <span class="n">embeddings_norm</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">normalize</span><span class="p">(</span><span class="n">embeddings</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">sim_matrix</span> <span class="o">=</span> <span class="n">embeddings_norm</span> <span class="o">@</span> <span class="n">embeddings_norm</span><span class="p">.</span><span class="n">T</span>
    <span class="n">similarity_matrices</span><span class="p">[</span><span class="n">method_name</span><span class="p">]</span> <span class="o">=</span> <span class="n">sim_matrix</span>
</code></pre></div></div>

<p>Rather than examining all relation types globally, I will focus on three
biologically meaningful questions.</p>

<h4 id="do-stimulators-and-inhibitors-have-opposing-transformations">Do stimulators and inhibitors have opposing transformations?</h4>

<p>If activation and inhibition are fundamentally opposite processes, their
learned embeddings should anti-correlate:</p>

\[r_{\text{stimulator → modified}} \approx -r_{\text{inhibitor → modified}}\]

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Calculate median similarity for context
</span><span class="n">median_similarities</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">n_relations</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">relation_types</span><span class="p">)</span>
<span class="n">mask</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">triu_indices</span><span class="p">(</span><span class="n">n_relations</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="k">for</span> <span class="n">method_name</span> <span class="ow">in</span> <span class="p">[</span><span class="s">"RotatE"</span><span class="p">,</span> <span class="s">"TransE"</span><span class="p">,</span> <span class="s">"DistMult"</span><span class="p">]:</span>
    <span class="n">sim_flat</span> <span class="o">=</span> <span class="n">similarity_matrices</span><span class="p">[</span><span class="n">method_name</span><span class="p">][</span><span class="n">mask</span><span class="p">]</span>
    <span class="n">median_similarities</span><span class="p">[</span><span class="n">method_name</span><span class="p">]</span> <span class="o">=</span> <span class="n">sim_flat</span><span class="p">.</span><span class="n">median</span><span class="p">().</span><span class="n">item</span><span class="p">()</span>

<span class="n">stim_vs_inhib</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span>
    <span class="s">"RotatE"</span><span class="p">:</span> <span class="p">[</span>
        <span class="n">similarity_matrices</span><span class="p">[</span><span class="s">"RotatE"</span><span class="p">][</span><span class="n">regulatory_relations</span><span class="p">[</span><span class="s">"activation"</span><span class="p">],</span> <span class="n">regulatory_relations</span><span class="p">[</span><span class="s">"inhibition"</span><span class="p">]].</span><span class="n">item</span><span class="p">(),</span>
        <span class="n">median_similarities</span><span class="p">[</span><span class="s">"RotatE"</span><span class="p">]</span>
    <span class="p">],</span>
    <span class="s">"TransE"</span><span class="p">:</span> <span class="p">[</span>
        <span class="n">similarity_matrices</span><span class="p">[</span><span class="s">"TransE"</span><span class="p">][</span><span class="n">regulatory_relations</span><span class="p">[</span><span class="s">"activation"</span><span class="p">],</span> <span class="n">regulatory_relations</span><span class="p">[</span><span class="s">"inhibition"</span><span class="p">]].</span><span class="n">item</span><span class="p">(),</span>
        <span class="n">median_similarities</span><span class="p">[</span><span class="s">"TransE"</span><span class="p">]</span>
    <span class="p">],</span>
    <span class="s">"DistMult"</span><span class="p">:</span> <span class="p">[</span>
        <span class="n">similarity_matrices</span><span class="p">[</span><span class="s">"DistMult"</span><span class="p">][</span><span class="n">regulatory_relations</span><span class="p">[</span><span class="s">"activation"</span><span class="p">],</span> <span class="n">regulatory_relations</span><span class="p">[</span><span class="s">"inhibition"</span><span class="p">]].</span><span class="n">item</span><span class="p">(),</span>
        <span class="n">median_similarities</span><span class="p">[</span><span class="s">"DistMult"</span><span class="p">]</span>
    <span class="p">]</span>
<span class="p">},</span> <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Spearman ρ: activation vs. inhibition"</span><span class="p">,</span> <span class="s">"Spearman ρ: median"</span><span class="p">])</span>
<span class="n">stim_vs_inhib</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">"metric"</span>

<span class="n">pd_utils</span><span class="p">.</span><span class="n">format_numeric_columns</span><span class="p">(</span><span class="n">stim_vs_inhib</span><span class="p">,</span> <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">display_tabulator</span><span class="p">(</span>
    <span class="n">stim_vs_inhib</span><span class="p">,</span>
    <span class="n">caption</span><span class="o">=</span><span class="s">"Reaction-type correlation summaries"</span><span class="p">,</span>
    <span class="n">layout</span> <span class="o">=</span> <span class="s">"fitDataTable"</span><span class="p">,</span>
    <span class="n">include_index</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Reaction-type correlation summaries
</figcaption>

<div class="data-table" style="" data-table="[{&quot;metric&quot;: &quot;Spearman \u03c1: activation vs. inhibition&quot;, &quot;RotatE&quot;: &quot;0.174&quot;, &quot;TransE&quot;: &quot;0.050&quot;, &quot;DistMult&quot;: &quot;0.637&quot;}, {&quot;metric&quot;: &quot;Spearman \u03c1: median&quot;, &quot;RotatE&quot;: &quot;-0.011&quot;, &quot;TransE&quot;: &quot;0.045&quot;, &quot;DistMult&quot;: &quot;0.209&quot;}]" data-columns="[{&quot;title&quot;: &quot;metric&quot;, &quot;field&quot;: &quot;metric&quot;}, {&quot;title&quot;: &quot;RotatE&quot;, &quot;field&quot;: &quot;RotatE&quot;}, {&quot;title&quot;: &quot;TransE&quot;, &quot;field&quot;: &quot;TransE&quot;}, {&quot;title&quot;: &quot;DistMult&quot;, &quot;field&quot;: &quot;DistMult&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataTable&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p><strong>Activation and inhibition are not geometric opposites.</strong> All three
methods show weak or positive correlation rather than the expected
anti-correlation. The positive correlations suggest that “being a
regulator” is more important than the direction of regulation
(activation versus inhibition) in structuring these transformations —
both relation types emphasize similar regulatory dimensions rather than
encoding opposing effects.</p>

<h4 id="how-are-undirected-relation-types-encoded">How are undirected relation-types encoded?</h4>

<p>Protein-protein interactions are undirected—they exist in the training
data as both A → B and B → A edges with the same “interactor →
interactor” relation type. One might expect this bidirectional structure
to push the relation embedding toward identity (zero rotation, zero
translation, unit scaling), which would naturally satisfy:</p>

\[\text{score}(A, r_{\text{interaction}}, B) \approx \text{score}(B, r_{\text{interaction}}, A)\]

<p>To test whether interactor edges learn identity-like transformations, I
will extract the learned relation embeddings from each model and compare
each relation type’s deviation from identity to the median deviation
across all relation types.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">interactor_data</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">method_name</span><span class="p">,</span> <span class="n">embeddings</span> <span class="ow">in</span> <span class="p">[</span>
    <span class="p">(</span><span class="s">"RotatE"</span><span class="p">,</span> <span class="n">RotatE_phases</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"TransE"</span><span class="p">,</span> <span class="n">TransE_vectors</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"DistMult"</span><span class="p">,</span> <span class="n">DistMult_deviations</span><span class="p">)</span>
<span class="p">]:</span>
    <span class="n">all_magnitudes</span> <span class="o">=</span> <span class="n">embeddings</span><span class="p">.</span><span class="nb">abs</span><span class="p">().</span><span class="nb">sum</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">median_magnitude</span> <span class="o">=</span> <span class="n">all_magnitudes</span><span class="p">.</span><span class="n">median</span><span class="p">().</span><span class="n">item</span><span class="p">()</span>
    <span class="n">interactor_magnitude</span> <span class="o">=</span> <span class="n">all_magnitudes</span><span class="p">[</span><span class="n">regulatory_relations</span><span class="p">[</span><span class="s">"interaction"</span><span class="p">]].</span><span class="n">item</span><span class="p">()</span>
    <span class="n">interactor_data</span><span class="p">[</span><span class="n">method_name</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="n">interactor_magnitude</span> <span class="o">/</span> <span class="n">median_magnitude</span><span class="p">]</span>

<span class="n">interactor_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span>
    <span class="n">interactor_data</span><span class="p">,</span> 
    <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"interaction transformation norm</span><span class="se">\n</span><span class="s">÷</span><span class="se">\n</span><span class="s">median transformation norm"</span><span class="p">]</span>
<span class="p">).</span><span class="nb">round</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">interactor_df</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">"metric"</span>

<span class="n">display_tabulator</span><span class="p">(</span>
    <span class="n">interactor_df</span><span class="p">,</span>
    <span class="n">caption</span> <span class="o">=</span> <span class="s">"Interaction transformation magnitudes"</span><span class="p">,</span>
    <span class="n">layout</span> <span class="o">=</span> <span class="s">"fitDataTable"</span><span class="p">,</span>
    <span class="n">wrap_columns</span> <span class="o">=</span> <span class="p">{</span><span class="s">"metric"</span> <span class="p">:</span> <span class="s">"30%"</span><span class="p">},</span>
    <span class="n">include_index</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Interaction transformation magnitudes
</figcaption>

<div class="data-table" style="" data-table="[{&quot;metric&quot;: &quot;interaction transformation norm\n\u00f7\nmedian transformation norm&quot;, &quot;RotatE&quot;: 1.11, &quot;TransE&quot;: 1.0, &quot;DistMult&quot;: 1.24}]" data-columns="[{&quot;title&quot;: &quot;metric&quot;, &quot;field&quot;: &quot;metric&quot;}, {&quot;title&quot;: &quot;RotatE&quot;, &quot;field&quot;: &quot;RotatE&quot;}, {&quot;title&quot;: &quot;TransE&quot;, &quot;field&quot;: &quot;TransE&quot;}, {&quot;title&quot;: &quot;DistMult&quot;, &quot;field&quot;: &quot;DistMult&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataTable&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p><strong>Interactor edges are not near identity.</strong> All three methods learn
typical or above-median transformations for protein-protein
interactions. While this seems to violate the symmetry requirement for
undirected edges, the loss functions don’t actually enforce equal scores
for A → B and B → A pairs. Instead, they optimize discrimination between
real edges and negative samples. For TransE, edges are scored as
$\text{score}(h, r, t) = -|h + r - t|$, but the margin-based loss
compares these scores to negatives — meaning non-zero transformations
can provide discriminative power even without maintaining symmetry.</p>

<h4 id="do-the-three-methods-agree-on-regulatory-semantics">Do the three methods agree on regulatory semantics?</h4>

<p>If all three methods learn similar patterns for how relation types
relate to each other, it would suggest they’re converging on shared
regulatory semantics.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Compare similarity matrices pairwise
</span><span class="n">n_relations</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">relation_types</span><span class="p">)</span>
<span class="n">mask</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">triu_indices</span><span class="p">(</span><span class="n">n_relations</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="n">agreement_data</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">method1</span><span class="p">,</span> <span class="n">method2</span> <span class="ow">in</span> <span class="n">combinations</span><span class="p">(</span><span class="n">similarity_matrices</span><span class="p">.</span><span class="n">keys</span><span class="p">(),</span> <span class="mi">2</span><span class="p">):</span>
    <span class="n">sim1_flat</span> <span class="o">=</span> <span class="n">similarity_matrices</span><span class="p">[</span><span class="n">method1</span><span class="p">][</span><span class="n">mask</span><span class="p">]</span>
    <span class="n">sim2_flat</span> <span class="o">=</span> <span class="n">similarity_matrices</span><span class="p">[</span><span class="n">method2</span><span class="p">][</span><span class="n">mask</span><span class="p">]</span>
    <span class="n">rho</span> <span class="o">=</span> <span class="n">compute_spearman_correlation_torch</span><span class="p">(</span><span class="n">sim1_flat</span><span class="p">,</span> <span class="n">sim2_flat</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="s">'cpu'</span><span class="p">)</span>
    <span class="n">agreement_data</span><span class="p">[</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">method1</span><span class="si">}</span><span class="s"> vs </span><span class="si">{</span><span class="n">method2</span><span class="si">}</span><span class="s">"</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="n">rho</span><span class="p">]</span>

<span class="n">agreement_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">agreement_data</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"Spearman ρ"</span><span class="p">])</span>
<span class="n">agreement_df</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">"metric"</span>

<span class="n">pd_utils</span><span class="p">.</span><span class="n">format_numeric_columns</span><span class="p">(</span><span class="n">agreement_df</span><span class="p">,</span> <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">display_tabulator</span><span class="p">(</span>
    <span class="n">agreement_df</span><span class="p">,</span>
    <span class="n">caption</span> <span class="o">=</span> <span class="s">"Model-to-model comparison of relation-type correlations"</span><span class="p">,</span>
    <span class="n">layout</span> <span class="o">=</span> <span class="s">"fitDataTable"</span><span class="p">,</span>
    <span class="n">include_index</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Model-to-model comparison of relation-type correlations
</figcaption>

<div class="data-table" style="" data-table="[{&quot;metric&quot;: &quot;Spearman \u03c1&quot;, &quot;RotatE vs TransE&quot;: &quot;-0.089&quot;, &quot;RotatE vs DistMult&quot;: &quot;0.003&quot;, &quot;TransE vs DistMult&quot;: &quot;0.020&quot;}]" data-columns="[{&quot;title&quot;: &quot;metric&quot;, &quot;field&quot;: &quot;metric&quot;}, {&quot;title&quot;: &quot;RotatE vs TransE&quot;, &quot;field&quot;: &quot;RotatE vs TransE&quot;}, {&quot;title&quot;: &quot;RotatE vs DistMult&quot;, &quot;field&quot;: &quot;RotatE vs DistMult&quot;}, {&quot;title&quot;: &quot;TransE vs DistMult&quot;, &quot;field&quot;: &quot;TransE vs DistMult&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataTable&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p><strong>The three methods learn different geometric patterns.</strong> The weak
inter-method correlations show that RotatE, TransE, and DistMult don’t
converge on a shared representation of abstract regulatory
relationships, instead they learn method-specific solutions to the edge
discrimination task.</p>

<h4 id="geometry-summary">Geometry summary</h4>

<p>The geometric analysis reveals that knowledge graph embedding methods
don’t naturally encode biological intuitions about regulatory
relationships:</p>

<ul>
  <li><strong>Activation and inhibition are not geometric opposites</strong>, showing
weak or positive correlations rather than anti-correlation</li>
  <li><strong>Undirected edges don’t require identity transformations</strong>, with
interaction edges learning typical or above-median transformation
magnitudes despite being bidirectional in the training data</li>
  <li><strong>Methods don’t converge on shared geometric patterns</strong>, suggesting
they learn different solutions rather than discovering universal
principles of abstract biological regulation</li>
</ul>

<p>The training graph is dense and shaped by experimental ascertainment
biases — different relation types are more readily detected for
different subsets of vertices. Knowledge graph embedding heads struggle
to cleanly separate activators from inhibitors through vertex
positioning alone, instead learning transformations that discriminate
real edges from negatives without capturing clear regulatory semantics.
DistMult’s strong performance despite this limitation is impressive.</p>

<h3 id="are-models-predicting-relation-specific-interactions">Are models predicting relation-specific interactions?</h3>

<p>If heads are learning relation-type-specific transformations, edges
should score highly for some relation types and poorly for others. If
heads rely on vertex embeddings alone, edge scores should remain similar
regardless of the assigned relation type.</p>

<p>To evaluate the specificity of relation-type-based scoring, I will score
each test set edge under every possible relation type, then calculate
the Spearman correlation between relation types’ edge score
distributions. High correlations indicate a model assigns similar scores
regardless of relation type, while low correlations suggest
relation-specific predictions.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">exists</span><span class="p">(</span><span class="n">CROSS_RELATION_PREDICTION_CACHE</span><span class="p">):</span>
    <span class="n">cross_relation_prediction_matrices</span> <span class="o">=</span> <span class="p">{</span><span class="s">"confusion"</span><span class="p">:</span> <span class="p">{},</span> <span class="s">"correlation"</span><span class="p">:</span> <span class="p">{}}</span>
    <span class="k">for</span> <span class="n">experiment</span> <span class="ow">in</span> <span class="n">RELATION_AWARE_FOCUSED_HEADS</span><span class="p">:</span>
        <span class="n">model</span> <span class="o">=</span> <span class="n">models</span><span class="p">[</span><span class="n">experiment</span><span class="p">]</span>
        <span class="n">cross_relation_prediction_matrices</span><span class="p">[</span><span class="s">"confusion"</span><span class="p">][</span><span class="n">experiment</span><span class="p">],</span> <span class="n">cross_relation_prediction_matrices</span><span class="p">[</span><span class="s">"correlation"</span><span class="p">][</span><span class="n">experiment</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span>
            <span class="n">calculate_relation_type_confusion_and_correlation</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">napistu_data</span><span class="p">,</span> <span class="n">normalize</span><span class="o">=</span><span class="s">"true"</span><span class="p">)</span>
        <span class="p">)</span>

    <span class="n">napistu_utils</span><span class="p">.</span><span class="n">save_pickle</span><span class="p">(</span><span class="n">CROSS_RELATION_PREDICTION_CACHE</span><span class="p">,</span> <span class="n">cross_relation_prediction_matrices</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">cross_relation_prediction_matrices</span> <span class="o">=</span> <span class="n">napistu_utils</span><span class="p">.</span><span class="n">load_pickle</span><span class="p">(</span><span class="n">CROSS_RELATION_PREDICTION_CACHE</span><span class="p">)</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="mi">16</span><span class="p">))</span>
<span class="n">axes</span> <span class="o">=</span> <span class="n">axes</span><span class="p">.</span><span class="n">flatten</span><span class="p">()</span>

<span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">head</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">RELATION_AWARE_FOCUSED_HEADS</span><span class="p">):</span>
    <span class="n">correlation_matrix</span> <span class="o">=</span> <span class="n">cross_relation_prediction_matrices</span><span class="p">[</span><span class="s">"correlation"</span><span class="p">][</span><span class="n">head</span><span class="p">]</span>
    <span class="n">title</span> <span class="o">=</span> <span class="n">textwrap</span><span class="p">.</span><span class="n">fill</span><span class="p">(</span><span class="n">HEAD_DESCRIPTIONS</span><span class="p">[</span><span class="n">head</span><span class="p">][</span><span class="s">"label"</span><span class="p">],</span> <span class="n">width</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
    <span class="n">plot_heatmap</span><span class="p">(</span>
        <span class="n">correlation_matrix</span><span class="p">,</span>
        <span class="n">row_labels</span><span class="o">=</span><span class="n">relation_types</span><span class="p">,</span>
        <span class="n">title</span><span class="o">=</span><span class="n">title</span><span class="p">,</span>
        <span class="n">cmap</span><span class="o">=</span><span class="s">'magma'</span><span class="p">,</span>
        <span class="n">cbar</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
        <span class="n">fmt</span><span class="o">=</span><span class="s">'.2f'</span><span class="p">,</span>
        <span class="n">vmax</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
        <span class="n">vmin</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
        <span class="n">cbar_label</span><span class="o">=</span><span class="s">'Spearman ρ'</span><span class="p">,</span>
        <span class="n">mask_upper_triangle</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="n">square</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="n">cluster</span><span class="o">=</span><span class="s">'both'</span><span class="p">,</span>
        <span class="n">cluster_method</span><span class="o">=</span><span class="s">'average'</span><span class="p">,</span>
        <span class="n">cluster_metric</span><span class="o">=</span><span class="s">'euclidean'</span><span class="p">,</span>
        <span class="n">title_size</span><span class="o">=</span><span class="mi">22</span><span class="p">,</span>
        <span class="n">ax</span><span class="o">=</span><span class="n">axes</span><span class="p">[</span><span class="n">idx</span><span class="p">],</span>
    <span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/figure/source/2026-01-06-relation_prediction/calculate_relation_type_score_correlations-output-1.png" alt="" /></p>

<p>From these cross-relation-type score correlations, several patterns
emerge:</p>

<ul>
  <li><strong>Top-performing models show strong relation-type specificity.</strong>
DistMult, relation-gated MLP, and relation-attention MLP heads all
generate predictions that are highly dependent on relation type,
with lower cross-relation correlations indicating distinct scoring
patterns for different edge types. This relation-type specificity
appears essential for achieving top performance (&gt;0.86
relation-weighted AUC).</li>
  <li><strong>High-performing heads share common structural patterns.</strong> The top
three model’s relation-type score correlation structures are similar
(rho ~= 0.75 DistMult-MLPs, and rho = 0.96 MLP-MLP), suggesting
they are learning similar patterns. This is particularly apparent
for “modifier → modified” edges, which are distinguished from other
relation types. This may reflect a source-specific curation quirk
that provides a strong training signal: “modifier → modified” edges
arise primarily from a single source (Omnipath) when an interaction
is annotated as both activation and inhibition, creating a
distinctive pattern that these models learn to recognize.</li>
  <li><strong>TransE shows limited relation-type differentiation.</strong> TransE
assigns similar scores to edges regardless of relation type, as
evidenced by uniformly high cross-relation correlations. This
indicates that it relies more heavily on source and target vertex
embedding similarity than on the learned relation-specific
transformations, potentially incorporating relation-agnostic
discriminative signals rather than capturing true regulatory
semantics.</li>
</ul>

<h3 id="validating-signed-predictions-with-perturbseq">Validating signed predictions with PerturbSeq</h3>

<p>To evaluate whether relation-aware heads can predict not just
interaction existence but regulatory direction (activation
vs. inhibition), I need datasets where molecular species are
systematically perturbed and their directed impacts on other species are
measured. While many such experiments exist, most are either already
incorporated into the graph through resources like STRING and IntAct, or
haven’t been aggregated at sufficient scale for validation.</p>

<p>PerturbSeq experiments — where genes are perturbed using CRISPR and
transcriptome-wide impacts are measured — provide an ideal validation
source. Large-scale datasets like Replogle et al. (2022, Cell) perturb
many genes, while smaller studies investigate targeted hypotheses (e.g.,
human mutation knock-ins). However, the original Replogle dataset
reports only Anderson-Darling q-values, not signed fold-changes. I
therefore turned to Harmonizome, an ongoing effort from the Ma’ayan lab
at Mount Sinai to generate and compare diverse gene-centric profiles
(Diamant et al., 2025, Nucleic Acids Research). Harmonizome provides
signed PerturbSeq fold-changes from both the Replogle datasets and
PerturbAtlas in consistent data formats.</p>

<p>To evaluate model predictions against PerturbSeq data, I:</p>

<ol>
  <li><strong>Mapped PerturbSeq data to graph vertices.</strong> Loaded species
identifiers (systematic identifier to molecular species ID mappings)
and compartmentalized species ID maps to translate PerturbSeq
systematic identifiers into graph vertex IDs.</li>
  <li><strong>Processed Harmonizome PerturbSeq interactions:</strong>
    <ul>
      <li>Mapped source and target genes to vertex IDs using the adapters
from step 1</li>
      <li>Selected strong perturbations by comparing Harmonizome values to
their thresholds</li>
      <li>Inferred regulatory direction from perturbation type and
fold-change direction:
        <ul>
          <li>Overexpression + upregulation → activation</li>
          <li>Overexpression + downregulation → inhibition\</li>
          <li>Knockout/knockdown + upregulation → inhibition
(de-repression)</li>
          <li>Knockout/knockdown + downregulation → activation</li>
          <li>Other perturbation types (e.g., knock-ins) were excluded</li>
        </ul>
      </li>
    </ul>
  </li>
  <li><strong>Generated relation-type predictions.</strong> For each PerturbSeq edge
(represented as source-target vertex indices), I scored both
activating (“stimulator → modified”) and repressive (“inhibitor →
modified”) relation types using each model. I assigned the
higher-scoring relation type as the predicted regulatory direction.</li>
  <li><strong>Compared predictions to PerturbSeq ground truth.</strong> I constructed
2×2 contingency tables comparing predicted relation types
(activation vs. inhibition) to observed PerturbSeq directions,
calculated significance using χ² tests, and quantified the agreement
between predicted and observed regulatory directions.</li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># map relation_types to relation_type indices so we can just score select relation_types
</span><span class="n">relation_type_indices</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span> <span class="p">:</span> <span class="n">i</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">relation_types</span><span class="p">)}</span>
<span class="n">focused_relation_type_indices</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span> <span class="p">:</span> <span class="n">relation_type_indices</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">PERTURBSEQ_RELATION_TYPES</span><span class="p">}</span>

<span class="c1"># load PerturbSeq results from Harmonizome, map to vertices, and  and filter to strong inhibition and activation
</span><span class="n">distinct_harmonizome_perturbseq_interactions</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">load_harmonizome_perturbseq_datasets</span><span class="p">(</span><span class="n">LOCAL_HARMONIZOME_DATA_DIR</span><span class="p">,</span> <span class="n">species_identifiers</span><span class="p">)</span>
    <span class="c1"># rollup to 1 entry per dataset, source, target, and perturbation type
</span>    <span class="p">.</span><span class="n">pipe</span><span class="p">(</span><span class="n">_get_distinct_harmonizome_perturbseq_interactions</span><span class="p">)</span>
    <span class="c1"># filter to only perturbation types where predicted direction is clear (e.g., ignore knock-ins and mutations)
</span>    <span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">"perturbation_type in @SIGNED_PERTURBATION_TYPES"</span><span class="p">)</span>
    <span class="c1"># assign predicted direction (strong inhibition, strong activation)
</span>    <span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="n">perturbseq_prediction</span><span class="o">=</span><span class="k">lambda</span> <span class="n">df</span><span class="p">:</span> <span class="n">assign_predicted_direction</span><span class="p">(</span><span class="n">df</span><span class="p">))</span>
    <span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">"perturbseq_prediction in @STRONG_ORDERED_SIGNED_PERTURBSEQ_DIRECTIONS"</span><span class="p">)</span>
<span class="p">)</span>

<span class="c1"># pull out all unique from-to species id pairs
</span><span class="n">distinct_perturbseq_pairs</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">distinct_harmonizome_perturbseq_interactions</span><span class="p">[[</span><span class="s">"perturbed_species_id"</span><span class="p">,</span> <span class="s">"target_species_id"</span><span class="p">]]</span>
    <span class="p">.</span><span class="n">drop_duplicates</span><span class="p">()</span>
    <span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="p">)</span>

<span class="c1"># Map from species_id to integer_ids and convert to tensor
</span><span class="n">perturbseq_edgelist_tensor</span> <span class="o">=</span> <span class="n">get_perturbseq_edgelist_tensor</span><span class="p">(</span>
    <span class="n">distinct_perturbseq_pairs</span><span class="p">,</span>
    <span class="n">name_to_sid_map</span>
<span class="p">)</span>

<span class="c1"># create predictions for the models of interest
</span><span class="n">predicted_relation_type_vs_perturbseq_truth_predictions</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">predicted_relation_type_vs_perturbseq_truth_pvalues</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">experiment</span> <span class="ow">in</span> <span class="n">RELATION_AWARE_FOCUSED_HEADS</span><span class="p">:</span>
    
    <span class="n">model</span> <span class="o">=</span> <span class="n">models</span><span class="p">[</span><span class="n">experiment</span><span class="p">]</span>

    <span class="n">summaries</span> <span class="o">=</span> <span class="n">compare_relation_type_predictions_to_perturbseq_truth</span><span class="p">(</span>
        <span class="n">model</span><span class="p">,</span>
        <span class="n">focused_relation_type_indices</span><span class="p">,</span>
        <span class="n">perturbseq_edgelist_tensor</span><span class="p">,</span>
        <span class="n">napistu_data</span><span class="p">,</span>
        <span class="n">distinct_perturbseq_pairs</span><span class="p">,</span>
        <span class="n">distinct_harmonizome_perturbseq_interactions</span><span class="p">,</span>
        <span class="n">STRONG_ORDERED_SIGNED_PERTURBSEQ_DIRECTIONS</span><span class="p">,</span>
        <span class="n">PERTURBSEQ_RELATION_TYPES</span><span class="p">,</span>
    <span class="p">)</span>

    <span class="n">predicted_relation_type_vs_perturbseq_truth_predictions</span><span class="p">[</span><span class="n">experiment</span><span class="p">],</span> <span class="n">predicted_relation_type_vs_perturbseq_truth_pvalues</span><span class="p">[</span><span class="n">experiment</span><span class="p">]</span> <span class="o">=</span> <span class="n">summaries</span>

<span class="c1"># create the heatmaps
</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span>
<span class="n">axes</span> <span class="o">=</span> <span class="n">axes</span><span class="p">.</span><span class="n">flatten</span><span class="p">()</span>

<span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">head</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">RELATION_AWARE_FOCUSED_HEADS</span><span class="p">):</span>
    
    <span class="n">dat</span> <span class="o">=</span> <span class="n">predicted_relation_type_vs_perturbseq_truth_predictions</span><span class="p">[</span><span class="n">head</span><span class="p">]</span>
    <span class="n">title</span> <span class="o">=</span> <span class="n">textwrap</span><span class="p">.</span><span class="n">fill</span><span class="p">(</span><span class="n">HEAD_DESCRIPTIONS</span><span class="p">[</span><span class="n">head</span><span class="p">][</span><span class="s">"label"</span><span class="p">],</span> <span class="n">width</span><span class="o">=</span><span class="mi">25</span><span class="p">)</span> <span class="o">+</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="sa">f</span><span class="s">"log10p = </span><span class="si">{</span><span class="n">predicted_relation_type_vs_perturbseq_truth_pvalues</span><span class="p">[</span><span class="n">head</span><span class="p">].</span><span class="nb">round</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="si">}</span><span class="s">"</span>
    <span class="n">row_labels</span> <span class="o">=</span> <span class="p">[</span><span class="n">textwrap</span><span class="p">.</span><span class="n">fill</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">PERTURBSEQ_RELATION_TYPES</span><span class="p">]</span>
    <span class="n">col_labels</span> <span class="o">=</span> <span class="p">[</span><span class="n">textwrap</span><span class="p">.</span><span class="n">fill</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">STRONG_ORDERED_SIGNED_PERTURBSEQ_DIRECTIONS</span><span class="p">]</span>
    <span class="n">x_label</span> <span class="o">=</span> <span class="s">"PerturbSeq Prediction"</span> <span class="k">if</span> <span class="n">idx</span> <span class="o">&gt;=</span> <span class="mi">2</span> <span class="k">else</span> <span class="bp">None</span>
    <span class="n">y_label</span> <span class="o">=</span> <span class="s">"Top Scoring Relation-Type"</span> <span class="k">if</span> <span class="n">idx</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span> <span class="bp">None</span>
    
    <span class="n">plot_heatmap</span><span class="p">(</span>
        <span class="n">dat</span><span class="p">,</span>
        <span class="n">row_labels</span><span class="o">=</span><span class="n">row_labels</span><span class="p">,</span>
        <span class="n">column_labels</span><span class="o">=</span><span class="n">col_labels</span><span class="p">,</span>
        <span class="n">title</span><span class="o">=</span><span class="n">title</span><span class="p">,</span>
        <span class="n">xlabel</span><span class="o">=</span><span class="n">x_label</span><span class="p">,</span>
        <span class="n">ylabel</span><span class="o">=</span><span class="n">y_label</span><span class="p">,</span>
        <span class="n">cmap</span><span class="o">=</span><span class="s">'magma'</span><span class="p">,</span>
        <span class="n">fmt</span><span class="o">=</span><span class="s">'.2f'</span><span class="p">,</span>
        <span class="n">vmin</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span>
        <span class="n">vmax</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span>
        <span class="n">cbar_label</span><span class="o">=</span><span class="s">'Proportion'</span><span class="p">,</span>
        <span class="n">square</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="n">cbar</span> <span class="o">=</span> <span class="bp">False</span><span class="p">,</span>
        <span class="n">title_size</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span>
        <span class="n">label_size</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span>
        <span class="n">axis_title_size</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span>
        <span class="n">annot_size</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span>
        <span class="n">ax</span><span class="o">=</span><span class="n">axes</span><span class="p">[</span><span class="n">idx</span><span class="p">],</span>
    <span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/figure/source/2026-01-06-relation_prediction/compare_perturbseq_truth_to_relation_type_predictions-output-1.png" alt="" /></p>

<p>Several patterns emerge from these contingency tables:</p>

<ul>
  <li><strong>Relation-type scores lack cross-calibration.</strong> Average scores can
vary substantially between relation types within a model. This is
most apparent for TransE, where inhibitory edges score
systematically higher than activating edges, resulting in
predominantly inhibitory predictions regardless of ground truth.
Since the loss function compares real edges to negative samples
within the same relation-type stratum, models have no incentive to
calibrate scores across relation types.</li>
  <li><strong>TransE predictions are independent of PerturbSeq ground truth.</strong>
TransE’s top-scoring relation types are only weakly associated with
observed regulatory direction (log10p = -2.98). There is actually a
slight enrichment in the off-diagonal quadrants (top-right and
bottom-left), where predictions disagree with CRISPR results.</li>
  <li><strong>Top-performing relation-aware heads achieve strong agreement with
ground truth.</strong> All three of the top-performing models (DistMult,
the relation-gated MLP and relation-aware attention heads) show
striking enrichment along the diagonal (top-left and bottom-right
quadrants), indicating correct prediction of regulatory direction.
The statistical significance is overwhelming (log10p &lt; -300), with
meaningful visual enrichment patterns where predicted
activation/inhibition aligns with observed PerturbSeq responses.</li>
</ul>

<p>The agreement between top-scoring relation types and CRISPR ground truth
is particularly impressive given several important caveats:</p>

<ul>
  <li><strong>Regulatory ground truth is inherently muddy.</strong> Harmonizome’s
predicted regulatory calls show limited alignment with the
Anderson-Darling q-values reported in the original Replogle
supplement, highlighting the fundamental difficulty of establishing
definitive in vivo regulatory ground truth from experimental data.</li>
  <li><strong>PerturbSeq captures both direct and indirect effects.</strong> CRISPRi
perturbations measure transcriptome-wide changes that include both
immediate regulatory targets and downstream cascade effects. While
I’d expect models to predict direct interactions more accurately,
the training data itself contains a mixture of direct and indirect
interactions, and relation types may provide a means for expressing
this distinction.</li>
  <li><strong>Signed regulatory edges are rare in the training data.</strong>
Activation (“stimulator → modified”) and inhibition (“inhibitor →
modified”) edges comprise a small fraction of the graph compared to
undirected physical interactions (~3% of edges). The fact that
models can accurately distinguish these regulatory directions
despite their relative scarcity demonstrates that these concepts are
effectively encoded into the vertex and relation-type embeddings.</li>
</ul>

<h2 id="published-models">Published models</h2>

<p>All of the
<a href="https://huggingface.co/datasets/seanhacks/relation_prediction">data</a>
and models used in this analysis are available on Hugging Face.</p>

<p>The best-performing relation-aware models may be of particular interest
to others. These are <strong>128-dim GraphConv encoders trained for
relation-stratified edge prediction</strong> with the following heads:</p>

<ul>
  <li><a href="https://huggingface.co/seanhacks/relation_prediction_distmult_128e">DistMult
head</a></li>
  <li><a href="https://huggingface.co/seanhacks/relation_gated_mlp">Relation-gated MLP
head</a></li>
  <li><a href="https://huggingface.co/seanhacks/relation_attention_mlp">Relation-attention MLP
head</a></li>
</ul>

<h2 id="summary">Summary</h2>

<p>Relation-aware graph neural networks offer a promising path forward for
predicting signed regulatory interactions — a major blind spot in
current virtual cell modeling efforts. While large-scale single-cell
RNA-seq atlases have enabled unprecedented molecular profiling,
translating these observations into predictive models of cellular
regulation requires distinguishing activation from inhibition, not just
identifying that interactions exist.</p>

<p>The results here validate both the expressive, expansive Napistu graphs
and the power of mining them with graph neural networks:</p>

<ul>
  <li><strong>Appropriate architectures matter more than parameter count.</strong>
Top-performing heads (DistMult, relation-gated MLP,
relation-attention MLP) all achieve strong relation-type
specificity, but through different mechanisms. DistMult accomplishes
this with minimal parameters (~1,400) through dimensional
weighting, while MLP-based heads use 60-90K parameters for gating or
attention mechanisms. Critically, raw attention heads substantially
underperform despite having 20× more parameters than DistMult,
demonstrating that architectural choices trump raw model size.</li>
  <li><strong>Learned relation embeddings prioritize discrimination over
semantic meaning.</strong> Activation and inhibition — biologically
opposing processes—produce similar rather than anti-correlated
geometric transformations. Undirected edges are not encoded with
symmetric transformations that would score A→B and B→A equally. The
geometric patterns learned by these methods reflect statistical
structure useful for edge discrimination, rather than interpretable
regulatory semantics.</li>
  <li><strong>PerturbSeq validation demonstrates biological grounding.</strong>
Top-performing models show impressive agreement with CRISPR
perturbation ground truth, correctly distinguishing activation from
inhibition with overwhelming statistical significance (log10p &lt;
-300). This validation against orthogonal experimental data confirms
the models have learned biologically meaningful representations of
regulation.</li>
</ul>

<p>This work opens several opportunities for refinement. Adapting loss
functions to calibrate scores across relation types would enable more
interpretable cross-relation comparisons and better support for
predicting novel interaction types. Continued architectural innovation
in encoder–decoder designs — particularly in how relation information
gates or modulates vertex representations — could further improve the
semantic encoding of regulatory concepts. These advances would
strengthen the foundation for computational models capable of predicting
not just molecular associations, but also their functional consequences
in cellular systems.</p>]]></content><author><name>Sean Hackett</name></author><category term="napistu" /><category term="ML" /><category term="python" /><category term="GNNs" /><category term="PyTorch" /><summary type="html"><![CDATA[In my last post, I discussed self-supervised edge prediction as a way of embedding genes using a gene-regulatory network. This approach allows genes, metabolites, drugs and other vertices to be connected based on shared network topology. However, to date I’ve only discussed edge prediction using a dot-product head, where a vertex-pair’s edge support is a direct readout of their similarity in embedding space (𝐚 · 𝐛). While surprisingly powerful, this head has limitations when vertices are heterogeneous or interact in qualitatively different ways — particularly when we want to distinguish between activation and inhibition. Here, I explore more expressive approaches for learning mappings between A → B by evaluating both general edge prediction heads (like MLPs) and “relation-aware” heads that can learn distinct mappings for different edge types. The post will cover: Data model and training changes enabling relation-specific predictions Geometric analysis revealing how relation-aware heads encode regulatory semantics PerturbSeq validation demonstrating successful prediction of signed regulatory interactions Pre-trained models available on HuggingFace]]></summary></entry><entry><title type="html">Napistu meets PyTorch Geometric - Predicting Regulatory Interactions with Graph Neural Networks</title><link href="https://www.shackett.org/napistu_torch/" rel="alternate" type="text/html" title="Napistu meets PyTorch Geometric - Predicting Regulatory Interactions with Graph Neural Networks" /><published>2025-11-19T00:00:00+00:00</published><updated>2025-11-19T00:00:00+00:00</updated><id>https://www.shackett.org/napistu_torch</id><content type="html" xml:base="https://www.shackett.org/napistu_torch/"><![CDATA[<p>Biological applications of graph neural networks (GNNs) typically work
with either small curated networks (100s-1,000s of nodes) or
aggressively filtered subsets of large databases like STRING. The
Octopus graph — which I introduced in my <a href="https://www.shackett.org/octopus_network/">previous
post</a> — occupies a
different space entirely. By integrating eight complementary pathway
databases, it creates a genome-scale network with ~50K proteins,
metabolites, and complexes spanning ~10M edges, all while preserving
rich metadata about edge provenance, confidence scores, and mechanistic
detail that filtered approaches discard.</p>

<p>This puts the Octopus in uncharted territory: <strong>large enough to capture
genome-scale complexity, yet structured enough to preserve the
biological interpretability that makes network analysis valuable</strong>. GNNs
scale well beyond genome-scale requirements (100M+ nodes in social
networks), but remain unexplored for comprehensive biological networks
that integrate regulatory, metabolic, and interaction data. Bridging
this gap requires infrastructure that handles both the biological
complexity of multi-source networks and the engineering complexity of
training GNNs at scale.</p>

<p>In this post, <strong>I’ll introduce
<a href="https://github.com/napistu/napistu-torch">Napistu-Torch</a> — the
infrastructure that finally makes this space navigable</strong>. Available from
<a href="https://pypi.org/project/napistu-torch/">PyPI</a> and indexed by the
<a href="https://www.shackett.org/napistu_mcp/">Napistu MCP server</a>,
Napistu-Torch provides a modular, reproducible framework for training
GNNs on comprehensive biological networks. I’ll demonstrate that it’s
feasible to train graph convolutional networks on the complete Octopus
network using just a laptop (albeit with 2 days of training time for the
full suite of models). But the real contribution is the ecosystem: the
data structures, pipelines, and evaluation strategies that unlock far
more sophisticated analyses.</p>

<!--more-->

<p>Specifically, I’ll walk through the key components of the Napistu-Torch
ecosystem:</p>

<ul>
  <li>
    <p><strong>Data engineering</strong>: Converting <code class="language-plaintext highlighter-rouge">NapistuGraph</code> objects into PyTorch
Geometric <code class="language-plaintext highlighter-rouge">Data</code> objects while preserving the Octopus network’s rich
vertex and edge metadata. The <code class="language-plaintext highlighter-rouge">NapistuDataStore</code> manages caching and
lazy loading of derived artifacts, eliminating the overhead of
rebuilding datasets — you can immediately start training or
evaluating models.</p>
  </li>
  <li>
    <p><strong>Model components</strong>: Breaking down the anatomy of a GNN into its
core building blocks — encoders that learn vertex representations
via message passing, heads that make predictions from embeddings,
and optional edge encoders that weight edges based on metadata. I’ll
compare several architectures (GCN, GraphSAGE, GraphConv) with and
without edge encoding.</p>
  </li>
  <li>
    <p><strong>Training infrastructure</strong>: Leveraging PyTorch Lightning to
orchestrate model training with minimal boilerplate. Configuration
files define entire experiments, making it easy to reproduce results
or modify architectures without touching code. The CLI supports the
full train-test workflow with automatic experiment tracking via
Weights &amp; Biases.</p>
  </li>
  <li>
    <p><strong>Self-supervised learning</strong>: Training without ground truth labels
by framing the task as edge prediction. The key challenge is forcing
the model to learn real biological patterns through careful negative
sampling — ensuring that negative examples aren’t trivially
distinguishable from true edges while remaining computationally
tractable at the scale of millions of edges.</p>
  </li>
  <li>
    <p><strong>Model interpretation</strong>: Evaluating what the models learn through
three lenses: (1) vertex embeddings that capture molecular
similarity and pathway membership, (2) learned edge weights that
reveal what makes a high-confidence interaction, and (3) edge
prediction patterns that assess whether the model is learning
biological structure versus discovering topological constraints.</p>
  </li>
</ul>

<h2 id="from-napistu-to-pytorch-geometric">From Napistu to PyTorch Geometric</h2>

<p>Graph neural networks learn representations of nodes and edges by
iteratively aggregating information from local neighborhoods — a
process called message passing. Unlike traditional neural networks that
operate on fixed-size inputs, GNNs can handle graphs of arbitrary size
and structure, making them well-suited for biological networks where
connectivity patterns encode meaningful relationships. Through multiple
rounds of message passing, GNNs capture increasingly complex structural
patterns, from immediate neighbors to broader network motifs.</p>

<p>Graph neural networks in Python typically use <a href="https://pytorch-geometric.readthedocs.io/">PyTorch Geometric
(PyG)</a>, a library that
extends PyTorch with data structures and operations optimized for
graph-structured data. PyG represents graphs using the <code class="language-plaintext highlighter-rouge">Data</code> class,
which stores node features, edge connectivity, and optional edge
attributes as PyTorch tensors — the fundamental format needed for
GPU-accelerated training.</p>

<p>Napistu networks, however, live in a different ecosystem. A
<code class="language-plaintext highlighter-rouge">NapistuGraph</code> (subclass of <code class="language-plaintext highlighter-rouge">igraph.Graph</code>) stores biological networks
with rich vertex and edge metadata—species types, reaction mechanisms,
database provenance, confidence scores. Training GNNs on these networks
requires bridging these two worlds: preserving Napistu’s biological
metadata while converting graphs into PyG’s tensor-based format.</p>

<p>This is where Napistu-Torch comes in. Following the same design
philosophy as Napistu-Py — extending established frameworks rather
than reinventing them — Napistu-Torch builds on PyG with biology-aware
data structures and methods. The goal is to lean on well-established
frameworks like PyG, PyTorch Lightning, and Weights &amp; Biases so the
codebase can focus on domain-specific challenges: encoding biological
signals, integrating diverse metadata, and evaluating models with
biologically meaningful metrics</p>

<h3 id="napistugraph--napistudata">NapistuGraph → NapistuData</h3>

<p>A key data structure in Napistu-Torch is <code class="language-plaintext highlighter-rouge">NapistuData</code>, which extends
PyG’s <code class="language-plaintext highlighter-rouge">Data</code> class to handle biological network metadata. At its core,
it contains the same PyTorch tensor components that any PyG model
expects:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">x</code>: vertex attributes [# vertices × # of vertex features]</li>
  <li><code class="language-plaintext highlighter-rouge">edge_index</code>: graph connectivity [2 × # of edges]</li>
  <li><code class="language-plaintext highlighter-rouge">edge_attr</code> (optional): edge attributes [# edges × # of edge
features]</li>
  <li><code class="language-plaintext highlighter-rouge">edge_weight</code> (optional): edge weights [# of edges × 1]</li>
  <li><code class="language-plaintext highlighter-rouge">y</code> (optional): node labels for supervised tasks [# of vertices ×
1]</li>
</ul>

<p>But <code class="language-plaintext highlighter-rouge">NapistuData</code> also tracks Napistu-specific metadata—feature
encoders, vertex and edge masks for train/val/test splits, and mappings
back to the original <code class="language-plaintext highlighter-rouge">NapistuGraph</code> identifiers.</p>

<h3 id="creating-a-napistudata">Creating a NapistuData</h3>

<p>Constructing a <code class="language-plaintext highlighter-rouge">NapistuData</code> instance involves three conceptual steps:</p>

<ol>
  <li><strong>Load the network</strong>: Start with a <code class="language-plaintext highlighter-rouge">NapistuGraph</code> and its associated
<code class="language-plaintext highlighter-rouge">SBML_dfs</code> database — here, the 8-source Octopus consensus network
downloaded from Google Cloud Storage</li>
  <li><strong>Augment with attributes</strong>: Add relevant vertex and edge metadata
as described in the <a href="https://www.shackett.org/octopus_network/#decorating-the--graph-with-species-and-reaction-data">Octopus network
post</a></li>
  <li><strong>Encode as tensors</strong>: Convert attributes to <code class="language-plaintext highlighter-rouge">torch.Tensor</code>s using
sklearn-based encoders with automatic type detection
(binary→passthrough, categorical→one-hot,
continuous→standardization) and train/val/test splitting</li>
</ol>

<p>In practice, you rarely construct <code class="language-plaintext highlighter-rouge">NapistuData</code> objects manually.
Instead, the <code class="language-plaintext highlighter-rouge">NapistuDataStore</code> handles this process automatically —
loading raw data, applying transformations, caching results, and
managing related artifacts. This is what enables immediate model
training without rebuild overhead. I’ll demonstrate the store-based
workflow after covering environment setup.</p>

<h2 id="following-along">Following along</h2>

<p>This analysis is fully reproducible — all code, data, and model
configurations are provided so you can run the complete workflow on your
own machine. This section covers environment setup and file locations.</p>

<h3 id="environment-setup">Environment setup</h3>

<p>To reproduce this notebook:</p>

<ol>
  <li>
    <p>Install <a href="https://docs.astral.sh/uv/#highlights">uv</a> (or use <code class="language-plaintext highlighter-rouge">pip</code> if
preferred).</p>
  </li>
  <li>
    <p>Set up a Python environment:</p>
  </li>
</ol>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>uv venv <span class="nt">--python</span> 3.11
<span class="nb">source</span> .venv/bin/activate
<span class="c"># Core dependencies</span>
uv pip <span class="nb">install </span><span class="nv">torch</span><span class="o">==</span>2.8.0
uv pip <span class="nb">install </span>torch-scatter torch-sparse <span class="nt">-f</span> https://data.pyg.org/whl/torch-2.8.0+cpu.html
uv pip <span class="nb">install </span><span class="nv">napistu</span><span class="o">==</span>0.7.5
uv pip <span class="nb">install</span> <span class="s2">"napistu-torch[pyg,lightning]==0.2.6"</span>
<span class="c"># if you'd like to render the notebook, you'll need to install these additional dependencies</span>
uv pip <span class="nb">install </span>seaborn ipykernel nbformat nbclient umap-learn
python <span class="nt">-m</span> ipykernel <span class="nb">install</span> <span class="nt">--user</span> <span class="nt">--name</span><span class="o">=</span>blog-staging
</code></pre></div></div>

<ol>
  <li>
    <p>Download the
<a href="https://github.com/shackett/shackett/blob/master/posts/posted/napistu_nets.qmd"><code class="language-plaintext highlighter-rouge">napistu_nets.qmd</code></a>
notebook (or copy and paste the relevant code blocks).</p>
  </li>
  <li>
    <p>Choose your path:</p>

    <ul>
      <li><strong>4a. Using pre-trained models</strong> (recommended): Download
<a href="https://github.com/shackett/shackett/blob/main/assets/data/napistu_nets_models.zip">pre-trained models and
configs</a>
(~50MB) and extract to your experiments directory.</li>
      <li><strong>4b. Training from scratch</strong>: Download <a href="https://github.com/shackett/shackett/blob/main/assets/data/napistu_nets_configs.zip">model configs
only</a>
to train models yourself (requires ~2 days on an M4 Max MacBook
Pro, 8-12 hours per model).</li>
    </ul>
  </li>
  <li>
    <p>Configure <code class="language-plaintext highlighter-rouge">EXPERIMENTS_DIR</code> and other paths in the <code class="language-plaintext highlighter-rouge">env_setup</code> code
block to point to your local directories.</p>
  </li>
</ol>

<h3 id="configuration-and-imports">Configuration and imports</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># imports
</span>
<span class="kn">import</span> <span class="nn">logging</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">matplotlib.colors</span> <span class="kn">import</span> <span class="n">LogNorm</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>

<span class="kn">from</span> <span class="nn">napistu_torch.evaluation.edge_prediction</span> <span class="kn">import</span> <span class="n">summarize_edge_predictions_by_strata</span><span class="p">,</span> <span class="n">plot_edge_predictions_by_strata</span>
<span class="kn">from</span> <span class="nn">napistu_torch.evaluation.edge_weights</span> <span class="kn">import</span> <span class="n">compute_edge_feature_sensitivity</span><span class="p">,</span> <span class="n">format_edge_feature_sensitivity</span><span class="p">,</span> <span class="n">plot_edge_feature_sensitivity</span>
<span class="kn">from</span> <span class="nn">napistu_torch.evaluation.evaluation_manager</span> <span class="kn">import</span> <span class="n">EvaluationManager</span>
<span class="kn">from</span> <span class="nn">napistu_torch.evaluation.model_comparison</span> <span class="kn">import</span> <span class="n">compare_embeddings</span>
<span class="kn">from</span> <span class="nn">napistu_torch.evaluation.pathways</span> <span class="kn">import</span> <span class="n">calculate_pathway_similarities</span>
<span class="kn">from</span> <span class="nn">napistu_torch.lightning.tasks</span> <span class="kn">import</span> <span class="n">get_edge_encoder</span>
<span class="kn">from</span> <span class="nn">napistu_torch.lightning.workflows</span> <span class="kn">import</span> <span class="n">predict</span>
<span class="kn">from</span> <span class="nn">napistu_torch.load.gcs</span> <span class="kn">import</span> <span class="n">gcs_model_to_store</span>
<span class="kn">from</span> <span class="nn">napistu_torch.utils.torch_utils</span> <span class="kn">import</span> <span class="n">select_device</span>
<span class="kn">from</span> <span class="nn">napistu_torch.visualization.basic_metrics</span> <span class="kn">import</span> <span class="n">plot_model_comparison</span>
<span class="kn">from</span> <span class="nn">napistu_torch.visualization.embeddings</span> <span class="kn">import</span> <span class="n">layout_umap</span><span class="p">,</span> <span class="n">plot_coordinates_with_masks</span>

<span class="n">logging</span><span class="p">.</span><span class="n">basicConfig</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="p">.</span><span class="n">INFO</span><span class="p">)</span>
<span class="n">logger</span> <span class="o">=</span> <span class="n">logging</span><span class="p">.</span><span class="n">getLogger</span><span class="p">(</span><span class="s">"napistu_torch"</span><span class="p">)</span>

<span class="c1"># globals
</span>
<span class="n">OVERWRITE</span> <span class="o">=</span> <span class="bp">False</span>

<span class="n">EXPERIMENTS_DIR</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">expanduser</span><span class="p">(</span><span class="s">"~/Desktop/EXPERIMENTS/20251106_edge_prediction"</span><span class="p">))</span>
<span class="n">NAPISTU_DATA_DIR</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">EXPERIMENTS_DIR</span><span class="p">,</span> <span class="s">".napistu_data"</span><span class="p">)</span>
<span class="n">STORE_DIR</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">EXPERIMENTS_DIR</span><span class="p">,</span> <span class="s">".store"</span><span class="p">)</span>
<span class="n">CACHE_DIR</span> <span class="o">=</span> <span class="n">EXPERIMENTS_DIR</span>

<span class="n">EXPERIMENTS</span> <span class="o">=</span> <span class="p">[</span>
    <span class="c1"># leave this as a list so it defines plot order
</span>    <span class="s">"20251106_gcn_baseline"</span><span class="p">,</span>
    <span class="s">"20251106_gcn_edge_encoding"</span><span class="p">,</span>
    <span class="s">"20251106_sage_baseline"</span><span class="p">,</span>
    <span class="s">"20251106_graphconv_baseline"</span><span class="p">,</span>
    <span class="s">"20251106_graphconv_edge_encoding"</span>
<span class="p">]</span>

<span class="n">EXPERIMENT_LABELS</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"20251106_gcn_baseline"</span> <span class="p">:</span> <span class="s">"GCN"</span><span class="p">,</span>
    <span class="s">"20251106_gcn_edge_encoding"</span> <span class="p">:</span> <span class="s">"GCN + Edge Encoding"</span><span class="p">,</span>
    <span class="s">"20251106_sage_baseline"</span> <span class="p">:</span> <span class="s">"SAGE"</span><span class="p">,</span>
    <span class="s">"20251106_graphconv_baseline"</span> <span class="p">:</span> <span class="s">"GraphConv"</span><span class="p">,</span>
    <span class="s">"20251106_graphconv_edge_encoding"</span> <span class="p">:</span> <span class="s">"GraphConv + Edge Encoding"</span>
<span class="p">}</span>

<span class="n">ordered_labels</span> <span class="o">=</span> <span class="p">[</span><span class="n">EXPERIMENT_LABELS</span><span class="p">[</span><span class="n">exp</span><span class="p">]</span> <span class="k">for</span> <span class="n">exp</span> <span class="ow">in</span> <span class="n">EXPERIMENTS</span><span class="p">]</span>

<span class="n">TOP_MODEL_NAME</span> <span class="o">=</span> <span class="s">"GraphConv + Edge Encoding"</span>
<span class="n">EMBEDDING_COMPARISONS_PATH</span> <span class="o">=</span> <span class="n">CACHE_DIR</span> <span class="o">/</span> <span class="s">"embedding_comparisons.tsv"</span>

<span class="c1"># validation
</span><span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">isdir</span><span class="p">(</span><span class="n">EXPERIMENTS_DIR</span><span class="p">):</span>
    <span class="k">raise</span> <span class="nb">FileNotFoundError</span><span class="p">(</span><span class="sa">f</span><span class="s">"Experiments directory not found: </span><span class="si">{</span><span class="n">EXPERIMENTS_DIR</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">isdir</span><span class="p">(</span><span class="n">CACHE_DIR</span><span class="p">):</span>
    <span class="k">raise</span> <span class="nb">FileNotFoundError</span><span class="p">(</span><span class="sa">f</span><span class="s">"Cache directory not found: </span><span class="si">{</span><span class="n">CACHE_DIR</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="managing-artifacts-with-napistudatastore">Managing artifacts with NapistuDataStore</h2>

<p>Training and evaluating GNN models requires more than just the graph
structure — we need encoded features, train/val/test splits, pathway
metadata for evaluation, and edge stratification data for negative
sampling. Building these artifacts from scratch involves loading the
full <code class="language-plaintext highlighter-rouge">SBML_dfs</code> database (several minutes) and running various
preprocessing steps. Doing this repeatedly during development would be
painfully slow.</p>

<p>The <code class="language-plaintext highlighter-rouge">NapistuDataStore</code> solves this by managing a registry of cached
artifacts. Once built, artifacts load in seconds rather than minutes.
Named artifact definitions in the <code class="language-plaintext highlighter-rouge">napistu_torch.load.artifacts</code> module
support common workflows and integrate seamlessly with config-driven
training.</p>

<p>Importantly, the store provides a clean abstraction layer over
Napistu-Py. All the logic for loading SBML databases, decorating graphs
with metadata, and extracting biological annotations is baked into
Napistu-Torch objects. Users can work entirely with <code class="language-plaintext highlighter-rouge">NapistuData</code>,
encoders, and dataloaders without ever touching Napistu-Py code —
though if you’re curious about the underlying biological data model,
<a href="https://www.shackett.org/octopus_network/">Napistu-Py is pretty cool</a>.</p>

<h3 id="initializing-the-store">Initializing the store</h3>

<p>A store can be initialized directly from one of the bundled Napistu
networks on Google Cloud Storage using <code class="language-plaintext highlighter-rouge">gcs_model_to_store</code>. This
creates and manages two local directories:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">napistu_data_dir</code>: Raw data including the <code class="language-plaintext highlighter-rouge">NapistuGraph</code> and
<code class="language-plaintext highlighter-rouge">SBML_dfs</code></li>
  <li><code class="language-plaintext highlighter-rouge">store_dir</code>: Cached artifacts and the registry file tracking what’s
been built</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">napistu_data_store</span> <span class="o">=</span> <span class="n">gcs_model_to_store</span><span class="p">(</span>
    <span class="n">napistu_data_dir</span> <span class="o">=</span> <span class="n">NAPISTU_DATA_DIR</span><span class="p">,</span>
    <span class="n">store_dir</span> <span class="o">=</span> <span class="n">STORE_DIR</span><span class="p">,</span>
    <span class="n">asset_name</span> <span class="o">=</span> <span class="s">"human_consensus"</span><span class="p">,</span>
    <span class="c1"># pin to a stable version of the dataset for reproducibility
</span>    <span class="n">asset_version</span> <span class="o">=</span> <span class="s">"20250923"</span> 
<span class="p">)</span>
</code></pre></div></div>

<h3 id="building-and-caching-artifacts">Building and caching artifacts</h3>

<p>The <code class="language-plaintext highlighter-rouge">ensure_artifacts</code> method checks whether requested artifacts exist
and builds any that are missing. For this analysis, we need four
artifacts:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">napistu_data_store</span><span class="p">.</span><span class="n">ensure_artifacts</span><span class="p">([</span>
    <span class="s">"edge_prediction"</span><span class="p">,</span>
    <span class="s">"comprehensive_pathway_memberships"</span><span class="p">,</span>
    <span class="s">"edge_strata_by_node_type"</span><span class="p">,</span>
    <span class="s">"edge_strata_by_node_species_type"</span>
<span class="p">])</span>
</code></pre></div></div>

<p>These artifacts are:</p>

<ul>
  <li><strong>edge_prediction</strong>: A <code class="language-plaintext highlighter-rouge">NapistuData</code> instance with train/val/test
edge masks, used for self-supervised learning</li>
  <li><strong>comprehensive_pathway_memberships</strong>: Detailed pathway associations
for all vertices (including fine-grained Reactome pathways), used
for evaluating whether embeddings capture biological organization</li>
  <li><strong>edge_strata_by_node_type</strong>: Edge categories based on source/target
node types (species→species, species→reaction, etc.), used for
stratified negative sampling</li>
  <li><strong>edge_strata_by_node_species_type</strong>: Finer-grained edge categories
including species types (protein, metabolite, RNA), used for
assessing prediction biases</li>
</ul>

<p>Once built, these artifacts load almost instantly:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">napistu_data</span> <span class="o">=</span> <span class="n">napistu_data_store</span><span class="p">.</span><span class="n">load_napistu_data</span><span class="p">(</span><span class="s">"edge_prediction"</span><span class="p">)</span>
<span class="n">comprehensive_pathway_memberships</span> <span class="o">=</span> <span class="n">napistu_data_store</span><span class="p">.</span><span class="n">load_vertex_tensor</span><span class="p">(</span><span class="s">"comprehensive_pathway_memberships"</span><span class="p">)</span>
<span class="n">edge_strata_by_node_type</span> <span class="o">=</span> <span class="n">napistu_data_store</span><span class="p">.</span><span class="n">load_pandas_df</span><span class="p">(</span><span class="s">"edge_strata_by_node_type"</span><span class="p">)</span>
<span class="n">edge_strata_by_node_species_type</span> <span class="o">=</span> <span class="n">napistu_data_store</span><span class="p">.</span><span class="n">load_pandas_df</span><span class="p">(</span><span class="s">"edge_strata_by_node_species_type"</span><span class="p">)</span>
</code></pre></div></div>

<pre><code class="language-warning">    INFO:napistu_torch.napistu_data_store:Loading NapistuData from /Users/sean/Desktop/EXPERIMENTS/20251106_edge_prediction/.store/napistu_data/edge_prediction.pt
    INFO:napistu_torch.napistu_data_store:Loading VertexTensor from /Users/sean/Desktop/EXPERIMENTS/20251106_edge_prediction/.store/vertex_tensors/comprehensive_pathway_memberships.pt
    INFO:napistu_torch.napistu_data_store:Loading pandas DataFrame from /Users/sean/Desktop/EXPERIMENTS/20251106_edge_prediction/.store/pandas_dfs/edge_strata_by_node_type.parquet
    INFO:napistu_torch.napistu_data_store:Loading pandas DataFrame from /Users/sean/Desktop/EXPERIMENTS/20251106_edge_prediction/.store/pandas_dfs/edge_strata_by_node_species_type.parquet
</code></pre>

<p>The store abstraction means that downstream code (training scripts,
evaluation notebooks) can simply request artifacts by name without
worrying about paths, versions, or rebuild logic.</p>

<h2 id="anatomy-of-a-gnn">Anatomy of a GNN</h2>

<p>Training a GNN requires coordinating several components: the model
architecture itself, the task definition, data management, and training
infrastructure. This section breaks down each component — starting
with a conceptual overview of what it does, then showing how it’s
implemented in Napistu-Torch. We’ll begin with the high-level system
architecture, then work through the task definition, model components
(encoder, head, edge encoder), and finally the training infrastructure
that orchestrates everything.</p>

<h3 id="system-architecture">System architecture</h3>

<p><img src="https://www.shackett.org/figure/napistu_torch/system_architecture.png" alt="System architecture mermaid diagram" style="width: 70%;" /></p>

<p>Training deep learning models involves coordinating several standard
components: the <strong>Task</strong> defines what we’re learning (loss function,
metrics), the <strong>Model</strong> implements the neural network architecture, the
<strong>DataModule</strong> handles data loading and batching, and the <strong>Trainer</strong>
orchestrates the optimization loop. Each component is configured
independently, providing modularity and clear separation of concerns.</p>

<p>The following subsections examine how these components work in the
context of biological network analysis, covering the task definition
(edge prediction), model architecture (GNN encoders and heads), and
training infrastructure. Later, in the Workflow Management section, I’ll
show how Napistu-Torch packages these component-level configs into a
single <code class="language-plaintext highlighter-rouge">ExperimentConfig</code> for experiment reproducibility.</p>

<h3 id="task-what-are-we-trying-to-predict">Task: what are we trying to predict?</h3>

<p>At the highest level, a GNN task defines the learning objective — what
predictions we want the model to make and how we’ll evaluate its
performance. Different tasks require different architectural components
and training strategies. Common GNN tasks include node classification
(predicting properties of individual nodes), graph classification
(predicting properties of entire graphs), and edge prediction
(predicting whether edges should exist between nodes).</p>

<p><strong>Edge prediction in Napistu-Torch.</strong> For this analysis, we’re using the
edge prediction task (also called link prediction). The goal is to
predict whether an edge should exist between two nodes in a biological
network. This is particularly valuable for discovering potential
protein-protein interactions, metabolic relationships, or regulatory
connections that may be missing from current databases. Crucially, edge
prediction is self-supervised — it doesn’t require vertex or edge
labels, which are often difficult to obtain or overly contrived for
biological networks.</p>

<p>The training process works by teaching the model to discriminate
between:</p>

<ul>
  <li><strong>Positive edges</strong>: Real edges that exist in the training set</li>
  <li><strong>Negative edges</strong>: Node pairs sampled as non-edges (chosen to
maintain biological plausibility)</li>
</ul>

<p>The <code class="language-plaintext highlighter-rouge">EdgePredictionTask</code> class in Napistu-Torch orchestrates this
process, handling negative sampling, computing the loss function (binary
cross-entropy), and evaluating performance using metrics like AUC and
average precision. The task operates in a transductive setting — the
model generates node embeddings using only training edges for message
passing; validation and test edges are excluded from neighborhood
aggregation but used as supervision to evaluate the decoder’s edge
predictions.</p>

<div class="content-section ai-aside">
  <div class="section-content">
    <p>Naïve negative sampling (randomly
pairing vertices) produces trivially distinguishable negatives. Early
models without stratification quickly reached &gt;0.95 AUC by exploiting
two artifacts: (1) sampling impossible edge types like reaction→reaction
that never occur in real edges, and (2) sampling random pairs that
ignore the highly variable degree distribution of biological networks,
making hub nodes easy to memorize.</p>

<p>The <code class="language-plaintext highlighter-rouge">NegativeSampler</code> addresses this by tracking edge attributes —
such as combinations of from- and to-node types and degree distributions
within each stratum. It generates negative samples that match both the
observed edge strata and vertex in- and out-degree distributions,
forcing the model to learn biological patterns rather than graph
artifacts.</p>

<p>For the models trained here, I use coarse-grained node-type strata
(species vs. reaction), sampling negatives to match the observed
proportions of species→species, species→reaction, and reaction→species.
Finer-grained stratification — matching entity types like protein
vs. complex vs. metabolite — could further reduce imbalances, but at a
cost: more strata mean sampling separately from dozens of pools during
batch construction, limiting vectorization efficiency.</p>

  </div>
</div>

<p>With the task defined, let’s examine the three core model components:
the encoder that learns node embeddings, the head that produces
predictions, and the optional edge encoder that can weight message
passing.</p>

<h3 id="encoder-learning-vertex-representations">Encoder: learning vertex representations</h3>

<p>The encoder is the core of a GNN — it transforms raw node features
into learned embeddings by aggregating information from each node’s
local neighborhood. Through multiple layers of message passing, encoders
capture increasingly complex patterns of connectivity and feature
similarity. The encoder’s architecture determines how information flows
through the graph and what kinds of structural patterns the model can
learn.</p>

<p><strong>Message passing encoders in Napistu-Torch.</strong> Napistu-Torch leverages
PyG’s library of encoder architectures, wrapping them through the
<code class="language-plaintext highlighter-rouge">MessagePassingEncoder</code> class for easy configuration:</p>

<ul>
  <li><strong>GraphSAGE (SAGE)</strong>
    <ul>
      <li>Samples and aggregates features from neighbors using various
aggregation functions (mean, max, add)</li>
      <li>Efficient and scalable, well-suited for large biological
networks</li>
      <li>Does not support edge weighting — treats all edges uniformly</li>
    </ul>
  </li>
  <li><strong>GraphConv</strong>
    <ul>
      <li>Similar to SAGE with a simplified message passing scheme</li>
      <li>Supports optional edge weighting</li>
    </ul>
  </li>
  <li><strong>Graph Convolutional Networks (GCN)</strong>
    <ul>
      <li>Uses symmetric normalization that accounts for node degrees</li>
      <li>Supports edge weighting</li>
    </ul>
  </li>
</ul>

<p>All encoders follow a multi-layer architecture, progressively refining
node embeddings through repeated neighborhood aggregation. The framework
provides a unified interface, allowing you to swap architectures through
configuration files without changing code.</p>

<h3 id="head-making-predictions-from-embeddings">Head: making predictions from embeddings</h3>

<p>Once the encoder has produced node embeddings, the head (or decoder)
transforms these embeddings into task-specific predictions. The head’s
role is to adapt the general-purpose embeddings produced by the encoder
to the specific prediction task — edge prediction, node
classification, graph classification, etc.</p>

<p><strong>Heads in Napistu-Torch.</strong> For edge prediction, the most commonly used
head is the <strong>dot product</strong>, which computes the inner product of source
and target node embeddings — assuming that nodes with similar
embeddings should be connected. This is the simplest and most efficient
option, serving as a strong baseline in most GNN edge prediction work.
Napistu-Torch also implements more expressive alternatives (MLP,
bilinear) that can learn non-linear relationships between node pairs,
though these come with increased computational cost.</p>

<p>For this analysis, all models use the dot product head due to its
efficiency on large biological networks and strong empirical
performance. Napistu-Torch also provides heads for other tasks like node
classification, all accessible through configuration files.</p>

<div class="content-section bio-section">
  <div class="section-content">
    <p>The dot product head is symmetric
— it treats vertex A→B identically to B→A — making it well-suited
for undirected interactions but poorly suited for regulation, where the
roles of regulator and target fundamentally differ. Asymmetric heads
like DistMult could address this by learning distinct representations
for source and target vertices, enabling the model to differentiate
regulators from their targets.</p>

<p>However, even asymmetric heads may struggle to take advantage of
Napistu’s diverse edge types: protein-protein interactions, activation,
inhibition, catalysis, and more. An asymmetric model could implicitly
learn these cues, but the meaning of a high prediction score would
remain ambiguous: is it activation, inhibition, binding?</p>

<p>Relation-aware heads like RotatE offer a more promising path. Originally
developed for knowledge graph completion, these architectures explicitly
model edge types as distinct relations, learning separate transformation
rules for each. This approach enables the model to capture the
principles of regulation directly, discerning activators from inhibitors
and regulators from binding partners. Rather than merely predicting
edges, these models yield typed edges — specific, testable hypotheses
about the nature of regulatory relationships.</p>

  </div>
</div>

<h3 id="edge-encoder-weighting-message-passing">Edge encoder: weighting message passing</h3>

<p>While many GNN implementations either ignore edge attributes entirely or
only accept a single edge weight, biological networks often have rich
edge metadata. The challenge is that most message passing architectures
(like SAGE) don’t support edge attributes at all, while others (like GCN
and GraphConv) only accept a single scalar weight per edge.</p>

<p><strong>Learned edge weighting in Napistu-Torch.</strong> Napistu-Torch addresses
this through the <code class="language-plaintext highlighter-rouge">EdgeEncoder</code> class, which compresses multi-dimensional
edge attributes into a single learned weight that modulates message
passing strength. The edge encoder is a lightweight MLP that takes edge
features as input and outputs a scalar weight in [0, 1] via sigmoid
activation. These learned weights control how strongly each edge
contributes during neighborhood aggregation.</p>

<p>This approach filters noisy edges and amplifies reliable ones, focusing
message passing on the most informative connections — crucial in
biological networks where edge quality varies widely. The encoder trains
end-to-end with the rest of the model, learning which edge attributes
are most predictive for the task while remaining compatible with
standard GNN architectures that expect scalar edge weights.</p>

<h3 id="model-overview">Model overview</h3>

<p><img src="https://www.shackett.org/figure/napistu_torch/model_overview.png" alt="Overview of model structure as a mermaid diagram" style="width: 100%;" /></p>

<p>The diagram shows how model components connect during a forward pass.
Node features and edge connectivity flow through the message passing
encoder to produce node embeddings. The head then transforms these
embeddings into task-specific predictions — edge scores for edge
prediction, class probabilities for node classification.</p>

<p>The optional edge encoder (shown in dashed lines) learns to weight edges
based on edge attributes, modulating how strongly each edge contributes
during message passing. This is particularly useful when edge
reliability varies across data sources, as in the Octopus network.</p>

<p>These model components (encoder, head, edge encoder) define the core
prediction logic, but training a GNN requires additional infrastructure
to manage data loading, optimization, and evaluation. Napistu-Torch uses
PyTorch Lightning to orchestrate this training workflow.</p>

<h3 id="training-infrastructure-pytorch-lightning">Training infrastructure: PyTorch Lightning</h3>

<p>While the core GNN components (encoder, head, task) are pure PyTorch,
actually training a model requires substantial boilerplate: optimizer
setup, learning rate scheduling, checkpoint saving, logging metrics,
handling different hardware accelerators, and coordinating
training/validation loops.</p>

<p><strong>Lightning integration in Napistu-Torch.</strong> Napistu-Torch uses PyTorch
Lightning to handle this training infrastructure automatically.
Lightning separates scientific code (model architecture, loss functions)
from engineering code (training loops, GPU management), making
experiments more reproducible and less error-prone.</p>

<p>The key Lightning components are:</p>

<ul>
  <li>
    <p><strong>LightningModule</strong>: Wraps the core task (encoder + head) and
defines training/validation steps, metrics computation, and
optimizer configuration. Napistu-Torch provides task-specific
adapters like <code class="language-plaintext highlighter-rouge">EdgePredictionLightning</code> that bridge pure PyTorch
implementations with Lightning’s training infrastructure.</p>
  </li>
  <li>
    <p><strong>Trainer</strong>: Orchestrates the training loop, handles checkpointing,
manages device placement (CPU/GPU/MPS), integrates with experiment
tracking tools like Weights &amp; Biases, and implements callbacks like
early stopping.</p>
  </li>
</ul>

<p>With this architecture, you can concentrate on defining model and task
logic, while Lightning takes care of all training mechanics.</p>

<h3 id="data-management-batching-strategies">Data management: batching strategies</h3>

<p>The <code class="language-plaintext highlighter-rouge">NapistuDataModule</code> is Lightning’s interface for data loading. It
can be initialized directly from an <code class="language-plaintext highlighter-rouge">ExperimentConfig</code>, automatically
handling artifact loading from the <code class="language-plaintext highlighter-rouge">NapistuDataStore</code>, data validation,
and dataloader creation.</p>

<p><strong>Full-batch vs. mini-batch training.</strong> Napistu-Torch provides two
DataModule implementations with fundamentally different training
strategies:</p>

<ul>
  <li>
    <p><strong>FullGraphDataModule</strong>: Returns the complete graph in each batch,
processing all training edges simultaneously for a single gradient
update per epoch. With only one update per epoch, the model can
converge prematurely before exploring the optimization landscape
effectively.</p>
  </li>
  <li>
    <p><strong>EdgeBatchDataModule</strong>: Splits training edges into mini-batches
while still using all training edges for message passing. Each batch
computes loss and gradients on a subset of training edges but uses
the full graph structure for neighborhood aggregation. This enables
multiple gradient updates per epoch by subdividing the supervision
signal — effectively trading fewer epochs for more updates per
epoch, allowing more thorough optimization.</p>
  </li>
</ul>

<p>For the models in this post, I used <code class="language-plaintext highlighter-rouge">EdgeBatchDataModule</code> with 20
batches per epoch, meaning the model updates its weights 20 times per
epoch rather than once.</p>

<h2 id="workflow-management">Workflow management</h2>

<p>For this post, I’m comparing 5 models across different encoder
architectures and edge encoding strategies:</p>

<ul>
  <li>GraphConv (+/- edge encoding)</li>
  <li>GCN (+/- edge encoding)</li>
  <li>SAGE (edge encoding not supported)</li>
</ul>

<p>These models use identical hyperparameters (200 epochs, same batch
configuration, same hidden dimensions) for a fair comparison. These
models are deliberately unoptimized, with no hyperparameter tuning and
simple dot-product heads, as the focus is on feasibility and
infrastructure rather than peak performance.</p>

<p><strong>Training configuration in Napistu-Torch.</strong> The <code class="language-plaintext highlighter-rouge">ExperimentConfig</code>
composes lower-level Pydantic configs (<code class="language-plaintext highlighter-rouge">DataConfig</code>, <code class="language-plaintext highlighter-rouge">ModelConfig</code>,
<code class="language-plaintext highlighter-rouge">TaskConfig</code>, <code class="language-plaintext highlighter-rouge">TrainerConfig</code>, <code class="language-plaintext highlighter-rouge">WandBConfig</code>) with validation and
sensible defaults. Define an experiment in a minimal YAML file, inherit
defaults automatically.</p>

<p>The Napistu-Torch CLI supports training and testing directly from the
command line:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>napistu-torch train graphconv_baseline.yaml <span class="nt">--out-dir</span> 20251106_graphconv_baseline
napistu-torch <span class="nb">test </span>20251106_graphconv_baseline
</code></pre></div></div>

<p>Training/validation/test metrics log to Weights &amp; Biases for easy
comparison across experiments. Each run saves a <code class="language-plaintext highlighter-rouge">RunManifest</code> containing
the Weights &amp; Biases run ID and complete <code class="language-plaintext highlighter-rouge">ExperimentConfig</code> (with all
defaults expanded), making experiments fully reproducible.</p>

<p>The configs and training script for these 5 models are available
<a href="https://github.com/shackett/shackett/blob/main/assets/data/napistu_nets_configs.zip">here</a>.
On an M4 Max MacBook Pro with 48GB of RAM, training the full suite takes
~2 days (~8-12 hours per model).</p>

<div class="content-section ai-aside">
  <div class="section-content">
    <p>PyTorch’s Metal Performance Shaders
(MPS) backend enables GPU acceleration on Apple Silicon, though support
is less mature than CUDA. For these experiments, MPS performed well on
simpler models, but when training models with edge encoders near my
machine’s memory limits, I encountered sporadic tensor corruption. I
trained those models on CPU instead—a reasonable fallback since the
irregular memory access patterns of message passing (variable numbers of
messages per node) meant the performance gap between CPU and GPU was
modest for these network sizes.</p>

  </div>
</div>

<p>Now let’s load these trained models and compare their performance.</p>

<h2 id="model-comparison">Model comparison</h2>

<p><strong>Model evaluation in Napistu-Torch.</strong> The <code class="language-plaintext highlighter-rouge">EvaluationManager</code> class
loads a run’s <code class="language-plaintext highlighter-rouge">RunManifest</code> and provides methods for accessing
checkpoints, the <code class="language-plaintext highlighter-rouge">NapistuDataStore</code>, and Weights &amp; Biases metrics —
eliminating the need for manual path management or API queries. Here,
I’ll load all five trained models and compare their performance metrics
and learned representations.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">eval_managers</span> <span class="o">=</span> <span class="p">{</span>
    <span class="n">EXPERIMENT_LABELS</span><span class="p">[</span><span class="n">out_dir</span><span class="p">]:</span> <span class="n">EvaluationManager</span><span class="p">(</span><span class="n">EXPERIMENTS_DIR</span> <span class="o">/</span> <span class="n">out_dir</span><span class="p">)</span> <span class="k">for</span> <span class="n">out_dir</span> <span class="ow">in</span> <span class="n">EXPERIMENTS</span>
<span class="p">}</span>

<span class="c1"># Extract model summaries directly from Weights &amp; Biases using their API
</span><span class="n">run_summaries</span> <span class="o">=</span> <span class="p">{</span><span class="n">exp</span><span class="p">:</span> <span class="n">manager</span><span class="p">.</span><span class="n">get_run_summary</span><span class="p">()</span> <span class="k">for</span> <span class="n">exp</span><span class="p">,</span> <span class="n">manager</span> <span class="ow">in</span> <span class="n">eval_managers</span><span class="p">.</span><span class="n">items</span><span class="p">()}</span>

<span class="c1"># visualize model comparison
</span><span class="n">fig</span><span class="p">,</span> <span class="p">(</span><span class="n">ax1</span><span class="p">,</span> <span class="n">ax2</span><span class="p">)</span> <span class="o">=</span> <span class="n">plot_model_comparison</span><span class="p">(</span><span class="n">run_summaries</span><span class="p">,</span> <span class="n">ordered_labels</span><span class="p">)</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-11-19-napistu_torch/basic_model_comparison-output-1.png" alt="" /></p>

<p>The training loss shown is the final epoch’s binary cross-entropy,
aggregated across all mini-batches, computed on equal numbers of real
edges (70% of the network) and negative samples. Validation AUC measures
how well the model ranks held-out real edges (15% of the network,
excluded from message passing) above an equal number of negative
samples. This metric is evaluated after each epoch for checkpoint
selection and early stopping. Test AUC evaluates the same ranking task
on the final 15% of edges.</p>

<p>Performance differences across models are modest but consistent. Encoder
architecture matters: SAGE &gt; GraphConv &gt;&gt; GCN. Edge encoding provides
a clear improvement across architectures that support it. While these
models haven’t been optimized — no hyperparameter tuning, simple dot
product heads—the consistent trends across architectures validate the
training infrastructure and provide a baseline for future work.</p>

<p>Next, I’ll compare the learned representations across models.
Specifically: Do different encoder architectures produce similar vertex
embeddings? Do models with edge encoders learn comparable edge weights?
These comparisons reveal whether the biological signal is robust to
architectural choices or whether different models capture fundamentally
different patterns.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">top_model_manager</span> <span class="o">=</span> <span class="n">eval_managers</span><span class="p">[</span><span class="n">TOP_MODEL_NAME</span><span class="p">]</span>
<span class="n">napistu_data</span> <span class="o">=</span> <span class="n">top_model_manager</span><span class="p">.</span><span class="n">load_napistu_data</span><span class="p">()</span>
<span class="n">napistu_data_store</span> <span class="o">=</span> <span class="n">top_model_manager</span><span class="p">.</span><span class="n">get_store</span><span class="p">()</span>
<span class="n">napistu_graph</span> <span class="o">=</span> <span class="n">napistu_data_store</span><span class="p">.</span><span class="n">load_napistu_graph</span><span class="p">()</span>

<span class="c1"># pull out the node types and create a mask to distinguish species and reactions
</span><span class="n">node_types</span> <span class="o">=</span> <span class="n">napistu_graph</span><span class="p">.</span><span class="n">get_vertex_series</span><span class="p">(</span><span class="s">"node_type"</span><span class="p">)</span>
<span class="n">is_species_mask</span> <span class="o">=</span> <span class="p">(</span><span class="n">node_types</span> <span class="o">==</span> <span class="s">"species"</span><span class="p">).</span><span class="n">values</span>

<span class="c1">## Extract model embeddings
</span><span class="n">edge_encodings</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">species_embeddings</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">exp</span><span class="p">,</span> <span class="n">evaluation_manager</span> <span class="ow">in</span> <span class="n">eval_managers</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
    <span class="c1"># load the model and data
</span>    <span class="n">model</span> <span class="o">=</span> <span class="n">evaluation_manager</span><span class="p">.</span><span class="n">load_model_from_checkpoint</span><span class="p">()</span>
    <span class="n">napistu_data</span> <span class="o">=</span> <span class="n">evaluation_manager</span><span class="p">.</span><span class="n">load_napistu_data</span><span class="p">()</span>
    
    <span class="c1"># pull out learned edge weights (if an edge encoder is present)
</span>    <span class="k">if</span> <span class="n">model</span><span class="p">.</span><span class="n">task</span><span class="p">.</span><span class="n">encoder</span><span class="p">.</span><span class="n">edge_weighting_type</span> <span class="o">==</span> <span class="s">"learned_encoder"</span><span class="p">:</span>
        <span class="n">edge_encodings</span><span class="p">[</span><span class="n">exp</span><span class="p">]</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">get_learned_edge_weights</span><span class="p">(</span><span class="n">napistu_data</span><span class="p">)</span>
    
    <span class="c1"># pull out the vertex embeddings
</span>    <span class="n">embeddings</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">get_embeddings</span><span class="p">(</span><span class="n">napistu_data</span><span class="p">)</span>
    <span class="n">species_embeddings</span><span class="p">[</span><span class="n">exp</span><span class="p">]</span> <span class="o">=</span> <span class="n">embeddings</span><span class="p">[</span><span class="n">is_species_mask</span><span class="p">]</span>

    <span class="c1"># cleanup
</span>    <span class="n">evaluation_manager</span><span class="p">.</span><span class="n">experiment_dict</span> <span class="o">=</span> <span class="bp">None</span>
</code></pre></div></div>

<h3 id="comparing-vertex-embeddings">Comparing vertex embeddings</h3>

<p>To compare vertex embeddings across models, I compute species-species
cosine similarities within each model’s embedding space (# species ×
hidden dimension), then calculate the Spearman correlation between these
similarity matrices across model pairs. This approach works regardless
of embedding dimension and avoids the need for explicit alignment (e.g.,
Procrustes rotation).</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">create_correlation_heatmap</span><span class="p">(</span>
    <span class="n">embedding_comparisons</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">,</span>
    <span class="n">model_order</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">:</span>
    
    <span class="c1"># Get all unique models
</span>    <span class="n">all_models</span> <span class="o">=</span> <span class="p">(</span>
        <span class="nb">set</span><span class="p">(</span><span class="n">embedding_comparisons</span><span class="p">[</span><span class="s">'model1'</span><span class="p">].</span><span class="n">unique</span><span class="p">())</span> <span class="o">|</span> \
        <span class="nb">set</span><span class="p">(</span><span class="n">embedding_comparisons</span><span class="p">[</span><span class="s">'model2'</span><span class="p">].</span><span class="n">unique</span><span class="p">())</span>
    <span class="p">)</span>
    
    <span class="c1"># Use provided order or default to sorted
</span>    <span class="k">if</span> <span class="n">model_order</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">models</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">all_models</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="c1"># Validate that all models in data are in the provided order
</span>        <span class="n">missing_models</span> <span class="o">=</span> <span class="n">all_models</span> <span class="o">-</span> <span class="nb">set</span><span class="p">(</span><span class="n">model_order</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">missing_models</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="sa">f</span><span class="s">"Models in data but not in model_order: </span><span class="si">{</span><span class="n">missing_models</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="c1"># Use only models that exist in the data, in the specified order
</span>        <span class="n">models</span> <span class="o">=</span> <span class="p">[</span><span class="n">m</span> <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">model_order</span> <span class="k">if</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">all_models</span><span class="p">]</span>
    
    <span class="c1"># Initialize matrix with 1s on diagonal
</span>    <span class="n">corr_matrix</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span>
        <span class="n">np</span><span class="p">.</span><span class="n">eye</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">models</span><span class="p">)),</span>
        <span class="n">index</span><span class="o">=</span><span class="n">models</span><span class="p">,</span> 
        <span class="n">columns</span><span class="o">=</span><span class="n">models</span>
    <span class="p">)</span>
    
    <span class="c1"># Fill in the correlations (both upper and lower triangles)
</span>    <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">embedding_comparisons</span><span class="p">.</span><span class="n">iterrows</span><span class="p">():</span>
        <span class="n">corr_matrix</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">row</span><span class="p">[</span><span class="s">'model1'</span><span class="p">],</span> <span class="n">row</span><span class="p">[</span><span class="s">'model2'</span><span class="p">]]</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">'spearman_rho'</span><span class="p">]</span>
        <span class="n">corr_matrix</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">row</span><span class="p">[</span><span class="s">'model2'</span><span class="p">],</span> <span class="n">row</span><span class="p">[</span><span class="s">'model1'</span><span class="p">]]</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">'spearman_rho'</span><span class="p">]</span>
    
    <span class="k">return</span> <span class="n">corr_matrix</span>

<span class="c1"># compute the embedding comparisons (cached since this takes a few minutes to run)
</span>
<span class="k">if</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">EMBEDDING_COMPARISONS_PATH</span><span class="p">)</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">OVERWRITE</span><span class="p">:</span>
    <span class="n">embedding_comparisons</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">EMBEDDING_COMPARISONS_PATH</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">device</span> <span class="o">=</span> <span class="n">select_device</span><span class="p">(</span><span class="n">mps_valid</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
    <span class="n">embedding_comparisons</span> <span class="o">=</span> <span class="n">compare_embeddings</span><span class="p">(</span><span class="n">species_embeddings</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>
    <span class="n">embedding_comparisons</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">EMBEDDING_COMPARISONS_PATH</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>

<span class="c1"># visualize the embedding comparisons
</span>
<span class="n">corr_matrix</span> <span class="o">=</span> <span class="n">create_correlation_heatmap</span><span class="p">(</span><span class="n">embedding_comparisons</span><span class="p">,</span> <span class="n">model_order</span><span class="o">=</span><span class="n">ordered_labels</span><span class="p">)</span>

<span class="c1"># Display as heatmap
</span><span class="n">mask</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">triu</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">ones_like</span><span class="p">(</span><span class="n">corr_matrix</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">bool</span><span class="p">),</span> <span class="n">k</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span>
<span class="n">sns</span><span class="p">.</span><span class="n">heatmap</span><span class="p">(</span>
    <span class="n">corr_matrix</span><span class="p">,</span>
    <span class="n">annot</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">fmt</span><span class="o">=</span><span class="s">'.3f'</span><span class="p">,</span>
    <span class="n">cmap</span><span class="o">=</span><span class="s">'RdYlBu_r'</span><span class="p">,</span>
    <span class="n">mask</span><span class="o">=</span><span class="n">mask</span><span class="p">,</span>
    <span class="n">vmin</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
    <span class="n">vmax</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
    <span class="n">square</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">cbar_kws</span><span class="o">=</span><span class="p">{</span><span class="s">'label'</span><span class="p">:</span> <span class="s">'Spearman ρ'</span><span class="p">},</span>
<span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span>
    <span class="s">'Vertex embedding similarity across models'</span><span class="p">,</span>
    <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span>
    <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">,</span>
    <span class="n">pad</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span>
    <span class="n">loc</span><span class="o">=</span><span class="s">'left'</span>
<span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-11-19-napistu_torch/compare_embeddings-output-1.png" alt="" /></p>

<p>All models produce highly correlated embeddings (ρ &gt; 0.6 across all
pairs), indicating they’ve captured similar biological structure despite
architectural differences. However, encoder choice does matter: GCN
embeddings correlate less strongly with GraphConv/SAGE (ρ ≈ 0.6-0.7)
than GraphConv and SAGE correlate with each other (ρ ≈ 0.9). This
suggests that while all models learn similar biological signals, GCN’s
symmetric normalization produces somewhat different vertex
representations than the mean aggregation used by GraphConv and SAGE.</p>

<h3 id="comparing-learned-edge-weights">Comparing learned edge weights</h3>

<p>As I’ve <a href="https://www.shackett.org/octopus_network/#decorating-the--graph-with-species-and-reaction-data">previously
discussed</a>,
confidence in regulatory interactions varies greatly across data
sources, and edge weights should capture this uncertainty for downstream
network analysis. However, determining appropriate edge weights is
challenging when multiple data source attributes each capture different
aspects of reliability.</p>

<p>The edge encoder provides a way to learn what makes a high-confidence
edge empirically. While I’ll examine the learned edge features in detail
later, here I’ll assess whether edge weights are consistent across model
architectures by directly comparing the ~10M learned weights from GCN +
edge encoding and GraphConv + edge encoding.</p>

<p>Since visualizing 10M points requires aggregation, I’ll use a hexbin
plot (bivariate histogram) with logit-transformed weights to map the
sigmoid outputs back to ℝ:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">safe_logit</span><span class="p">(</span><span class="n">p</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">eps</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">1e-7</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">:</span>
    <span class="n">p</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">clamp</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">eps</span><span class="p">,</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">eps</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">p</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">p</span><span class="p">))</span>

<span class="k">def</span> <span class="nf">plot_edge_encoding_hexbin</span><span class="p">(</span>
    <span class="n">tensor1</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span>
    <span class="n">tensor2</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span>
    <span class="n">label1</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"Model 1"</span><span class="p">,</span>
    <span class="n">label2</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"Model 2"</span><span class="p">,</span>
    <span class="n">transform_to_logit</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span>
    <span class="n">gridsize</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">50</span><span class="p">,</span>
    <span class="n">cmap</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">'viridis'</span><span class="p">,</span>
    <span class="n">figsize</span><span class="p">:</span> <span class="nb">tuple</span> <span class="o">=</span> <span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">8</span><span class="p">),</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">tuple</span><span class="p">:</span>
    <span class="c1"># Apply logit transformation if requested
</span>    <span class="k">if</span> <span class="n">transform_to_logit</span><span class="p">:</span>
        <span class="n">tensor1</span> <span class="o">=</span> <span class="n">safe_logit</span><span class="p">(</span><span class="n">tensor1</span><span class="p">)</span>
        <span class="n">tensor2</span> <span class="o">=</span> <span class="n">safe_logit</span><span class="p">(</span><span class="n">tensor2</span><span class="p">)</span>
    
    <span class="c1"># Convert to numpy
</span>    <span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">is_tensor</span><span class="p">(</span><span class="n">tensor1</span><span class="p">):</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">tensor1</span><span class="p">.</span><span class="n">cpu</span><span class="p">().</span><span class="n">numpy</span><span class="p">()</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">tensor1</span><span class="p">)</span>
    
    <span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">is_tensor</span><span class="p">(</span><span class="n">tensor2</span><span class="p">):</span>
        <span class="n">y</span> <span class="o">=</span> <span class="n">tensor2</span><span class="p">.</span><span class="n">cpu</span><span class="p">().</span><span class="n">numpy</span><span class="p">()</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">tensor2</span><span class="p">)</span>
    
    <span class="c1"># Create figure
</span>    <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="n">figsize</span><span class="p">)</span>
    
    <span class="c1"># Create hexbin plot with log-scaled color
</span>    <span class="n">hexbin</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">hexbin</span><span class="p">(</span>
        <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span>
        <span class="n">gridsize</span><span class="o">=</span><span class="n">gridsize</span><span class="p">,</span>
        <span class="n">cmap</span><span class="o">=</span><span class="n">cmap</span><span class="p">,</span>
        <span class="n">mincnt</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
        <span class="n">norm</span><span class="o">=</span><span class="n">LogNorm</span><span class="p">()</span>
    <span class="p">)</span>
    
    <span class="c1"># Add colorbar
</span>    <span class="n">cb</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">colorbar</span><span class="p">(</span><span class="n">hexbin</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">)</span>
    <span class="n">cb</span><span class="p">.</span><span class="n">set_label</span><span class="p">(</span><span class="s">'Count (log scale)'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">)</span>
    
    <span class="c1"># Labels and title
</span>    <span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="n">label1</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">13</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="n">label2</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">13</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span>
        <span class="s">'Learned edge weight similarity across models'</span><span class="p">,</span>
        <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span>
        <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">,</span>
        <span class="n">pad</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span>
        <span class="n">loc</span><span class="o">=</span><span class="s">'left'</span>
    <span class="p">)</span>
    
    <span class="c1"># Add diagonal reference line (y=x)
</span>    <span class="n">min_val</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="nb">min</span><span class="p">(),</span> <span class="n">y</span><span class="p">.</span><span class="nb">min</span><span class="p">())</span>
    <span class="n">max_val</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="nb">max</span><span class="p">(),</span> <span class="n">y</span><span class="p">.</span><span class="nb">max</span><span class="p">())</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span>
        <span class="p">[</span><span class="n">min_val</span><span class="p">,</span> <span class="n">max_val</span><span class="p">],</span>
        <span class="p">[</span><span class="n">min_val</span><span class="p">,</span> <span class="n">max_val</span><span class="p">],</span>
        <span class="s">'r--'</span><span class="p">,</span>
        <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span>
        <span class="n">linewidth</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
        <span class="n">label</span><span class="o">=</span><span class="s">'y=x'</span>
    <span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s">'upper left'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">11</span><span class="p">)</span>
    
    <span class="c1"># Equal aspect ratio
</span>    <span class="n">ax</span><span class="p">.</span><span class="n">set_aspect</span><span class="p">(</span><span class="s">'equal'</span><span class="p">,</span> <span class="n">adjustable</span><span class="o">=</span><span class="s">'box'</span><span class="p">)</span>
    
    <span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
    
    <span class="k">return</span> <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plot_edge_encoding_hexbin</span><span class="p">(</span>
    <span class="n">edge_encodings</span><span class="p">[</span><span class="s">"GCN + Edge Encoding"</span><span class="p">],</span>
    <span class="n">edge_encodings</span><span class="p">[</span><span class="s">"GraphConv + Edge Encoding"</span><span class="p">],</span>
    <span class="n">label1</span><span class="o">=</span><span class="s">"GCN + Edge Encoding"</span><span class="p">,</span>
    <span class="n">label2</span><span class="o">=</span><span class="s">"GraphConv + Edge Encoding"</span><span class="p">,</span>
    <span class="n">transform_to_logit</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">gridsize</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span>
    <span class="n">cmap</span><span class="o">=</span><span class="s">'viridis'</span>
<span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-11-19-napistu_torch/plot_edge_encoding_comparisons-output-1.png" alt="" /></p>

<p>The edge encoder paired with GraphConv uses a wider dynamic range (logit
-5 to 7, corresponding to sigmoid weights 0.007-0.999) compared to GCN +
edge encoder (logit -3 to 0, corresponding to 0.05-0.5). This means
GraphConv’s edge encoder more thoroughly distinguishes between low-,
medium-, and high-confidence edges. Both model-encoder combinations
agree on which edges to effectively ignore (lower-left quadrant, sigmoid
weights &lt; 0.05), but differ in how strongly they upweight reliable
edges.</p>

<p>Having established these cross-model comparisons, I’ll now examine the
top-performing model (GraphConv + edge encoding) in detail to understand
what biological patterns it has captured and where its limitations lie.</p>

<h2 id="evaluating-the-top-model">Evaluating the top model</h2>

<p>Having compared models on performance and learned representations, I’ll
now examine what the best-performing model (GraphConv + edge encoding)
has actually learned. GNN-based edge prediction offers three potential
contributions beyond predicting missing edges:</p>

<ol>
  <li>
    <p><strong>Vertex embeddings capture molecular similarity</strong> - Embeddings
group similar vertices across entity types, enabling community
detection and direct similarity queries. Community detection could
identify functional modules — sets of proteins, metabolites, and
reactions that cluster together in the embedding space. Similarity
queries could assess how closely any two entities resemble each
other, even across different entity types.</p>
  </li>
  <li>
    <p><strong>Learned edge weights for network analysis</strong> - The edge encoder
learns weights that reflect edge reliability, potentially replacing
hand-crafted heuristics in downstream analyses like network layouts,
shortest paths, propagation algorithms, and shallow embedding
methods. For multi-source networks like the Octopus, this is
particularly valuable, rather than manually deciding how to weight
STRING coexpression scores versus IntAct citation counts, the model
learns what combinations of edge attributes indicate reliability.</p>
  </li>
  <li>
    <p><strong>Edge predictions for hypothesis generation</strong> - While
self-supervised training is the primary motivation for edge
prediction, the predictions themselves may identify plausible
regulatory connections absent from current databases. This becomes
more promising with expressive heads that can model directional
regulation rather than the symmetric similarity assumed by the dot
product.</p>
  </li>
</ol>

<p>I’ll explore each of these potential contributions using the GraphConv +
edge encoding model.</p>

<h3 id="molecular-similarity">Molecular similarity</h3>

<h4 id="embedding-structure">Embedding structure</h4>

<p>To assess the structure of the vertex embeddings, I’ll use UMAP to
project the 128-dimensional embeddings into 2D, then overlay vertex
attributes to explore what determines similarity in the embedding space.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">embeddings</span> <span class="o">=</span> <span class="n">species_embeddings</span><span class="p">[</span><span class="n">TOP_MODEL_NAME</span><span class="p">]</span>
<span class="n">umap_layout_species</span> <span class="o">=</span> <span class="n">layout_umap</span><span class="p">(</span><span class="n">embeddings</span><span class="p">,</span> <span class="n">n_neighbors</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>

<span class="n">mask</span> <span class="o">=</span> <span class="p">[</span><span class="nb">bool</span><span class="p">(</span><span class="n">re</span><span class="p">.</span><span class="n">search</span><span class="p">(</span><span class="s">"__species_type"</span><span class="p">,</span> <span class="n">x</span><span class="p">))</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">napistu_data</span><span class="p">.</span><span class="n">get_vertex_feature_names</span><span class="p">()]</span>
<span class="n">indices</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">m</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">mask</span><span class="p">)</span> <span class="k">if</span> <span class="n">m</span><span class="p">]</span>
<span class="n">masks</span> <span class="o">=</span> <span class="n">napistu_data</span><span class="p">.</span><span class="n">x</span><span class="p">[:,</span> <span class="n">indices</span><span class="p">]</span>

<span class="c1"># only look at the vertices with embedding values
</span><span class="n">masks</span> <span class="o">=</span> <span class="n">masks</span><span class="p">[</span><span class="n">is_species_mask</span><span class="p">]</span>
<span class="n">mask_names</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">m</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">napistu_data</span><span class="p">.</span><span class="n">get_vertex_feature_names</span><span class="p">(),</span> <span class="n">mask</span><span class="p">)</span> <span class="k">if</span> <span class="n">m</span><span class="p">]</span>

<span class="c1"># drop empty masks
</span><span class="n">empty_masks</span> <span class="o">=</span> <span class="n">masks</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span>
<span class="n">masks</span> <span class="o">=</span> <span class="n">masks</span><span class="p">[:,</span> <span class="o">~</span><span class="n">empty_masks</span><span class="p">]</span>
<span class="n">mask_names</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">m</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">mask_names</span><span class="p">,</span> <span class="o">~</span><span class="n">empty_masks</span><span class="p">)</span> <span class="k">if</span> <span class="n">m</span><span class="p">]</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plot_coordinates_with_masks</span><span class="p">(</span>
    <span class="n">coordinates</span><span class="o">=</span><span class="n">umap_layout_species</span><span class="p">,</span>
    <span class="n">masks</span><span class="o">=</span><span class="n">masks</span><span class="p">,</span>
    <span class="n">mask_names</span><span class="o">=</span><span class="n">mask_names</span><span class="p">,</span>
    <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">15</span><span class="p">),</span>
    <span class="n">ncols</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
    <span class="n">cmap_bg</span><span class="o">=</span><span class="s">'lightblue'</span><span class="p">,</span>
    <span class="n">cmap_fg</span><span class="o">=</span><span class="s">'darkred'</span><span class="p">,</span>
    <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span>
    <span class="n">s</span><span class="o">=</span><span class="mi">5</span>
<span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-11-19-napistu_torch/vertex_embedding_umap_plot-output-1.png" alt="" /></p>

<p>The UMAP visualization shows clear clustering by entity type: proteins
cluster with proteins, and metabolites with metabolites. This isn’t
surprising given the network’s strong homophily — entities of the same
type preferentially connect. STRING alone contributes &gt;80% of the edges
in the network, and STRING edges are exclusively protein-protein
interactions.</p>

<p>However, entity types don’t completely segregate. Proteins and
metabolites intermix at cluster boundaries, and the embedding shows
finer-grained structure within each entity type. This suggests the GNN
captures more than just entity type — it’s learning the biological
organization within these categories. To reveal what additional
information the embedding encodes, I will analyze how pathway membership
and data source annotations are reflected in the representations</p>

<h4 id="pathway-similarity">Pathway similarity</h4>

<p>To assess pathway organization in the embeddings, I’ll use the
comprehensive pathway membership artifact created earlier. This binary
tensor encodes both coarse-grained data sources (8 sources) and
fine-grained pathway annotations (2800+ Reactome pathways) for each
vertex.</p>

<p>For each source and pathway, I’ll calculate the average cosine
similarity between all vertex pairs that belong to that category. The
fine-grained Reactome pathways are then aggregated to assess whether
individual pathways produce tighter clusters than Reactome as a whole.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># not really needed, but we can check artifacts alignments to the canonical vertex and edge feature names from the NapistuData instance
</span><span class="n">pathway_assignments</span> <span class="o">=</span> <span class="n">comprehensive_pathway_memberships</span><span class="p">.</span><span class="n">align_to_napistu_data</span><span class="p">(</span><span class="n">napistu_data</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">False</span><span class="p">).</span><span class="n">data</span>

<span class="n">pathway_similarities</span> <span class="o">=</span> <span class="n">calculate_pathway_similarities</span><span class="p">(</span>
    <span class="n">embedding_matrix</span> <span class="o">=</span> <span class="n">embeddings</span><span class="p">,</span>
    <span class="n">pathway_assignments</span> <span class="o">=</span> <span class="n">pathway_assignments</span><span class="p">[</span><span class="n">is_species_mask</span><span class="p">],</span>
    <span class="n">pathway_names</span> <span class="o">=</span> <span class="n">comprehensive_pathway_memberships</span><span class="p">.</span><span class="n">feature_names</span><span class="p">,</span>
<span class="p">)</span>
<span class="c1"># rename categories for clarity
</span><span class="n">pathway_similarities</span><span class="p">[</span><span class="s">"Reactome (by pathway)"</span><span class="p">]</span> <span class="o">=</span> <span class="n">pathway_similarities</span><span class="p">.</span><span class="n">pop</span><span class="p">(</span><span class="s">"other"</span><span class="p">)</span>
<span class="n">pathway_similarities</span><span class="p">[</span><span class="s">"Reactome (overall)"</span><span class="p">]</span> <span class="o">=</span> <span class="n">pathway_similarities</span><span class="p">.</span><span class="n">pop</span><span class="p">(</span><span class="s">"Reactome"</span><span class="p">)</span>
<span class="k">del</span> <span class="n">pathway_similarities</span><span class="p">[</span><span class="s">"overall"</span><span class="p">]</span>

<span class="c1"># Sort by value
</span><span class="n">sorted_items</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">pathway_similarities</span><span class="p">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="n">categories</span> <span class="o">=</span> <span class="p">[</span><span class="n">item</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">sorted_items</span><span class="p">]</span>
<span class="n">values</span> <span class="o">=</span> <span class="p">[</span><span class="n">item</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">sorted_items</span><span class="p">]</span>

<span class="c1"># Create figure and axis
</span><span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>

<span class="c1"># Create barplot
</span><span class="n">bars</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">barh</span><span class="p">(</span><span class="n">categories</span><span class="p">,</span> <span class="n">values</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'steelblue'</span><span class="p">,</span> <span class="n">edgecolor</span><span class="o">=</span><span class="s">'black'</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>

<span class="c1"># Customize the plot
</span><span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Within-category cosine similarity'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Data source'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span>
    <span class="s">'Within-category cosine similarity by data source'</span><span class="p">,</span> 
    <span class="n">fontsize</span><span class="o">=</span><span class="mi">14</span><span class="p">,</span>
    <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">,</span>
    <span class="n">pad</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span>
    <span class="n">loc</span><span class="o">=</span><span class="s">'left'</span>
<span class="p">)</span>

<span class="c1"># Add value labels on the bars
</span><span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">cat</span><span class="p">,</span> <span class="n">val</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">categories</span><span class="p">,</span> <span class="n">values</span><span class="p">)):</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">text</span><span class="p">(</span><span class="n">val</span> <span class="o">+</span> <span class="mf">0.01</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="n">val</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">'</span><span class="p">,</span> 
            <span class="n">va</span><span class="o">=</span><span class="s">'center'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">9</span><span class="p">)</span>

<span class="c1"># Add grid for easier reading
</span><span class="n">ax</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'x'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">'--'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_axisbelow</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-11-19-napistu_torch/pathway_similarity-output-1.png" alt="" /></p>

<p>The embedding structure reflects the network’s edge composition. STRING
contributes &gt;80% of edges—all protein-protein interactions — which
means the training objective is dominated by getting protein
relationships right. This pushes the model to spread proteins across the
embedding space to capture their diverse interaction patterns, resulting
in low within-source similarity for protein-rich databases: STRING
(0.061), IntAct (0.046), OmniPath (0.044).</p>

<p>In contrast, specialized sources with fewer, lower-degree entities get
pushed into tighter regions of the embedding space. Reactome (0.550
overall, 0.586 by pathway) and Recon3D (0.531) show much higher
within-source similarity. These sources contribute distinctive entity
types — complexes and proteoforms for Reactome, and detailed metabolic
species for Recon3D. The model learns to distinguish these entities from
standard proteins. However, because they contribute few edges to the
training signal, the model clusters them together rather than resolving
fine-grained structure within them.</p>

<p>This explains why Reactome pathways show only modest additional cohesion
(0.586) compared to Reactome overall (0.550), the model learns “this is
a Reactome entity” but doesn’t strongly differentiate between specific
Reactome pathways.</p>

<h3 id="learned-edge-weights">Learned edge weights</h3>

<p>Next, I’ll explore what makes a high-confidence edge using sensitivity
analysis on the edge encoder.</p>

<p>The edge encoder maps 69 edge attributes to a single weight in (0, 1).
To evaluate each attribute’s importance, I calculate its average
gradient with respect to the learned edge weight across 1M randomly
sampled edges. This sensitivity score reveals which features most
strongly influence the model’s confidence in an edge.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">device</span> <span class="o">=</span> <span class="n">select_device</span><span class="p">(</span><span class="n">mps_valid</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>

<span class="n">top_model</span> <span class="o">=</span> <span class="n">top_model_manager</span><span class="p">.</span><span class="n">load_model_from_checkpoint</span><span class="p">(</span><span class="n">top_model_manager</span><span class="p">.</span><span class="n">best_checkpoint_path</span><span class="p">)</span>
<span class="n">edge_encoder</span> <span class="o">=</span> <span class="n">get_edge_encoder</span><span class="p">(</span><span class="n">top_model</span><span class="p">)</span>
<span class="n">feature_sensitivities</span> <span class="o">=</span> <span class="n">compute_edge_feature_sensitivity</span><span class="p">(</span><span class="n">edge_encoder</span><span class="p">,</span> <span class="n">napistu_data</span><span class="p">.</span><span class="n">edge_attr</span><span class="p">,</span> <span class="mi">1000000</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>

<span class="n">formatted_feature_sensitivities</span> <span class="o">=</span> <span class="n">format_edge_feature_sensitivity</span><span class="p">(</span><span class="n">feature_sensitivities</span><span class="p">,</span> <span class="n">napistu_data</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plot_edge_feature_sensitivity</span><span class="p">(</span><span class="n">formatted_feature_sensitivities</span><span class="p">,</span> <span class="n">top_n</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="mi">8</span><span class="p">),</span> <span class="n">truncate_names</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-11-19-napistu_torch/edge_feature_sensitivity_plot-output-1.png" alt="" /></p>

<p>The model shows clear preferences: literature-derived evidence (OmniPath
primary sources, STRING text mining) increases edge confidence, while
indirect functional evidence (STRING coexpression transfer, experimental
transfer) decreases it. This suggests the model may be learning to
construct mechanistic regulatory relationships — combinations of
physical interactions and functional associations — rather than simply
upweighting physical interactions or functional signals independently.</p>

<div class="content-section bio-section">
  <div class="section-content">
    <p>Napistu aims to capture
mechanistic relationships at genome-wide scale: an edge from A→B
indicates that A is sufficient to modify B (at least in some contexts),
enabling paths through the network to represent regulatory cascades.
However, our understanding of regulation is highly incomplete.
Gold-standard mechanistic resources like Reactome and Recon3D are
accurate but sparse — low false positive rates but high false negative
rates. To complement them, Napistu integrates broader resources: STRING
(primarily functional associations like coexpression) and
IntAct/OmniPath (physical interactions like binding and
phosphorylation). These provide a dense web of <em>plausible</em> regulatory
connections, but conflate mechanistic regulation with its functional
byproducts.</p>

<p>The Octopus network thus integrates databases with fundamentally
different evidence types. Training on edge prediction across this
integrated network may push the model to learn which combinations of
features distinguish true mechanistic regulation — relationships that
are both physically direct and functionally consequential — from mere
functional associations.</p>

<p>The observation that learned edge weights prioritize literature-derived
evidence over coexpression is encouraging. It suggests the model may be
learning mechanism-grounded causality rather than being misled by
correlation.</p>

  </div>
</div>

<p>Interpreting individual features is complicated by the edge encoder’s
nonlinear combinations. For example, OmniPath primary sources show
strong positive sensitivity while total OmniPath sources show negative
sensitivity, suggesting the model values concentrated evidence from
specific high-quality sources over diffuse evidence from many sources.</p>

<p>Notably, none of the hand-crafted confidence scores from the original
databases — STRING combined score, IntAct MI score, Reactome FI score
— appear among the most sensitive features. This suggests the edge
encoder is learning data quality signals that differ from
expert-designed heuristics, reinforcing the value of end-to-end training
for edge weighting.</p>

<p>Finally, I’ll explore the types of edges being predicted by the
GraphConv GNN.</p>

<h3 id="edge-predictions">Edge predictions</h3>

<p>As previously discussed, when negative samples differ too greatly from
real edges, the model can exploit vertex attributes that indicate
implausible connections, rather than capturing the underlying biological
network structure. To address this shortcut, negative samples were
generated using the observed node_type strata (i.e., sampling an equal
number of species→species, species→reaction, and reaction→species edges
as in the real set of edges). Yet, certain vertex features could
trivially separate real and negative edges; for instance, regulatory
RNAs never interact with metabolites in the dataset</p>

<p>To evaluate whether the model exploits potential misalignment between
real edges and negative samples, I’ll compare predicted edge
probabilities for each edge class to the probabilities expected from
their relative frequencies in real and negative edges</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">edge_predictions</span> <span class="o">=</span> <span class="n">predict</span><span class="p">(</span>
    <span class="n">top_model_manager</span><span class="p">.</span><span class="n">get_experiment_dict</span><span class="p">(),</span>
    <span class="n">checkpoint</span><span class="o">=</span><span class="n">top_model_manager</span><span class="p">.</span><span class="n">best_checkpoint_path</span>
<span class="p">)</span>

<span class="n">species_strata_recovery</span> <span class="o">=</span> <span class="n">summarize_edge_predictions_by_strata</span><span class="p">(</span><span class="n">edge_predictions</span><span class="p">,</span> <span class="n">edge_strata_by_node_species_type</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Filter to categories with &gt;= 100 edges
</span><span class="n">species_strata_recovery_filtered</span> <span class="o">=</span> <span class="n">species_strata_recovery</span><span class="p">[</span><span class="n">species_strata_recovery</span><span class="p">[</span><span class="s">'count'</span><span class="p">]</span> <span class="o">&gt;=</span> <span class="mi">100</span><span class="p">]</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plot_edge_predictions_by_strata</span><span class="p">(</span><span class="n">species_strata_recovery_filtered</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-11-19-napistu_torch/edge_prediction_analysis_scatterplot-output-1.png" alt="" /></p>

<p>The plot reveals two distinct patterns. For common edge types —
particularly protein→protein interactions (bright yellow, log₂ O/E ≈ 0)
— real edges and negative samples occur at similar frequencies, yet
the model’s predictions span a wide range (0.3-1.0). This indicates the
model is learning patterns of vertex similarity that go beyond trivial
features like entity type. The model differentiates among protein pairs
based on their learned embeddings, not just their shared protein
identity.</p>

<p>For rare edge types involving specialized entities (complexes,
metabolites from Recon3D), the pattern changes. These categories often
show extreme enrichment or depletion (|log₂ O/E| &gt; 2), yet the model
assigns them consistently high prediction probabilities regardless of
whether they’re enriched or depleted. This likely reflects the tight
embedding clusters observed earlier for Reactome and Recon3D entities
— specialized molecular species cluster strongly in the embedding
space, leading the dot product head to predict high edge probabilities
between them even when such edges are rare in the training data.</p>

<p>This analysis suggests the model has learned meaningful biological
structure within major edge types while potentially overgeneralizing for
rare, specialized entities. The wide prediction spread for
protein→protein edges is encouraging for future work, with more
expressive heads and validation datasets, these learned similarity
patterns could identify novel regulatory connections within
well-represented entity types.</p>

<h2 id="summary">Summary</h2>

<p>This post introduced Napistu-Torch and demonstrated that training GNNs
on genome-scale biological networks is feasible with standard hardware
— the complete suite of models trained in ~2 days on a laptop. More
importantly, I’ve established the foundational infrastructure needed to
explore this space systematically:</p>

<ul>
  <li>The <code class="language-plaintext highlighter-rouge">NapistuDataStore</code> handles conversion from biological networks
to PyTorch Geometric format with caching that eliminates rebuild
overhead.</li>
  <li>Modular encoders, heads, and edge encoders enable architectural
exploration through configuration files rather than code changes.</li>
  <li>PyTorch Lightning integration and CLI-driven workflows make
experiments reproducible and trackable via Weights &amp; Biases.</li>
  <li>Edge prediction provides self-supervised training without
ground-truth labels while stratified negative sampling maintains
computational tractability.</li>
  <li>Vertex embeddings, learned edge weights, and prediction patterns
reveal what biological structure the models capture.</li>
</ul>

<p>The most impactful findings are that different GNN architectures
converge on similar biological representations — suggesting the signal
is robust—and not trivially recovered from vertex attributes like its
data source or molecule type.</p>

<p>This work creates a low-activation-energy foundation for exploring how
GNNs can tap the potential of comprehensive biological networks like the
Octopus. The hard infrastructure work is done — what remains is the
interesting part: using these tools to build accurate genome-scale
representations and discover novel biology.</p>]]></content><author><name>Sean Hackett</name></author><category term="napistu" /><category term="ML" /><category term="python" /><category term="SWE" /><category term="GNNs" /><category term="PyTorch" /><summary type="html"><![CDATA[Biological applications of graph neural networks (GNNs) typically work with either small curated networks (100s-1,000s of nodes) or aggressively filtered subsets of large databases like STRING. The Octopus graph — which I introduced in my previous post — occupies a different space entirely. By integrating eight complementary pathway databases, it creates a genome-scale network with ~50K proteins, metabolites, and complexes spanning ~10M edges, all while preserving rich metadata about edge provenance, confidence scores, and mechanistic detail that filtered approaches discard. This puts the Octopus in uncharted territory: large enough to capture genome-scale complexity, yet structured enough to preserve the biological interpretability that makes network analysis valuable. GNNs scale well beyond genome-scale requirements (100M+ nodes in social networks), but remain unexplored for comprehensive biological networks that integrate regulatory, metabolic, and interaction data. Bridging this gap requires infrastructure that handles both the biological complexity of multi-source networks and the engineering complexity of training GNNs at scale. In this post, I’ll introduce Napistu-Torch — the infrastructure that finally makes this space navigable. Available from PyPI and indexed by the Napistu MCP server, Napistu-Torch provides a modular, reproducible framework for training GNNs on comprehensive biological networks. I’ll demonstrate that it’s feasible to train graph convolutional networks on the complete Octopus network using just a laptop (albeit with 2 days of training time for the full suite of models). But the real contribution is the ecosystem: the data structures, pipelines, and evaluation strategies that unlock far more sophisticated analyses.]]></summary></entry><entry><title type="html">Napistu’s Octopus: An 8-source human consensus pathway model</title><link href="https://www.shackett.org/octopus_network/" rel="alternate" type="text/html" title="Napistu’s Octopus: An 8-source human consensus pathway model" /><published>2025-10-07T00:00:00+00:00</published><updated>2025-10-07T00:00:00+00:00</updated><id>https://www.shackett.org/octopus_network</id><content type="html" xml:base="https://www.shackett.org/octopus_network/"><![CDATA[<p>Introducing the Octopus: Napistu’s eight-source Human Consensus Pathway
Model that unites the breadth of protein-protein interaction networks
with the depth of regulatory databases and metabolic models.The result
is a genome-scale directed graph that is both densely connected and
mechanistically precise. In this post, I will:</p>

<ul>
  <li>Provide an overview of the Octopus model and its construction</li>
  <li>Show side-by-side summaries of individual data sources highlighting
their complementarity</li>
  <li>Demonstrate that the model successfully merges results, creating a
dense network covering the complete cellular repertoire of genes,
metabolites, drugs, and complexes</li>
  <li>Illustrate how source-level information can be carried forward to
the Octopus’s graphical network to augment its vertex and edge
features</li>
</ul>

<!--more-->

<p><img src="https://www.shackett.org/figure/octopus_network/octopus_network.png" alt="Octopus network" style="width: 70%;" /></p>

<p>The model is distributed as a set of related Napistu assets bundled
together. The core components are two major Napistu data structures:</p>

<ul>
  <li><a href="https://github.com/napistu/napistu/wiki/SBML-DFs"><code class="language-plaintext highlighter-rouge">SBML_dfs</code></a>: An
in-memory relational database organizing molecular species (genes,
metabolites, complexes, drugs) and their relationships (reactions,
interactions). I’ll provide a thorough review of this format below.</li>
  <li><a href="https://github.com/napistu/napistu/wiki/Napistu-Graphs"><code class="language-plaintext highlighter-rouge">NapistuGraph</code></a>:
A directed graph representation of the same network, translating
molecular species and reactions into a network optimized for
downstream analysis.</li>
</ul>

<h2 id="building-the-">Building the 🐙</h2>

<p>I built the Octopus using Napistu’s CLI, which processes individual
pathway sources, merges them into consensus models, and translates them
into genome-scale molecular networks. The build process runs as a cached
<a href="https://github.com/napistu/napistu/blob/main/dev/create_human_consensus.qmd">Quarto
notebook</a>
— sufficient for present needs, but a dedicated workflow manager like
NextFlow or Airflow would be better suited for broader adoption within
the research community.</p>

<p>The Octopus build process follows seven sequential steps:</p>

<ol>
  <li><strong>Ingest</strong> data source-specific content and format as <code class="language-plaintext highlighter-rouge">SBML_dfs</code>
objects</li>
  <li><strong>Standardize</strong> compartmentalization — the Octopus uses an
uncompartmentalized approach for simplicity</li>
  <li><strong>Merge</strong> <code class="language-plaintext highlighter-rouge">SBML_dfs</code> objects into a single consensus model</li>
  <li><strong>Filter</strong> cofactors to prevent molecules like water from appearing
as hub regulators</li>
  <li><strong>Convert</strong> the <code class="language-plaintext highlighter-rouge">SBML_dfs</code> into a <code class="language-plaintext highlighter-rouge">NapistuGraph</code> network
representation</li>
  <li><strong>Generate</strong> derived summaries including precomputed molecular
distances</li>
  <li><strong>Package</strong> all components into a single artifact and deploy to
Google Cloud Storage</li>
</ol>

<p><img src="https://www.shackett.org/figure/octopus_network/octopus_build_process.png" alt="Graphical layout of the build process for the Octopus model" style="width: 100%;" /></p>

<div class="content-section bio-section">
  <div class="section-content">
    <p>The uncompartmentalized approach
sacrifices one of Napistu’s most compelling features: modeling spatial
organization and transport mechanisms. These are fundamental to
physiology and pathophysiology — such as proton transport for ATP
synthesis or protein aggregation in neurodegeneration. Napistu could
uniquely extend quantitative metabolic modeling principles to
genome-scale networks of cellular physiology, even treating cell types
or tissues as compartments to model local-global process interactions in
Systems Physiology.</p>

<p>I’m excited about these directions, but building a compartmentalized
model is a major effort — it demands strong biological use cases and
high-quality data sources. The biggest challenge is defining the right
level of compartmental granularity and aligning data sources to that
resolution for effective integration. My current uncompartmentalized
approach sidesteps this complexity, though Human Proteome Atlas
integration could provide a path forward when the right biological
question arises.</p>

  </div>
</div>

<h2 id="follow-along">Follow Along!</h2>

<h3 id="environment-setup">Environment setup</h3>

<p>To follow along with the code in this post, you’ll need a Python
environment with the <code class="language-plaintext highlighter-rouge">napistu</code> package installed. Here’s a simple setup
using <code class="language-plaintext highlighter-rouge">venv</code>:</p>

<ol>
  <li>
    <p>Install <a href="https://docs.astral.sh/uv/#highlights">uv</a> (or use <code class="language-plaintext highlighter-rouge">pip</code> if
preferred)</p>
  </li>
  <li>
    <p>Setup a Python environment:</p>
  </li>
</ol>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>uv venv <span class="nt">--python</span> 3.11
<span class="nb">source</span> .venv/bin/activate

<span class="c"># Core dependencies</span>
uv pip <span class="nb">install</span> <span class="s2">"napistu==0.7.1"</span>
<span class="c"># if you'd like to render the notebook, you'll need to install these additional dependencies</span>
uv pip <span class="nb">install </span>seaborn ipykernel nbformat nbclient
python <span class="nt">-m</span> ipykernel <span class="nb">install</span> <span class="nt">--user</span> <span class="nt">--name</span><span class="o">=</span>blog-staging
</code></pre></div></div>

<ol>
  <li>
    <p>Download the
<a href="https://github.com/shackett/shackett/blob/master/posts/posted/octopus_network.qmd"><code class="language-plaintext highlighter-rouge">octopus_network.qmd</code></a>
notebook (or copy and paste the relevant code blocks)</p>
  </li>
  <li>
    <p>Configure <code class="language-plaintext highlighter-rouge">data_dir</code> in the setup code to a path where you’re
comfortable saving the consensus <code class="language-plaintext highlighter-rouge">SBML_dfs</code> model</p>
  </li>
</ol>

<h3 id="configuring-the-python-notebook">Configuring the Python notebook</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">from</span> <span class="nn">math</span> <span class="kn">import</span> <span class="n">pi</span>

<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>

<span class="kn">from</span> <span class="nn">napistu</span> <span class="kn">import</span> <span class="n">utils</span> <span class="k">as</span> <span class="n">napistu_utils</span>
<span class="kn">from</span> <span class="nn">napistu.gcs</span> <span class="kn">import</span> <span class="n">downloads</span>
<span class="kn">from</span> <span class="nn">napistu</span> <span class="kn">import</span> <span class="n">sbml_dfs_utils</span>
<span class="kn">from</span> <span class="nn">napistu.network</span> <span class="kn">import</span> <span class="n">ng_utils</span>
<span class="kn">from</span> <span class="nn">napistu.sbml_dfs_core</span> <span class="kn">import</span> <span class="n">SBML_dfs</span>
<span class="kn">from</span> <span class="nn">napistu.network.ng_core</span> <span class="kn">import</span> <span class="n">NapistuGraph</span>
<span class="kn">from</span> <span class="nn">napistu.ontologies.constants</span> <span class="kn">import</span> <span class="n">SPECIES_TYPE_PLURAL</span>

<span class="kn">from</span> <span class="nn">shackett_utils.blog.html_utils</span> <span class="kn">import</span> <span class="n">display_tabulator</span>

<span class="c1"># globals
</span><span class="n">DATA_DIR</span> <span class="o">=</span> <span class="s">"/tmp/napistu_data"</span>
<span class="n">ASSET</span> <span class="o">=</span> <span class="s">"human_consensus"</span>
<span class="n">VERSION_TAG</span> <span class="o">=</span> <span class="s">"20250923"</span>
<span class="n">INPUT_SBML_DFS_SUMMARIES_URL</span> <span class="o">=</span> <span class="s">"https://raw.githubusercontent.com/shackett/shackett/main/assets/data/octopus_input_sbml_dfs_summaries.json"</span>

<span class="c1"># utils
</span><span class="k">def</span> <span class="nf">cooccurrence_to_conditional_prob</span><span class="p">(</span><span class="n">cooccur_df</span><span class="p">):</span>
    <span class="n">set_sizes</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">diag</span><span class="p">(</span><span class="n">cooccur_df</span><span class="p">.</span><span class="n">values</span><span class="p">)</span>
    <span class="n">intersection</span> <span class="o">=</span> <span class="n">cooccur_df</span><span class="p">.</span><span class="n">values</span>
    <span class="n">conditional_prob</span> <span class="o">=</span> <span class="n">intersection</span> <span class="o">/</span> <span class="n">set_sizes</span>  <span class="c1"># P(A|B) = |A ∩ B| / |B|
</span>    <span class="k">return</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">conditional_prob</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="n">cooccur_df</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">cooccur_df</span><span class="p">.</span><span class="n">columns</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">simple_pd_heatmap</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">plot_title</span><span class="p">,</span> <span class="n">colorbar_label</span><span class="o">=</span><span class="s">"Counts"</span><span class="p">,</span> <span class="n">fmt</span><span class="o">=</span><span class="s">"d"</span><span class="p">):</span>
    <span class="c1"># Set up the figure size and style
</span>    <span class="n">plt</span><span class="p">.</span><span class="n">rcParams</span><span class="p">.</span><span class="n">update</span><span class="p">({</span><span class="s">'font.size'</span><span class="p">:</span> <span class="mi">15</span><span class="p">})</span>  <span class="c1"># Base font size
</span>    
    <span class="c1"># Create clustermap with proper sizing
</span>    <span class="n">g</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">clustermap</span><span class="p">(</span>
        <span class="n">df</span><span class="p">,</span> 
        <span class="n">annot</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>  <span class="c1"># Show values in cells
</span>        <span class="n">cmap</span><span class="o">=</span><span class="s">'Blues'</span><span class="p">,</span> 
        <span class="n">fmt</span><span class="o">=</span><span class="n">fmt</span><span class="p">,</span>
        <span class="n">cbar_kws</span><span class="o">=</span><span class="p">{</span><span class="s">'label'</span><span class="p">:</span> <span class="n">colorbar_label</span><span class="p">},</span>
        <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span>  <span class="c1"># Larger figure
</span>        <span class="n">annot_kws</span><span class="o">=</span><span class="p">{</span><span class="s">'size'</span><span class="p">:</span> <span class="mi">12</span><span class="p">},</span>  <span class="c1"># Annotation font size
</span>        <span class="n">cbar_pos</span><span class="o">=</span><span class="p">(</span><span class="mf">0.02</span><span class="p">,</span> <span class="mf">0.83</span><span class="p">,</span> <span class="mf">0.03</span><span class="p">,</span> <span class="mf">0.15</span><span class="p">),</span>  <span class="c1"># Colorbar position (left, bottom, width, height)
</span>    <span class="p">)</span>
    
    <span class="c1"># Increase font sizes for axis labels
</span>    <span class="n">g</span><span class="p">.</span><span class="n">ax_heatmap</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="n">g</span><span class="p">.</span><span class="n">ax_heatmap</span><span class="p">.</span><span class="n">get_xlabel</span><span class="p">(),</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">14</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">)</span>
    <span class="n">g</span><span class="p">.</span><span class="n">ax_heatmap</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="n">g</span><span class="p">.</span><span class="n">ax_heatmap</span><span class="p">.</span><span class="n">get_ylabel</span><span class="p">(),</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">14</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">)</span>
    
    <span class="c1"># Rotate and adjust tick labels for better readability
</span>    <span class="n">g</span><span class="p">.</span><span class="n">ax_heatmap</span><span class="p">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'x'</span><span class="p">,</span> <span class="n">labelsize</span><span class="o">=</span><span class="mi">11</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">45</span><span class="p">)</span>
    <span class="n">g</span><span class="p">.</span><span class="n">ax_heatmap</span><span class="p">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'y'</span><span class="p">,</span> <span class="n">labelsize</span><span class="o">=</span><span class="mi">11</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    
    <span class="c1"># Add title with proper positioning (left-aligned)
</span>    <span class="n">g</span><span class="p">.</span><span class="n">fig</span><span class="p">.</span><span class="n">suptitle</span><span class="p">(</span><span class="n">plot_title</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="mf">0.98</span><span class="p">,</span> <span class="n">horizontalalignment</span><span class="o">=</span><span class="s">'left'</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="mf">0.05</span><span class="p">)</span>
    
    <span class="c1"># Adjust layout to prevent clipping
</span>    <span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
    
    <span class="c1"># Return the clustermap object for further customization if needed
</span>    <span class="k">return</span> <span class="n">g</span>

<span class="k">def</span> <span class="nf">create_pathway_radar_plot</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">7</span><span class="p">),</span> <span class="n">title</span><span class="o">=</span><span class="s">'Pathway Analysis Radar Plot'</span><span class="p">):</span>
    
    <span class="c1"># Get categories (columns) and pathways (rows)
</span>    <span class="n">categories</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">)</span>
    <span class="n">pathways</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">index</span><span class="p">)</span>
    
    <span class="c1"># Number of variables
</span>    <span class="n">num_vars</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">categories</span><span class="p">)</span>
    
    <span class="c1"># Compute angle for each axis
</span>    <span class="n">angles</span> <span class="o">=</span> <span class="p">[</span><span class="n">n</span> <span class="o">/</span> <span class="nb">float</span><span class="p">(</span><span class="n">num_vars</span><span class="p">)</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">pi</span> <span class="k">for</span> <span class="n">n</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_vars</span><span class="p">)]</span>
    <span class="n">angles</span> <span class="o">+=</span> <span class="n">angles</span><span class="p">[:</span><span class="mi">1</span><span class="p">]</span>  <span class="c1"># Complete the circle
</span>    
    <span class="c1"># Initialize the plot
</span>    <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="n">figsize</span><span class="p">,</span> <span class="n">subplot_kw</span><span class="o">=</span><span class="nb">dict</span><span class="p">(</span><span class="n">projection</span><span class="o">=</span><span class="s">'polar'</span><span class="p">))</span>
    
    <span class="c1"># Color palette for different pathways
</span>    <span class="n">colors</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">cm</span><span class="p">.</span><span class="n">tab10</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">pathways</span><span class="p">)))</span>
    
    <span class="c1"># Plot each pathway
</span>    <span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">pathway</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">pathways</span><span class="p">):</span>
        <span class="n">values</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">pathway</span><span class="p">].</span><span class="n">values</span><span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">float</span><span class="p">)</span>
        
        <span class="c1"># Log10 transform (add 1 to avoid log(0))
</span>        <span class="n">log_values</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">log10</span><span class="p">(</span><span class="n">values</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
        
        <span class="c1"># Complete the circle
</span>        <span class="n">plot_values</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">log_values</span><span class="p">)</span> <span class="o">+</span> <span class="p">[</span><span class="n">log_values</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span>
        
        <span class="c1"># Plot
</span>        <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">angles</span><span class="p">,</span> <span class="n">plot_values</span><span class="p">,</span> <span class="s">'o-'</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> 
                <span class="n">label</span><span class="o">=</span><span class="n">pathway</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="n">idx</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">)</span>
        <span class="n">ax</span><span class="p">.</span><span class="n">fill</span><span class="p">(</span><span class="n">angles</span><span class="p">,</span> <span class="n">plot_values</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.15</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="n">idx</span><span class="p">])</span>
    
    <span class="c1"># Fix axis to go in the right order and start at 12 o'clock
</span>    <span class="n">ax</span><span class="p">.</span><span class="n">set_theta_offset</span><span class="p">(</span><span class="n">pi</span> <span class="o">/</span> <span class="mi">2</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_theta_direction</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
    
    <span class="c1"># Set category labels - use built-in matplotlib positioning
</span>    <span class="n">ax</span><span class="p">.</span><span class="n">set_xticks</span><span class="p">(</span><span class="n">angles</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_xticklabels</span><span class="p">(</span><span class="n">categories</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">11</span><span class="p">)</span>
    
    <span class="c1"># Adjust label padding to move them outside the plot
</span>    <span class="n">ax</span><span class="p">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'x'</span><span class="p">,</span> <span class="n">pad</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
    
    <span class="c1"># Set y-axis labels to show original values at powers of 10
</span>    <span class="c1"># Determine the max value to set appropriate range
</span>    <span class="n">max_log_value</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">log10</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">values</span><span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">float</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span>
    
    <span class="c1"># Create ticks at powers of 10: 10, 100, 1000, 10000, etc.
</span>    <span class="n">max_power</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">ceil</span><span class="p">(</span><span class="n">max_log_value</span><span class="p">))</span>
    <span class="n">ytick_values</span> <span class="o">=</span> <span class="p">[</span><span class="mi">10</span><span class="o">**</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">max_power</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)]</span>
    <span class="n">ytick_positions</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">log10</span><span class="p">(</span><span class="n">v</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">ytick_values</span><span class="p">]</span>
    
    <span class="n">ax</span><span class="p">.</span><span class="n">set_yticks</span><span class="p">(</span><span class="n">ytick_positions</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_yticklabels</span><span class="p">([</span><span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="n">v</span><span class="si">:</span><span class="p">,</span><span class="si">}</span><span class="s">'</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">ytick_values</span><span class="p">],</span> <span class="n">size</span><span class="o">=</span><span class="mi">9</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">log10</span><span class="p">(</span><span class="mi">10</span><span class="o">**</span><span class="n">max_power</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span>
    
    <span class="c1"># Add grid
</span>    <span class="n">ax</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="bp">True</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">'--'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">)</span>
    
    <span class="c1"># Add legend
</span>    <span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s">'upper right'</span><span class="p">,</span> <span class="n">bbox_to_anchor</span><span class="o">=</span><span class="p">(</span><span class="mf">1.3</span><span class="p">,</span> <span class="mf">1.1</span><span class="p">),</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
    
    <span class="c1"># Add title
</span>    <span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="n">title</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span> <span class="n">pad</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
    
    <span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
    
    <span class="k">return</span> <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span>
</code></pre></div></div>

<h2 id="data-sources">Data sources</h2>

<p>The Octopus’s integration success stems from Napistu’s flexible
<a href="https://github.com/napistu/napistu/wiki/SBML-DFs"><code class="language-plaintext highlighter-rouge">SBML_dfs</code></a> data
structure, which standardizes diverse pathway sources while preserving
their unique molecular and mechanistic contributions.</p>

<h3 id="overview-of-the-sbml_dfs-pathway-representation">Overview of the <code class="language-plaintext highlighter-rouge">SBML_dfs</code> pathway representation</h3>

<p>The core <code class="language-plaintext highlighter-rouge">SBML_dfs</code> data representation involves five tables linked by
primary key-foreign key relationships:</p>

<ul>
  <li><strong>Compartments</strong>: Define distinct cellular locations (e.g., cytosol,
nucleoplasm). Uncompartmentalized models contain only “cellular
component” by convention.</li>
  <li><strong>Species</strong>: Catalog distinct molecular entities including proteins,
metabolites, complexes, and drugs.</li>
  <li><strong>Compartmentalized Species</strong>: Map each species to its specific
compartmental locations.</li>
  <li><strong>Reactions</strong>: Represent distinct biochemical events including
metabolic reactions, complex formation, and physical/functional
interactions.</li>
  <li><strong>Reaction Species</strong>: Define each compartmentalized species’ role in
specific reactions (substrate, catalyst, inhibitor, etc.).</li>
</ul>

<p>Additional optional tables store quantitative annotations beyond the
core schema:</p>

<ul>
  <li><strong>Species Data</strong>: Contains tables with molecular species-specific
quantitative attributes.</li>
  <li><strong>Reactions Data</strong>: Contains tables with reaction-specific
quantitative attributes.</li>
</ul>

<p><img src="https://www.shackett.org/figure/octopus_network/sbml_dfs_schema.png" alt="The SBML_dfs database schema" style="width: 100%;" /></p>

<h3 id="source-descriptions">Source descriptions</h3>

<p>Each data source is formatted as a separate SBML_dfs object
encapsulating its molecular species, their interactions, and any
quantitative data.</p>

<ul>
  <li><strong>Reactome</strong> is the human gold-standard pathway database, employing
rigorous expert curation with multi-tier review by over 820
scientists to produce reaction-centric models of cellular processes.</li>
  <li><strong>Recon3D</strong> is a comprehensive human metabolic model that enables
quantitative flux balance analysis and phenotype prediction.</li>
  <li><strong>STRING</strong> is a comprehensive protein interaction database that
integrates evidence from seven distinct channels with probabilistic
scoring to capture functional associations rather than directional
causality. Its strength is broad multi-organism coverage with
confidence scores calibrated to known pathway relationships.</li>
  <li><strong>IntAct</strong> is a manually curated database of experimentally verified
molecular interactions with unprecedented annotation depth, making
it the gold standard for high-confidence molecular interaction data.</li>
  <li><strong>Reactome-FI</strong> transforms Reactome’s detailed biochemical reactions
into simplified functional interaction networks using machine
learning.</li>
  <li><strong>OmniPath</strong> is a comprehensive integration database that combines
data from over 100 resources into unified directed signaling
networks with sophisticated consensus-building mechanisms. It
specializes in literature-curated activity flow interactions with
effect signs (activation/inhibition).</li>
  <li><strong>TRRUST</strong> uses sentence-based text mining to identify transcription
factor-target regulatory relationsh1ips from Medline abstracts.</li>
  <li><strong>Dogma</strong> is a Napistu-specific resource that contributes gene
annotations to help merge species across different primary
ontologies without adding reactions to the consensus model.</li>
</ul>

<h3 id="source-comparisons">Source comparisons</h3>

<p>To understand how these sources complement each other, I’ll provide four
side-by-side analyses examining the scale and characteristics of each
database:</p>

<ul>
  <li><strong>Scale</strong>: How many species and reactions does each source contain?</li>
  <li><strong>Molecular diversity</strong>: What types of entities exist (proteins,
metabolites, complexes, drugs)?</li>
  <li><strong>Interaction mechanisms</strong>: How do molecules connect (undirected
associations, directed regulation, metabolic transformations)?</li>
  <li><strong>Quantitative data</strong>: What additional measurements do sources
provide (confidence scores, expression levels, binding affinities)?</li>
</ul>

<p>These comparisons use summary statistics extracted from each source’s
SBML_dfs during the Octopus build process. The summaries were saved to a
public GitHub repository for reproducibility and transparency. For
reference, the source summaries were generated using this non-runnable
code block:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">napistu</span> <span class="kn">import</span> <span class="n">consensus</span>
<span class="kn">from</span> <span class="nn">napistu.sbml_dfs_core</span> <span class="kn">import</span> <span class="n">SBML_dfs</span>
<span class="kn">from</span> <span class="nn">napistu</span> <span class="kn">import</span> <span class="n">utils</span>

<span class="n">sbml_dfs_uris</span> <span class="o">=</span> <span class="p">[</span>
    <span class="c1"># mechanisms
</span>    <span class="s">"napistu_data/human_consensus/cache/reactome/reactome.pkl"</span><span class="p">,</span>
    <span class="s">"napistu_data/human_consensus/cache/bigg.pkl"</span><span class="p">,</span>
    <span class="c1"># consensus interactions
</span>    <span class="s">"napistu_data/human_consensus/cache/hpa_filtered_string.pkl"</span><span class="p">,</span>
    <span class="c1"># PPIs
</span>    <span class="s">"napistu_data/human_consensus/cache/intact.pkl"</span><span class="p">,</span>
    <span class="c1"># regulatory mechanisms
</span>    <span class="s">"napistu_data/human_consensus/cache/omnipath.pkl"</span><span class="p">,</span>
    <span class="s">"napistu_data/human_consensus/cache/reactome_fi.pkl"</span><span class="p">,</span>
    <span class="s">"napistu_data/human_consensus/cache/trrust.pkl"</span><span class="p">,</span>
    <span class="c1"># gene annotations
</span>    <span class="s">"napistu_data/human_consensus/cache/dogma_sbml_dfs.pkl"</span><span class="p">,</span>
<span class="p">]</span>

<span class="n">sbml_dfs_list</span> <span class="o">=</span> <span class="p">[</span><span class="n">SBML_dfs</span><span class="p">.</span><span class="n">from_pickle</span><span class="p">(</span><span class="n">uri</span><span class="p">)</span> <span class="k">for</span> <span class="n">uri</span> <span class="ow">in</span> <span class="n">sbml_dfs_uris</span><span class="p">]</span>

<span class="c1"># reorganize as a list and table containing model-level metadata from the individual SBML_dfs
</span><span class="n">sbml_dfs_dict</span><span class="p">,</span> <span class="n">pw_index</span> <span class="o">=</span> <span class="n">consensus</span><span class="p">.</span><span class="n">prepare_consensus_model</span><span class="p">(</span><span class="n">sbml_dfs_list</span><span class="p">)</span>
<span class="n">sbml_dfs_dict_summaries</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span> <span class="n">v</span><span class="p">.</span><span class="n">get_summary</span><span class="p">()</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">sbml_dfs_dict</span><span class="p">.</span><span class="n">items</span><span class="p">()}</span>
<span class="n">utils</span><span class="p">.</span><span class="n">save_json</span><span class="p">(</span><span class="s">"&lt;&lt;MY_LOCAL_PATH&gt;&gt;/input_sbml_dfs_summaries.json"</span><span class="p">,</span> <span class="n">sbml_dfs_dict_summaries</span><span class="p">)</span>
</code></pre></div></div>

<p>I can load these pre-computed summaries directly from GitHub:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sbml_dfs_dict_summaries</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">INPUT_SBML_DFS_SUMMARIES_URL</span><span class="p">).</span><span class="n">json</span><span class="p">()</span>
</code></pre></div></div>

<p>These summaries enable direct side-by-side comparison of each source’s
unique characteristics and contributions to the consensus model.</p>

<h4 id="scale">Scale</h4>

<p>I’ll count entities in each source to assess their relative sizes.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># load the SBML_dfs' summaries from GitHub
</span><span class="n">entity_type_counts</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="n">k</span><span class="p">:</span> <span class="n">v</span><span class="p">[</span><span class="s">"n_entity_types"</span><span class="p">]</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">sbml_dfs_dict_summaries</span><span class="p">.</span><span class="n">items</span><span class="p">()})</span>
    <span class="p">.</span><span class="n">T</span>
    <span class="p">.</span><span class="n">sort_index</span><span class="p">(</span><span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="p">)</span>

<span class="n">display_tabulator</span><span class="p">(</span><span class="n">entity_type_counts</span><span class="p">,</span> <span class="n">layout</span> <span class="o">=</span> <span class="s">"fitDataTable"</span><span class="p">,</span> <span class="n">caption</span> <span class="o">=</span> <span class="s">"Entity counts per source"</span><span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Entity counts per source
</figcaption>

<div class="data-table" style="" data-table="[{&quot;index&quot;: &quot;Reactome&quot;, &quot;compartmentalized_species&quot;: 23905, &quot;compartments&quot;: 135, &quot;reaction_species&quot;: 63339, &quot;reactions&quot;: 15532, &quot;species&quot;: 22284}, {&quot;index&quot;: &quot;Recon3D&quot;, &quot;compartmentalized_species&quot;: 8083, &quot;compartments&quot;: 9, &quot;reaction_species&quot;: 54912, &quot;reactions&quot;: 10600, &quot;species&quot;: 4476}, {&quot;index&quot;: &quot;STRING&quot;, &quot;compartmentalized_species&quot;: 19384, &quot;compartments&quot;: 1, &quot;reaction_species&quot;: 8326852, &quot;reactions&quot;: 4163426, &quot;species&quot;: 19384}, {&quot;index&quot;: &quot;IntAct&quot;, &quot;compartmentalized_species&quot;: 22467, &quot;compartments&quot;: 1, &quot;reaction_species&quot;: 1002874, &quot;reactions&quot;: 501437, &quot;species&quot;: 22467}, {&quot;index&quot;: &quot;OmniPath&quot;, &quot;compartmentalized_species&quot;: 19509, &quot;compartments&quot;: 1, &quot;reaction_species&quot;: 952998, &quot;reactions&quot;: 476499, &quot;species&quot;: 19509}, {&quot;index&quot;: &quot;Reactome-FI&quot;, &quot;compartmentalized_species&quot;: 13733, &quot;compartments&quot;: 1, &quot;reaction_species&quot;: 942060, &quot;reactions&quot;: 471030, &quot;species&quot;: 13733}, {&quot;index&quot;: &quot;TRRUST&quot;, &quot;compartmentalized_species&quot;: 2862, &quot;compartments&quot;: 1, &quot;reaction_species&quot;: 16854, &quot;reactions&quot;: 8427, &quot;species&quot;: 2862}, {&quot;index&quot;: &quot;Dogma&quot;, &quot;compartmentalized_species&quot;: 19362, &quot;compartments&quot;: 1, &quot;reaction_species&quot;: 2, &quot;reactions&quot;: 1, &quot;species&quot;: 19362}]" data-columns="[{&quot;title&quot;: &quot;index&quot;, &quot;field&quot;: &quot;index&quot;}, {&quot;title&quot;: &quot;compartmentalized_species&quot;, &quot;field&quot;: &quot;compartmentalized_species&quot;}, {&quot;title&quot;: &quot;compartments&quot;, &quot;field&quot;: &quot;compartments&quot;}, {&quot;title&quot;: &quot;reaction_species&quot;, &quot;field&quot;: &quot;reaction_species&quot;}, {&quot;title&quot;: &quot;reactions&quot;, &quot;field&quot;: &quot;reactions&quot;}, {&quot;title&quot;: &quot;species&quot;, &quot;field&quot;: &quot;species&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataTable&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p>Sources contain similar numbers of molecular species but reaction counts
vary dramatically — from <em>TRRUST</em>’s ~8K reactions to <em>STRING</em>’s 4.2M.
<em>Dogma</em> contains only one placeholder reaction since it contributes gene
annotations rather than interactions, helping merge species across
different ontologies (<em>Ensembl</em>, <em>UniProt</em>, <em>Entrez</em>).</p>

<h4 id="molecular-diversity">Molecular diversity</h4>

<p>Each source specializes in different molecular entity types based on
their ontological annotations.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">species_type_counts</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="n">k</span><span class="p">:</span> <span class="n">v</span><span class="p">[</span><span class="s">"n_species_per_type"</span><span class="p">]</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">sbml_dfs_dict_summaries</span><span class="p">.</span><span class="n">items</span><span class="p">()})</span>
    <span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'Int64'</span><span class="p">)</span>
    <span class="p">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
    <span class="p">.</span><span class="n">T</span>
    <span class="p">.</span><span class="n">sort_index</span><span class="p">(</span><span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
    <span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span> <span class="o">=</span> <span class="n">SPECIES_TYPE_PLURAL</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">display_tabulator</span><span class="p">(</span>
  <span class="n">species_type_counts</span><span class="p">,</span>
  <span class="n">layout</span> <span class="o">=</span> <span class="s">"fitDataTable"</span><span class="p">,</span>
  <span class="n">caption</span> <span class="o">=</span> <span class="s">"Counts of molecular species types in each source"</span>
<span class="p">)</span>

<span class="n">RADAR_ORDER</span> <span class="o">=</span> <span class="p">[</span><span class="s">"proteins"</span><span class="p">,</span> <span class="s">"metabolites"</span><span class="p">,</span> <span class="s">"drugs"</span><span class="p">,</span> <span class="s">"unknowns"</span><span class="p">,</span> <span class="s">"other"</span><span class="p">,</span> <span class="s">"regulatory RNAs"</span><span class="p">,</span> <span class="s">"complexes"</span><span class="p">]</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">create_pathway_radar_plot</span><span class="p">(</span>
    <span class="n">species_type_counts</span><span class="p">[</span><span class="n">RADAR_ORDER</span><span class="p">],</span>
    <span class="n">title</span> <span class="o">=</span> <span class="s">"Species types by pathway source"</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Counts of molecular species types in each source
</figcaption>

<div class="data-table" style="" data-table="[{&quot;index&quot;: &quot;Reactome&quot;, &quot;complexes&quot;: 14818, &quot;drugs&quot;: 0, &quot;metabolites&quot;: 1591, &quot;other&quot;: 748, &quot;proteins&quot;: 5123, &quot;regulatory RNAs&quot;: 4, &quot;unknowns&quot;: 0}, {&quot;index&quot;: &quot;Recon3D&quot;, &quot;complexes&quot;: 0, &quot;drugs&quot;: 128, &quot;metabolites&quot;: 2664, &quot;other&quot;: 0, &quot;proteins&quot;: 1665, &quot;regulatory RNAs&quot;: 0, &quot;unknowns&quot;: 19}, {&quot;index&quot;: &quot;STRING&quot;, &quot;complexes&quot;: 0, &quot;drugs&quot;: 0, &quot;metabolites&quot;: 0, &quot;other&quot;: 0, &quot;proteins&quot;: 19384, &quot;regulatory RNAs&quot;: 0, &quot;unknowns&quot;: 0}, {&quot;index&quot;: &quot;IntAct&quot;, &quot;complexes&quot;: 0, &quot;drugs&quot;: 0, &quot;metabolites&quot;: 191, &quot;other&quot;: 177, &quot;proteins&quot;: 22051, &quot;regulatory RNAs&quot;: 48, &quot;unknowns&quot;: 0}, {&quot;index&quot;: &quot;OmniPath&quot;, &quot;complexes&quot;: 169, &quot;drugs&quot;: 0, &quot;metabolites&quot;: 958, &quot;other&quot;: 523, &quot;proteins&quot;: 16294, &quot;regulatory RNAs&quot;: 929, &quot;unknowns&quot;: 636}, {&quot;index&quot;: &quot;Reactome-FI&quot;, &quot;complexes&quot;: 0, &quot;drugs&quot;: 0, &quot;metabolites&quot;: 0, &quot;other&quot;: 0, &quot;proteins&quot;: 13636, &quot;regulatory RNAs&quot;: 0, &quot;unknowns&quot;: 97}, {&quot;index&quot;: &quot;TRRUST&quot;, &quot;complexes&quot;: 0, &quot;drugs&quot;: 0, &quot;metabolites&quot;: 0, &quot;other&quot;: 0, &quot;proteins&quot;: 2809, &quot;regulatory RNAs&quot;: 0, &quot;unknowns&quot;: 53}, {&quot;index&quot;: &quot;Dogma&quot;, &quot;complexes&quot;: 0, &quot;drugs&quot;: 0, &quot;metabolites&quot;: 0, &quot;other&quot;: 0, &quot;proteins&quot;: 19362, &quot;regulatory RNAs&quot;: 0, &quot;unknowns&quot;: 0}]" data-columns="[{&quot;title&quot;: &quot;index&quot;, &quot;field&quot;: &quot;index&quot;}, {&quot;title&quot;: &quot;complexes&quot;, &quot;field&quot;: &quot;complexes&quot;}, {&quot;title&quot;: &quot;drugs&quot;, &quot;field&quot;: &quot;drugs&quot;}, {&quot;title&quot;: &quot;metabolites&quot;, &quot;field&quot;: &quot;metabolites&quot;}, {&quot;title&quot;: &quot;other&quot;, &quot;field&quot;: &quot;other&quot;}, {&quot;title&quot;: &quot;proteins&quot;, &quot;field&quot;: &quot;proteins&quot;}, {&quot;title&quot;: &quot;regulatory RNAs&quot;, &quot;field&quot;: &quot;regulatory RNAs&quot;}, {&quot;title&quot;: &quot;unknowns&quot;, &quot;field&quot;: &quot;unknowns&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataTable&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p><img src="/figure/source/2025-10-07-octopus_network/n_species_per_type-output-3.png" alt="" /></p>

<p>Clear specialization patterns emerge: gene-centric sources (<em>STRING</em>,
<em>Dogma</em>, <em>Reactome-FI</em>, <em>TRRUST</em>), metabolite-focused databases
(<em>BiGG</em>), and comprehensive resources covering diverse molecular species
(<em>Reactome</em>, <em>IntAct</em>, <em>OmniPath</em>).</p>

<h4 id="interaction-mechanisms">Interaction mechanisms</h4>

<p>I’ll examine interaction types using Systems Biology Ontology (SBO)
terms that define molecular roles:</p>

<ul>
  <li><strong>Interactor</strong>: Undirected associations</li>
  <li><strong>Stimulator/Inhibitor/Modifier</strong>: Regulators of expression or
activity</li>
  <li><strong>Modified</strong>: Targets of regulation</li>
  <li><strong>Catalyst</strong>: Enzymes and transporters</li>
  <li><strong>Substrate/Product</strong>: Consumed or produced molecules</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sbo_term_counts</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="n">k</span><span class="p">:</span> <span class="n">v</span><span class="p">[</span><span class="s">"sbo_name_counts"</span><span class="p">]</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">sbml_dfs_dict_summaries</span><span class="p">.</span><span class="n">items</span><span class="p">()})</span>
    <span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'Int64'</span><span class="p">)</span>
    <span class="p">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
    <span class="p">.</span><span class="n">T</span>
    <span class="p">.</span><span class="n">sort_index</span><span class="p">(</span><span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">display_tabulator</span><span class="p">(</span>
    <span class="n">sbo_term_counts</span><span class="p">,</span>
    <span class="n">width</span><span class="o">=</span><span class="s">"auto"</span><span class="p">,</span>
    <span class="n">layout</span><span class="o">=</span><span class="s">"fitDataStretch"</span><span class="p">,</span>
    <span class="n">caption</span> <span class="o">=</span> <span class="s">"Counts of reaction participant roles in each source"</span>
<span class="p">)</span>

<span class="n">RADAR_ORDER</span> <span class="o">=</span> <span class="p">[</span><span class="s">"catalyst"</span><span class="p">,</span> <span class="s">"reactant"</span><span class="p">,</span> <span class="s">"product"</span><span class="p">,</span> <span class="s">"stimulator"</span><span class="p">,</span> <span class="s">"inhibitor"</span><span class="p">,</span> <span class="s">"modifier"</span><span class="p">,</span> <span class="s">"modified"</span><span class="p">,</span> <span class="s">"interactor"</span><span class="p">]</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">create_pathway_radar_plot</span><span class="p">(</span>
    <span class="n">sbo_term_counts</span><span class="p">[</span><span class="n">RADAR_ORDER</span><span class="p">],</span>
    <span class="n">title</span> <span class="o">=</span> <span class="s">"SBO terms by pathway source"</span><span class="p">,</span>
    <span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Counts of reaction participant roles in each source
</figcaption>

<div class="data-table" style="width: auto; display: inline-block;" data-table="[{&quot;index&quot;: &quot;Reactome&quot;, &quot;catalyst&quot;: 6613, &quot;inhibitor&quot;: 1028, &quot;interactor&quot;: 0, &quot;modified&quot;: 0, &quot;modifier&quot;: 0, &quot;product&quot;: 23691, &quot;reactant&quot;: 29972, &quot;stimulator&quot;: 2035}, {&quot;index&quot;: &quot;Recon3D&quot;, &quot;catalyst&quot;: 0, &quot;inhibitor&quot;: 0, &quot;interactor&quot;: 0, &quot;modified&quot;: 0, &quot;modifier&quot;: 0, &quot;product&quot;: 19913, &quot;reactant&quot;: 20512, &quot;stimulator&quot;: 14487}, {&quot;index&quot;: &quot;STRING&quot;, &quot;catalyst&quot;: 0, &quot;inhibitor&quot;: 0, &quot;interactor&quot;: 8326852, &quot;modified&quot;: 0, &quot;modifier&quot;: 0, &quot;product&quot;: 0, &quot;reactant&quot;: 0, &quot;stimulator&quot;: 0}, {&quot;index&quot;: &quot;IntAct&quot;, &quot;catalyst&quot;: 0, &quot;inhibitor&quot;: 0, &quot;interactor&quot;: 1002874, &quot;modified&quot;: 0, &quot;modifier&quot;: 0, &quot;product&quot;: 0, &quot;reactant&quot;: 0, &quot;stimulator&quot;: 0}, {&quot;index&quot;: &quot;OmniPath&quot;, &quot;catalyst&quot;: 0, &quot;inhibitor&quot;: 43725, &quot;interactor&quot;: 124014, &quot;modified&quot;: 410328, &quot;modifier&quot;: 86700, &quot;product&quot;: 4164, &quot;reactant&quot;: 4164, &quot;stimulator&quot;: 279903}, {&quot;index&quot;: &quot;Reactome-FI&quot;, &quot;catalyst&quot;: 37936, &quot;inhibitor&quot;: 5991, &quot;interactor&quot;: 721680, &quot;modified&quot;: 110190, &quot;modifier&quot;: 0, &quot;product&quot;: 0, &quot;reactant&quot;: 0, &quot;stimulator&quot;: 66263}, {&quot;index&quot;: &quot;TRRUST&quot;, &quot;catalyst&quot;: 0, &quot;inhibitor&quot;: 1715, &quot;interactor&quot;: 0, &quot;modified&quot;: 8427, &quot;modifier&quot;: 3775, &quot;product&quot;: 0, &quot;reactant&quot;: 0, &quot;stimulator&quot;: 2937}, {&quot;index&quot;: &quot;Dogma&quot;, &quot;catalyst&quot;: 0, &quot;inhibitor&quot;: 0, &quot;interactor&quot;: 0, &quot;modified&quot;: 1, &quot;modifier&quot;: 1, &quot;product&quot;: 0, &quot;reactant&quot;: 0, &quot;stimulator&quot;: 0}]" data-columns="[{&quot;title&quot;: &quot;index&quot;, &quot;field&quot;: &quot;index&quot;}, {&quot;title&quot;: &quot;catalyst&quot;, &quot;field&quot;: &quot;catalyst&quot;}, {&quot;title&quot;: &quot;inhibitor&quot;, &quot;field&quot;: &quot;inhibitor&quot;}, {&quot;title&quot;: &quot;interactor&quot;, &quot;field&quot;: &quot;interactor&quot;}, {&quot;title&quot;: &quot;modified&quot;, &quot;field&quot;: &quot;modified&quot;}, {&quot;title&quot;: &quot;modifier&quot;, &quot;field&quot;: &quot;modifier&quot;}, {&quot;title&quot;: &quot;product&quot;, &quot;field&quot;: &quot;product&quot;}, {&quot;title&quot;: &quot;reactant&quot;, &quot;field&quot;: &quot;reactant&quot;}, {&quot;title&quot;: &quot;stimulator&quot;, &quot;field&quot;: &quot;stimulator&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataStretch&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p><img src="/figure/source/2025-10-07-octopus_network/sbo_term_counts-output-3.png" alt="" /></p>

<p>Broad sources like <em>STRING</em> favor generic “interactor” classifications,
while specialized databases like <em>Recon3D</em> and <em>Reactome</em> capture
specific mechanistic detail more faithfully.</p>

<h4 id="quantitative-data">Quantitative data</h4>

<p>Beyond structural information, sources provide additional annotations
and metadata for both molecular species and their interactions.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data_summaries</span> <span class="o">=</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span> <span class="n">v</span><span class="p">[</span><span class="s">"data_summary"</span><span class="p">]</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">sbml_dfs_dict_summaries</span><span class="p">.</span><span class="n">items</span><span class="p">()</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">v</span><span class="p">[</span><span class="s">"data_summary"</span><span class="p">][</span><span class="s">"reactions"</span><span class="p">])</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">}</span>

<span class="n">data_summaries_list</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">data_summaries</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
    <span class="k">for</span> <span class="n">entity_type</span><span class="p">,</span> <span class="n">entity_data</span> <span class="ow">in</span> <span class="n">v</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">entity_data</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
            <span class="k">for</span> <span class="n">table_name</span><span class="p">,</span> <span class="n">table_data</span> <span class="ow">in</span> <span class="n">entity_data</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
                <span class="n">table_summary</span> <span class="o">=</span><span class="p">{</span>
                    <span class="s">"table_name"</span> <span class="p">:</span> <span class="n">table_name</span><span class="p">,</span>
                    <span class="s">"entity_type"</span> <span class="p">:</span> <span class="n">entity_type</span><span class="p">,</span>
                    <span class="s">"n_rows"</span> <span class="p">:</span> <span class="n">table_data</span><span class="p">[</span><span class="s">"n_rows"</span><span class="p">],</span>
                    <span class="s">"columns"</span> <span class="p">:</span> <span class="s">", "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">table_data</span><span class="p">[</span><span class="s">"columns"</span><span class="p">]),</span>
                <span class="p">}</span>
                <span class="n">data_summaries_list</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">table_summary</span><span class="p">)</span>
<span class="n">data_summaries_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data_summaries_list</span><span class="p">)</span>

<span class="n">display_tabulator</span><span class="p">(</span>
    <span class="n">data_summaries_df</span><span class="p">,</span>
    <span class="n">wrap_columns</span> <span class="o">=</span> <span class="s">"columns"</span><span class="p">,</span>
    <span class="n">column_widths</span> <span class="o">=</span> <span class="p">{</span><span class="s">"columns"</span> <span class="p">:</span> <span class="s">"65%"</span><span class="p">},</span>
    <span class="n">caption</span> <span class="o">=</span> <span class="s">"Additional species- and/or reactions-data in each source"</span>
<span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Additional species- and/or reactions-data in each source
</figcaption>

<div class="data-table" style="" data-table="[{&quot;table_name&quot;: &quot;STRING&quot;, &quot;entity_type&quot;: &quot;reactions&quot;, &quot;n_rows&quot;: 4163426, &quot;columns&quot;: &quot;neighborhood, neighborhood_transferred, fusion, cooccurence, homology, coexpression, coexpression_transferred, experiments, experiments_transferred, database, database_transferred, textmining, textmining_transferred, combined_score&quot;}, {&quot;table_name&quot;: &quot;IntAct&quot;, &quot;entity_type&quot;: &quot;reactions&quot;, &quot;n_rows&quot;: 501437, &quot;columns&quot;: &quot;publication_score, interaction_method_score, interaction_type_score, miscore, n_publications, interaction_method_biochemical, interaction_method_biophysical, interaction_method_imaging technique, interaction_method_post transcriptional interference, interaction_method_protein complementation assay, interaction_method_unknown, interaction_type_association, interaction_type_colocalization, interaction_type_direct interaction, interaction_type_physical association&quot;}, {&quot;table_name&quot;: &quot;OmniPath&quot;, &quot;entity_type&quot;: &quot;reactions&quot;, &quot;n_rows&quot;: 476499, &quot;columns&quot;: &quot;is_directed, is_stimulation, is_inhibition, consensus_direction, consensus_stimulation, consensus_inhibition, n_primary_sources, n_references, n_sources&quot;}, {&quot;table_name&quot;: &quot;OmniPath&quot;, &quot;entity_type&quot;: &quot;species&quot;, &quot;n_rows&quot;: 19509, &quot;columns&quot;: &quot;species_type&quot;}, {&quot;table_name&quot;: &quot;Reactome-FI&quot;, &quot;entity_type&quot;: &quot;reactions&quot;, &quot;n_rows&quot;: 471030, &quot;columns&quot;: &quot;fi_score&quot;}]" data-columns="[{&quot;title&quot;: &quot;table_name&quot;, &quot;field&quot;: &quot;table_name&quot;}, {&quot;title&quot;: &quot;entity_type&quot;, &quot;field&quot;: &quot;entity_type&quot;}, {&quot;title&quot;: &quot;n_rows&quot;, &quot;field&quot;: &quot;n_rows&quot;}, {&quot;title&quot;: &quot;columns&quot;, &quot;field&quot;: &quot;columns&quot;, &quot;formatter&quot;: &quot;textarea&quot;, &quot;variableHeight&quot;: true, &quot;width&quot;: &quot;65%&quot;}]" data-options="{&quot;layout&quot;: &quot;fitColumns&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p>This additional data falls into two key categories: confidence scoring
systems (<em>STRING</em> interaction scores, <em>IntAct</em> experimental evidence)
and mechanistic granularity (<em>OmniPath</em> activation/inhibition
breakdowns, <em>IntAct</em> interaction types). Both provide crucial context
for assessing interaction reliability and biological mechanisms.</p>

<h3 id="source-compatibility">Source compatibility</h3>

<p>Many data sources used by Napistu, like <em>STRING</em> and <em>OmniPath</em>, already
aim to integrate multiple upstream data sources into a consistent
consensus. Napistu builds on these resources to merge what would
otherwise be incompatible data sources into a single, well-mixed model.
Without proper integration, sources would separate like oil and water
— each forming disconnected subnetworks with minimal overlap. Instead,
we need to gel them together by establishing a unified molecular
vocabulary that enables seamless integration of source-specific
interactions.</p>

<p><strong>Napistu accomplishes this integration through:</strong></p>

<ul>
  <li><strong>Data standardization</strong>: Systematic identifiers and <em>SBO</em> ontology
terms create a common vocabulary for describing molecules and their
interactions across diverse sources</li>
  <li><strong>Algorithmic merging</strong>: A consensus procedure that identifies
equivalent entities and reconciles overlapping information into a
single integrated model</li>
</ul>

<h2 id="merging-sbml_dfs-objects-into-a-consensus-sbml_dfs">Merging <code class="language-plaintext highlighter-rouge">SBML_dfs</code> objects into a consensus <code class="language-plaintext highlighter-rouge">SBML_dfs</code></h2>

<p>Merging multiple <code class="language-plaintext highlighter-rouge">SBML_dfs</code> objects into a consensus model requires
resolving entities by determining which compartments, species, and
reactions are shared across sources. This process works through tables
in logical order (compartments &amp; species $\rightarrow$ compartmentalized
species $\rightarrow$ reactions &amp; reaction species), aggregating a
single table drawn from all models to construct:</p>

<ul>
  <li><strong>Consensus tables</strong>: New unified tables with standardized structure</li>
  <li><strong>Key mapping tables</strong>: Lookup tables linking old source-specific
primary keys to new consensus keys for updating foreign key
relationships</li>
</ul>

<p>Two variables are critical for successful merging:</p>

<ul>
  <li><strong>Identifiers</strong>: Determine what <em>can</em> be merged by organizing
systematic identifiers as curated lists, each defined by ontology,
identifier, and bioqualifier (e.g., “BQB_IS” or “BQB_HAS_PART”)</li>
  <li><strong>Sources</strong>: Track what <em>has</em> been merged by associating each
consensus entity with all contributing data sources</li>
</ul>

<p>The <a href="https://github.com/napistu/napistu/wiki/Consensus">consensus
algorithm</a> proceeds
through four steps:</p>

<ol>
  <li><strong>Resolve foundational entities</strong>: Use greedy network-based matching
to identify species and compartments through shared identifiers,
connecting entities that share BQB-coded systematic identifiers</li>
  <li><strong>Define compartmentalized species</strong>: Map resolved species to their
appropriate compartmental locations</li>
  <li><strong>Merge reactions</strong>: Update reaction species annotations and
identify redundant reactions based on participants and mechanisms</li>
  <li><strong>Harmonize data tables</strong>: Update species_data and reactions_data
with consensus primary keys, aggregating results to ensure one row
per consensus entity</li>
</ol>

<div class="content-section bio-section">
  <div class="section-content">
    <p>The consensus algorithm is robust
but exposes incompatibilities between models when sources use different
ontologies or resolution levels. If compartments are defined at
different granularities or species use incompatible identifier systems,
sources merge poorly — essentially speaking different languages.
Rather than combining, as intended, they produce networks with multiple
disconnected subgraphs that negate the benefits of consensus modeling.
Many incompatibilities can be identified during the preprocessing stage
through schema validation and syntactic checks. However, additional
conflicts often emerge only during post-consensus validation, which
evaluates whether molecular species—and, where applicable,
reactions—have been accurately and semantically merged across
heterogeneous sources.</p>

  </div>
</div>

<h2 id="loading-the-">Loading the 🐙</h2>

<p>With the consensus algorithm framework established, let’s examine the
actual eight-source Octopus model to see how well these theoretical
merging principles work in practice. To do this, I’ll download the
pre-built model from Google Cloud Storage and provide some quick
summaries of its core properties.</p>

<p>The Octopus network is available through GCS and gets updated
periodically as sources are added and Napistu data structures evolve. To
ensure reproducibility for this post and others like <a href="https://www.shackett.org/napistu_network_propagation/">Network Biology
with Napistu, Part 2: Translating Statistical Associations into
Biological
Mechanisms</a>,
tagged versions are preserved for reliable future access. Here, I’ll
load a tagged version compatible with Napistu 0.7.1.</p>

<div class="content-section bio-section">
  <div class="section-content">
    <p>This represents the latest human
consensus model as of October 2025, but the model continues advancing
(hopefully toward a 10-source 🦑 model soon!). To access the most
current version, simply install the latest Napistu release and remove
the version tag from <code class="language-plaintext highlighter-rouge">gcs.downloads.load_public_napistu_asset</code>.</p>

  </div>
</div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># ~3 min load
# download and cache the Octopus sbml_dfs and the other assets its bundled with
</span><span class="n">sbml_dfs_path</span> <span class="o">=</span> <span class="n">downloads</span><span class="p">.</span><span class="n">load_public_napistu_asset</span><span class="p">(</span>
    <span class="n">asset</span> <span class="o">=</span> <span class="n">ASSET</span><span class="p">,</span>
    <span class="n">subasset</span> <span class="o">=</span> <span class="s">"sbml_dfs"</span><span class="p">,</span>
    <span class="n">data_dir</span> <span class="o">=</span> <span class="n">DATA_DIR</span><span class="p">,</span>
    <span class="c1"># download the tagged version for reproducibility and Python env compatibility
</span>    <span class="n">version</span> <span class="o">=</span> <span class="n">VERSION_TAG</span><span class="p">,</span>
<span class="p">)</span>

<span class="n">sbml_dfs</span> <span class="o">=</span> <span class="n">SBML_dfs</span><span class="p">.</span><span class="n">from_pickle</span><span class="p">(</span><span class="n">sbml_dfs_path</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="-summary">🐙 summary</h2>

<p>With the core <code class="language-plaintext highlighter-rouge">SBML_dfs</code> object loaded, I’ll examine its high-level
properties using the <code class="language-plaintext highlighter-rouge">show_summary</code> method.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">summary_stats</span> <span class="o">=</span> <span class="n">sbml_dfs</span><span class="p">.</span><span class="n">get_summary</span><span class="p">()</span>
<span class="n">summary_table</span> <span class="o">=</span> <span class="n">sbml_dfs_utils</span><span class="p">.</span><span class="n">format_sbml_dfs_summary</span><span class="p">(</span><span class="n">summary_stats</span><span class="p">)</span>
<span class="n">display_tabulator</span><span class="p">(</span>
    <span class="n">summary_table</span><span class="p">,</span>
    <span class="n">width</span><span class="o">=</span><span class="s">"auto"</span><span class="p">,</span>
    <span class="n">layout</span><span class="o">=</span><span class="s">"fitDataStretch"</span><span class="p">,</span>
    <span class="n">caption</span> <span class="o">=</span> <span class="s">"Consensus SBML_dfs summaries"</span>
<span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Consensus SBML_dfs summaries
</figcaption>

<div class="data-table" style="width: auto; display: inline-block;" data-table="[{&quot;Metric&quot;: &quot;Species&quot;, &quot;Value&quot;: &quot;43,814&quot;}, {&quot;Metric&quot;: &quot;- Proteins&quot;, &quot;Value&quot;: &quot;20,980 (47.9%)&quot;}, {&quot;Metric&quot;: &quot;- Complexes&quot;, &quot;Value&quot;: &quot;14,971 (34.2%)&quot;}, {&quot;Metric&quot;: &quot;- Metabolites&quot;, &quot;Value&quot;: &quot;4,797 (10.9%)&quot;}, {&quot;Metric&quot;: &quot;- Other&quot;, &quot;Value&quot;: &quot;1,156 (2.6%)&quot;}, {&quot;Metric&quot;: &quot;- Regulatory RNAs&quot;, &quot;Value&quot;: &quot;981 (2.2%)&quot;}, {&quot;Metric&quot;: &quot;- Unknowns&quot;, &quot;Value&quot;: &quot;805 (1.8%)&quot;}, {&quot;Metric&quot;: &quot;- Drugs&quot;, &quot;Value&quot;: &quot;124 (0.3%)&quot;}, {&quot;Metric&quot;: &quot;&quot;, &quot;Value&quot;: &quot;&quot;}, {&quot;Metric&quot;: &quot;Compartments&quot;, &quot;Value&quot;: &quot;1&quot;}, {&quot;Metric&quot;: &quot;- cellular_component&quot;, &quot;Value&quot;: &quot;43,814 (100.0%)&quot;}, {&quot;Metric&quot;: &quot;&quot;, &quot;Value&quot;: &quot;&quot;}, {&quot;Metric&quot;: &quot;Compartmentalized Species&quot;, &quot;Value&quot;: &quot;43,814&quot;}, {&quot;Metric&quot;: &quot;Reactions&quot;, &quot;Value&quot;: &quot;4,806,439&quot;}, {&quot;Metric&quot;: &quot;Reaction Species&quot;, &quot;Value&quot;: &quot;9,642,389&quot;}]" data-columns="[{&quot;title&quot;: &quot;Metric&quot;, &quot;field&quot;: &quot;Metric&quot;}, {&quot;title&quot;: &quot;Value&quot;, &quot;field&quot;: &quot;Value&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataStretch&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p>The consensus model contains genes/proteins, metabolites, complexes,
drugs, and regulatory RNAs within a single compartment — cellular
component (the root term of GO’s <em>cellular component</em> category). The
model encompasses approximately 4.5M reactions spanning undirected
interactions, directed regulation, and complex multi-participant
regulatory mechanisms.</p>

<p>While the earlier source comparisons demonstrated each database’s
potential contributions, the key question remains: did the sources
actually merge into a single well-mixed model? Successful integration
requires extensive molecular species sharing across sources and
meaningful reaction overlap. Rather than separate, highly connected
subnetworks with minimal inter-source connections, we want a unified
network where sources are genuinely integrated.</p>

<p>The model’s <code class="language-plaintext highlighter-rouge">Source</code> objects provide the answer — they track which
data sources contributed to each species, compartment, and reaction,
enabling direct assessment of integration success.</p>

<h3 id="shared-molecular-vocabulary">Shared molecular vocabulary</h3>

<p>To assess integration success, I’ll examine which data sources
contributed to each molecular species through contingency tables of
species-source occurrences. Values reflect how many of a source’s
molecular species merged into each consensus species — typically 0
(not present) or 1 (exact match), though occasionally higher when
multiple protein annotations roll up to a single gene.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">species_source_occurrence</span> <span class="o">=</span> <span class="n">sbml_dfs</span><span class="p">.</span><span class="n">get_source_occurrence</span><span class="p">(</span><span class="s">"species"</span><span class="p">)</span>
<span class="n">display_tabulator</span><span class="p">(</span>
    <span class="n">species_source_occurrence</span><span class="p">.</span><span class="n">head</span><span class="p">(),</span>
    <span class="n">layout</span> <span class="o">=</span> <span class="s">"fitDataTable"</span><span class="p">,</span>
    <span class="n">caption</span> <span class="o">=</span> <span class="s">"Example molecular species and the sources they were originally found in"</span>
<span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Example molecular species and the sources they were originally found in
</figcaption>

<div class="data-table" style="" data-table="[{&quot;s_id&quot;: &quot;S00000000&quot;, &quot;Dogma&quot;: 1, &quot;IntAct&quot;: 2, &quot;OmniPath&quot;: 1, &quot;Reactome&quot;: 0, &quot;Reactome-FI&quot;: 1, &quot;Recon3D&quot;: 0, &quot;STRING&quot;: 1, &quot;TRRUST&quot;: 0}, {&quot;s_id&quot;: &quot;S00000001&quot;, &quot;Dogma&quot;: 1, &quot;IntAct&quot;: 3, &quot;OmniPath&quot;: 1, &quot;Reactome&quot;: 3, &quot;Reactome-FI&quot;: 1, &quot;Recon3D&quot;: 0, &quot;STRING&quot;: 1, &quot;TRRUST&quot;: 1}, {&quot;s_id&quot;: &quot;S00000002&quot;, &quot;Dogma&quot;: 1, &quot;IntAct&quot;: 0, &quot;OmniPath&quot;: 1, &quot;Reactome&quot;: 0, &quot;Reactome-FI&quot;: 0, &quot;Recon3D&quot;: 0, &quot;STRING&quot;: 1, &quot;TRRUST&quot;: 0}, {&quot;s_id&quot;: &quot;S00000003&quot;, &quot;Dogma&quot;: 1, &quot;IntAct&quot;: 1, &quot;OmniPath&quot;: 1, &quot;Reactome&quot;: 0, &quot;Reactome-FI&quot;: 0, &quot;Recon3D&quot;: 0, &quot;STRING&quot;: 1, &quot;TRRUST&quot;: 1}, {&quot;s_id&quot;: &quot;S00000004&quot;, &quot;Dogma&quot;: 1, &quot;IntAct&quot;: 2, &quot;OmniPath&quot;: 1, &quot;Reactome&quot;: 1, &quot;Reactome-FI&quot;: 1, &quot;Recon3D&quot;: 0, &quot;STRING&quot;: 1, &quot;TRRUST&quot;: 1}]" data-columns="[{&quot;title&quot;: &quot;s_id&quot;, &quot;field&quot;: &quot;s_id&quot;}, {&quot;title&quot;: &quot;Dogma&quot;, &quot;field&quot;: &quot;Dogma&quot;}, {&quot;title&quot;: &quot;IntAct&quot;, &quot;field&quot;: &quot;IntAct&quot;}, {&quot;title&quot;: &quot;OmniPath&quot;, &quot;field&quot;: &quot;OmniPath&quot;}, {&quot;title&quot;: &quot;Reactome&quot;, &quot;field&quot;: &quot;Reactome&quot;}, {&quot;title&quot;: &quot;Reactome-FI&quot;, &quot;field&quot;: &quot;Reactome-FI&quot;}, {&quot;title&quot;: &quot;Recon3D&quot;, &quot;field&quot;: &quot;Recon3D&quot;}, {&quot;title&quot;: &quot;STRING&quot;, &quot;field&quot;: &quot;STRING&quot;}, {&quot;title&quot;: &quot;TRRUST&quot;, &quot;field&quot;: &quot;TRRUST&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataTable&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p>I can visualize species sharing patterns by converting the occurrence
matrix ($X$) into a cooccurrence matrix ($C$) using:</p>

\[C = B B^T\]

<p>Where $B = \mathbf{1}(X \neq 0)$ is the binary matrix obtained by
converting non-zero entries of $X$ to 1.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">species_source_cooccurrence</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">sbml_dfs</span><span class="p">.</span><span class="n">get_source_cooccurrence</span><span class="p">(</span><span class="s">"species"</span><span class="p">)</span>
    <span class="p">.</span><span class="n">rename_axis</span><span class="p">(</span><span class="s">'Database'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="p">.</span><span class="n">rename_axis</span><span class="p">(</span><span class="s">'Database'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="p">)</span>

<span class="n">simple_pd_heatmap</span><span class="p">(</span><span class="n">species_source_cooccurrence</span><span class="p">,</span> <span class="s">"Species Source Co-Occurrence"</span><span class="p">)</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-10-07-octopus_network/species_source_cooccurrences-output-1.png" alt="" /></p>

<p>The heatmap reveals that gene-centric, dense sources (<em>STRING</em>, <em>Dogma</em>,
<em>Reactome-FI</em>) cluster together with similar molecular coverage, while
<em>Reactome</em>, <em>Recon3D</em>, and <em>TRRUST</em> remain more isolated. This reflects
<em>TRRUST</em>’s smaller size and the molecular specialization of <em>Reactome</em>
and <em>Recon3D</em> compared to comprehensive protein databases.</p>

<p>To quantify this specialization, I’ll identify species unique to
individual sources.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">private_species</span> <span class="o">=</span> <span class="n">species_source_occurrence</span><span class="p">.</span><span class="n">loc</span><span class="p">[(</span><span class="n">species_source_occurrence</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">).</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">]</span>

<span class="n">private_species_source_counts</span> <span class="o">=</span> <span class="p">(</span>
    <span class="p">(</span><span class="n">private_species</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span>
    <span class="p">.</span><span class="nb">sum</span><span class="p">()</span>
    <span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="s">"Private species"</span><span class="p">)</span>
    <span class="p">.</span><span class="n">to_frame</span><span class="p">()</span>
    <span class="p">.</span><span class="n">T</span>
<span class="p">)</span>

<span class="n">display_tabulator</span><span class="p">(</span>
    <span class="n">private_species_source_counts</span><span class="p">,</span>
    <span class="n">width</span><span class="o">=</span><span class="s">"auto"</span><span class="p">,</span>
    <span class="n">layout</span><span class="o">=</span><span class="s">"fitDataStretch"</span><span class="p">,</span>
    <span class="n">include_index</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
    <span class="n">caption</span> <span class="o">=</span> <span class="s">"Private molecular species from each source"</span>
<span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Private molecular species from each source
</figcaption>

<div class="data-table" style="width: auto; display: inline-block;" data-table="[{&quot;Reactome&quot;: 16637, &quot;OmniPath&quot;: 2924, &quot;IntAct&quot;: 2418, &quot;Recon3D&quot;: 2297, &quot;STRING&quot;: 343, &quot;Reactome-FI&quot;: 168, &quot;TRRUST&quot;: 53, &quot;Dogma&quot;: 0}]" data-columns="[{&quot;title&quot;: &quot;Reactome&quot;, &quot;field&quot;: &quot;Reactome&quot;}, {&quot;title&quot;: &quot;OmniPath&quot;, &quot;field&quot;: &quot;OmniPath&quot;}, {&quot;title&quot;: &quot;IntAct&quot;, &quot;field&quot;: &quot;IntAct&quot;}, {&quot;title&quot;: &quot;Recon3D&quot;, &quot;field&quot;: &quot;Recon3D&quot;}, {&quot;title&quot;: &quot;STRING&quot;, &quot;field&quot;: &quot;STRING&quot;}, {&quot;title&quot;: &quot;Reactome-FI&quot;, &quot;field&quot;: &quot;Reactome-FI&quot;}, {&quot;title&quot;: &quot;TRRUST&quot;, &quot;field&quot;: &quot;TRRUST&quot;}, {&quot;title&quot;: &quot;Dogma&quot;, &quot;field&quot;: &quot;Dogma&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataStretch&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p>Several sources contribute substantial numbers of private species, each
for logical reasons:</p>

<ul>
  <li><strong>Reactome</strong>: Detailed complex mechanisms with fine-grained complex
definitions</li>
  <li><strong>OmniPath</strong>: Extensive drug collections (<em>PubChem</em>) and microRNAs
(<em>MirBase</em>)</li>
  <li><strong>IntAct</strong>: Small molecules and microRNAs alongside core protein
interactions</li>
  <li><strong>Recon3D</strong>: Extensive coverage of metabolites and lipids</li>
</ul>

<p>The Octopus successfully integrates molecular species, with proteins
shared across multiple sources while specialized molecular types arise
from domain-specific resources.</p>

<h3 id="reaction-overlap-reveals-data-source-specialization">Reaction overlap reveals data source specialization</h3>

<p>To understand what individual sources contribute, I’ll analyze reaction
source occurrences and cooccurrences using a similar approach to species
analysis.</p>

<p>To interpret this analysis, readers should understand two important
points:</p>

<ul>
  <li><strong>Strict merging criteria</strong>: Reactions merge only with identical
participants and <em>SBO</em> terms. A reaction between genes <em>A</em> and <em>B</em>
won’t merge if one source labels them inhibitor $\rightarrow$
modified while another uses modifier $\rightarrow$ modified,
explaining the low overlap we’ll observe.</li>
  <li><strong>Analysis scope</strong>: This analysis excludes interactor-interactor
interactions because the existing tooling is designed to surface
information relevant for graph construction, where these
interactions become direct edges between molecular species with no
reaction vertices added.</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">reactions_source_occurrence</span> <span class="o">=</span> <span class="n">sbml_dfs</span><span class="p">.</span><span class="n">get_source_occurrence</span><span class="p">(</span><span class="s">"reactions"</span><span class="p">)</span>

<span class="n">display_tabulator</span><span class="p">(</span>
    <span class="n">reactions_source_occurrence</span><span class="p">.</span><span class="n">head</span><span class="p">(),</span>
    <span class="n">width</span><span class="o">=</span><span class="s">"auto"</span><span class="p">,</span>
    <span class="n">layout</span><span class="o">=</span><span class="s">"fitDataStretch"</span><span class="p">,</span>
    <span class="n">caption</span> <span class="o">=</span> <span class="s">"Example reactions and the sources they were originally found in"</span>
<span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Example reactions and the sources they were originally found in
</figcaption>

<div class="data-table" style="width: auto; display: inline-block;" data-table="[{&quot;r_id&quot;: &quot;R00000000&quot;, &quot;OmniPath&quot;: 1, &quot;Reactome&quot;: 0, &quot;Reactome-FI&quot;: 0, &quot;Recon3D&quot;: 0, &quot;TRRUST&quot;: 0}, {&quot;r_id&quot;: &quot;R00000214&quot;, &quot;OmniPath&quot;: 0, &quot;Reactome&quot;: 0, &quot;Reactome-FI&quot;: 2, &quot;Recon3D&quot;: 0, &quot;TRRUST&quot;: 0}, {&quot;r_id&quot;: &quot;R00000215&quot;, &quot;OmniPath&quot;: 0, &quot;Reactome&quot;: 0, &quot;Reactome-FI&quot;: 1, &quot;Recon3D&quot;: 0, &quot;TRRUST&quot;: 0}, {&quot;r_id&quot;: &quot;R00000216&quot;, &quot;OmniPath&quot;: 0, &quot;Reactome&quot;: 0, &quot;Reactome-FI&quot;: 1, &quot;Recon3D&quot;: 0, &quot;TRRUST&quot;: 0}, {&quot;r_id&quot;: &quot;R00000217&quot;, &quot;OmniPath&quot;: 1, &quot;Reactome&quot;: 0, &quot;Reactome-FI&quot;: 0, &quot;Recon3D&quot;: 0, &quot;TRRUST&quot;: 0}]" data-columns="[{&quot;title&quot;: &quot;r_id&quot;, &quot;field&quot;: &quot;r_id&quot;}, {&quot;title&quot;: &quot;OmniPath&quot;, &quot;field&quot;: &quot;OmniPath&quot;}, {&quot;title&quot;: &quot;Reactome&quot;, &quot;field&quot;: &quot;Reactome&quot;}, {&quot;title&quot;: &quot;Reactome-FI&quot;, &quot;field&quot;: &quot;Reactome-FI&quot;}, {&quot;title&quot;: &quot;Recon3D&quot;, &quot;field&quot;: &quot;Recon3D&quot;}, {&quot;title&quot;: &quot;TRRUST&quot;, &quot;field&quot;: &quot;TRRUST&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataStretch&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p>The reaction occurrence data is notably sparse, and the
order-of-magnitude differences in reaction counts between sources
complicate direct cooccurrence visualization.</p>

<p>To better assess source dependencies, I’ll calculate conditional
probabilities $\Pr(A|B)$ from the cooccurrence matrix, showing how
likely a reaction from source $B$ also appears in source $A$.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">reactions_source_cooccurrence</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">sbml_dfs</span><span class="p">.</span><span class="n">get_source_cooccurrence</span><span class="p">(</span><span class="s">"reactions"</span><span class="p">)</span>
    <span class="p">.</span><span class="n">rename_axis</span><span class="p">(</span><span class="s">'Database'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="p">.</span><span class="n">rename_axis</span><span class="p">(</span><span class="s">'Database'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="p">)</span>

<span class="n">reactions_source_conditional_prob</span> <span class="o">=</span> <span class="n">cooccurrence_to_conditional_prob</span><span class="p">(</span><span class="n">reactions_source_cooccurrence</span><span class="p">)</span>

<span class="n">simple_pd_heatmap</span><span class="p">(</span><span class="n">reactions_source_conditional_prob</span><span class="p">,</span> <span class="s">"Conditional probability of reaction found in column source</span><span class="se">\n</span><span class="s"> given that it occurs in the row source"</span><span class="p">,</span> <span class="n">fmt</span><span class="o">=</span><span class="s">".3f"</span><span class="p">,</span> <span class="n">colorbar_label</span><span class="o">=</span><span class="s">"Probability"</span><span class="p">)</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-10-07-octopus_network/reactions_source_cooccurrences-output-1.png" alt="" /></p>

<p>The conditional probability analysis reveals distinct patterns:
<em>Reactome</em> and <em>Recon3D</em> reactions remain largely unique, while
meaningful overlap exists between <em>Reactome-FI</em>, <em>OmniPath</em>, and
<em>TRRUST</em>. The strongest overlap occurs between <em>TRRUST</em> and <em>OmniPath</em>
(50% of <em>TRRUST</em> interactions also appear in <em>OmniPath</em>) — an expected
result since <em>TRRUST</em> is one of the resources incorporated into
<em>OmniPath</em>.</p>

<p>These patterns demonstrate successful species integration alongside
preserved source-specific reaction diversity, with each database
contributing substantial unique mechanistic content to the consensus
model.</p>

<h2 id="decorating-the--graph-with-species-and-reaction-data">Decorating the 🐙 graph with species and reaction data</h2>

<p><img src="https://www.shackett.org/figure/octopus_network/shell_stealing_octopus.jpeg" alt="Photo of a shell-stealing octopus" style="width: 100%;" /></p>

<p>While <code class="language-plaintext highlighter-rouge">SBML_dfs</code> comprehensively organizes pathway data, network
analyses like <a href="https://www.shackett.org/napistu_network_propagation/">personalized
PageRank</a> require
graph representations. <code class="language-plaintext highlighter-rouge">NapistuGraph</code>s convert this tabular data into
networks where compartmentalized species and reactions become vertices
connected by information flow edges. Built on <code class="language-plaintext highlighter-rouge">igraph</code>’s foundation,
they combine versatile graph operations with biological annotations,
data provenance, and specialized network biology methods.</p>

<p>A <code class="language-plaintext highlighter-rouge">NapistuGraph</code> was bundled with the <code class="language-plaintext highlighter-rouge">SBML_dfs</code> downloaded above,
enabling direct loading and analysis:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">napistu_graph_path</span> <span class="o">=</span> <span class="n">downloads</span><span class="p">.</span><span class="n">load_public_napistu_asset</span><span class="p">(</span>
    <span class="n">asset</span> <span class="o">=</span> <span class="n">ASSET</span><span class="p">,</span>
    <span class="n">subasset</span> <span class="o">=</span> <span class="s">"napistu_graph"</span><span class="p">,</span>
    <span class="n">data_dir</span> <span class="o">=</span> <span class="n">DATA_DIR</span><span class="p">,</span>
    <span class="n">version</span> <span class="o">=</span> <span class="n">VERSION_TAG</span>
<span class="p">)</span>

<span class="n">napistu_graph</span> <span class="o">=</span> <span class="n">NapistuGraph</span><span class="p">.</span><span class="n">from_pickle</span><span class="p">(</span><span class="n">napistu_graph_path</span><span class="p">)</span>

<span class="n">summary_stats</span> <span class="o">=</span> <span class="n">napistu_graph</span><span class="p">.</span><span class="n">get_summary</span><span class="p">()</span>
<span class="n">summary_table</span> <span class="o">=</span> <span class="n">ng_utils</span><span class="p">.</span><span class="n">format_napistu_graph_summary</span><span class="p">(</span><span class="n">summary_stats</span><span class="p">)</span>
<span class="n">display_tabulator</span><span class="p">(</span>
    <span class="n">summary_table</span><span class="p">,</span>
    <span class="n">wrap_columns</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span>
    <span class="n">column_widths</span> <span class="o">=</span> <span class="p">{</span><span class="s">"Value"</span> <span class="p">:</span> <span class="s">"80%"</span><span class="p">},</span>
    <span class="n">caption</span> <span class="o">=</span> <span class="s">"Summaries of the NapistuGraph network"</span>
<span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Summaries of the NapistuGraph network
</figcaption>

<div class="data-table" style="" data-table="[{&quot;Metric&quot;: &quot;Vertices&quot;, &quot;Value&quot;: &quot;446,619&quot;}, {&quot;Metric&quot;: &quot;- Reaction&quot;, &quot;Value&quot;: &quot;402,805 (90.2%)&quot;}, {&quot;Metric&quot;: &quot;- Species&quot;, &quot;Value&quot;: &quot;43,814 (9.8%)&quot;}, {&quot;Metric&quot;: &quot;&quot;, &quot;Value&quot;: &quot;&quot;}, {&quot;Metric&quot;: &quot;Species Types&quot;, &quot;Value&quot;: &quot;&quot;}, {&quot;Metric&quot;: &quot;- Protein&quot;, &quot;Value&quot;: &quot;20,980 (0.0%)&quot;}, {&quot;Metric&quot;: &quot;- Complex&quot;, &quot;Value&quot;: &quot;14,971 (0.0%)&quot;}, {&quot;Metric&quot;: &quot;- Metabolite&quot;, &quot;Value&quot;: &quot;4,797 (0.0%)&quot;}, {&quot;Metric&quot;: &quot;- Other&quot;, &quot;Value&quot;: &quot;1,156 (0.0%)&quot;}, {&quot;Metric&quot;: &quot;- Regulatory Rna&quot;, &quot;Value&quot;: &quot;981 (0.0%)&quot;}, {&quot;Metric&quot;: &quot;- Unknown&quot;, &quot;Value&quot;: &quot;805 (0.0%)&quot;}, {&quot;Metric&quot;: &quot;- Drug&quot;, &quot;Value&quot;: &quot;124 (0.0%)&quot;}, {&quot;Metric&quot;: &quot;&quot;, &quot;Value&quot;: &quot;&quot;}, {&quot;Metric&quot;: &quot;Edges&quot;, &quot;Value&quot;: &quot;9,566,151&quot;}, {&quot;Metric&quot;: &quot;- interactor&quot;, &quot;Value&quot;: &quot;8,721,262 (91.2%)&quot;}, {&quot;Metric&quot;: &quot;- modified&quot;, &quot;Value&quot;: &quot;380,929 (4.0%)&quot;}, {&quot;Metric&quot;: &quot;- stimulator&quot;, &quot;Value&quot;: &quot;223,521 (2.3%)&quot;}, {&quot;Metric&quot;: &quot;- modifier&quot;, &quot;Value&quot;: &quot;88,036 (0.9%)&quot;}, {&quot;Metric&quot;: &quot;- inhibitor&quot;, &quot;Value&quot;: &quot;44,687 (0.5%)&quot;}, {&quot;Metric&quot;: &quot;- catalyst&quot;, &quot;Value&quot;: &quot;43,093 (0.5%)&quot;}, {&quot;Metric&quot;: &quot;- reactant&quot;, &quot;Value&quot;: &quot;34,972 (0.4%)&quot;}, {&quot;Metric&quot;: &quot;- product&quot;, &quot;Value&quot;: &quot;29,651 (0.3%)&quot;}, {&quot;Metric&quot;: &quot;&quot;, &quot;Value&quot;: &quot;&quot;}, {&quot;Metric&quot;: &quot;Vertex Attributes&quot;, &quot;Value&quot;: &quot;name, node_name, node_type, species_type, s_id, c_id&quot;}, {&quot;Metric&quot;: &quot;Edge Attributes&quot;, &quot;Value&quot;: &quot;from, to, r_id, sbo_term, stoichiometry, species_type, r_isreversible, direction, string_wt, weight, upstream_weight, source_wt&quot;}]" data-columns="[{&quot;title&quot;: &quot;Metric&quot;, &quot;field&quot;: &quot;Metric&quot;, &quot;formatter&quot;: &quot;textarea&quot;, &quot;variableHeight&quot;: true}, {&quot;title&quot;: &quot;Value&quot;, &quot;field&quot;: &quot;Value&quot;, &quot;formatter&quot;: &quot;textarea&quot;, &quot;variableHeight&quot;: true, &quot;width&quot;: &quot;80%&quot;}]" data-options="{&quot;layout&quot;: &quot;fitColumns&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p>Many vertex and edge attributes mirror those from the <code class="language-plaintext highlighter-rouge">SBML_dfs</code>
summaries.</p>

<p>The critical quantitative attribute is <em>edge weight</em>, representing each
interaction’s plausibility and strength. Edge weights drive most graph
algorithms — from shortest path calculations and network layouts to
propagation methods. However, capturing this complexity in a single
attribute becomes increasingly challenging as more sources contribute
quantitative information relevant to regulatory plausibility.</p>

<div class="content-section ai-aside">
  <div class="section-content">
    <p>Earlier versions of the Napistu human
consensus model used simple heuristics for edge weighting: assign
favorable (low) weights to sparse mechanistic sources like <em>Reactome</em>
while quantitatively weighting <em>STRING</em> based on its confidence scores.
This approach worked when <em>STRING</em> dominated the quantitative landscape,
but the Octopus model’s addition of moderately dense sources —
<em>OmniPath</em>, <em>IntAct</em>, and <em>Reactome-FI</em> — each with their own
confidence metrics complicates this strategy. Rather than continuing to
stack ad hoc weighting schemes, the growing diversity of quantitative
evidence calls for more principled approaches.</p>

<p>I’m increasingly interested in learning edge trustworthiness empirically
through predictive performance rather than manual calibration. While
this is challenging for biological applications like regulatory network
prediction due to limited ground truth data, the network itself offers
opportunities for self-supervised learning. However, realizing this
potential requires a rich feature space beyond basic topological
properties — which brings us to the wealth of quantitative information
that can be integrated into the NapistuGraph representation.</p>

  </div>
</div>

<p>The default NapistuGraph contains a limited array of vertex and edge
attributes, but there’s actually a wealth of quantitative information
highlighted throughout this post that can be integrated directly into
the Octopus network. The power of Napistu’s design becomes apparent when
we start layering in this additional context. I’ll demonstrate by
augmenting the graph with two particularly valuable information types:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">add_sbml_dfs_summaries</code>: Generates source and ontology occurrence
data for all vertices, revealing which databases contributed to each
node and what biological categories they represent</li>
  <li><code class="language-plaintext highlighter-rouge">add_all_entity_data</code>: Transfers comprehensive quantitative
measurements from reactions_data and species_data tables directly
onto their corresponding edges and vertices</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># augment the graph
# add ontology and source data to vertices
</span><span class="n">napistu_graph</span><span class="p">.</span><span class="n">add_sbml_dfs_summaries</span><span class="p">(</span><span class="n">sbml_dfs</span><span class="p">,</span> <span class="n">stratify_by_bqb</span> <span class="o">=</span> <span class="bp">False</span><span class="p">)</span>

<span class="c1"># add reactions_data to edges
</span><span class="n">napistu_graph</span><span class="p">.</span><span class="n">add_all_entity_data</span><span class="p">(</span><span class="n">sbml_dfs</span><span class="p">,</span> <span class="s">"reactions"</span><span class="p">,</span> <span class="n">overwrite</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">napistu_graph</span><span class="p">.</span><span class="n">add_all_entity_data</span><span class="p">(</span><span class="n">sbml_dfs</span><span class="p">,</span> <span class="s">"species"</span><span class="p">,</span> <span class="n">mode</span> <span class="o">=</span> <span class="s">"extend"</span><span class="p">)</span>

<span class="n">summary_stats</span> <span class="o">=</span> <span class="n">napistu_graph</span><span class="p">.</span><span class="n">get_summary</span><span class="p">()</span>
<span class="n">summary_table</span> <span class="o">=</span> <span class="n">ng_utils</span><span class="p">.</span><span class="n">format_napistu_graph_summary</span><span class="p">(</span><span class="n">summary_stats</span><span class="p">)</span>
<span class="n">display_tabulator</span><span class="p">(</span>
    <span class="n">summary_table</span><span class="p">,</span>
    <span class="n">wrap_columns</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span>
    <span class="n">column_widths</span> <span class="o">=</span> <span class="p">{</span><span class="s">"Value"</span> <span class="p">:</span> <span class="s">"80%"</span><span class="p">},</span>
    <span class="n">caption</span> <span class="o">=</span> <span class="s">"Post-augmentation summaries of the `NapistuGraph` network"</span>
<span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Post-augmentation summaries of the `NapistuGraph` network
</figcaption>

<div class="data-table" style="" data-table="[{&quot;Metric&quot;: &quot;Vertices&quot;, &quot;Value&quot;: &quot;446,619&quot;}, {&quot;Metric&quot;: &quot;- Reaction&quot;, &quot;Value&quot;: &quot;402,805 (90.2%)&quot;}, {&quot;Metric&quot;: &quot;- Species&quot;, &quot;Value&quot;: &quot;43,814 (9.8%)&quot;}, {&quot;Metric&quot;: &quot;&quot;, &quot;Value&quot;: &quot;&quot;}, {&quot;Metric&quot;: &quot;Species Types&quot;, &quot;Value&quot;: &quot;&quot;}, {&quot;Metric&quot;: &quot;- Protein&quot;, &quot;Value&quot;: &quot;20,980 (0.0%)&quot;}, {&quot;Metric&quot;: &quot;- Complex&quot;, &quot;Value&quot;: &quot;14,971 (0.0%)&quot;}, {&quot;Metric&quot;: &quot;- Metabolite&quot;, &quot;Value&quot;: &quot;4,797 (0.0%)&quot;}, {&quot;Metric&quot;: &quot;- Other&quot;, &quot;Value&quot;: &quot;1,156 (0.0%)&quot;}, {&quot;Metric&quot;: &quot;- Regulatory Rna&quot;, &quot;Value&quot;: &quot;981 (0.0%)&quot;}, {&quot;Metric&quot;: &quot;- Unknown&quot;, &quot;Value&quot;: &quot;805 (0.0%)&quot;}, {&quot;Metric&quot;: &quot;- Drug&quot;, &quot;Value&quot;: &quot;124 (0.0%)&quot;}, {&quot;Metric&quot;: &quot;&quot;, &quot;Value&quot;: &quot;&quot;}, {&quot;Metric&quot;: &quot;Edges&quot;, &quot;Value&quot;: &quot;9,566,151&quot;}, {&quot;Metric&quot;: &quot;- interactor&quot;, &quot;Value&quot;: &quot;8,721,262 (91.2%)&quot;}, {&quot;Metric&quot;: &quot;- modified&quot;, &quot;Value&quot;: &quot;380,929 (4.0%)&quot;}, {&quot;Metric&quot;: &quot;- stimulator&quot;, &quot;Value&quot;: &quot;223,521 (2.3%)&quot;}, {&quot;Metric&quot;: &quot;- modifier&quot;, &quot;Value&quot;: &quot;88,036 (0.9%)&quot;}, {&quot;Metric&quot;: &quot;- inhibitor&quot;, &quot;Value&quot;: &quot;44,687 (0.5%)&quot;}, {&quot;Metric&quot;: &quot;- catalyst&quot;, &quot;Value&quot;: &quot;43,093 (0.5%)&quot;}, {&quot;Metric&quot;: &quot;- reactant&quot;, &quot;Value&quot;: &quot;34,972 (0.4%)&quot;}, {&quot;Metric&quot;: &quot;- product&quot;, &quot;Value&quot;: &quot;29,651 (0.3%)&quot;}, {&quot;Metric&quot;: &quot;&quot;, &quot;Value&quot;: &quot;&quot;}, {&quot;Metric&quot;: &quot;Vertex Attributes&quot;, &quot;Value&quot;: &quot;name, node_name, node_type, species_type, s_id, c_id, Dogma, chemspider, go, metanetx.reaction, chebi, intact, mdpi, kegg.glycan, signor, doi, envipath, url, bigg.reaction, pubchem, kegg.compound, sabiork, hmdb, ensembl_protein, metanetx.chemical, uniprot, refseq, seed.compound, reactome, hprd, lipidmaps, ols, omim, sgc, mirbase, Recon3D, slm, Reactome-FI, ec-code, ncbi_books, kegg.drug, reactome.reaction, refseq_synonym, rnacentral, other, biocyc, pubmed, ebi_refseq, bigg.metabolite, ncbi_entrez_gene, reactome.compound, iuphar.ligand, NCI_Thesaurus, smiles, dx_doi, pubchem.compound, TRRUST, IntAct, doid, inchi_key, refseq_name, ccds, seed.reaction, ncbigi, OmniPath, STRING, Reactome, biorxiv, phosphosite, rhea, ensembl_gene, corum, ensembl_transcript, matrixdb_biomolecule, kegg.reaction, OmniPath_species_type&quot;}, {&quot;Metric&quot;: &quot;Edge Attributes&quot;, &quot;Value&quot;: &quot;from, to, r_id, sbo_term, stoichiometry, species_type, r_isreversible, direction, string_wt, weight, upstream_weight, source_wt, STRING_neighborhood_transferred, IntAct_n_publications, OmniPath_n_references, OmniPath_n_primary_sources, STRING_database_transferred, IntAct_interaction_method_imaging technique, STRING_experiments, OmniPath_consensus_inhibition, IntAct_publication_score, OmniPath_is_directed, IntAct_interaction_method_unknown, OmniPath_consensus_stimulation, STRING_textmining_transferred, OmniPath_is_stimulation, OmniPath_consensus_direction, IntAct_interaction_type_physical association, STRING_neighborhood, Reactome-FI_fi_score, IntAct_interaction_type_colocalization, IntAct_interaction_method_post transcriptional interference, STRING_experiments_transferred, IntAct_miscore, OmniPath_is_inhibition, STRING_combined_score, STRING_fusion, IntAct_interaction_method_biochemical, STRING_coexpression, STRING_cooccurence, IntAct_interaction_type_association, STRING_textmining, OmniPath_n_sources, IntAct_interaction_method_protein complementation assay, IntAct_interaction_type_score, STRING_homology, IntAct_interaction_method_score, IntAct_interaction_method_biophysical, IntAct_interaction_type_direct interaction, STRING_coexpression_transferred, STRING_database&quot;}]" data-columns="[{&quot;title&quot;: &quot;Metric&quot;, &quot;field&quot;: &quot;Metric&quot;, &quot;formatter&quot;: &quot;textarea&quot;, &quot;variableHeight&quot;: true}, {&quot;title&quot;: &quot;Value&quot;, &quot;field&quot;: &quot;Value&quot;, &quot;formatter&quot;: &quot;textarea&quot;, &quot;variableHeight&quot;: true, &quot;width&quot;: &quot;80%&quot;}]" data-options="{&quot;layout&quot;: &quot;fitColumns&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p>Now, we’ve gone from a relatively spartan set of vertex and edge
attributes to a comprehensive graph of human cellular physiology
enriched with detailed biological annotations that describe what each
vertex and edge represents. This is a robust foundation for training
expressive network-based methods like graph neural networks.</p>

<h2 id="summary">Summary</h2>

<p>The Octopus model integrates eight diverse pathway databases into a
unified, genome-scale network of human cellular physiology. Through
systematic merging of complementary data sources, the model establishes
a shared molecular vocabulary while preserving each source’s specialized
contributions:</p>

<ul>
  <li><strong>Molecular species integration</strong>: Proteins and other species are
effectively shared across sources, creating a common parts list that
specialized databases can extend with domain-specific molecules
(metabolites from Recon3D, complexes from Reactome).</li>
  <li><strong>Reaction specialization</strong>: Sources show modest but meaningful
overlap in reactions, with each database contributing unique
mechanisms that reflect its individual curation focus.</li>
</ul>

<p>The resulting NapistuGraph provides a framework for layering extensive
biological information onto network structures. Source provenance,
confidence scores, ontological classifications, and mechanistic
annotations can be systematically integrated as vertex and edge
attributes, enabling sophisticated analyses from network propagation to
machine learning approaches.</p>

<p>The Octopus model is now ready for use. I’m excited to build on this
foundation and to see how the community engages with this new resource.</p>]]></content><author><name>Sean Hackett</name></author><category term="napistu" /><category term="genomics" /><category term="python" /><category term="networks" /><summary type="html"><![CDATA[Introducing the Octopus: Napistu’s eight-source Human Consensus Pathway Model that unites the breadth of protein-protein interaction networks with the depth of regulatory databases and metabolic models.The result is a genome-scale directed graph that is both densely connected and mechanistically precise. In this post, I will: Provide an overview of the Octopus model and its construction Show side-by-side summaries of individual data sources highlighting their complementarity Demonstrate that the model successfully merges results, creating a dense network covering the complete cellular repertoire of genes, metabolites, drugs, and complexes Illustrate how source-level information can be carried forward to the Octopus’s graphical network to augment its vertex and edge features]]></summary></entry><entry><title type="html">Building AI-Friendly Scientific Software: A Model Context Protocol Journey</title><link href="https://www.shackett.org/napistu_mcp/" rel="alternate" type="text/html" title="Building AI-Friendly Scientific Software: A Model Context Protocol Journey" /><published>2025-09-04T00:00:00+00:00</published><updated>2025-09-04T00:00:00+00:00</updated><id>https://www.shackett.org/napistu_mcp</id><content type="html" xml:base="https://www.shackett.org/napistu_mcp/"><![CDATA[<p>In this post, I walk through building a remote Model Context Protocol
(<em>MCP</em>) server that enhances AI agents’ ability to navigate and
contribute meaningfully to the complex
<a href="https://github.com/napistu/napistu">Napistu</a> scientific codebase.</p>

<p>This tool empowers new users, advanced contributors, and AI agents alike
to quickly access relevant project knowledge.</p>

<p>Before <em>MCP</em>, I fed Claude a mix of README files, wikis, and raw code
hoping for useful answers. Tools like Cursor struggled with the tangled
structure, sparking the idea for the Napistu <em>MCP</em> server.</p>

<p>I’ll cover:</p>

<ul>
  <li>Why I built the Napistu <em>MCP</em> server and the problems it solves</li>
  <li>How I deployed it using GitHub Actions and Google Cloud Run</li>
  <li>Case studies showing how AI agents perform with — and without —
<em>MCP</em> context</li>
</ul>

<!--more-->

<h2 id="the-ai-development-paradox">The AI development paradox</h2>

<p>We’re at an interesting inflection point in software development. AI
both <strong>accelerates</strong> and <strong>hinders</strong> the creation of high-quality code.</p>

<h3 id="-acceleration">✅ Acceleration</h3>

<p>AI speeds up development by:</p>

<ul>
  <li>Handling repetitive tasks</li>
  <li>Lowering the barrier to entry</li>
  <li>Simplifying debugging</li>
</ul>

<p>Sometimes I hand an agent a stack of failing <code class="language-plaintext highlighter-rouge">pytest</code> errors and say,
“You handle this.” And it does. It’s pure magic.</p>

<h3 id="-friction">❌ Friction</h3>

<p>But AI also introduces chaos:</p>

<ul>
  <li>Repeats patterns instead of reusing code</li>
  <li>Misses domain-specific idioms</li>
  <li>Adds unnecessary abstractions</li>
  <li>Produces brittle, poorly-structured code</li>
</ul>

<p>This goes beyond simple messiness — it can rapidly escalate into a
technical debt time bomb.</p>

<h3 id="-key-insight">🎯 Key Insight</h3>

<p>AI isn’t inherently good or bad; its performance is <strong>task-dependent</strong>.
Most AI failures can be traced back to missing context: the model simply
wasn’t given the information it needed to succeed. If an AI agent
understands your domain, your design patterns, and your project
structure, it can generate excellent code. Without that context, it’s
flying blind — and that’s where most of the frustration comes from.</p>

<p>Many of us are seeking the optimal balance where AI maximizes
productivity, while simultaneously working to expand that potential by
enhancing its performance on critical tasks.</p>

<h2 id="information-is-everything">Information is everything</h2>

<p>Context is the central challenge in any domain-specific codebase — be
it a financial trading system, a game engine, or a scientific library.
Until AI agents understand the context of the codebase, they will
struggle to follow existing patterns and conventions, and misuse
domain-specific approaches.</p>

<p>Off-the-shelf approaches to AI integration are improving, either by
seamlessly integrating with external services (like Claude talking with
GitHub, Google Docs, etc.), or by directly interfacing with the codebase
itself (as tools like Cursor and Copilot do). But, as you’ll see, while
these options help, they still leave something to be desired.</p>

<h3 id="case-study-the-value-of-context">Case study: the value of context</h3>

<p>To highlight the value of context, I’ll use a real-world example. Say I
ask an agent to help me with the following question:</p>

<blockquote>
  <p>How do I create a consensus network from multiple pathway databases in
Napistu? Please create a single artifact with your initial thoughts.</p>
</blockquote>

<p>Let’s explore a few scenarios to see how this plays out.</p>

<h4 id="scenario-1-no-context">Scenario 1: no context</h4>

<p><em>Without any context</em>, Claude has no idea what I’m talking about and
starts Googling:</p>

<blockquote>
  <p>I’m not familiar with Napistu as a specific software tool or platform
for pathway database analysis. … Since I cannot locate specific
documentation for Napistu, I’ll create an artifact with general
guidance on creating consensus networks from multiple pathway
databases</p>
</blockquote>

<p>Claude suggests some <a href="https://claude.ai/public/artifacts/adc42e81-b7c9-4fba-bd1c-b59c5294de05">helpful
ideas</a>
— but says nothing about Napistu.</p>

<h4 id="scenario-2-some-context">Scenario 2: some context</h4>

<p><em>With relevant code context</em>, by pointing Claude to relevant <code class="language-plaintext highlighter-rouge">.py</code> files
from GitHub:</p>

<blockquote>
  <p>Looking at the Napistu codebase, I can see this is a systems biology
toolkit for working with pathway models. Let me create a comprehensive
guide on how to create a consensus network from multiple pathway
databases.</p>
</blockquote>

<p>The <a href="https://claude.ai/public/artifacts/5d098345-d881-4122-b37d-91832dcaa72f">resulting
artifact</a>
highlights key classes and functions, organizing them into a clear,
orderly progression. Nonetheless, the response feels disjointed, as if
it pulled snippets from many sources without adequately synthesizing
them. Moreover, producing this response required me to provide specific
<code class="language-plaintext highlighter-rouge">.py</code> files to Claude because only ~15% of the <code class="language-plaintext highlighter-rouge">napistu.py</code> codebase
could fit into Claude’s context window.</p>

<h4 id="scenario-3-expert-knowledge">Scenario 3: expert knowledge</h4>

<p>If you asked an expert (me, 🙃) this question, the guidance would draw
from multiple information sources:</p>

<blockquote>
  <p>Start with the
<a href="https://github.com/napistu/napistu/blob/main/tutorials/merging_models_into_a_consensus.ipynb">merging_models_into_a_consensus</a>
tutorial — it provides a step-by-step walkthrough of this exact
workflow. Building consensus involves calling
<code class="language-plaintext highlighter-rouge">consensus.construct_consensus_model()</code> with multiple <code class="language-plaintext highlighter-rouge">SBML_dfs</code>
objects and a pathway index, which organizes the objects’ metadata.
This is currently being reworked in <a href="https://github.com/napistu/napistu-py/issues/169">Issue
169</a> to remove the
pathway index requirement. Finally, review the <a href="https://github.com/napistu/napistu/wiki/Consensus">consensus Napistu wiki
page</a> to gain a
high-level understanding of the key algorithms.</p>
</blockquote>

<p>This response demonstrates true understanding — it connects theory
(the algorithm), practice (the tutorial), current development (the
GitHub issue), and conceptual framework (the wiki) into actionable
guidance.</p>

<h4 id="providing-ai-agents-with-expert-knowledge">Providing AI agents with expert knowledge</h4>

<p>For an AI agent to match the expert response, we would need to do more
than just expand its context window; we need to address two key
challenges:</p>

<ol>
  <li><strong>information fragmentation</strong> - Relevant information is scattered
across multiple sources such as code repositories, wikis, issue
trackers, tutorials, README files, and more. This dispersion makes
providing relevant information to agents a cumbersome and often
manual process.</li>
  <li><strong>signal vs. noise</strong> - Critical context can easily be obscured by
large volumes of irrelevant or low-priority information, making it
challenging for AI agents to identify what truly matters.</li>
</ol>

<h3 id="solution-preview-what-if-an-ai-could-retrieve-domain-specific-information-on-demand">Solution preview: What if an AI could retrieve domain-specific information on demand?</h3>

<p>Before diving into how to provide agent-friendly information, let me
first show you the results of that effort in Napistu.</p>

<p>First, I’ll install Napistu with <em>MCP</em> dependencies enabled.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install</span> <span class="s1">'napistu[mcp]'</span>
</code></pre></div></div>

<p>Then, I can configure the remote documentation server’s URL and port
with <code class="language-plaintext highlighter-rouge">production_client_config</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">HTML</span><span class="p">,</span> <span class="n">display</span>
<span class="kn">from</span> <span class="nn">napistu.mcp.config</span> <span class="kn">import</span> <span class="n">production_client_config</span>
<span class="n">config</span> <span class="o">=</span> <span class="n">production_client_config</span><span class="p">()</span>

<span class="n">display</span><span class="p">(</span><span class="n">HTML</span><span class="p">(</span><span class="sa">f</span><span class="s">"""
    &lt;div&gt;
    &lt;b&gt;Client config:&lt;/b&gt;&lt;br&gt;
    &lt;b&gt;Host:&lt;/b&gt; </span><span class="si">{</span><span class="n">config</span><span class="p">.</span><span class="n">host</span><span class="si">}</span><span class="s">&lt;br&gt;
    &lt;b&gt;Port:&lt;/b&gt; </span><span class="si">{</span><span class="n">config</span><span class="p">.</span><span class="n">port</span><span class="si">}</span><span class="s">&lt;br&gt;&lt;br&gt;
    &lt;/div&gt;
    """</span><span class="p">))</span>
</code></pre></div></div>

<div>
<b>Client config:</b><br />
<b>Host:</b> napistu-mcp-server-844820030839.us-west1.run.app<br />
<b>Port:</b> 443<br /><br />
</div>

<p>Since this is a remote server, I can now start interacting with it
directly. I’ll pose the consensus modeling question again and then
reformat the AI-friendly JSON output as human-readable tables.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">html</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">napistu.mcp.client</span> <span class="kn">import</span> <span class="n">search_component</span>
<span class="kn">from</span> <span class="nn">shackett_utils.utils</span> <span class="kn">import</span> <span class="n">pd_utils</span>
<span class="kn">from</span> <span class="nn">shackett_utils.blog.html_utils</span> <span class="kn">import</span> <span class="n">display_tabulator</span>

<span class="k">def</span> <span class="nf">sanitize_content</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">pd</span><span class="p">.</span><span class="n">isna</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
        <span class="k">return</span> <span class="s">""</span>
    <span class="c1"># Remove/replace problematic characters
</span>    <span class="n">text</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
    <span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">'[^\w\s\-.,!?():]'</span><span class="p">,</span> <span class="s">' '</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>  <span class="c1"># Keep only basic chars
</span>    <span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">'\s+'</span><span class="p">,</span> <span class="s">' '</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>  <span class="c1"># Normalize whitespace
</span>    <span class="k">return</span> <span class="n">html</span><span class="p">.</span><span class="n">escape</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>  <span class="c1"># HTML escape
</span>
<span class="n">QUERY</span> <span class="o">=</span> <span class="s">"How do I create a consensus network from multiple pathway databases in Napistu?"</span>
<span class="n">COMPONENTS</span> <span class="o">=</span> <span class="p">[</span><span class="s">"codebase"</span><span class="p">,</span> <span class="s">"documentation"</span><span class="p">,</span> <span class="s">"tutorials"</span><span class="p">]</span>

<span class="c1"># Returns actual Napistu function signatures, docs, and usage examples
</span>
<span class="n">combined_results</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
<span class="k">for</span> <span class="n">component</span> <span class="ow">in</span> <span class="n">COMPONENTS</span><span class="p">:</span>
    <span class="n">results</span> <span class="o">=</span> <span class="k">await</span> <span class="n">search_component</span><span class="p">(</span>
        <span class="n">component</span><span class="p">,</span>
        <span class="n">QUERY</span><span class="p">,</span>
        <span class="n">config</span><span class="o">=</span><span class="n">config</span>
    <span class="p">)</span>
    <span class="n">results_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">results</span><span class="p">[</span><span class="s">"results"</span><span class="p">]).</span><span class="n">assign</span><span class="p">(</span><span class="n">component</span><span class="o">=</span><span class="n">component</span><span class="p">)</span>
    
    <span class="n">combined_results</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">results_df</span><span class="p">)</span>

<span class="n">combined_results_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="n">combined_results</span><span class="p">).</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="s">"similarity_score"</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)[[</span><span class="s">"component"</span><span class="p">,</span>  <span class="s">"similarity_score"</span><span class="p">,</span> <span class="s">"source"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">]]</span>

<span class="n">display_combined_results_df</span> <span class="o">=</span> <span class="n">combined_results_df</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">pd_utils</span><span class="p">.</span><span class="n">format_numeric_columns</span><span class="p">(</span><span class="n">display_combined_results_df</span><span class="p">,</span> <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">pd_utils</span><span class="p">.</span><span class="n">format_character_columns</span><span class="p">(</span><span class="n">display_combined_results_df</span><span class="p">,</span> <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">display_combined_results_df</span><span class="p">[</span><span class="s">'content'</span><span class="p">]</span> <span class="o">=</span> <span class="n">display_combined_results_df</span><span class="p">[</span><span class="s">'content'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">sanitize_content</span><span class="p">)</span>
<span class="n">display_combined_results_df</span><span class="p">[</span><span class="s">'source'</span><span class="p">]</span> <span class="o">=</span> <span class="n">display_combined_results_df</span><span class="p">[</span><span class="s">'source'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">sanitize_content</span><span class="p">)</span>

<span class="n">display_tabulator</span><span class="p">(</span>
    <span class="n">display_combined_results_df</span><span class="p">,</span>
    <span class="n">caption</span><span class="o">=</span><span class="s">"Top search results by cosine similarity"</span><span class="p">,</span>
    <span class="n">wrap_columns</span><span class="o">=</span><span class="p">[</span><span class="s">"source"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">],</span>
    <span class="n">column_widths</span><span class="o">=</span><span class="p">{</span><span class="s">"source"</span> <span class="p">:</span> <span class="s">"25%"</span><span class="p">,</span> <span class="s">"content"</span> <span class="p">:</span> <span class="s">"50%"</span><span class="p">},</span>
    <span class="n">include_index</span> <span class="o">=</span> <span class="bp">False</span>
<span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Top search results by cosine similarity
</figcaption>

<div class="data-table" style="" data-table="[{&quot;component&quot;: &quot;documentation&quot;, &quot;similarity_score&quot;: &quot;0.714&quot;, &quot;source&quot;: &quot;readme: napistu (part 5)&quot;, &quot;content&quot;: &quot; Tutorials These tutorials are intended as stand-alone demonstrations of Napistu s core functionality. Most exampl...&quot;}, {&quot;component&quot;: &quot;documentation&quot;, &quot;similarity_score&quot;: &quot;0.685&quot;, &quot;source&quot;: &quot;readme: napistu (part 1)&quot;, &quot;content&quot;: &quot; Napistu The Napistu project is an approach for creating and working with genome-scale mechanistic networks. Pathwa...&quot;}, {&quot;component&quot;: &quot;codebase&quot;, &quot;similarity_score&quot;: &quot;0.623&quot;, &quot;source&quot;: &quot;functions: napistu.consensus.prepare_consensus_model&quot;, &quot;content&quot;: &quot;napistu.consensus.napistu.cons ensus.prepare_consensus_model(sbml_dfs_list:list SBML_dfs ) tuple dict str,SBML_dfs ,PW...&quot;}, {&quot;component&quot;: &quot;codebase&quot;, &quot;similarity_score&quot;: &quot;0.619&quot;, &quot;source&quot;: &quot;functions: napistu.consensus.construct_consensus_model&quot;, &quot;content&quot;: &quot;napistu.consensus.napistu.cons ensus.construct_consensus_model(sbml_dfs_dict:dict str,SBML_dfs ,pw_index:PWIndex,model...&quot;}, {&quot;component&quot;: &quot;tutorials&quot;, &quot;similarity_score&quot;: &quot;0.608&quot;, &quot;source&quot;: &quot;tutorials: merging_models_into_a_consensus (part 1)&quot;, &quot;content&quot;: &quot;--- title: Tutorial - Merging Networks into a Consensus author: Shackett date: May 9th 2025 --- This notebook ...&quot;}, {&quot;component&quot;: &quot;tutorials&quot;, &quot;similarity_score&quot;: &quot;0.591&quot;, &quot;source&quot;: &quot;tutorials: creating_a_napistu_graph (part 2)&quot;, &quot;content&quot;: &quot; Load an sbml_dfs pathway representation A sbml_dfs , further described in the understanding_sbml_dfs.qmd vig...&quot;}, {&quot;component&quot;: &quot;documentation&quot;, &quot;similarity_score&quot;: &quot;0.580&quot;, &quot;source&quot;: &quot;readme: napistu (part 2)&quot;, &quot;content&quot;: &quot;- Represent a range of publicly-available data sources using a common data structure, sbml_dfs , which is meant to f...&quot;}, {&quot;component&quot;: &quot;tutorials&quot;, &quot;similarity_score&quot;: &quot;0.574&quot;, &quot;source&quot;: &quot;tutorials: working_with_genome_scale_networks (part 1)&quot;, &quot;content&quot;: &quot;--- title: Tutorial - Working with Genome-Scale Networks author: Shackett date: May 9th 2025 --- --- pytho...&quot;}, {&quot;component&quot;: &quot;documentation&quot;, &quot;similarity_score&quot;: &quot;0.574&quot;, &quot;source&quot;: &quot;wiki: Data-Sources (part 4)&quot;, &quot;content&quot;: &quot; Formats used in Napistu: Reactome s results are shared in multiple formats with the major data sources being path...&quot;}, {&quot;component&quot;: &quot;documentation&quot;, &quot;similarity_score&quot;: &quot;0.569&quot;, &quot;source&quot;: &quot;wiki: Exploring-Molecular-Relationships-as-Networks (part 1)&quot;, &quot;content&quot;: &quot;Napistu s molecular graphs let us answer biological questions using classical approaches in network analysis. This is...&quot;}, {&quot;component&quot;: &quot;codebase&quot;, &quot;similarity_score&quot;: &quot;0.561&quot;, &quot;source&quot;: &quot;functions: napistu.network.net_create.create_napistu_graph&quot;, &quot;content&quot;: &quot;napistu.network.net_create.nap istu.network.net_create.create_napistu_graph(sbml_dfs:SBML_dfs,directed:bool True,wirin...&quot;}, {&quot;component&quot;: &quot;tutorials&quot;, &quot;similarity_score&quot;: &quot;0.552&quot;, &quot;source&quot;: &quot;tutorials: downloading_pathway_data (part 23)&quot;, &quot;content&quot;: &quot;INFO:napistu.consensus:Merging reactions identifiers INFO:napistu.consensus:Merging reactions sources INFO:napistu.co...&quot;}, {&quot;component&quot;: &quot;codebase&quot;, &quot;similarity_score&quot;: &quot;0.545&quot;, &quot;source&quot;: &quot;functions: napistu.consensus._build_consensus_identifiers&quot;, &quot;content&quot;: &quot;napistu.consensus.napistu.cons ensus._build_consensus_identifiers(sbml_df:DataFrame,table_schema:dict,defining_biologi...&quot;}, {&quot;component&quot;: &quot;tutorials&quot;, &quot;similarity_score&quot;: &quot;0.544&quot;, &quot;source&quot;: &quot;tutorials: merging_models_into_a_consensus (part 5)&quot;, &quot;content&quot;: &quot;INFO:napistu.consensus:Creatin g source table INFO:napistu.consensus:Aggregating old sources INFO:napistu.consensus:Re...&quot;}, {&quot;component&quot;: &quot;codebase&quot;, &quot;similarity_score&quot;: &quot;0.537&quot;, &quot;source&quot;: &quot;functions: napistu.network.net_create.process_napistu_graph&quot;, &quot;content&quot;: &quot;napistu.network.net_create.nap istu.network.net_create.process_napistu_graph(sbml_dfs:SBML_dfs,directed:bool True,wiri...&quot;}]" data-columns="[{&quot;title&quot;: &quot;component&quot;, &quot;field&quot;: &quot;component&quot;}, {&quot;title&quot;: &quot;similarity_score&quot;, &quot;field&quot;: &quot;similarity_score&quot;}, {&quot;title&quot;: &quot;source&quot;, &quot;field&quot;: &quot;source&quot;, &quot;formatter&quot;: &quot;textarea&quot;, &quot;variableHeight&quot;: true, &quot;width&quot;: &quot;25%&quot;}, {&quot;title&quot;: &quot;content&quot;, &quot;field&quot;: &quot;content&quot;, &quot;formatter&quot;: &quot;textarea&quot;, &quot;variableHeight&quot;: true, &quot;width&quot;: &quot;50%&quot;}]" data-options="{&quot;layout&quot;: &quot;fitColumns&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p>Because the top result is Markdown, I’ll display it as a blockquote.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">Markdown</span>

<span class="k">def</span> <span class="nf">quote_markdown</span><span class="p">(</span><span class="n">markdown_content</span><span class="p">):</span>
    
    <span class="n">suppressed_headings</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">"^#+ (.*)$"</span><span class="p">,</span> <span class="sa">r</span><span class="s">"**\1**"</span><span class="p">,</span> <span class="n">markdown_content</span><span class="p">,</span> <span class="n">flags</span><span class="o">=</span><span class="n">re</span><span class="p">.</span><span class="n">MULTILINE</span><span class="p">)</span>
    <span class="n">blockquoted</span> <span class="o">=</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="sa">f</span><span class="s">"&gt; </span><span class="si">{</span><span class="n">line</span><span class="si">}</span><span class="s">"</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">suppressed_headings</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">))</span>

    <span class="k">return</span> <span class="n">blockquoted</span>

<span class="c1"># Add blockquote formatting
</span><span class="n">markdown_content</span> <span class="o">=</span> <span class="n">combined_results_df</span><span class="p">[</span><span class="s">"content"</span><span class="p">].</span><span class="n">iloc</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">display</span><span class="p">(</span><span class="n">Markdown</span><span class="p">(</span><span class="n">quote_markdown</span><span class="p">(</span><span class="n">markdown_content</span><span class="p">)))</span>
</code></pre></div></div>

<blockquote>
  <p><strong>Tutorials</strong></p>

  <p>These tutorials are intended as stand-alone demonstrations of
Napistu’s core functionality. Most examples will focus on small
pathways so that results can easily be reproduced by users.</p>

  <ul>
    <li>Downloading pathway data</li>
    <li>Understanding the <code class="language-plaintext highlighter-rouge">sbml_dfs</code> format</li>
    <li>Merging networks with the <code class="language-plaintext highlighter-rouge">consensus</code> module</li>
    <li>Using the CPR Command Line Interface (CLI)</li>
    <li>Formatting <code class="language-plaintext highlighter-rouge">sbml_dfs</code> as <code class="language-plaintext highlighter-rouge">napistu_graph</code> networks</li>
    <li>Suggesting mechanisms with network approaches</li>
    <li>Adding molecule- and reaction-level information to graphs</li>
    <li>R-based network visualization</li>
  </ul>
</blockquote>

<p>Much of the information that the expert provided is returned in this
initial query. However, the goal is not to deliver <strong>all</strong> relevant
information at once because this would inevitably include a significant
amount of irrelevant data. Rather, we can use tools like
<code class="language-plaintext highlighter-rouge">search_component</code> to give agents agency, putting information at the
tips of their virtual fingers. This allows agents to nimbly explore a
problem drawing upon relevant resources on-demand. As a result, rather
than generating generic or hallucinated responses, agents can uncover
actual patterns, locate pertinent tutorials, and gain a deeper
understanding of our domain-specific approaches.</p>

<h3 id="enter-the-model-context-protocol-mcp">Enter the Model Context Protocol (<em>MCP</em>)</h3>

<p><em>MCP</em> provides a standardized way for AI models to access external
information sources. Think of <em>MCP</em> as giving AI agents a research
assistant who knows your project inside and out — someone who can
instantly locate relevant documentation, code examples, and
domain-specific implementation patterns specific to your domain.</p>

<p>AI agents can interact with <em>MCP</em> servers through two primary
mechanisms: <strong>tools</strong> and <strong>resources</strong>. Think of <strong>resources</strong> as
reference materials agents can read (like a library catalog), and
<strong>tools</strong> as actions they can execute (like asking a librarian to
retrieve specific materials).</p>

<p>Let’s look at how this works in practice with the Napistu <em>MCP</em> server:</p>

<p><strong>Tools</strong> enable agents to perform actions and searches:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">search_documentation()</code> - Find relevant project docs and issues</li>
  <li><code class="language-plaintext highlighter-rouge">search_codebase()</code> - Discover functions, classes, and methods</li>
  <li><code class="language-plaintext highlighter-rouge">search_tutorials()</code> - Locate implementation examples</li>
</ul>

<p>The <code class="language-plaintext highlighter-rouge">search_component</code> function used in the solution preview above
indirectly uses a tool by calling the lower-level function
<code class="language-plaintext highlighter-rouge">call_server_tools</code>. <code class="language-plaintext highlighter-rouge">call_server_tools</code> in turn calls the actual
<em>FastMCP</em> Client method <code class="language-plaintext highlighter-rouge">call_tool</code>. This method accepts a tool name and
arguments, returning structured results.</p>

<p><strong>Resources</strong> in Napistu <em>MCP</em> provide read-only access to structured
information:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">napistu://health</code> - Server status and component health</li>
  <li><code class="language-plaintext highlighter-rouge">napistu://documentation/summary</code> - Overview of available
documentation</li>
  <li><code class="language-plaintext highlighter-rouge">napistu://tutorials/index</code> - Available tutorial content</li>
</ul>

<p>To call a resource endpoint like <code class="language-plaintext highlighter-rouge">napistu://documentation/summary</code> in
Python, we can use the <code class="language-plaintext highlighter-rouge">read_server_resource</code> function, which calls the
<em>FastMCP</em> Client method <code class="language-plaintext highlighter-rouge">read_resource</code>. This method takes a resource
URI and returns the contents of the resource.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">napistu.mcp.client</span> <span class="kn">import</span> <span class="n">read_server_resource</span>

<span class="n">content</span> <span class="o">=</span> <span class="k">await</span> <span class="n">read_server_resource</span><span class="p">(</span><span class="s">"napistu://documentation/summary"</span><span class="p">,</span> <span class="n">config</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>
</code></pre></div></div>

<pre><code class="language-output">    {
      "readme_files": [
        "napistu",
        "napistu-py",
        "napistu-r",
        "napistu/tutorials"
      ],
      "issues": [
        "napistu",
        "napistu-py",
        "napistu-r"
      ],
      "prs": [
        "napistu",
        "napistu-py",
        "napistu-r"
      ],
      "wiki_pages": [
        "Environment-Setup",
        "Data-Sources",
        "Napistu-Graphs",
        "Model-Context-Protocol-(MCP)-server",
        "SBML-DFs",
        "SBML",
        "Dev-Zone",
        "Exploring-Molecular-Relationships-as-Networks",
        "Precomputed-distances",
        "GitHub-Actions-napistu‐py",
        "Consensus",
        "History"
      ],
      "packagedown_sections": []
    }
</code></pre>

<p>This architecture solves our earlier problem; instead of manually
curating context, AI agents can dynamically discover and retrieve
exactly the information they need.</p>

<h1 id="anatomy-of-the-napistu-mcp-server">Anatomy of the Napistu <em>MCP</em> server</h1>

<p>Before diving into the technical implementation, it’s worth
understanding why I built this system. The Napistu <em>MCP</em> server serves
three key purposes:</p>

<ol>
  <li>Dramatically lowers the barrier to entry for new users who struggle
with the “cold start” problem</li>
  <li>Democratizes domain expertise to encourage broader community
contributions</li>
  <li>Gives core developers’ AI agents comprehensive project knowledge to
efficiently extend the codebase</li>
</ol>

<p>These objectives directly address the information fragmentation and
context limitations we identified earlier. With this motivation in mind,
I’ll provide an overview of the server’s architecture.</p>

<h2 id="fastmcp-foundation">FastMCP foundation</h2>

<p>The Model Context Protocol provides a standard way for AI models to
access external information.
<a href="https://github.com/jlowin/fastmcp"><em>FastMCP</em></a> provides a Flask-like
Python implementation of the <em>MCP</em> protocol.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">fastmcp</span> <span class="kn">import</span> <span class="n">FastMCP</span>

<span class="n">mcp</span> <span class="o">=</span> <span class="n">FastMCP</span><span class="p">(</span><span class="s">"napistu-server"</span><span class="p">)</span>

<span class="o">@</span><span class="n">mcp</span><span class="p">.</span><span class="n">resource</span><span class="p">(</span><span class="s">"napistu://health"</span><span class="p">)</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">health_check</span><span class="p">():</span>
    <span class="k">return</span> <span class="p">{</span><span class="s">"status"</span><span class="p">:</span> <span class="s">"healthy"</span><span class="p">,</span> <span class="s">"components"</span><span class="p">:</span> <span class="p">[...]}</span>

<span class="o">@</span><span class="n">mcp</span><span class="p">.</span><span class="n">tool</span><span class="p">()</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">search_documentation</span><span class="p">(</span><span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="k">return</span> <span class="p">{</span><span class="s">"results"</span><span class="p">:</span> <span class="p">[...]}</span>
</code></pre></div></div>

<p><em>FastMCP</em> manages the protocol details, while we focus on exposing
Napistu’s knowledge.</p>

<p>The <code class="language-plaintext highlighter-rouge">server.py</code> module orchestrates the entire lifecycle through a
simple three-step process:</p>

<ol>
  <li>Create a <em>FastMCP</em> server instance with the validated host, port,
and server name from a standard, or manually defined, configuration.</li>
  <li>Based on the selected profile (execution, docs, or full),
<strong>register</strong> the enabled components with the server; each component
adds its own resources and tools to the endpoint registry.</li>
  <li>Asynchronously <strong>initialize</strong> all registered components, loading
their data sources and setting up semantic search indexing in
parallel.</li>
</ol>

<p>Once this process completes, the server starts listening for incoming
<em>MCP</em> requests.</p>

<h2 id="components">Components</h2>

<p>Napistu employs a component-based architecture that ensures separation
of concerns — each component manages its own data sources and search
logic. This design supports graceful degradation; for instance, a failed
GitHub API call won’t disrupt tutorial searches. It also enables
flexible deployment, allowing the activation of only the necessary
components. This modularity lets me create servers tailored to specific
use cases—for example, a local server capable of executing Napistu
code or a remote server focused solely on documentation.</p>

<p>The current components are:</p>

<ul>
  <li><strong>Documentation</strong>: READMEs, wiki pages, GitHub issues and PRs\</li>
  <li><strong>Codebase</strong>: API documentation and function signatures sourced from
Read The Docs\</li>
  <li><strong>Tutorials</strong>: Jupyter notebooks converted into searchable Markdown\</li>
  <li><strong>Execution</strong> (<em>in development</em>): Interaction with a live Python
environment</li>
  <li><strong>Health</strong>: Server monitoring and diagnostics for system components</li>
</ul>

<p>Each component follows a consistent pattern: load data, register
endpoints, and handle search.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">DocumentationComponent</span><span class="p">(</span><span class="n">MCPComponent</span><span class="p">):</span>
    <span class="k">async</span> <span class="k">def</span> <span class="nf">initialize</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">semantic_search</span><span class="p">:</span> <span class="n">SemanticSearch</span> <span class="o">=</span> <span class="bp">None</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
        <span class="s">"""Load READMEs, wiki pages, GitHub issues"""</span>
        <span class="c1"># Load external data and populate component state
</span>        <span class="k">return</span> <span class="n">success</span>
    
    <span class="k">def</span> <span class="nf">register</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">mcp</span><span class="p">:</span> <span class="n">FastMCP</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="s">"""Register resources and tools with MCP server"""</span>
        <span class="o">@</span><span class="n">mcp</span><span class="p">.</span><span class="n">tool</span><span class="p">()</span>
        <span class="k">async</span> <span class="k">def</span> <span class="nf">search_documentation</span><span class="p">(</span><span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
            <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">state</span><span class="p">.</span><span class="n">semantic_search</span><span class="p">.</span><span class="n">search</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="s">"documentation"</span><span class="p">)</span>
</code></pre></div></div>

<div class="content-section ai-aside">
  <div class="section-content">
    <p>It’s important to write detailed
AI-first docstrings for <em>MCP</em> resources and tools. This information is
available to most agents before they interact with the server’s
endpoints, so it’s helpful to clarify <strong>when</strong> and <strong>when NOT</strong> to use
the method. While all-caps and bold sections may seem a bit obnoxious to
human readers, they do effectively draw an agent’s attention.</p>

<p>For example, here is part of the docstring for the <code class="language-plaintext highlighter-rouge">search_codebase</code>
tool:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>**USE THIS WHEN:**
- Looking for specific Napistu functions, classes, or modules
- Finding API documentation for Napistu features

**DO NOT USE FOR:**
- General programming concepts not specific to Napistu
- Documentation for other libraries or frameworks
</code></pre></div></div>


  </div>
</div>

<h2 id="smart-search-semantic--vector-embeddings">Smart search: semantic + vector embeddings</h2>

<p>We support two search methods: exact keyword search (e.g.,
“create_consensus”) and semantic search (e.g., “How do I merge pathway
data?”). Semantic search is powered by a <code class="language-plaintext highlighter-rouge">SemanticSearch</code> object used
across components.</p>

<p>The pipeline includes:</p>

<ul>
  <li><strong>Content Processing</strong>: Load content and chunk long documents at
natural boundaries</li>
  <li><strong>Embedding Generation</strong>: Convert chunks to 384-dimensional vectors
using all-MiniLM-L6-v2 sentence transformer, selected for its ease
of implementation and effectiveness with general text</li>
  <li><strong>Vector Storage</strong>: Store embeddings in ChromaDB, along with
metadata to support fast similarity search</li>
  <li><strong>Query Processing</strong>: Embed user queries and find nearest neighbors
using cosine similarity</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">SemanticSearch</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">persist_directory</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"./chroma_db"</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">client</span> <span class="o">=</span> <span class="n">chromadb</span><span class="p">.</span><span class="n">PersistentClient</span><span class="p">(</span><span class="n">path</span><span class="o">=</span><span class="n">persist_directory</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">embedding_function</span> <span class="o">=</span> <span class="n">SentenceTransformerEmbeddingFunction</span><span class="p">(</span>
            <span class="n">model_name</span><span class="o">=</span><span class="s">"all-MiniLM-L6-v2"</span>
        <span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">search</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">collection_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
        <span class="c1"># Convert query to vector, find similar content by cosine similarity
</span>        <span class="k">return</span> <span class="n">similarity_results_with_scores</span>
</code></pre></div></div>

<div class="content-section ai-aside">
  <div class="section-content">
    <p>An early version of the server used
keyword-based search to comb through all of the cached information. The
results quality was massively improved, however, switching to
vector-based search required me to implement several new features. To
approach this problem, I worked with Claude to research different
approaches balancing projected performance against ease of
implementation and maintainability. Since Claude was performing well, I
chose to stick with it for implementing the semantic search
functionality instead of switching to Cursor. This worked well because I
could provide the entire <code class="language-plaintext highlighter-rouge">napistu.mcp</code> subpackage as context, and the
codebase was already well-structured. The system introduced unnecessary
complexity in a few areas — such as managing separate component-level
ChromaDB databases rather than a unified centralized database — but
overall, the implementation proceeded efficiently, and I had the
functionality up and running within a few hours.</p>

<p>While I maintain strict oversight of agents contributing to the
scientific portions of the Napistu codebase, I allow greater autonomy
for agents working on the development of the <code class="language-plaintext highlighter-rouge">napistu.mcp</code> subpackage.
To do this, I focus more on code review to validate the AI’s assumptions
(e.g., “do we really need to assign global variables?”), and to
suggest refactoring (e.g., “would creating a <code class="language-plaintext highlighter-rouge">ServerProfile</code> class
simplify component configuration?”). After a session of implementing
features in Claude, I’ve leveraged it to update the <a href="https://github.com/napistu/napistu/wiki/Model-Context-Protocol-(MCP)-server">Napistu MCP server
wiki</a>
with some additional guidance (e.g., “shorten 4-fold, remove this
section”). Maintaining this high-level documentation, accessible via
<em>MCP</em>, effectively helps agents “save their place” for future
development sessions.</p>

  </div>
</div>

<h2 id="the-agent-experience">The agent experience</h2>

<p>With the server architecture in place, let’s explore how this translates
to the actual user experience for both humans and AI agents. The <em>MCP</em>
protocol uses structured JSON messages over HTTP, with one key
advantage: both humans and AI agents interact through the same unified
interface.</p>

<p><strong>Human developers</strong> engage directly using the Napistu client utilities
or the <em>MCP</em> command line interface.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">napistu.mcp.client</span> <span class="kn">import</span> <span class="n">search_component</span>
<span class="kn">from</span> <span class="nn">napistu.mcp.config</span> <span class="kn">import</span> <span class="n">production_client_config</span>

<span class="n">config</span> <span class="o">=</span> <span class="n">production_client_config</span><span class="p">()</span>
<span class="n">results</span> <span class="o">=</span> <span class="k">await</span> <span class="n">search_component</span><span class="p">(</span><span class="s">"documentation"</span><span class="p">,</span> <span class="s">"how to install Napistu"</span><span class="p">,</span> <span class="n">config</span><span class="o">=</span><span class="n">config</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>AI agents</strong> (such as Claude, Cursor, or any <em>MCP</em>-compatible tool)
send equivalent requests via their respective <em>MCP</em> client
implementations.</p>

<p>For example, when an agent asks “How do I create consensus networks in
Napistu?””, it automatically:</p>

<p>AI agents (like Claude, Cursor, or any <em>MCP</em>-compatible tool) send the
same underlying requests but through their <em>MCP</em> client implementations.
When an agent asks “How do I create consensus networks in Napistu?”, it
automatically:</p>

<ol>
  <li>Calls the <code class="language-plaintext highlighter-rouge">search_tutorials</code> tool with the query</li>
  <li>Receives structured results with similarity scores and content
snippets</li>
  <li>May follow up with <code class="language-plaintext highlighter-rouge">search_documentation</code> or <code class="language-plaintext highlighter-rouge">search_codebase</code> for
additional context</li>
  <li>Uses all this information to provide comprehensive, accurate
guidance.</li>
</ol>

<p>The key insight is that agents receive the same rich, structured
responses as human developers but can instantly process and integrate
information across multiple sources. This transforms a simple Q&amp;A
interaction into an expert-level consultation.</p>

<h2 id="from-local-to-global-deployment-story">From local to global: deployment story</h2>

<h3 id="local-development">Local development</h3>

<p>It’s easy to set up a local <em>MCP</em> server that digests relevant documents
and interacts with local agents.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Install Napistu with MCP dependencies</span>
pip <span class="nb">install</span> <span class="s1">'napistu[mcp]'</span>

<span class="c"># Start full development server (all components)</span>
python <span class="nt">-m</span> napistu.mcp server full

<span class="c"># Health check shows component loading</span>
python <span class="nt">-m</span> napistu.mcp health <span class="nt">--local</span>
</code></pre></div></div>

<pre><code class="language-output">
</code></pre>

<h1 id="-napistu-mcp-server-health-check">🏥 Napistu MCP Server Health Check</h1>
<p>Server URL: http://127.0.0.1:8765/mcp</p>

<pre><code class="language-output">Server Status: healthy

Components:
  ✅ documentation: healthy
  ✅ codebase: healthy  
  ✅ tutorials: healthy
  ✅ semantic_search: healthy
</code></pre>

<p>This local approach works well for individual developers, but creates
barriers for broader adoption. It requires installing Napistu,
maintaining a background process, and keeping it running — imposing a
significant burden on users who simply want to explore the project or
collaborate.</p>

<h3 id="the-always-up-solution">The always-up solution</h3>

<p>Instead, I aimed to create an always-available service that I and others
access easily, without any local setup. This meant deploying the server
to the cloud with automatic updates triggered by changes in codebase,
integrated seamlessly into my <a href="https://github.com/napistu/napistu/wiki/GitHub-Actions-napistu%E2%80%90py">GitHub Actions CI/CD
workflows</a>.</p>

<p>Every tagged release triggers deployment to Google Cloud Run.</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Deploy workflow - simplified view</span>
<span class="na">on</span><span class="pi">:</span>
  <span class="na">workflow_run</span><span class="pi">:</span>
    <span class="na">workflows</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">Release"</span><span class="pi">]</span>  <span class="c1"># Auto-deploy after successful release</span>
    <span class="na">types</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">completed</span><span class="pi">]</span>
  <span class="na">schedule</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">cron</span><span class="pi">:</span> <span class="s1">'</span><span class="s">0</span><span class="nv"> </span><span class="s">10</span><span class="nv"> </span><span class="s">*</span><span class="nv"> </span><span class="s">*</span><span class="nv"> </span><span class="s">*'</span>  <span class="c1"># Daily content refresh at 2 AM PST (10 AM UTC)</span>

<span class="na">jobs</span><span class="pi">:</span>
  <span class="na">deploy</span><span class="pi">:</span>
    <span class="na">steps</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Deploy to Cloud Run</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">gcloud run deploy napistu-mcp-server \</span>
            <span class="s">--image="us-west1-docker.pkg.dev/.../napistu-mcp-server:latest" \</span>
            <span class="s">--cpu=1 --memory=2Gi \</span>
            <span class="s">--set-env-vars="MCP_PROFILE=docs"</span>
</code></pre></div></div>

<p>The production setup runs the “docs” profile (documentation + codebase +
tutorials, no execution component since these are meant to operate in a
user environment) with 1 CPU and 2Gi memory, costing less than $1 per
day. Content is refreshed both upon new release of <code class="language-plaintext highlighter-rouge">napistu-py</code> and
nightly to ensure the latest documentation changes are captured.</p>

<p>This deployment strategy creates a powerful feedback loop between
development and documentation.</p>

<h2 id="the-payoff">The payoff</h2>

<p>Now any AI tool can access the Napistu knowledge base instantly at
https://napistu-mcp-server-844820030839.us-west1.run.app. Users don’t
need to install software, run local processes, or handle maintenance;
they can simply configure their AI tools to connect to the shared
knowledge base. The service automatically updates with the latest
documentation and code changes, while Google Cloud Run handles scaling,
health checks, and automatic restarts to ensure high availability.</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">//</span><span class="w"> </span><span class="err">Claude</span><span class="w"> </span><span class="err">Desktop</span><span class="w"> </span><span class="err">/</span><span class="w"> </span><span class="err">Cursor</span><span class="w"> </span><span class="err">configuration</span><span class="w">
</span><span class="err">//</span><span class="w"> </span><span class="err">Add</span><span class="w"> </span><span class="err">this</span><span class="w"> </span><span class="err">to</span><span class="w"> </span><span class="err">your</span><span class="w"> </span><span class="err">MCP</span><span class="w"> </span><span class="err">settings</span><span class="w"> </span><span class="err">to</span><span class="w"> </span><span class="err">access</span><span class="w"> </span><span class="err">Napistu</span><span class="w"> </span><span class="err">knowledge</span><span class="w">
</span><span class="p">{</span><span class="w">
  </span><span class="nl">"mcpServers"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"napistu"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"command"</span><span class="p">:</span><span class="w"> </span><span class="s2">"npx"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"args"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"mcp-remote"</span><span class="p">,</span><span class="w"> </span><span class="s2">"https://napistu-mcp-server-844820030839.us-west1.run.app/mcp/"</span><span class="p">]</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The result is clear: Napistu’s entire knowledge base becomes instantly
searchable by AI agents worldwide, dramatically lowering the barrier to
contribution and collaboration.</p>

<h3 id="️-security-and-privacy">🛡️ Security and privacy</h3>

<p>The remote Napistu <em>MCP</em> server is intended for narrowly-scoped
information retrieval, so security and privacy issues should be minimal:</p>

<ul>
  <li><strong>Scoped</strong> to Napistu-specific resources only</li>
  <li><strong>No user data</strong> stored or processed. Standard cloud platform
logging may occur because this runs on Google Cloud Run</li>
  <li><strong>Auditable</strong>: all resources and tools are openly available in the
public <a href="https://github.com/napistu/napistu-py/tree/main/src/napistu/mcp">GitHub
repository</a></li>
</ul>

<h1 id="case-studies-ai-agents-in-action">Case studies: AI agents in action</h1>

<p>As previously noted, the server’s core goals are to make the codebase
more accessible to new users and collaborators, while also enhancing the
capabilities of AI agents used by core developers. In this section, I’ll
present two case studies that A/B test the impact of the <em>MCP</em> server on
both onboarding and development experiences. In both cases, even when
<em>MCP</em> was not enabled, I provided substantial contextual information
through standard channels — Claude accessed files via GitHub, and
Cursor had access to the full codebase. This makes <em>MCP</em>’s impact less
binary and instead highlights its marginal contributions in more
realistic, real-world scenarios.</p>

<h2 id="case-study-1-learning-with-claude">Case Study 1: Learning with Claude</h2>

<p>To illustrate <em>MCP</em>’s value for training, I provided Claude the same
prompt both with and without <em>MCP</em> enabled for comparison:</p>

<blockquote>
  <p>I’m new to Napistu. Can you provide a high-level overview of the
structure, creation, and usage of SBML_dfs? Please think deeply and
incorporate your response into a Markdown file.</p>
</blockquote>

<h3 id="without-mcp">Without <em>MCP</em></h3>

<p><img src="https://www.shackett.org/figure/napistu_mcp/claude_no_mcp_clipped.gif" alt="GIF showing a Claude session without using the MCP server" style="width: 100%;" /></p>

<p>The resulting
<a href="https://claude.ai/public/artifacts/83a1e517-cf6f-42fa-8693-02fa710e6854">artifact</a>
is a mixed bag:</p>

<ul>
  <li>✅ Provides a good overview of the core and optional tables and
their relationships.</li>
  <li>✅ Public methods are grouped logically, with some light-weight
explanations.</li>
  <li>✅ Most code appears syntactically correct.</li>
  <li>❌ Logistically, I had to manually add select files from GitHub
which requires prior knowledge of the codebase — something a new
user would likely lack.</li>
  <li>❌ The “Creating SBML_dfs Objects” section mentions a subset of the
approaches and includes the consensus logic, which feels out of
place.</li>
  <li>❌ Advanced usage and “integration with the Napistu ecosystem”
consists of random functionality inferred from the CLI.</li>
</ul>

<h3 id="with-mcp">With <em>MCP</em></h3>

<p><img src="https://www.shackett.org/figure/napistu_mcp/claude_w_mcp_clipped.gif" alt="GIF showing a Claude session while using the MCP server" style="width: 100%;" /></p>

<p>Armed with the <em>MCP</em>, the
<a href="https://claude.ai/public/artifacts/492972cc-5157-4dcd-8840-8f058c3dfc1b">artifact</a>
is well rounded, but far from perfect.</p>

<ul>
  <li>✅ Good overview of the core and optional tables and their
relationships.</li>
  <li>✅ Good high-level overview of how to create SBML_dfs and its public
methods.</li>
  <li>✅ Advanced usage, best practices, and general use cases are solid</li>
  <li>❌ <code class="language-plaintext highlighter-rouge">model_source = source.Source("MyDatabase", "v1.0")</code>. This line
captures the gist of the <code class="language-plaintext highlighter-rouge">model_source</code> object, but it won’t
actually run. Since this was a new addition to the codebase, this is
probably a case where the documentation has lagged behind the code.
This serves as a helpful reminder that you’ll only get relevant
information when your sources are up-to-date.</li>
</ul>

<p>I definitely prefer the artifact generated with <em>MCP</em>, although it was
generated from a single initial prompt. In the absence of <em>MCP</em>,
follow-up questions by a user would quickly devolve into hallucinations.
With access to the Napistu <em>MCP</em>, Claude can continue to guide the user
through the complex codebase while maintaining a high-level perspective.</p>

<p><strong>The Result</strong>: From “intimidating research codebase” to “approachable,
guided experience”</p>

<h2 id="case-study-2-building-with-cursor">Case Study 2: Building with Cursor</h2>

<p>There are several areas where access to the <em>MCP</em> server would
significantly benefit Cursor:</p>

<ul>
  <li>🚀 When NOT working on the actual Napistu codebase, Cursor could
still look up classes and functions.</li>
  <li>🚀 For general usage questions and training prompts, as demonstrated
in Case Study 1, having access to related content and leveraging
semantic search to handle synonyms would be especially valuable.
However, this is not a major Cursor use case, and users would likely
receive better responses from a general-purpose LLM like Claude in
such scenarios.</li>
</ul>

<p>And, situations where having access to the Napistu <em>MCP</em> would be
entirely unnecessary:</p>

<ul>
  <li>🤷 When working directly on the Napistu codebase, Cursor can
efficiently look up function signatures and search for functionality
using its native methods.</li>
</ul>

<p>Rather than setting up a strawman by denying Cursor access to the
codebase, I wanted to explore scenarios when <em>MCP</em> could help Cursor in
more nuanced situations. In exploring this question, I asked Cursor
directly — and I found its response quite insightful.</p>

<blockquote>
  <p>MCP tools in Cursor are about bridging the gap between “what the code
can do” and “how it’s meant to be used”. They’re not replacing code
navigation; rather, they’re adding the intent and context that lives
in documentation and tutorials.</p>
</blockquote>

<p>Intent and context become particularly relevant when applying a
framework like Napistu, rather than directly extending it. In this
sense, the Napistu <em>MCP</em> is particularly valuable when using Napistu to
explore scientific questions within a notebook environment. Given that
Cursor recently added support for Jupyter notebooks (albeit still in an
early and somewhat rough state), this represents a particularly
compelling use case. To make the tasks straightforward, I asked Cursor
to extend one of the Napistu tutorials because tutorials should be a mix
of code and explanatory prose, just like a good biological analysis:</p>

<blockquote>
  <p>Can you help me extend the <code class="language-plaintext highlighter-rouge">understanding_sbml_dfs.ipynb</code> tutorial to
flesh out the “from_edgelist” workflow and to include any recent
updates to the core data structure? THINK DEEPLY AND (DO NOT USE THE
NAPISTU MCP / USE THE NAPISTU MCP AS NEEDED). Since you’ll have
trouble directly editing the ipynb, please suggest what I should
incorporate in a separate Markdown file. Edit for readability and to
prioritize high value content. Limit the total content to less than 30
new sentences.</p>
</blockquote>

<h3 id="without-mcp-1">Without <em>MCP</em></h3>

<p><img src="https://www.shackett.org/figure/napistu_mcp/cursor_no_mcp_clipped.gif" alt="GIF showing a Cursor session without using the MCP server" style="width: 100%;" /></p>

<p>The
<a href="https://www.shackett.org/post_support_20250903/#cursor-without-mcp">artifact</a>
has its high and low points:</p>

<ul>
  <li>✅ The summary of the edgelist format with running code is quite
good.</li>
  <li>❌ It has no idea what I meant by “recent updates” and just provides
code snippets for random public functions.</li>
</ul>

<h3 id="with-mcp-1">With <em>MCP</em></h3>

<p><img src="https://www.shackett.org/figure/napistu_mcp/cursor_w_mcp_clipped.gif" alt="GIF showing a Cursor session while using the MCP server" style="width: 100%;" /></p>

<p>Third-party integrations with Cursor are in a far rawer state than the
major models, so you really have to twist its arm to use it. In
practice, the experience is underwhelming — Cursor tends to rely
solely on the <code class="language-plaintext highlighter-rouge">search_codebase</code> tool, which surfaces information it
already has access to. As a result, the actual
<a href="https://www.shackett.org/post_support_20250903/#cursor-with-mcp">output</a> is
fairly poor:</p>

<ul>
  <li>❌ Only pseudocode describing the edgelist format</li>
  <li>❌ Misinterprets “recent updates,” instead listing all public
methods not already covered in the tutorial</li>
</ul>

<p>Between these two scenarios, I would choose the output generated without
<em>MCP</em> access. Much of this comes down to how Cursor used the <em>MCP</em>
server. It tends to follow a one-track mindset — fixated on “code,
code, code” — even when equipped with tools that could broaden its
scope. The key takeaway is that the agentic coding space still has
significant room to mature, particularly in fields like computational
biology, where effective work demands both technical execution and deep
domain insight.</p>

<h1 id="agents-for-science">Agents for science</h1>

<p>In this post, I’ve shared how I’ve improved my AI-based code development
experience by creating a remote Model Context Protocol (<em>MCP</em>) server
for my scientific codebase,
<a href="https://github.com/napistu/napistu">Napistu</a>. The server delivers
on-demand, contextually relevant information to AI agents, enabling them
to efficiently surface relevant heterogeneous information and synthesize
this information into actionable guidance. By deploying the server to
Google Cloud Run via GitHub Actions, it can easily be used by both new
and advanced users, at little cost to me.</p>

<p>To clarify the benefits of on-demand context, I provided case studies
comparing agent behavior with and without access to the <em>MCP</em> server.
These examples highlight the server’s impact on both onboarding
efficiency and the overall development experience.</p>

<h2 id="whats-next">What’s next</h2>

<p><strong>More content</strong></p>

<ul>
  <li><strong>External library documentation</strong>: Because the codebase is
formatted by directly scraping the <a href="https://napistu.readthedocs.io/en/latest/">Read the
docs</a>, it would be
straightforward to ingest documentation for non-Napistu libraries
like <em>igraph</em>.</li>
  <li><strong>More Napistu docs</strong>: Include the Napistu CLI, and READMEs, and the
<em>napistu.r</em>’s pkgdown site.</li>
  <li><strong>Supporting multiple Napistu versions</strong>: By preparing the core data
as part of Napistu’s CI/CD workflow and saving the results to GCS,
the server can download and cache a local version based on the
user’s request.</li>
</ul>

<p><strong>More power</strong></p>

<ul>
  <li><strong>Cross component semantic search</strong>: This would allow agents to
search across multiple components, providing a more comprehensive
understanding of the codebase.</li>
  <li><strong>Execution components running Napistu functions</strong>: The execution
components enable agents to register Python objects and apply
transformations using Napistu functions. While still experimental,
these components would allow agents to execute multiple steps in a
live Python environment (like looking up two genes and finding the
shortest path between them, entirely within the execution context).</li>
</ul>

<p><strong>More science</strong></p>

<ul>
  <li>Planning features and updating documentation with Claude</li>
  <li>Efficiently implementing features and squashing bugs in Cursor</li>
</ul>

<h2 id="-getting-started">🔧 Getting Started</h2>

<p>Want to explore or contribute?</p>

<ul>
  <li>Configure Claude Desktop or your favorite LLM with the <em>MCP</em> server.</li>
  <li>Ask questions about Napistu’s internals and architecture.</li>
  <li>Start contributing to open issues with AI-assisted development.</li>
  <li>Join our community discussions to collaborate, share ideas, and help
shape the project.</li>
</ul>]]></content><author><name>Sean Hackett</name></author><category term="napistu" /><category term="AI" /><category term="python" /><category term="SWE" /><summary type="html"><![CDATA[In this post, I walk through building a remote Model Context Protocol (MCP) server that enhances AI agents’ ability to navigate and contribute meaningfully to the complex Napistu scientific codebase. This tool empowers new users, advanced contributors, and AI agents alike to quickly access relevant project knowledge. Before MCP, I fed Claude a mix of README files, wikis, and raw code hoping for useful answers. Tools like Cursor struggled with the tangled structure, sparking the idea for the Napistu MCP server. I’ll cover: Why I built the Napistu MCP server and the problems it solves How I deployed it using GitHub Actions and Google Cloud Run Case studies showing how AI agents perform with — and without — MCP context]]></summary></entry><entry><title type="html">Network Biology with Napistu, Part 2: Translating Statistical Associations into Biological Mechanisms</title><link href="https://www.shackett.org/napistu_network_propagation/" rel="alternate" type="text/html" title="Network Biology with Napistu, Part 2: Translating Statistical Associations into Biological Mechanisms" /><published>2025-08-27T00:00:00+00:00</published><updated>2025-08-27T00:00:00+00:00</updated><id>https://www.shackett.org/napistu_network_propagation</id><content type="html" xml:base="https://www.shackett.org/napistu_network_propagation/"><![CDATA[<p>This is part two of a two-part series on
<strong><a href="https://github.com/napistu/napistu">Napistu</a></strong> — a new framework
for building genome-scale molecular networks and integrating them with
high-dimensional data. Using a methylmalonic acidemia (MMA) multimodal
dataset as a case study, I’ll demonstrate how to distill
disease-relevant signals into mechanistic insights through network-based
analysis.</p>

<h3 id="from-statistical-associations-to-biological-mechanisms">From statistical associations to biological mechanisms</h3>

<p>Modern genomics excels at identifying disease-associated genes and
proteins through statistical analysis. Methods like Gene Set Enrichment
Analysis (<em>GSEA</em>) group these genes into functional categories, offering
useful biological context. However, we aim to go beyond simply
identifying which genes and gene sets change. Our goal is to understand
why these genes change together, uncovering the mechanistic depth
typically seen in Figure 1 of a <em>Cell</em> paper. To achieve this, we must
identify key molecular components, summarize their interactions, and
characterize the dynamic cascades that drive emergent biological
behavior.</p>

<p>In this post, I’ll demonstrate how to gain this insight by mapping
statistical disease signatures onto genome-scale biological networks.
Then, using personalized PageRank, I’ll trace signals from dysregulated
genes back to their shared regulatory origins. This transforms lists of
differentially expressed genes into interconnected modules that reveal
upstream mechanisms driving coordinated molecular changes.</p>

<!--more-->

<h3 id="napistus-implementation">Napistu’s implementation</h3>

<p>Napistu makes network biology practical through three core capabilities:</p>

<ol>
  <li>
    <p>Pathway representation with <code class="language-plaintext highlighter-rouge">SBML_dfs</code></p>

    <p>Napistu uses a custom format,
<a href="https://github.com/napistu/napistu/wiki/SBML-DFs"><code class="language-plaintext highlighter-rouge">SBML_dfs</code></a>, to
faithfully capture regulatory mechanisms. It tracks genes,
metabolites, protein complexes, and drugs as molecular species,
connecting them through regulatory interactions and biochemical
transformations.</p>
  </li>
  <li>
    <p>Translation to <code class="language-plaintext highlighter-rouge">NapistuGraph</code>s</p>

    <p>It provides tools to convert these pathway representations into
<a href="https://github.com/napistu/napistu/wiki/Napistu-Graphs"><code class="language-plaintext highlighter-rouge">NapistuGraph</code></a>,
where vertices represent molecular species and edges represent
direct regulatory relationships.</p>
  </li>
  <li>
    <p>Biological query capabilities</p>

    <p>Napistu enables users to ask general-purpose questions of these
networks, such as:</p>

    <ul>
      <li><em>What is the relationship between two genes?</em></li>
      <li><em>What are the direct and indirect regulators of a molecular
target?</em></li>
      <li><em>And — as we’ll explore in this post — what shared
mechanisms unite a set of disease-associated genes?</em></li>
    </ul>
  </li>
</ol>

<p>Throughout this post, I’ll use two types of asides to add context
without interrupting the main flow:</p>

<ul>
  <li>🟩 Green boxes offer biological background and systems biology
“inside baseball.”</li>
  <li>🟦 Blue boxes share reflections on building scientific software in
the age of AI.</li>
</ul>

<div class="content-section bio-section">
  <div class="section-content">
    <p><strong>For biologists:</strong> Discover a
versatile open-source framework designed to tackle one of the key final
mile problems when working with high-dimensional genomics data. I’ll
show you how network propagation recovers both the genetic drivers of
MMA and its major metabolic dysfunction from functional changes in the
transcriptome and proteome. This regulation suggests that investigating
specific regulatory pathways involved in metabolic sensing — such as
ROS and sirtuins — could offer promising insights into MMA
pathophysiology.</p>

  </div>
</div>

<div class="content-section ai-aside">
  <div class="section-content">
    <p><strong>For computational folks:</strong> In this
post, I’ll walk you through how Napistu seamlessly integrates network
models with high-dimensional data using practical workflows — from
multimodal identifier mapping to personalized PageRank with empirical
nulls. Plus, I’ll share firsthand insights on leveraging AI to develop
complex scientific software — tackling challenges that often lie
beyond the reach of large language models (LLMs).</p>

  </div>
</div>

<h3 id="series-overview">Series overview</h3>

<p><strong><a href="https://shackett.org/multiomic_profiles/">Part 1: Creating Multimodal Disease
Profiles</a></strong> established the
foundation for this post by systematically extracting disease-relevant
molecular signatures from the Forny et al. methylmalonic acidemia
dataset. Through careful batch effect correction and both supervised and
unsupervised analyses, I uncovered coordinated gene and protein
expression programs linked to key disease phenotypes. The result? Clean,
quantitative profiles — ready for network-level mechanistic
exploration. Most profiles were generated using Generalized Additive
Models (GAMs), each combining regression summaries (effect size,
statistic, p/q-value) with phenotypes — such as case vs. control, MMA
urine levels reflecting metabolic burden, or <code class="language-plaintext highlighter-rouge">OHCblPlus</code> as a proxy for
enzyme activity.</p>

<p>In this post, I’ll decode these statistical signals by mapping them onto
genome-scale biological networks with Napistu. The goal is to trace
disease signals from dysregulated genes and proteins upstream to their
common regulatory drivers. I’ll begin by mapping statistical results
onto genes within the pathway model, then transfer these signals to
nodes in a regulatory graph. Finally, using personalized PageRank with
empirical null models, I’ll identify subgraphs enriched for disease
signals — revealing the upstream regulatory mechanisms driving MMA
pathophysiology.</p>

<p><img src="https://www.shackett.org/figure/napistu_ppr/napistu_blog_post.png" alt="Overview of the Network Biology with Napistu blog series where multimodal data is formatted as molecular profiles, overlaid on genome-scale graphs to find common regulators" style="width: 100%;" /></p>

<h2 id="integrating-genome-scale-networks-and-genome-wide-data">Integrating genome-scale networks and genome-wide data</h2>

<h3 id="environment-setup">Environment setup</h3>

<p>To reproduce this analysis:</p>

<ol>
  <li>
    <p>Follow the <a href="https://shackett.org/multiomic_profiles/#environment-setup">setup
instructions</a>
and run the first notebook in the series
<a href="https://github.com/shackett/shackett/blob/master/posts/posted/creating_multimodal_profiles.qmd"><code class="language-plaintext highlighter-rouge">creating_multimodal_profiles.qmd</code></a>.
This will set up both the Python environment and the input data
required for this analysis.</p>
  </li>
  <li>
    <p>Download the
<a href="https://github.com/shackett/shackett/blob/master/posts/posted/napistu_network_propagation.qmd"><code class="language-plaintext highlighter-rouge">napistu_network_propagation.qmd</code></a>
notebook</p>
  </li>
  <li>
    <p>Modify the following code block in your copy of the notebook to set
appropriate paths:</p>

    <p>a.  <code class="language-plaintext highlighter-rouge">CACHE_DIR</code> should match the value used in
    <code class="language-plaintext highlighter-rouge">creating_multimodal_profiles.qmd</code>
b.  <code class="language-plaintext highlighter-rouge">INPUT_DATA_DIR</code> should be a suitable location for saving the
    network representations (~4 GB in size)</p>
  </li>
  <li>
    <p>Run the notebook and render an HTML output by executing:</p>

    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>quarto render napistu_network_propagation.qmd
</code></pre></div>    </div>
  </li>
</ol>

<p>First, I’ll load the necessary Python modules, configure file paths, set
global parameters, and define utility functions.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">types</span> <span class="kn">import</span> <span class="n">SimpleNamespace</span>

<span class="kn">import</span> <span class="nn">mudata</span> <span class="k">as</span> <span class="n">md</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">display</span><span class="p">,</span> <span class="n">HTML</span>

<span class="kn">from</span> <span class="nn">napistu</span> <span class="kn">import</span> <span class="n">utils</span> <span class="k">as</span> <span class="n">napistu_utils</span>
<span class="kn">from</span> <span class="nn">napistu.sbml_dfs_core</span> <span class="kn">import</span> <span class="n">SBML_dfs</span>
<span class="kn">from</span> <span class="nn">napistu.network.ng_core</span> <span class="kn">import</span> <span class="n">NapistuGraph</span>
<span class="kn">from</span> <span class="nn">napistu.source</span> <span class="kn">import</span> <span class="n">unnest_sources</span>
<span class="kn">from</span> <span class="nn">napistu.gcs</span> <span class="kn">import</span> <span class="n">downloads</span>
<span class="kn">from</span> <span class="nn">napistu.matching</span> <span class="kn">import</span> <span class="n">mount</span>
<span class="kn">from</span> <span class="nn">napistu.network</span> <span class="kn">import</span> <span class="n">net_propagation</span>
<span class="kn">from</span> <span class="nn">napistu.network</span> <span class="kn">import</span> <span class="n">data_handling</span>
<span class="kn">from</span> <span class="nn">napistu.network</span> <span class="kn">import</span> <span class="n">ng_utils</span>
<span class="kn">from</span> <span class="nn">napistu.scverse.loading</span> <span class="kn">import</span> <span class="n">prepare_anndata_results_df</span>
<span class="kn">from</span> <span class="nn">napistu.scverse.loading</span> <span class="kn">import</span> <span class="n">prepare_mudata_results_df</span>
<span class="kn">from</span> <span class="nn">napistu.constants</span> <span class="kn">import</span> <span class="n">ONTOLOGIES</span><span class="p">,</span> <span class="n">MINI_SBO_TO_NAME</span>

<span class="kn">from</span> <span class="nn">shackett_utils.statistics</span> <span class="kn">import</span> <span class="n">hypothesis_testing</span>
<span class="kn">from</span> <span class="nn">shackett_utils.statistics</span> <span class="kn">import</span> <span class="n">multi_model_fitting</span>
<span class="kn">from</span> <span class="nn">shackett_utils.blog.html_utils</span> <span class="kn">import</span> <span class="n">display_tabulator</span>
<span class="kn">from</span> <span class="nn">shackett_utils.utils.pd_utils</span> <span class="kn">import</span> <span class="n">format_numeric_columns</span>

<span class="c1"># setup logging
</span><span class="kn">import</span> <span class="nn">logging</span>
<span class="n">logging</span><span class="p">.</span><span class="n">basicConfig</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="p">.</span><span class="n">INFO</span><span class="p">)</span>
<span class="n">logger</span> <span class="o">=</span> <span class="n">logging</span><span class="p">.</span><span class="n">getLogger</span><span class="p">(</span><span class="n">__name__</span><span class="p">)</span>
<span class="n">logging</span><span class="p">.</span><span class="n">getLogger</span><span class="p">(</span><span class="s">'matplotlib.font_manager'</span><span class="p">).</span><span class="n">setLevel</span><span class="p">(</span><span class="n">logging</span><span class="p">.</span><span class="n">WARNING</span><span class="p">)</span>
<span class="n">logging</span><span class="p">.</span><span class="n">getLogger</span><span class="p">(</span><span class="s">'matplotlib.pyplot'</span><span class="p">).</span><span class="n">setLevel</span><span class="p">(</span><span class="n">logging</span><span class="p">.</span><span class="n">WARNING</span><span class="p">)</span> 

<span class="c1"># File paths and data organization
# All input data should be placed in the SUPPLEMENTAL_DATA_DIR
# Cached results and models will be stored in CACHE_DIR
</span>
<span class="c1"># paths
</span><span class="n">PROJECT_DIR</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">expanduser</span><span class="p">(</span><span class="s">"~/napistu_mma_posts"</span><span class="p">)</span>
<span class="n">INPUT_DATA_DIR</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">PROJECT_DIR</span><span class="p">,</span> <span class="s">"input"</span><span class="p">)</span>
<span class="n">CACHE_DIR</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">PROJECT_DIR</span><span class="p">,</span> <span class="s">"cache"</span><span class="p">)</span>

<span class="c1"># inputs
# model to download from GCS and store in NAPISTU_DATA_DIR
</span><span class="n">NAPISTU_ASSET</span> <span class="o">=</span> <span class="s">"human_consensus"</span>
<span class="n">NAPISTU_ASSET_VERSION</span> <span class="o">=</span> <span class="s">"20250901"</span>
<span class="c1"># H5Mu file containing the optimal model from MOFA+ and regression summaries
</span><span class="n">OPTIMAL_MODEL_H5MU_OUTFILE</span> <span class="o">=</span> <span class="s">"mofa_optimal_model.h5mu"</span>

<span class="c1"># intermediate files
</span><span class="n">PPR_NULL_CACHE_OUTFILE</span> <span class="o">=</span> <span class="s">"ppr_null_cache.tsv"</span>

<span class="c1"># outputs
</span><span class="n">PPR_RESULTS_OUTFILE</span> <span class="o">=</span> <span class="s">"ppr_results.tsv"</span>
<span class="n">SBML_DFS_W_DATA_OUTFILE</span> <span class="o">=</span> <span class="s">"sbml_dfs_w_data.pkl"</span>
<span class="n">NAPISTU_GRAPH_W_DATA_OUTFILE</span> <span class="o">=</span> <span class="s">"napistu_graph_w_data.pkl"</span>

<span class="c1"># Paths to input/output files
</span><span class="n">OPTIMAL_MODEL_H5MU_PATH</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">CACHE_DIR</span><span class="p">,</span> <span class="n">OPTIMAL_MODEL_H5MU_OUTFILE</span><span class="p">)</span>
<span class="n">PPR_NULL_TMP_PATH</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">CACHE_DIR</span><span class="p">,</span> <span class="n">PPR_NULL_CACHE_OUTFILE</span><span class="p">)</span>
<span class="n">PPR_RESULTS_PATH</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">PROJECT_DIR</span><span class="p">,</span> <span class="n">PPR_RESULTS_OUTFILE</span><span class="p">)</span>
<span class="n">SBML_DFS_W_DATA_PATH</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">PROJECT_DIR</span><span class="p">,</span> <span class="n">SBML_DFS_W_DATA_OUTFILE</span><span class="p">)</span>
<span class="n">NAPISTU_GRAPH_W_DATA_PATH</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">PROJECT_DIR</span><span class="p">,</span> <span class="n">NAPISTU_GRAPH_W_DATA_OUTFILE</span><span class="p">)</span>

<span class="c1"># dataset metadata
</span><span class="n">FORNY_MODALITIES</span> <span class="o">=</span> <span class="n">SimpleNamespace</span><span class="p">(</span>
    <span class="n">TRANSCRIPTOMICS</span> <span class="o">=</span> <span class="s">"transcriptomics"</span><span class="p">,</span>
    <span class="n">PROTEOMICS</span> <span class="o">=</span> <span class="s">"proteomics"</span>
<span class="p">)</span>

<span class="n">MODALITIES</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">FORNY_MODALITIES</span><span class="p">.</span><span class="n">__dict__</span><span class="p">.</span><span class="n">values</span><span class="p">())</span>

<span class="c1"># Napistu controlled vocabulary
</span><span class="n">FORNY_ONTOLOGIES</span> <span class="o">=</span> <span class="n">SimpleNamespace</span><span class="p">(</span>
    <span class="n">ENSEMBL_GENE</span> <span class="o">=</span> <span class="n">ONTOLOGIES</span><span class="p">.</span><span class="n">ENSEMBL_GENE</span><span class="p">,</span>
    <span class="n">UNIPROT</span> <span class="o">=</span> <span class="n">ONTOLOGIES</span><span class="p">.</span><span class="n">UNIPROT</span>
<span class="p">)</span>

<span class="n">FORNY_DEFS</span> <span class="o">=</span> <span class="n">SimpleNamespace</span><span class="p">(</span>
    <span class="c1"># varm table names set in part
</span>    <span class="n">LFS</span> <span class="o">=</span> <span class="s">"LFs"</span><span class="p">,</span>
    <span class="c1"># table names used to add data sources to `sbml_dfs`
</span>    <span class="n">MOFA_LFS</span> <span class="o">=</span> <span class="s">"mofa_lfs"</span><span class="p">,</span>
    <span class="n">VAR_LEVEL_RESULTS</span> <span class="o">=</span> <span class="s">"var_level_results"</span><span class="p">,</span>
    <span class="c1"># template for is_X variables will be used to restrict vertex permutation
</span>    <span class="c1"># to measured proteins/transcripts
</span>    <span class="n">INDICATOR_STR</span> <span class="o">=</span> <span class="s">'is_{modality}'</span><span class="p">,</span>
    <span class="n">MODALITY_VAR_LEVEL_RESULTS_STR</span> <span class="o">=</span> <span class="s">"{modality}_var_level_results"</span> 
<span class="p">)</span>

<span class="n">MUDATA_ONTOLOGIES</span> <span class="o">=</span> <span class="p">{</span>
    <span class="c1"># these dicts indicate the ontology that we want to match against for each modality
</span>    <span class="c1"># and indicate that this ontology's identifiers are present in the .var table's index
</span>    <span class="n">FORNY_MODALITIES</span><span class="p">.</span><span class="n">TRANSCRIPTOMICS</span> <span class="p">:</span>
        <span class="p">{</span>
            <span class="s">"ontologies"</span> <span class="p">:</span> <span class="p">[</span><span class="n">FORNY_ONTOLOGIES</span><span class="p">.</span><span class="n">ENSEMBL_GENE</span><span class="p">],</span>
            <span class="s">"index_which_ontology"</span> <span class="p">:</span> <span class="n">FORNY_ONTOLOGIES</span><span class="p">.</span><span class="n">ENSEMBL_GENE</span>
        <span class="p">},</span>
    <span class="n">FORNY_MODALITIES</span><span class="p">.</span><span class="n">PROTEOMICS</span> <span class="p">:</span>
        <span class="p">{</span>
            <span class="s">"ontologies"</span> <span class="p">:</span> <span class="p">[</span><span class="n">FORNY_ONTOLOGIES</span><span class="p">.</span><span class="n">UNIPROT</span><span class="p">],</span>
            <span class="s">"index_which_ontology"</span> <span class="p">:</span> <span class="n">FORNY_ONTOLOGIES</span><span class="p">.</span><span class="n">UNIPROT</span>
        <span class="p">}</span>
<span class="p">}</span>

<span class="c1"># attributes to use for network propagation
</span><span class="n">LFS_OF_INTEREST</span> <span class="o">=</span> <span class="p">[</span><span class="s">"LF1"</span><span class="p">,</span> <span class="s">"LF2"</span><span class="p">,</span> <span class="s">"LF3"</span><span class="p">,</span> <span class="s">"LF4"</span><span class="p">,</span> <span class="s">"LF5"</span><span class="p">]</span>

<span class="c1"># regression terms to add from var table
</span><span class="n">PPR_LINEAR_PHENOTYPES</span> <span class="o">=</span> <span class="p">{</span><span class="s">"MMA_urine"</span><span class="p">,</span> <span class="s">"OHCblPlus"</span><span class="p">,</span> <span class="s">"case"</span><span class="p">,</span> <span class="s">"responsive_to_acute_treatment"</span><span class="p">}</span>
<span class="n">PPR_SMOOTH_PHENOTYPES</span> <span class="o">=</span> <span class="p">{</span><span class="s">"date_freezing"</span><span class="p">,</span> <span class="s">"proteomics_runorder"</span><span class="p">}</span>

<span class="n">VAR_VARS</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
<span class="k">for</span> <span class="n">phenotype</span> <span class="ow">in</span> <span class="n">PPR_LINEAR_PHENOTYPES</span><span class="p">:</span>
    <span class="n">VAR_VARS</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"est_</span><span class="si">{</span><span class="n">phenotype</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="n">VAR_VARS</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"stat_</span><span class="si">{</span><span class="n">phenotype</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="c1"># using -log10p since normal p- and q-values will underflow
</span>    <span class="n">VAR_VARS</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"log10p_</span><span class="si">{</span><span class="n">phenotype</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="n">VAR_VARS</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"q_</span><span class="si">{</span><span class="n">phenotype</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">for</span> <span class="n">phenotype</span> <span class="ow">in</span> <span class="n">PPR_SMOOTH_PHENOTYPES</span><span class="p">:</span>
    <span class="n">VAR_VARS</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"q_</span><span class="si">{</span><span class="n">phenotype</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

<span class="n">STAT_PREFIXES</span> <span class="o">=</span> <span class="p">[</span><span class="s">"est"</span><span class="p">,</span> <span class="s">"log10p"</span><span class="p">,</span> <span class="s">"q"</span><span class="p">,</span> <span class="s">"stat"</span><span class="p">]</span>
<span class="n">VAR_METADATA</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">([</span>
  <span class="o">*</span><span class="p">[{</span> <span class="s">"phenotype"</span> <span class="p">:</span> <span class="s">"latent factors"</span><span class="p">,</span> <span class="s">"summary"</span> <span class="p">:</span> <span class="n">x</span><span class="p">,</span> <span class="s">"variable"</span> <span class="p">:</span> <span class="n">x</span><span class="p">}</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">LFS_OF_INTEREST</span><span class="p">],</span>
  <span class="o">*</span><span class="p">[{</span> <span class="s">"phenotype"</span> <span class="p">:</span> <span class="n">x</span><span class="p">,</span> <span class="s">"summary"</span> <span class="p">:</span> <span class="n">y</span><span class="p">,</span> <span class="s">"variable"</span> <span class="p">:</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">y</span><span class="si">}</span><span class="s">_</span><span class="si">{</span><span class="n">x</span><span class="si">}</span><span class="s">"</span><span class="p">}</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="p">(</span><span class="n">PPR_LINEAR_PHENOTYPES</span> <span class="o">|</span> <span class="n">PPR_SMOOTH_PHENOTYPES</span><span class="p">)</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">STAT_PREFIXES</span> <span class="k">if</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">y</span><span class="si">}</span><span class="s">_</span><span class="si">{</span><span class="n">x</span><span class="si">}</span><span class="s">"</span> <span class="ow">in</span> <span class="n">VAR_VARS</span><span class="p">]</span>
<span class="p">])</span>
<span class="n">VAR_METADATA</span><span class="p">[</span><span class="s">"summary"</span><span class="p">]</span> <span class="o">=</span> <span class="n">VAR_METADATA</span><span class="p">[</span><span class="s">"summary"</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'_'</span><span class="p">,</span> <span class="s">' '</span><span class="p">)</span>

<span class="c1"># defining variables to add as vertex attributes and how to transform them so
# they are appropriate for personalized pagerank reset probability
</span><span class="n">ATTRIBUTES_TO_GRAPH_SPEC</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">{</span>
        <span class="s">"attribute_names"</span><span class="p">:</span> <span class="s">"LF"</span><span class="p">,</span>
        <span class="s">"table_name"</span><span class="p">:</span> <span class="n">FORNY_DEFS</span><span class="p">.</span><span class="n">MOFA_LFS</span><span class="p">,</span>
        <span class="s">"transformation"</span><span class="p">:</span> <span class="s">"square"</span>
    <span class="p">},</span>
    <span class="p">{</span>
        <span class="s">"attribute_names"</span><span class="p">:</span> <span class="s">"^est_"</span><span class="p">,</span>
        <span class="s">"table_name"</span><span class="p">:</span> <span class="n">FORNY_DEFS</span><span class="p">.</span><span class="n">VAR_LEVEL_RESULTS</span><span class="p">,</span>
        <span class="s">"transformation"</span><span class="p">:</span> <span class="s">"square"</span>
    <span class="p">},</span>
    <span class="p">{</span>
        <span class="s">"attribute_names"</span><span class="p">:</span> <span class="s">"^stat_"</span><span class="p">,</span>
        <span class="s">"table_name"</span><span class="p">:</span> <span class="n">FORNY_DEFS</span><span class="p">.</span><span class="n">VAR_LEVEL_RESULTS</span><span class="p">,</span>
        <span class="s">"transformation"</span><span class="p">:</span> <span class="s">"abs"</span>
    <span class="p">},</span>
    <span class="p">{</span>
        <span class="s">"attribute_names"</span><span class="p">:</span> <span class="s">"^log10p_"</span><span class="p">,</span>
        <span class="s">"table_name"</span><span class="p">:</span> <span class="n">FORNY_DEFS</span><span class="p">.</span><span class="n">VAR_LEVEL_RESULTS</span><span class="p">,</span>
        <span class="s">"transformation"</span><span class="p">:</span> <span class="s">"negate"</span>
    <span class="p">},</span>
    <span class="p">{</span>
        <span class="s">"attribute_names"</span><span class="p">:</span> <span class="s">"^q_"</span><span class="p">,</span>
        <span class="s">"table_name"</span><span class="p">:</span> <span class="n">FORNY_DEFS</span><span class="p">.</span><span class="n">VAR_LEVEL_RESULTS</span><span class="p">,</span>
        <span class="s">"transformation"</span><span class="p">:</span> <span class="s">"underflow_guarded_nlog10"</span>
    <span class="p">},</span>
<span class="p">]</span>

<span class="n">ATTRIBUTES_TO_GRAPH_SPEC</span> <span class="o">=</span> <span class="n">ATTRIBUTES_TO_GRAPH_SPEC</span> <span class="o">+</span> <span class="p">[{</span>
    <span class="s">"attribute_names"</span><span class="p">:</span> <span class="n">FORNY_DEFS</span><span class="p">.</span><span class="n">INDICATOR_STR</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">modality</span> <span class="o">=</span> <span class="n">m</span><span class="p">),</span>
    <span class="s">"table_name"</span><span class="p">:</span> <span class="n">FORNY_DEFS</span><span class="p">.</span><span class="n">MODALITY_VAR_LEVEL_RESULTS_STR</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">modality</span> <span class="o">=</span> <span class="n">m</span><span class="p">),</span>
    <span class="s">"transformation"</span><span class="p">:</span> <span class="s">"identity"</span>
<span class="p">}</span> <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">MODALITIES</span><span class="p">]</span>

<span class="c1"># masks from vertex attribute name to modality to use during vertex permutation 
</span><span class="n">REGEXES_TO_MASKS</span> <span class="o">=</span> <span class="p">{</span> <span class="n">x</span><span class="p">:</span> <span class="n">FORNY_DEFS</span><span class="p">.</span><span class="n">INDICATOR_STR</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">modality</span> <span class="o">=</span> <span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">MODALITIES</span> <span class="p">}</span>

<span class="c1"># utility functions
</span>
<span class="k">def</span> <span class="nf">underflow_guarded_nlog10</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="mf">1e-12</span><span class="p">:</span>
        <span class="k">return</span> <span class="mf">1e-12</span> <span class="c1"># underflow guard
</span>    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">np</span><span class="p">.</span><span class="n">log10</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>

<span class="n">CUSTOM_TRANSFORMATIONS</span> <span class="o">=</span> <span class="p">{</span>
    <span class="c1"># take the absolute value
</span>    <span class="s">"abs"</span> <span class="p">:</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">abs</span><span class="p">(</span><span class="n">x</span><span class="p">),</span>
    <span class="s">"negate"</span> <span class="p">:</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="o">-</span><span class="n">x</span><span class="p">,</span>
    <span class="c1"># -log10[pvalue]
</span>    <span class="s">"underflow_guarded_nlog10"</span> <span class="p">:</span> <span class="n">underflow_guarded_nlog10</span><span class="p">,</span>
    <span class="s">"square"</span> <span class="p">:</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">**</span><span class="mi">2</span>
<span class="p">}</span>

<span class="k">def</span> <span class="nf">floor_pvalue_by_resolution</span><span class="p">(</span><span class="n">p_value</span><span class="p">,</span> <span class="n">n_samples</span><span class="p">):</span>
    <span class="s">"""
    Floor p-values by resolution.
    """</span>
    
    <span class="k">return</span> <span class="p">(</span><span class="n">p_value</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">/</span> <span class="n">n_samples</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">n_samples</span> <span class="o">/</span> <span class="p">(</span><span class="n">n_samples</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span>

<span class="k">def</span> <span class="nf">create_stacked_barplot_seaborn</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
    <span class="s">"""
    Alternative version using seaborn styling
    """</span>
    <span class="c1"># Set seaborn style
</span>    <span class="n">sns</span><span class="p">.</span><span class="n">set_style</span><span class="p">(</span><span class="s">"whitegrid"</span><span class="p">)</span>
    
    <span class="c1"># Group by measure and sum counts across modalities
</span>    <span class="n">total_counts</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'variable'</span><span class="p">)[</span><span class="s">'count'</span><span class="p">].</span><span class="nb">sum</span><span class="p">().</span><span class="n">sort_values</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    
    <span class="c1"># Create pivot table
</span>    <span class="n">pivot_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="s">'variable'</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="s">'modality'</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="s">'count'</span><span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="n">pivot_df</span> <span class="o">=</span> <span class="n">pivot_df</span><span class="p">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">total_counts</span><span class="p">.</span><span class="n">index</span><span class="p">)</span>
    
    <span class="c1"># Create the plot
</span>    <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span>
    
    <span class="c1"># Use seaborn color palette
</span>    <span class="n">colors</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">(</span><span class="s">"husl"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">pivot_df</span><span class="p">.</span><span class="n">columns</span><span class="p">))</span>
    
    <span class="c1"># Plot stacked bars
</span>    <span class="n">pivot_df</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">'bar'</span><span class="p">,</span> <span class="n">stacked</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
    
    <span class="c1"># Customize
</span>    <span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Stacked Barplot by Attributes and Modality'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Attributes (ordered by total count)'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Count'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s">'Modality'</span><span class="p">,</span> <span class="n">bbox_to_anchor</span><span class="o">=</span><span class="p">(</span><span class="mf">1.05</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">loc</span><span class="o">=</span><span class="s">'upper left'</span><span class="p">)</span>
    
    <span class="n">plt</span><span class="p">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">rotation</span><span class="o">=</span><span class="mi">45</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s">'right'</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
    
    <span class="k">return</span> <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span>

<span class="k">def</span> <span class="nf">plot_ppr_enrichment_histograms</span><span class="p">(</span><span class="n">fdr_controlled_results</span><span class="p">):</span>
    <span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="n">fdr_controlled_results</span><span class="p">[</span><span class="n">fdr_controlled_results</span><span class="p">[</span><span class="s">"is_enriched"</span><span class="p">]</span> <span class="o">==</span> <span class="bp">False</span><span class="p">][</span><span class="s">"p_value"</span><span class="p">].</span><span class="n">hist</span><span class="p">(</span>
        <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
    <span class="p">)</span>
    <span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Depleted (False)"</span><span class="p">)</span>
    <span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"P-value"</span><span class="p">)</span>
    <span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Count"</span><span class="p">)</span>

    <span class="n">fdr_controlled_results</span><span class="p">[</span><span class="n">fdr_controlled_results</span><span class="p">[</span><span class="s">"is_enriched"</span><span class="p">]</span> <span class="o">==</span> <span class="bp">True</span><span class="p">][</span><span class="s">"p_value"</span><span class="p">].</span><span class="n">hist</span><span class="p">(</span>
        <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
    <span class="p">)</span>
    <span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Enriched (True)"</span><span class="p">)</span>
    <span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"P-value"</span><span class="p">)</span>

    <span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>

<span class="k">def</span> <span class="nf">reorder_by_rank_sum</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
    <span class="s">"""Reorder rows by sum of ranks (lower sum = better overall rank)"""</span>
    <span class="n">df_num</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'.'</span><span class="p">,</span> <span class="n">pd</span><span class="p">.</span><span class="n">NA</span><span class="p">).</span><span class="nb">apply</span><span class="p">(</span><span class="n">pd</span><span class="p">.</span><span class="n">to_numeric</span><span class="p">,</span> <span class="n">errors</span><span class="o">=</span><span class="s">'coerce'</span><span class="p">)</span>
    <span class="n">max_val</span> <span class="o">=</span> <span class="n">df_num</span><span class="p">.</span><span class="nb">max</span><span class="p">().</span><span class="nb">max</span><span class="p">()</span>
    <span class="n">df_filled</span> <span class="o">=</span> <span class="n">df_num</span><span class="p">.</span><span class="n">fillna</span><span class="p">(</span><span class="n">max_val</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
    <span class="n">rank_sums</span> <span class="o">=</span> <span class="n">df_filled</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">rank_sums</span><span class="p">.</span><span class="n">sort_values</span><span class="p">().</span><span class="n">index</span><span class="p">]</span>

<span class="c1"># constants affecting behavior
</span><span class="n">N_NULL_SAMPLES</span> <span class="o">=</span> <span class="mi">500</span>
<span class="n">OVERWRITE</span> <span class="o">=</span> <span class="bp">False</span>
</code></pre></div></div>

<h3 id="loading-mma-molecular-profiles">Loading MMA molecular profiles</h3>

<p>Next, I’ll load the results generated in the <a href="https://shackett.org/multiomic_profiles/">previous
post</a>. These are stored in a
<code class="language-plaintext highlighter-rouge">MuData</code> object saved as an <code class="language-plaintext highlighter-rouge">.h5mu</code> file. (You’ll see the contents of
this object later when I discuss adding attributes to the graph.) The
only modification I’ll make is adding indicator variables —
<code class="language-plaintext highlighter-rouge">is_transcriptomics</code> and <code class="language-plaintext highlighter-rouge">is_proteomics</code> — to each modality to easily
track measured transcripts and proteins in downstream analyses.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># lets load the Forny results so we can try adding a few different types of tables to the sbml_dfs
</span><span class="n">mdata</span> <span class="o">=</span> <span class="n">md</span><span class="p">.</span><span class="n">read_h5mu</span><span class="p">(</span><span class="n">OPTIMAL_MODEL_H5MU_PATH</span><span class="p">)</span>

<span class="c1"># create an indicator which just highlights which modalities are present in the mdata
# this will propagate this indicator to vertices in the graph which is useful for generating
# a mask for constructing vertices' null distributions
</span><span class="n">ADATA_LEVEL_VARS</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="k">for</span> <span class="n">modality</span> <span class="ow">in</span> <span class="n">MODALITIES</span><span class="p">:</span>
    <span class="n">indicator_var</span> <span class="o">=</span> <span class="n">FORNY_DEFS</span><span class="p">.</span><span class="n">INDICATOR_STR</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">modality</span><span class="o">=</span><span class="n">modality</span><span class="p">)</span>
    <span class="c1"># add to var table
</span>    <span class="n">mdata</span><span class="p">[</span><span class="n">modality</span><span class="p">].</span><span class="n">var</span><span class="p">[</span><span class="n">indicator_var</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
    <span class="c1"># indicate that this should be added to the sbml_dfs later
</span>    <span class="n">ADATA_LEVEL_VARS</span><span class="p">[</span><span class="n">modality</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="n">indicator_var</span><span class="p">]</span>
</code></pre></div></div>

<h3 id="loading-napistu-data">Loading Napistu data</h3>

<p>To simplify access, I’ve uploaded a lightweight test pathway (a merged
set of three metabolic pathways) and the full human consensus pathway to
Google Cloud Storage (GCS). These pathway representations center around
two key objects:</p>

<ul>
  <li><a href="https://github.com/napistu/napistu/wiki/SBML-DFs"><code class="language-plaintext highlighter-rouge">SBML_dfs</code></a>: An
in-memory relational database organizing molecular species (genes,
metabolites, complexes, drugs) and their relationships (reactions,
interactions).</li>
  <li><a href="https://github.com/napistu/napistu/wiki/Napistu-Graphs"><code class="language-plaintext highlighter-rouge">NapistuGraph</code></a>:
A directed graph representation of the same network, translating
molecular species and reactions into a graph structure for
downstream analysis.</li>
</ul>

<p>The human consensus <code class="language-plaintext highlighter-rouge">SBML_dfs</code> and <code class="language-plaintext highlighter-rouge">NapistuGraph</code> I will use combine
these sources:</p>

<ul>
  <li><em>Reactome</em>: human-centric gold-standard pathways of cellular
physiology and signaling</li>
  <li><em>BiGG</em>: the Recon3D genome-scale metabolic model</li>
  <li><em>TRRUST</em>: curated transcription factor–target interactions</li>
  <li><em>STRING</em>: undirected physical and functional interactions</li>
  <li><em>Dogma</em>: a model of cognate relationships between genes,
transcripts, and proteins with their systematic identifiers</li>
</ul>

<p>I built this consensus using the Napistu CLI (see the <a href="https://github.com/napistu/napistu/blob/main/dev/create_human_consensus.qmd">build
pipeline</a>),
which supports constructing and refining genome-scale pathway models for
most model organisms. Below is an overview of how the human consensus
was assembled:</p>

<p><img src="https://www.shackett.org/figure/napistu_ppr/Napistu_build_process.png" alt="Napistu human consensus pathway model build pipeline showing
integration of multiple biological
databases" /></p>

<p>Next, I’ll download and load the human consensus <code class="language-plaintext highlighter-rouge">SBML_dfs</code> from my
public GCS bucket (please avoid frequent downloads 🙂), along with the
corresponding <code class="language-plaintext highlighter-rouge">NapistuGraph</code> and a lookup table of systematic
identifiers, by running:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># this will download the sbml_dfs, napistu_graph, and species_identifiers from a public GCS bucket
# or if they already exist in the INPUT_DATA_DIR, it will just set the path to the existing asset
</span><span class="n">sbml_dfs_path</span> <span class="o">=</span> <span class="n">downloads</span><span class="p">.</span><span class="n">load_public_napistu_asset</span><span class="p">(</span>
    <span class="n">asset</span> <span class="o">=</span> <span class="n">NAPISTU_ASSET</span><span class="p">,</span>
    <span class="n">data_dir</span> <span class="o">=</span> <span class="n">INPUT_DATA_DIR</span><span class="p">,</span>
    <span class="n">subasset</span> <span class="o">=</span> <span class="s">"sbml_dfs"</span><span class="p">,</span>
    <span class="n">version</span> <span class="o">=</span> <span class="n">NAPISTU_ASSET_VERSION</span>
<span class="p">)</span>

<span class="n">napistu_graph_path</span> <span class="o">=</span> <span class="n">downloads</span><span class="p">.</span><span class="n">load_public_napistu_asset</span><span class="p">(</span>
    <span class="n">asset</span> <span class="o">=</span> <span class="n">NAPISTU_ASSET</span><span class="p">,</span>
    <span class="n">data_dir</span> <span class="o">=</span> <span class="n">INPUT_DATA_DIR</span><span class="p">,</span>
    <span class="n">subasset</span> <span class="o">=</span> <span class="s">"napistu_graph"</span><span class="p">,</span>
    <span class="n">version</span> <span class="o">=</span> <span class="n">NAPISTU_ASSET_VERSION</span>
<span class="p">)</span>

<span class="n">species_identifiers_path</span> <span class="o">=</span> <span class="n">downloads</span><span class="p">.</span><span class="n">load_public_napistu_asset</span><span class="p">(</span>
    <span class="n">asset</span> <span class="o">=</span> <span class="n">NAPISTU_ASSET</span><span class="p">,</span>
    <span class="n">data_dir</span> <span class="o">=</span> <span class="n">INPUT_DATA_DIR</span><span class="p">,</span>
    <span class="n">subasset</span> <span class="o">=</span> <span class="s">"species_identifiers"</span><span class="p">,</span>
    <span class="n">version</span> <span class="o">=</span> <span class="n">NAPISTU_ASSET_VERSION</span>
<span class="p">)</span>

<span class="c1"># ~2 min load
</span><span class="n">sbml_dfs</span> <span class="o">=</span> <span class="n">SBML_dfs</span><span class="p">.</span><span class="n">from_pickle</span><span class="p">(</span><span class="n">sbml_dfs_path</span><span class="p">)</span>

<span class="n">napistu_graph</span> <span class="o">=</span> <span class="n">NapistuGraph</span><span class="p">.</span><span class="n">from_pickle</span><span class="p">(</span><span class="n">napistu_graph_path</span><span class="p">)</span>

<span class="n">species_identifiers</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">species_identifiers_path</span><span class="p">,</span> <span class="n">delimiter</span> <span class="o">=</span> <span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>

<span class="n">ng_utils</span><span class="p">.</span><span class="n">validate_assets</span><span class="p">(</span>
    <span class="n">sbml_dfs</span> <span class="o">=</span> <span class="n">sbml_dfs</span><span class="p">,</span>
    <span class="n">napistu_graph</span> <span class="o">=</span> <span class="n">napistu_graph</span><span class="p">,</span>
    <span class="n">identifiers_df</span> <span class="o">=</span> <span class="n">species_identifiers</span>
<span class="p">)</span>
</code></pre></div></div>

<p>With the core Napistu objects loaded, I’ll briefly summarize their
contents — counting molecular species from each data source and
outlining the types of regulatory relationships captured by graph edges.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># generate some simple model summaries
</span><span class="n">species_sources</span> <span class="o">=</span> <span class="n">unnest_sources</span><span class="p">(</span><span class="n">sbml_dfs</span><span class="p">.</span><span class="n">species</span><span class="p">)</span>

<span class="n">species_counts_by_source</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">species_sources</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">species_sources</span><span class="p">[</span><span class="s">"pathway_id"</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">"napistu_data"</span><span class="p">)]</span>
    <span class="p">.</span><span class="n">value_counts</span><span class="p">(</span><span class="s">"pathway_id"</span><span class="p">)</span>
    <span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
    <span class="p">.</span><span class="n">assign</span><span class="p">(</span>
        <span class="n">pathway_id</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">'pathway_id'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span>
        <span class="k">lambda</span> <span class="n">path</span><span class="p">:</span> <span class="n">Path</span><span class="p">(</span><span class="n">path</span><span class="p">).</span><span class="n">stem</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'uncompartmentalized_'</span><span class="p">,</span> <span class="s">''</span><span class="p">).</span><span class="n">replace</span><span class="p">(</span><span class="s">'hpa_filtered_'</span><span class="p">,</span> <span class="s">''</span><span class="p">)</span>
    <span class="p">)</span>
    <span class="p">)</span>
    <span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s">"count"</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">"pathway_id"</span><span class="p">)</span>
    <span class="p">.</span><span class="n">T</span>
    <span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="n">total</span> <span class="o">=</span> <span class="n">sbml_dfs</span><span class="p">.</span><span class="n">species</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="p">)</span>

<span class="n">display_tabulator</span><span class="p">(</span><span class="n">species_counts_by_source</span><span class="p">,</span> <span class="n">caption</span><span class="o">=</span><span class="s">"Counts of molecular species from each source"</span><span class="p">)</span>

<span class="n">participant_counts</span> <span class="o">=</span> <span class="n">napistu_graph</span><span class="p">.</span><span class="n">get_edge_dataframe</span><span class="p">().</span><span class="n">value_counts</span><span class="p">(</span><span class="s">"sbo_term"</span><span class="p">).</span><span class="n">rename</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="n">MINI_SBO_TO_NAME</span><span class="p">).</span><span class="n">to_frame</span><span class="p">().</span><span class="n">T</span>

<span class="n">display_tabulator</span><span class="p">(</span><span class="n">participant_counts</span><span class="p">,</span> <span class="n">caption</span><span class="o">=</span><span class="s">"Counts of reaction species by role"</span><span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Counts of molecular species from each source
</figcaption>

<div class="data-table" style="" data-table="[{&quot;index&quot;: &quot;count&quot;, &quot;reactome&quot;: 23046, &quot;string&quot;: 19385, &quot;dogma_sbml_dfs&quot;: 19362, &quot;bigg&quot;: 4476, &quot;trrust&quot;: 2862, &quot;total&quot;: 38776}]" data-columns="[{&quot;title&quot;: &quot;index&quot;, &quot;field&quot;: &quot;index&quot;}, {&quot;title&quot;: &quot;reactome&quot;, &quot;field&quot;: &quot;reactome&quot;}, {&quot;title&quot;: &quot;string&quot;, &quot;field&quot;: &quot;string&quot;}, {&quot;title&quot;: &quot;dogma_sbml_dfs&quot;, &quot;field&quot;: &quot;dogma_sbml_dfs&quot;}, {&quot;title&quot;: &quot;bigg&quot;, &quot;field&quot;: &quot;bigg&quot;}, {&quot;title&quot;: &quot;trrust&quot;, &quot;field&quot;: &quot;trrust&quot;}, {&quot;title&quot;: &quot;total&quot;, &quot;field&quot;: &quot;total&quot;}]" data-options="{&quot;layout&quot;: &quot;fitColumns&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Counts of reaction species by role
</figcaption>

<div class="data-table" style="" data-table="[{&quot;index&quot;: &quot;count&quot;, &quot;interactor&quot;: 7801948, &quot;product&quot;: 34070, &quot;reactant&quot;: 31086, &quot;stimulator&quot;: 13326, &quot;catalyst&quot;: 6691, &quot;modifier&quot;: 3722, &quot;inhibitor&quot;: 2914}]" data-columns="[{&quot;title&quot;: &quot;index&quot;, &quot;field&quot;: &quot;index&quot;}, {&quot;title&quot;: &quot;interactor&quot;, &quot;field&quot;: &quot;interactor&quot;}, {&quot;title&quot;: &quot;product&quot;, &quot;field&quot;: &quot;product&quot;}, {&quot;title&quot;: &quot;reactant&quot;, &quot;field&quot;: &quot;reactant&quot;}, {&quot;title&quot;: &quot;stimulator&quot;, &quot;field&quot;: &quot;stimulator&quot;}, {&quot;title&quot;: &quot;catalyst&quot;, &quot;field&quot;: &quot;catalyst&quot;}, {&quot;title&quot;: &quot;modifier&quot;, &quot;field&quot;: &quot;modifier&quot;}, {&quot;title&quot;: &quot;inhibitor&quot;, &quot;field&quot;: &quot;inhibitor&quot;}]" data-options="{&quot;layout&quot;: &quot;fitColumns&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p>These model statistics highlight the scale and scope of the network. The
consensus model integrates over 38,000 molecular species, including
~19,000 proteins (from gene-centric sources like <em>STRING</em> and <em>Dogma</em>)
and ~19,000 metabolites and complexes (primarily from <em>Reactome</em> and
the <em>BiGG</em> <em>Recon3D</em> model). The graph contains nearly 4 million
molecular interactions, the majority from <em>STRING</em>’s physical and
functional associations. Notably, around 92,000 edges carry deeper
mechanistic annotations, such as transcription factor → target or enzyme
→ substrate.</p>

<p>From a bird’s-eye view:</p>

<p><img src="https://www.shackett.org/figure/napistu_ppr/Consensus_graph.png" alt="Genome-scale network diagram for the human consensus model" width="700" /></p>

<p>This genome-scale view shows the overall network structure, but I can
zoom into any region to examine molecular interactions at high
resolution. For example, I can explore the molecular neighborhood of
<strong>MMUT</strong> (labeled as “MUT” in the network) to identify its upstream
regulators and downstream targets. This local view reveals how <em>MMUT</em>
connects to both regulatory genes (such as <em>AKT</em> and <em>IGF1</em>) and
metabolites (like its enzymatic product methylmalonyl-CoA, shown as
L-MM-CoA), illustrating the integration of gene regulatory and metabolic
networks.</p>

<div class="content-section bio-section">
  <div class="section-content">
    <p>Vertex names may differ from the
nomenclature used by individual data sources; however, all merges are
based on reliable database identifiers, ensuring accurate molecular
relationships. Annotations are organized in <code class="language-plaintext highlighter-rouge">Identifiers</code> objects, which
track the identifiers across multiple ontologies related to a given
entity. These annotations also incorporate <a href="https://sbml.org/"><code class="language-plaintext highlighter-rouge">SBML</code></a>’s
biological qualifiers, which define relationships such as <code class="language-plaintext highlighter-rouge">BQB_IS</code>
(identity), <code class="language-plaintext highlighter-rouge">BQB_HAS_PART</code> (component of a complex), and
<code class="language-plaintext highlighter-rouge">BQB_IS_DESCRIBED_BY</code> (reference to supporting literature).</p>

<p>Napistu leverages these annotations both to merge data sources when
building the consensus model and to seamlessly integrate
high-dimensional datasets with its pathway representations.</p>

  </div>
</div>

<p><img src="https://www.shackett.org/figure/napistu_ppr/ENSG00000146085_close_up_neighborhood.png" alt="Network visualization of the local molecular neighborhood of MMUT" width="700" /></p>

<div class="content-section bio-section">
  <div class="section-content">
    <p>I created this visualization —
and the subgraph figure you’ll see later — using results from this
post. If you’re interested in generating visualizations like this, check
out
<a href="https://github.com/napistu/napistu-scrapyard/blob/main/applications/forny_2023/network_vis.qmd"><code class="language-plaintext highlighter-rouge">network_visualzation.qmd</code></a>.
Though Napistu is primarily a Python framework, its companion R package,
<a href="https://github.com/napistu/napistu-r"><strong>napistu.r</strong></a>, is purpose-built
for visualizing Napistu networks. It leverages <em>ggraph</em> for
grammar-of-graphics-based visualizations and uses <em>reticulate</em> to bridge
R and Python, enabling direct access to Napistu’s data structures and
functions.</p>

  </div>
</div>

<h2 id="adding-data-to-networks">Adding data to networks</h2>

<p>To use an `omics dataset in Napistu, I will:</p>

<ol>
  <li><em>Mount the dataset onto the pathway <code class="language-plaintext highlighter-rouge">SBML_dfs</code></em>, which involves:
a.  Matching systematic identifiers between the dataset and the
    pathway to link `omics features with Napistu species.
b.  Resolving many-to-one mappings (i.e., when multiple features map
    to the same molecular species).
c.  Constructing a table indexed by unique species IDs, with dataset
    variables as columns.
d.  Adding this table to the <code class="language-plaintext highlighter-rouge">species_data</code> attribute of the
    <code class="language-plaintext highlighter-rouge">SBML_dfs</code>. Multiple tables and/or datasets can be stored in
    <code class="language-plaintext highlighter-rouge">species_data</code>.</li>
  <li><em>Pass variables to graph vertices</em>
a.  Use <code class="language-plaintext highlighter-rouge">net_create._add_graph_species_attribute</code> to pass variables
    from one or more <em>species_data</em> tables to a <code class="language-plaintext highlighter-rouge">NapistuGraph</code>’s
    vertices.
b.  Optionally, variables can be transformed at this stage (e.g., to
    make them non-negative for personalized PageRank).</li>
  <li><em>Use these vertex attributes for downstream analyses</em>, such as
setting the <code class="language-plaintext highlighter-rouge">reset_proportional_to</code> parameter in personalized
PageRank.</li>
</ol>

<h3 id="data-rightarrow-sbml_dfs">Data $\Rightarrow$ <code class="language-plaintext highlighter-rouge">SBML_dfs</code></h3>

<p>To identify which results to explore further with network methods, I’ll
first review the <code class="language-plaintext highlighter-rouge">MuData</code> object from the previous analysis. This
summary highlights both <code class="language-plaintext highlighter-rouge">MuData</code>-level attributes — such as
Multi-Omics Factor Analysis (<em>MOFA</em>) results — and modality-level
<code class="language-plaintext highlighter-rouge">AnnData</code> attributes, including:</p>

<ul>
  <li><em>obs</em>: sample-level metadata</li>
  <li><em>var</em>: feature-level metadata</li>
  <li><em>X</em> and <em>layers</em>: measurements</li>
  <li><em>obsm</em>, <em>varm</em>: tensors defined over samples or features</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mdata</span>
</code></pre></div></div>

<pre>MuData object with n_obs × n_vars = 221 × 13922
  uns:  &#x27;mofa&#x27;
  obsm: &#x27;X_mofa&#x27;
  varm: &#x27;LFs&#x27;
  2 modalities
    transcriptomics:    221 x 9134
      obs:  &#x27;case&#x27;, &#x27;gender&#x27;, &#x27;consanguinity&#x27;, &#x27;mut_category&#x27;, &#x27;wgs_zygosity&#x27;, &#x27;acidosis&#x27;, &#x27;metabolic_acidosis&#x27;, &#x27;metabolic_ketoacidosis&#x27;, &#x27;ketosis&#x27;, &#x27;hyperammonemia&#x27;, &#x27;abnormal_muscle_tone&#x27;, &#x27;musc_hypotonia&#x27;, &#x27;musc_hypertonia&#x27;, &#x27;fct_respiratory_abnormality&#x27;, &#x27;dyspnea&#x27;, &#x27;tachypnea&#x27;, &#x27;reduced_consciousness&#x27;, &#x27;lethargy&#x27;, &#x27;coma&#x27;, &#x27;seizures&#x27;, &#x27;general_tonic_clonic_seizure&#x27;, &#x27;any_GI_problem&#x27;, &#x27;failure_to_thrive&#x27;, &#x27;any_delay&#x27;, &#x27;behavioral_abnormality&#x27;, &#x27;concurrent_infection&#x27;, &#x27;urine_ketones&#x27;, &#x27;dialysis&#x27;, &#x27;peritoneal_dialysis&#x27;, &#x27;insulin&#x27;, &#x27;diet&#x27;, &#x27;carnitine&#x27;, &#x27;cobalamin&#x27;, &#x27;bicarb&#x27;, &#x27;glucose_IV&#x27;, &#x27;cobalamin_responsive&#x27;, &#x27;antibiotic_treatment&#x27;, &#x27;protein_restriction&#x27;, &#x27;tube_feeding_day&#x27;, &#x27;tube_feeding_night&#x27;, &#x27;tube_feeding_overall&#x27;, &#x27;language_delay&#x27;, &#x27;any_neurological_abnormalities_chronic&#x27;, &#x27;impaired_kidney_fct&#x27;, &#x27;hemat_abnormality&#x27;, &#x27;anemia&#x27;, &#x27;neutropenia&#x27;, &#x27;skin_abnormalities&#x27;, &#x27;hearing_impairment&#x27;, &#x27;osteoporosis&#x27;, &#x27;failure_to_thrive_chronic&#x27;, &#x27;global_dev_delay_chr&#x27;, &#x27;hypotonia_chr&#x27;, &#x27;basal_ganglia_abnormality_chr&#x27;, &#x27;failure_to_thrive_or_tube_feeding&#x27;, &#x27;irritability&#x27;, &#x27;hyperventilation&#x27;, &#x27;hypothermia&#x27;, &#x27;somnolence&#x27;, &#x27;vomiting&#x27;, &#x27;dehydration&#x27;, &#x27;feeding_problem&#x27;, &#x27;responsive_to_acute_treatment&#x27;, &#x27;n_passage&#x27;, &#x27;date_collection&#x27;, &#x27;date_freezing&#x27;, &#x27;onset_age&#x27;, &#x27;OHCblMinus&#x27;, &#x27;OHCblPlus&#x27;, &#x27;ratio&#x27;, &#x27;SimultOHCblMinus&#x27;, &#x27;SimultOHCblPlus&#x27;, &#x27;AdoCblMinus&#x27;, &#x27;AdoCblPlus&#x27;, &#x27;SimultAdoCblMinus&#x27;, &#x27;SimultAdoCblPlus&#x27;, &#x27;prot_mut_level&#x27;, &#x27;rnaseq_mut_level&#x27;, &#x27;MMA_urine&#x27;, &#x27;ammonia_umolL&#x27;, &#x27;pH&#x27;, &#x27;base_excess&#x27;, &#x27;MMA_urine_after_treat&#x27;, &#x27;carnitine_dose&#x27;, &#x27;natural_protein_amount&#x27;, &#x27;total_protein_amount&#x27;, &#x27;weight_centile_quant&#x27;, &#x27;length_centile_quant&#x27;, &#x27;head_circumfernce_quant&#x27;, &#x27;proteomics_runorder&#x27;
      var:  &#x27;est_MMA_urine&#x27;, &#x27;est_OHCblPlus&#x27;, &#x27;est_case&#x27;, &#x27;est_responsive_to_acute_treatment&#x27;, &#x27;p_MMA_urine&#x27;, &#x27;p_OHCblPlus&#x27;, &#x27;p_case&#x27;, &#x27;p_date_freezing&#x27;, &#x27;p_proteomics_runorder&#x27;, &#x27;p_responsive_to_acute_treatment&#x27;, &#x27;log10p_MMA_urine&#x27;, &#x27;log10p_OHCblPlus&#x27;, &#x27;log10p_case&#x27;, &#x27;log10p_responsive_to_acute_treatment&#x27;, &#x27;q_MMA_urine&#x27;, &#x27;q_OHCblPlus&#x27;, &#x27;q_case&#x27;, &#x27;q_date_freezing&#x27;, &#x27;q_proteomics_runorder&#x27;, &#x27;q_responsive_to_acute_treatment&#x27;, &#x27;stat_MMA_urine&#x27;, &#x27;stat_OHCblPlus&#x27;, &#x27;stat_case&#x27;, &#x27;stat_responsive_to_acute_treatment&#x27;, &#x27;stderr_MMA_urine&#x27;, &#x27;stderr_OHCblPlus&#x27;, &#x27;stderr_case&#x27;, &#x27;stderr_responsive_to_acute_treatment&#x27;, &#x27;is_transcriptomics&#x27;
      obsm: &#x27;X_pca&#x27;
      varm: &#x27;PCs&#x27;
      layers:   &#x27;log2_centered&#x27;
    proteomics: 221 x 4788
      obs:  &#x27;case&#x27;, &#x27;gender&#x27;, &#x27;consanguinity&#x27;, &#x27;mut_category&#x27;, &#x27;wgs_zygosity&#x27;, &#x27;acidosis&#x27;, &#x27;metabolic_acidosis&#x27;, &#x27;metabolic_ketoacidosis&#x27;, &#x27;ketosis&#x27;, &#x27;hyperammonemia&#x27;, &#x27;abnormal_muscle_tone&#x27;, &#x27;musc_hypotonia&#x27;, &#x27;musc_hypertonia&#x27;, &#x27;fct_respiratory_abnormality&#x27;, &#x27;dyspnea&#x27;, &#x27;tachypnea&#x27;, &#x27;reduced_consciousness&#x27;, &#x27;lethargy&#x27;, &#x27;coma&#x27;, &#x27;seizures&#x27;, &#x27;general_tonic_clonic_seizure&#x27;, &#x27;any_GI_problem&#x27;, &#x27;failure_to_thrive&#x27;, &#x27;any_delay&#x27;, &#x27;behavioral_abnormality&#x27;, &#x27;concurrent_infection&#x27;, &#x27;urine_ketones&#x27;, &#x27;dialysis&#x27;, &#x27;peritoneal_dialysis&#x27;, &#x27;insulin&#x27;, &#x27;diet&#x27;, &#x27;carnitine&#x27;, &#x27;cobalamin&#x27;, &#x27;bicarb&#x27;, &#x27;glucose_IV&#x27;, &#x27;cobalamin_responsive&#x27;, &#x27;antibiotic_treatment&#x27;, &#x27;protein_restriction&#x27;, &#x27;tube_feeding_day&#x27;, &#x27;tube_feeding_night&#x27;, &#x27;tube_feeding_overall&#x27;, &#x27;language_delay&#x27;, &#x27;any_neurological_abnormalities_chronic&#x27;, &#x27;impaired_kidney_fct&#x27;, &#x27;hemat_abnormality&#x27;, &#x27;anemia&#x27;, &#x27;neutropenia&#x27;, &#x27;skin_abnormalities&#x27;, &#x27;hearing_impairment&#x27;, &#x27;osteoporosis&#x27;, &#x27;failure_to_thrive_chronic&#x27;, &#x27;global_dev_delay_chr&#x27;, &#x27;hypotonia_chr&#x27;, &#x27;basal_ganglia_abnormality_chr&#x27;, &#x27;failure_to_thrive_or_tube_feeding&#x27;, &#x27;irritability&#x27;, &#x27;hyperventilation&#x27;, &#x27;hypothermia&#x27;, &#x27;somnolence&#x27;, &#x27;vomiting&#x27;, &#x27;dehydration&#x27;, &#x27;feeding_problem&#x27;, &#x27;responsive_to_acute_treatment&#x27;, &#x27;n_passage&#x27;, &#x27;date_collection&#x27;, &#x27;date_freezing&#x27;, &#x27;onset_age&#x27;, &#x27;OHCblMinus&#x27;, &#x27;OHCblPlus&#x27;, &#x27;ratio&#x27;, &#x27;SimultOHCblMinus&#x27;, &#x27;SimultOHCblPlus&#x27;, &#x27;AdoCblMinus&#x27;, &#x27;AdoCblPlus&#x27;, &#x27;SimultAdoCblMinus&#x27;, &#x27;SimultAdoCblPlus&#x27;, &#x27;prot_mut_level&#x27;, &#x27;rnaseq_mut_level&#x27;, &#x27;MMA_urine&#x27;, &#x27;ammonia_umolL&#x27;, &#x27;pH&#x27;, &#x27;base_excess&#x27;, &#x27;MMA_urine_after_treat&#x27;, &#x27;carnitine_dose&#x27;, &#x27;natural_protein_amount&#x27;, &#x27;total_protein_amount&#x27;, &#x27;weight_centile_quant&#x27;, &#x27;length_centile_quant&#x27;, &#x27;head_circumfernce_quant&#x27;, &#x27;proteomics_runorder&#x27;
      var:  &#x27;PG.ProteinDescriptions&#x27;, &#x27;PG.ProteinNames&#x27;, &#x27;PG.Qvalue&#x27;, &#x27;est_MMA_urine&#x27;, &#x27;est_OHCblPlus&#x27;, &#x27;est_case&#x27;, &#x27;est_responsive_to_acute_treatment&#x27;, &#x27;p_MMA_urine&#x27;, &#x27;p_OHCblPlus&#x27;, &#x27;p_case&#x27;, &#x27;p_date_freezing&#x27;, &#x27;p_proteomics_runorder&#x27;, &#x27;p_responsive_to_acute_treatment&#x27;, &#x27;log10p_MMA_urine&#x27;, &#x27;log10p_OHCblPlus&#x27;, &#x27;log10p_case&#x27;, &#x27;log10p_responsive_to_acute_treatment&#x27;, &#x27;q_MMA_urine&#x27;, &#x27;q_OHCblPlus&#x27;, &#x27;q_case&#x27;, &#x27;q_date_freezing&#x27;, &#x27;q_proteomics_runorder&#x27;, &#x27;q_responsive_to_acute_treatment&#x27;, &#x27;stat_MMA_urine&#x27;, &#x27;stat_OHCblPlus&#x27;, &#x27;stat_case&#x27;, &#x27;stat_responsive_to_acute_treatment&#x27;, &#x27;stderr_MMA_urine&#x27;, &#x27;stderr_OHCblPlus&#x27;, &#x27;stderr_case&#x27;, &#x27;stderr_responsive_to_acute_treatment&#x27;, &#x27;is_proteomics&#x27;
      obsm: &#x27;X_pca&#x27;
      varm: &#x27;PCs&#x27;
      layers:   &#x27;log2_centered&#x27;</pre>

<p>Feature-level attributes (e.g., <code class="language-plaintext highlighter-rouge">MuData</code> or <code class="language-plaintext highlighter-rouge">AnnData</code>’s <em>var</em>, <em>varm</em>,
<em>X</em>, or <em>layers</em>) can be seamlessly added to a <code class="language-plaintext highlighter-rouge">NapistuGraph</code> using
high-level workflows that handle identifier mapping, disambiguation, and
complex membership automatically.</p>

<p>Napistu supports three data input types:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">mudata.MuData</code> objects contain multiple <code class="language-plaintext highlighter-rouge">AnnData</code> objects where
<code class="language-plaintext highlighter-rouge">var</code> and <code class="language-plaintext highlighter-rouge">varm</code> attributes can be defined across multiple datasets.
Results can be stored in separate tables, as separate attributes
within the same table, or merged into a single attribute (e.g.,
combining transcript- and protein-level summaries).</li>
  <li><code class="language-plaintext highlighter-rouge">anndata.AnnData</code> objects contribute systematic identifiers from
their <em>var</em> and feature-level summaries come from either the <em>var</em>,
<em>varm</em>, <em>X</em>, or <em>layers</em> tables.</li>
  <li><code class="language-plaintext highlighter-rouge">pd.DataFrame</code> objects which include one or more columns with
systematic identifiers.</li>
</ul>

<p>Here, I’ll start with a detailed example using results from a <code class="language-plaintext highlighter-rouge">MuData</code>
object.</p>

<h4 id="adding-latent-factors">Adding latent factors</h4>

<p>In the previous post, I applied Multi-Omics Factor Analysis (<em>MOFA</em>) to
decompose the dataset into 30 covarying latent factors. The factor
loadings are a 13,900 × 30 matrix: <code class="language-plaintext highlighter-rouge">mdata.varm["LFs"]</code>.</p>

<p>The <code class="language-plaintext highlighter-rouge">prepare_mudata_results_df</code> function prepares this tensor for
Napistu by:</p>

<ul>
  <li>Extracting modality-specific systematic identifiers from <em>var</em></li>
  <li>Combining them with the corresponding factor loadings</li>
  <li>Returning a dictionary that supports various strategies for merging
across modalities</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mofa_lfs</span> <span class="o">=</span> <span class="n">prepare_mudata_results_df</span><span class="p">(</span>
    <span class="n">mdata</span><span class="p">,</span>
    <span class="n">mudata_ontologies</span><span class="o">=</span><span class="n">MUDATA_ONTOLOGIES</span><span class="p">,</span>
    <span class="n">table_type</span><span class="o">=</span><span class="s">"varm"</span><span class="p">,</span>
    <span class="n">table_name</span><span class="o">=</span><span class="n">FORNY_DEFS</span><span class="p">.</span><span class="n">LFS</span><span class="p">,</span> <span class="c1"># this would be autodetected
</span>    <span class="n">results_attrs</span><span class="o">=</span><span class="n">LFS_OF_INTEREST</span><span class="p">,</span>
    <span class="n">table_colnames</span><span class="o">=</span><span class="p">[</span><span class="sa">f</span><span class="s">"LF</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">"</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">mdata</span><span class="p">.</span><span class="n">varm</span><span class="p">[</span><span class="n">FORNY_DEFS</span><span class="p">.</span><span class="n">LFS</span><span class="p">].</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)]</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Below are the first five rows of the <em>MOFA</em> latent factors for each
modality:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">modality</span> <span class="ow">in</span> <span class="n">MODALITIES</span><span class="p">:</span>
  
    <span class="n">systematic_id_column</span> <span class="o">=</span> <span class="n">MUDATA_ONTOLOGIES</span><span class="p">[</span><span class="n">modality</span><span class="p">][</span><span class="s">"index_which_ontology"</span><span class="p">]</span>
  
    <span class="n">mofa_lfs_examples</span> <span class="o">=</span> <span class="p">(</span>
        <span class="n">mofa_lfs</span><span class="p">[</span><span class="n">modality</span><span class="p">]</span>
        <span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="n">columns</span> <span class="o">=</span> <span class="n">systematic_id_column</span><span class="p">)</span>
        <span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
        <span class="p">.</span><span class="n">copy</span><span class="p">()</span>
    <span class="p">)</span>
    <span class="n">format_numeric_columns</span><span class="p">(</span><span class="n">mofa_lfs_examples</span><span class="p">,</span> <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
  
    <span class="n">display_tabulator</span><span class="p">(</span>
        <span class="n">mofa_lfs_examples</span><span class="p">,</span>
        <span class="n">caption</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"Extracted latent factors for </span><span class="si">{</span><span class="n">modality</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
        <span class="n">column_widths</span><span class="o">=</span><span class="p">{</span><span class="n">systematic_id_column</span> <span class="p">:</span> <span class="s">"25%"</span><span class="p">}</span>
    <span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Extracted latent factors for transcriptomics
</figcaption>

<div class="data-table" style="" data-table="[{&quot;ensembl_gene&quot;: &quot;ENSG00000164543&quot;, &quot;LF1&quot;: &quot;-0.095&quot;, &quot;LF2&quot;: &quot;-0.077&quot;, &quot;LF3&quot;: &quot;-0.000&quot;, &quot;LF4&quot;: &quot;0.035&quot;, &quot;LF5&quot;: &quot;0.070&quot;}, {&quot;ensembl_gene&quot;: &quot;ENSG00000228049&quot;, &quot;LF1&quot;: &quot;-0.003&quot;, &quot;LF2&quot;: &quot;-0.064&quot;, &quot;LF3&quot;: &quot;0.000&quot;, &quot;LF4&quot;: &quot;-0.100&quot;, &quot;LF5&quot;: &quot;-0.001&quot;}, {&quot;ensembl_gene&quot;: &quot;ENSG00000147548&quot;, &quot;LF1&quot;: &quot;0.004&quot;, &quot;LF2&quot;: &quot;0.019&quot;, &quot;LF3&quot;: &quot;0.000&quot;, &quot;LF4&quot;: &quot;0.082&quot;, &quot;LF5&quot;: &quot;-0.047&quot;}, {&quot;ensembl_gene&quot;: &quot;ENSG00000164008&quot;, &quot;LF1&quot;: &quot;-0.000&quot;, &quot;LF2&quot;: &quot;-0.027&quot;, &quot;LF3&quot;: &quot;-0.000&quot;, &quot;LF4&quot;: &quot;-0.069&quot;, &quot;LF5&quot;: &quot;0.034&quot;}, {&quot;ensembl_gene&quot;: &quot;ENSG00000120948&quot;, &quot;LF1&quot;: &quot;-0.033&quot;, &quot;LF2&quot;: &quot;0.034&quot;, &quot;LF3&quot;: &quot;-0.000&quot;, &quot;LF4&quot;: &quot;0.012&quot;, &quot;LF5&quot;: &quot;0.002&quot;}]" data-columns="[{&quot;title&quot;: &quot;ensembl_gene&quot;, &quot;field&quot;: &quot;ensembl_gene&quot;, &quot;width&quot;: &quot;25%&quot;}, {&quot;title&quot;: &quot;LF1&quot;, &quot;field&quot;: &quot;LF1&quot;}, {&quot;title&quot;: &quot;LF2&quot;, &quot;field&quot;: &quot;LF2&quot;}, {&quot;title&quot;: &quot;LF3&quot;, &quot;field&quot;: &quot;LF3&quot;}, {&quot;title&quot;: &quot;LF4&quot;, &quot;field&quot;: &quot;LF4&quot;}, {&quot;title&quot;: &quot;LF5&quot;, &quot;field&quot;: &quot;LF5&quot;}]" data-options="{&quot;layout&quot;: &quot;fitColumns&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Extracted latent factors for proteomics
</figcaption>

<div class="data-table" style="" data-table="[{&quot;uniprot&quot;: &quot;Q13426&quot;, &quot;LF1&quot;: &quot;-0.007&quot;, &quot;LF2&quot;: &quot;-0.000&quot;, &quot;LF3&quot;: &quot;-0.146&quot;, &quot;LF4&quot;: &quot;-0.004&quot;, &quot;LF5&quot;: &quot;-0.004&quot;}, {&quot;uniprot&quot;: &quot;Q5JU69;Q8N2E6&quot;, &quot;LF1&quot;: &quot;-0.001&quot;, &quot;LF2&quot;: &quot;0.041&quot;, &quot;LF3&quot;: &quot;0.096&quot;, &quot;LF4&quot;: &quot;-0.000&quot;, &quot;LF5&quot;: &quot;-0.005&quot;}, {&quot;uniprot&quot;: &quot;Q7Z7H5&quot;, &quot;LF1&quot;: &quot;0.033&quot;, &quot;LF2&quot;: &quot;-0.037&quot;, &quot;LF3&quot;: &quot;0.196&quot;, &quot;LF4&quot;: &quot;-0.006&quot;, &quot;LF5&quot;: &quot;-0.062&quot;}, {&quot;uniprot&quot;: &quot;Q16774&quot;, &quot;LF1&quot;: &quot;0.026&quot;, &quot;LF2&quot;: &quot;0.008&quot;, &quot;LF3&quot;: &quot;0.281&quot;, &quot;LF4&quot;: &quot;0.002&quot;, &quot;LF5&quot;: &quot;0.005&quot;}, {&quot;uniprot&quot;: &quot;O14548&quot;, &quot;LF1&quot;: &quot;-0.033&quot;, &quot;LF2&quot;: &quot;0.136&quot;, &quot;LF3&quot;: &quot;0.004&quot;, &quot;LF4&quot;: &quot;0.001&quot;, &quot;LF5&quot;: &quot;-0.007&quot;}]" data-columns="[{&quot;title&quot;: &quot;uniprot&quot;, &quot;field&quot;: &quot;uniprot&quot;, &quot;width&quot;: &quot;25%&quot;}, {&quot;title&quot;: &quot;LF1&quot;, &quot;field&quot;: &quot;LF1&quot;}, {&quot;title&quot;: &quot;LF2&quot;, &quot;field&quot;: &quot;LF2&quot;}, {&quot;title&quot;: &quot;LF3&quot;, &quot;field&quot;: &quot;LF3&quot;}, {&quot;title&quot;: &quot;LF4&quot;, &quot;field&quot;: &quot;LF4&quot;}, {&quot;title&quot;: &quot;LF5&quot;, &quot;field&quot;: &quot;LF5&quot;}]" data-options="{&quot;layout&quot;: &quot;fitColumns&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p>To add these results to an <code class="language-plaintext highlighter-rouge">SBML_dfs</code> object, I’ll:</p>

<ul>
  <li>Create a pandas.DataFrame with 0–1 rows per distinct molecular
species in the model</li>
  <li>Store this DataFrame as a key-value pair in the <em>species_data</em>
dictionary attribute of the <code class="language-plaintext highlighter-rouge">SBML_dfs</code></li>
</ul>

<p>Molecular species are linked to various ontologies (e.g., <em>Ensembl</em>,
<em>UniProt</em>, <em>Entrez</em>). Napistu can distinguish genes, transcripts, and
proteins as distinct molecular species (“dogmatic mode”). However, the
loaded model merges these into a single species, ignoring such
distinctions.</p>

<p>Merging `omics data into the pathway representation involves:</p>

<ul>
  <li>Matching based on identifiers</li>
  <li>Resolving collisions (e.g., when transcripts and proteins map to the
same species, or multiple proteins map to one molecular species)</li>
</ul>

<p>These challenges are handled by <code class="language-plaintext highlighter-rouge">mount.bind_dict_of_wide_results</code>. In
this example, I “stagger” results to keep latent factors separate based
on whether they correspond to transcripts or proteins.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mount</span><span class="p">.</span><span class="n">bind_dict_of_wide_results</span><span class="p">(</span>
    <span class="n">sbml_dfs</span><span class="p">,</span>
    <span class="n">mofa_lfs</span><span class="p">,</span>
    <span class="n">FORNY_DEFS</span><span class="p">.</span><span class="n">MOFA_LFS</span><span class="p">,</span>
    <span class="n">strategy</span> <span class="o">=</span> <span class="s">"stagger"</span><span class="p">,</span>
    <span class="n">species_identifiers</span> <span class="o">=</span> <span class="n">species_identifiers</span><span class="p">,</span>
    <span class="c1"># ontologies were already renamed to the controlled vocabulary in prepare_mudata_results_df()
</span>    <span class="n">ontologies</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
    <span class="c1"># ignored because species_identifiers is provided
</span>    <span class="n">dogmatic</span> <span class="o">=</span> <span class="bp">False</span><span class="p">,</span>
    <span class="c1"># for clarity; default is True
</span>    <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span>
    <span class="n">verbose</span> <span class="o">=</span> <span class="bp">False</span>
<span class="p">)</span>
</code></pre></div></div>

<p>The outcome is a single <em>species_data</em> table integrating latent factors
from both modalities, mapped onto the model’s molecular species.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">example_species_data</span> <span class="o">=</span> <span class="n">sbml_dfs</span><span class="p">.</span><span class="n">species_data</span><span class="p">[</span><span class="n">FORNY_DEFS</span><span class="p">.</span><span class="n">MOFA_LFS</span><span class="p">].</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">).</span><span class="n">copy</span><span class="p">()</span>
<span class="n">format_numeric_columns</span><span class="p">(</span><span class="n">example_species_data</span><span class="p">,</span> <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">display_tabulator</span><span class="p">(</span><span class="n">example_species_data</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="s">"auto"</span><span class="p">,</span> <span class="n">layout</span><span class="o">=</span><span class="s">"fitDataStretch"</span><span class="p">)</span>
</code></pre></div></div>

<div class="data-table" style="width: auto; display: inline-block;" data-table="[{&quot;s_id&quot;: &quot;S00000001&quot;, &quot;LF1_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF2_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF3_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF4_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF5_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF1_proteomics&quot;: &quot;-0.010&quot;, &quot;LF2_proteomics&quot;: &quot;0.063&quot;, &quot;LF3_proteomics&quot;: &quot;-0.031&quot;, &quot;LF4_proteomics&quot;: &quot;-0.001&quot;, &quot;LF5_proteomics&quot;: &quot;0.023&quot;, &quot;feature_id&quot;: &quot;11618&quot;}, {&quot;s_id&quot;: &quot;S00000009&quot;, &quot;LF1_transcriptomics&quot;: &quot;-0.044&quot;, &quot;LF2_transcriptomics&quot;: &quot;-0.008&quot;, &quot;LF3_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF4_transcriptomics&quot;: &quot;-0.037&quot;, &quot;LF5_transcriptomics&quot;: &quot;0.017&quot;, &quot;LF1_proteomics&quot;: &quot;0.025&quot;, &quot;LF2_proteomics&quot;: &quot;-0.007&quot;, &quot;LF3_proteomics&quot;: &quot;0.033&quot;, &quot;LF4_proteomics&quot;: &quot;0.001&quot;, &quot;LF5_proteomics&quot;: &quot;-0.001&quot;, &quot;feature_id&quot;: &quot;11437,1660&quot;}, {&quot;s_id&quot;: &quot;S00000011&quot;, &quot;LF1_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF2_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF3_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF4_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF5_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF1_proteomics&quot;: &quot;0.056&quot;, &quot;LF2_proteomics&quot;: &quot;-0.150&quot;, &quot;LF3_proteomics&quot;: &quot;0.003&quot;, &quot;LF4_proteomics&quot;: &quot;-0.031&quot;, &quot;LF5_proteomics&quot;: &quot;-0.020&quot;, &quot;feature_id&quot;: &quot;9864&quot;}, {&quot;s_id&quot;: &quot;S00000012&quot;, &quot;LF1_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF2_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF3_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF4_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF5_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF1_proteomics&quot;: &quot;-0.017&quot;, &quot;LF2_proteomics&quot;: &quot;-0.072&quot;, &quot;LF3_proteomics&quot;: &quot;0.027&quot;, &quot;LF4_proteomics&quot;: &quot;-0.016&quot;, &quot;LF5_proteomics&quot;: &quot;0.001&quot;, &quot;feature_id&quot;: &quot;9967&quot;}, {&quot;s_id&quot;: &quot;S00000013&quot;, &quot;LF1_transcriptomics&quot;: &quot;-0.119&quot;, &quot;LF2_transcriptomics&quot;: &quot;-0.062&quot;, &quot;LF3_transcriptomics&quot;: &quot;0.000&quot;, &quot;LF4_transcriptomics&quot;: &quot;-0.124&quot;, &quot;LF5_transcriptomics&quot;: &quot;0.007&quot;, &quot;LF1_proteomics&quot;: &quot;-0.004&quot;, &quot;LF2_proteomics&quot;: &quot;-0.018&quot;, &quot;LF3_proteomics&quot;: &quot;0.007&quot;, &quot;LF4_proteomics&quot;: &quot;-0.004&quot;, &quot;LF5_proteomics&quot;: &quot;0.000&quot;, &quot;feature_id&quot;: &quot;50,9967&quot;}]" data-columns="[{&quot;title&quot;: &quot;s_id&quot;, &quot;field&quot;: &quot;s_id&quot;}, {&quot;title&quot;: &quot;LF1_transcriptomics&quot;, &quot;field&quot;: &quot;LF1_transcriptomics&quot;}, {&quot;title&quot;: &quot;LF2_transcriptomics&quot;, &quot;field&quot;: &quot;LF2_transcriptomics&quot;}, {&quot;title&quot;: &quot;LF3_transcriptomics&quot;, &quot;field&quot;: &quot;LF3_transcriptomics&quot;}, {&quot;title&quot;: &quot;LF4_transcriptomics&quot;, &quot;field&quot;: &quot;LF4_transcriptomics&quot;}, {&quot;title&quot;: &quot;LF5_transcriptomics&quot;, &quot;field&quot;: &quot;LF5_transcriptomics&quot;}, {&quot;title&quot;: &quot;LF1_proteomics&quot;, &quot;field&quot;: &quot;LF1_proteomics&quot;}, {&quot;title&quot;: &quot;LF2_proteomics&quot;, &quot;field&quot;: &quot;LF2_proteomics&quot;}, {&quot;title&quot;: &quot;LF3_proteomics&quot;, &quot;field&quot;: &quot;LF3_proteomics&quot;}, {&quot;title&quot;: &quot;LF4_proteomics&quot;, &quot;field&quot;: &quot;LF4_proteomics&quot;}, {&quot;title&quot;: &quot;LF5_proteomics&quot;, &quot;field&quot;: &quot;LF5_proteomics&quot;}, {&quot;title&quot;: &quot;feature_id&quot;, &quot;field&quot;: &quot;feature_id&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataStretch&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<div class="content-section ai-aside">
  <div class="section-content">
    <p>The <code class="language-plaintext highlighter-rouge">scverse</code> and <code class="language-plaintext highlighter-rouge">matching</code>
subpackages are recent additions to Napistu, designed to make the
framework more user-friendly. The goal is to reduce technical barriers
for researchers by providing streamlined workflows for common formats
like <code class="language-plaintext highlighter-rouge">MuData</code> and <code class="language-plaintext highlighter-rouge">AnnData</code> objects. (Thanks to Vito Zanotelli for
encouraging this direction!)</p>

<p>This module was developed using AI-assisted coding, which revealed some
interesting insights into the strengths and weaknesses of different AI
tools for scientific software development. Language models like Claude
were helpful for understanding biological data structures and offering
conceptual guidance, but struggled when it came to extending an existing
codebase — often suggesting overly complex or impractical solutions.
Code-focused AI tools like Cursor proved more effective for the actual
implementation work.</p>

<p>The development process followed an iterative approach: first building
prototypes to understand the functionality requirements, then drafting
comprehensive tests, followed by implementing individual functions with
continuous testing, and finally polishing the code with proper
documentation and type annotations. This AI-assisted workflow
significantly accelerated development while maintaining code quality - a
pattern that’s becoming increasingly valuable for scientific software
projects.</p>

  </div>
</div>

<h4 id="adding-statistical-summaries-and-modality-masks">Adding statistical summaries and modality masks</h4>

<p>Having shown how to bind <em>MOFA</em> factor loadings to the pathway, I’ll now
add differential expression results.</p>

<p>The disease phenotypes of interest are:</p>

<ul>
  <li><em>OHCblPlus</em>: enzymatic activity readout</li>
  <li><em>MMA_urine</em>: metabolic burden indicator</li>
  <li><em>case</em>: disease status</li>
  <li><em>responsive_to_acute_treatment</em>: effectiveness of acute vitamin
supplementation</li>
</ul>

<p>For each phenotype, I’ll add the following statistical summaries:</p>

<ul>
  <li><em>estimate</em>: the regression effect size</li>
  <li><em>statistic</em>: the regression t-statistic</li>
  <li><em>log10p</em>: the $-\log_{10}(\text{p-value})$ (calculated this way to
avoid numerical underflow)</li>
  <li><em>q-value</em>: the Benjamini-Hochberg FDR-adjusted p-value</li>
</ul>

<p>I’ll also include the q-values for covariates used in the regressions
— namely, the nonlinear associations of <em>freezing_date</em> and
<em>proteomics_run_order</em>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># now we can add .var attributes from the mdata
</span><span class="n">diffex_results</span> <span class="o">=</span> <span class="n">prepare_mudata_results_df</span><span class="p">(</span>
    <span class="n">mdata</span><span class="p">,</span>
    <span class="n">mudata_ontologies</span><span class="o">=</span><span class="n">MUDATA_ONTOLOGIES</span><span class="p">,</span>
    <span class="n">table_type</span><span class="o">=</span><span class="s">"var"</span><span class="p">,</span>
    <span class="n">results_attrs</span><span class="o">=</span><span class="n">VAR_VARS</span><span class="p">,</span>
    <span class="n">level</span> <span class="o">=</span> <span class="s">"adata"</span>
<span class="p">)</span>

<span class="n">mount</span><span class="p">.</span><span class="n">bind_dict_of_wide_results</span><span class="p">(</span>
    <span class="n">sbml_dfs</span><span class="p">,</span>
    <span class="n">diffex_results</span><span class="p">,</span>
    <span class="n">FORNY_DEFS</span><span class="p">.</span><span class="n">VAR_LEVEL_RESULTS</span><span class="p">,</span>
    <span class="n">strategy</span> <span class="o">=</span> <span class="s">"stagger"</span><span class="p">,</span>
    <span class="n">species_identifiers</span> <span class="o">=</span> <span class="n">species_identifiers</span><span class="p">,</span>
    <span class="n">verbose</span> <span class="o">=</span> <span class="bp">False</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Finally, I’ll add modality-level indicator variables. For loose results
in a <code class="language-plaintext highlighter-rouge">pandas.DataFrame</code>, I could use <code class="language-plaintext highlighter-rouge">bind_wide_results</code> to add them
directly to the <code class="language-plaintext highlighter-rouge">SBML_dfs</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">modality</span> <span class="ow">in</span> <span class="n">MODALITIES</span><span class="p">:</span>
    <span class="n">anndata_results_df</span> <span class="o">=</span> <span class="n">prepare_anndata_results_df</span><span class="p">(</span>
        <span class="n">mdata</span><span class="p">[</span><span class="n">modality</span><span class="p">],</span>
        <span class="n">table_type</span><span class="o">=</span><span class="s">"var"</span><span class="p">,</span>
        <span class="n">index_which_ontology</span> <span class="o">=</span> <span class="n">MUDATA_ONTOLOGIES</span><span class="p">[</span><span class="n">modality</span><span class="p">][</span><span class="s">"index_which_ontology"</span><span class="p">],</span>
        <span class="n">results_attrs</span><span class="o">=</span><span class="n">ADATA_LEVEL_VARS</span><span class="p">[</span><span class="n">modality</span><span class="p">]</span>
    <span class="p">)</span>

    <span class="n">mount</span><span class="p">.</span><span class="n">bind_wide_results</span><span class="p">(</span>
        <span class="n">sbml_dfs</span><span class="p">,</span>
        <span class="n">anndata_results_df</span><span class="p">,</span>
        <span class="n">FORNY_DEFS</span><span class="p">.</span><span class="n">MODALITY_VAR_LEVEL_RESULTS_STR</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">modality</span> <span class="o">=</span> <span class="n">modality</span><span class="p">),</span>
        <span class="n">species_identifiers</span> <span class="o">=</span> <span class="n">species_identifiers</span><span class="p">,</span>
        <span class="n">ontologies</span> <span class="o">=</span> <span class="n">MUDATA_ONTOLOGIES</span><span class="p">[</span><span class="n">modality</span><span class="p">][</span><span class="s">"ontologies"</span><span class="p">]</span>
    <span class="p">)</span>
</code></pre></div></div>

<h3 id="sbml_dfs-rightarrow-napistugraph"><code class="language-plaintext highlighter-rouge">SBML_dfs</code> $\Rightarrow$ <code class="language-plaintext highlighter-rouge">NapistuGraph</code></h3>

<p>Now, I can pass selected attributes from the <em>species_data</em> tables to
the <code class="language-plaintext highlighter-rouge">NapistuGraph</code> object. While the low-level, flexible
<code class="language-plaintext highlighter-rouge">set_graph_attrs</code> method (which also supports setting edge attributes)
is available, I’ll use the more user-friendly function instead:
<code class="language-plaintext highlighter-rouge">data_handling.add_results_table_to_graph()</code>.</p>

<div class="content-section bio-section">
  <div class="section-content">
    <p>When working with an <code class="language-plaintext highlighter-rouge">SBML_dfs</code>
loaded from GCS, users can typically rely on the pre-generated
<code class="language-plaintext highlighter-rouge">NapistuGraph</code> bundled with it. These graphs are:</p>

<ul>
  <li>Directed (though reversible reactions, such as <em>STRING</em>
interactions, appear as paired forward and reverse edges)</li>
  <li>Wired based on a regulatory hierarchy: vertices are arranged in
tiers — regulators → catalysts → substrates → reactions → products</li>
  <li>Sensibly weighted: edges reflect meaningful interaction weights
where applicable</li>
</ul>

<p>The only modification I’ll make for this analysis is reversing the
graph’s edges, so that signals can flow from effects (e.g., dysregulated
genes) upstream to their potential causes (e.g., transcriptional or
enzymatic regulators).</p>

  </div>
</div>

<p>Here’s a view of the graph showing a random selection of vertices and
edges:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">vertex_indices</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">napistu_graph</span><span class="p">.</span><span class="n">vs</span><span class="p">),</span> <span class="mi">10</span><span class="p">,</span> <span class="n">replace</span> <span class="o">=</span> <span class="bp">False</span><span class="p">)</span>
<span class="n">vertices_df</span> <span class="o">=</span><span class="p">(</span>
    <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span>
        <span class="n">i</span> <span class="p">:</span> <span class="n">napistu_graph</span><span class="p">.</span><span class="n">vs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">attributes</span><span class="p">()</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">vertex_indices</span>
    <span class="p">})</span>
    <span class="p">.</span><span class="n">T</span>
    <span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="n">sc_Source</span> <span class="o">=</span> <span class="s">"."</span><span class="p">)</span>
    <span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">"name"</span><span class="p">)</span>
    <span class="p">.</span><span class="n">fillna</span><span class="p">(</span><span class="s">"."</span><span class="p">)</span>
<span class="p">)</span>

<span class="n">edge_indices</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">napistu_graph</span><span class="p">.</span><span class="n">es</span><span class="p">),</span> <span class="mi">10</span><span class="p">,</span> <span class="n">replace</span> <span class="o">=</span> <span class="bp">False</span><span class="p">)</span>
<span class="n">edges_df</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span>
        <span class="n">i</span> <span class="p">:</span> <span class="n">napistu_graph</span><span class="p">.</span><span class="n">es</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">attributes</span><span class="p">()</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">edge_indices</span>
    <span class="p">})</span>
    <span class="p">.</span><span class="n">T</span>
    <span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="n">sc_Source</span> <span class="o">=</span> <span class="s">"."</span><span class="p">)</span>
    <span class="p">.</span><span class="n">set_index</span><span class="p">([</span><span class="s">"from"</span><span class="p">,</span> <span class="s">"to"</span><span class="p">])</span>
<span class="p">)</span>
<span class="n">format_numeric_columns</span><span class="p">(</span><span class="n">edges_df</span><span class="p">,</span> <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>

<span class="n">display_tabulator</span><span class="p">(</span><span class="n">vertices_df</span><span class="p">,</span> <span class="n">caption</span><span class="o">=</span><span class="s">"Vertices"</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="s">"auto"</span><span class="p">,</span> <span class="n">layout</span><span class="o">=</span><span class="s">"fitDataStretch"</span><span class="p">)</span>
<span class="n">display_tabulator</span><span class="p">(</span><span class="n">edges_df</span><span class="p">,</span> <span class="n">caption</span><span class="o">=</span><span class="s">"Edges"</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="s">"auto"</span><span class="p">,</span> <span class="n">layout</span><span class="o">=</span><span class="s">"fitDataStretch"</span><span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Vertices
</figcaption>

<div class="data-table" style="width: auto; display: inline-block;" data-table="[{&quot;name&quot;: &quot;SC00000539&quot;, &quot;node_name&quot;: &quot;GlcNAc-GlcA-GlcNAc&quot;, &quot;node_type&quot;: &quot;species&quot;, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;}, {&quot;name&quot;: &quot;R00389169&quot;, &quot;node_name&quot;: &quot;RELA modifier of HMOX1&quot;, &quot;node_type&quot;: &quot;reaction&quot;, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;.&quot;}, {&quot;name&quot;: &quot;SC00027413&quot;, &quot;node_name&quot;: &quot;solute carrier family 25 member 47 [cellular_component]&quot;, &quot;node_type&quot;: &quot;species&quot;, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;}, {&quot;name&quot;: &quot;SC00026835&quot;, &quot;node_name&quot;: &quot;GDNF family receptor alpha like [cellular_component]&quot;, &quot;node_type&quot;: &quot;species&quot;, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;}, {&quot;name&quot;: &quot;SC00022715&quot;, &quot;node_name&quot;: &quot;dipeptidyl peptidase 8 [cellular_component]&quot;, &quot;node_type&quot;: &quot;species&quot;, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;}, {&quot;name&quot;: &quot;SC00021226&quot;, &quot;node_name&quot;: &quot;olfactory receptor family 1 subfamily C member 1 [cellular_component]&quot;, &quot;node_type&quot;: &quot;species&quot;, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;}, {&quot;name&quot;: &quot;SC00022708&quot;, &quot;node_name&quot;: &quot;CA4&quot;, &quot;node_type&quot;: &quot;species&quot;, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;}, {&quot;name&quot;: &quot;SC00014948&quot;, &quot;node_name&quot;: &quot;SLC9B2 dimer&quot;, &quot;node_type&quot;: &quot;species&quot;, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;}, {&quot;name&quot;: &quot;R00065224&quot;, &quot;node_name&quot;: &quot;CEBPB modifier of F8&quot;, &quot;node_type&quot;: &quot;reaction&quot;, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;.&quot;}, {&quot;name&quot;: &quot;SC00018803&quot;, &quot;node_name&quot;: &quot;THOC5&quot;, &quot;node_type&quot;: &quot;species&quot;, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;}]" data-columns="[{&quot;title&quot;: &quot;name&quot;, &quot;field&quot;: &quot;name&quot;}, {&quot;title&quot;: &quot;node_name&quot;, &quot;field&quot;: &quot;node_name&quot;}, {&quot;title&quot;: &quot;node_type&quot;, &quot;field&quot;: &quot;node_type&quot;}, {&quot;title&quot;: &quot;sc_Source&quot;, &quot;field&quot;: &quot;sc_Source&quot;}, {&quot;title&quot;: &quot;species_type&quot;, &quot;field&quot;: &quot;species_type&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataStretch&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Edges
</figcaption>

<div class="data-table" style="width: auto; display: inline-block;" data-table="[{&quot;index&quot;: &quot;SC00030954 / SC00035419&quot;, &quot;r_id&quot;: &quot;R03757688&quot;, &quot;sbo_term&quot;: &quot;SBO:0000336&quot;, &quot;stoichiometry&quot;: 0.0, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;, &quot;r_isreversible&quot;: true, &quot;direction&quot;: &quot;forward&quot;, &quot;string_wt&quot;: 6.329113924050633, &quot;weight&quot;: 6.329113924050633, &quot;upstream_weight&quot;: 6.329113924050633, &quot;source_wt&quot;: 10}, {&quot;index&quot;: &quot;SC00032212 / SC00033302&quot;, &quot;r_id&quot;: &quot;R03840042&quot;, &quot;sbo_term&quot;: &quot;SBO:0000336&quot;, &quot;stoichiometry&quot;: 0.0, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;, &quot;r_isreversible&quot;: true, &quot;direction&quot;: &quot;forward&quot;, &quot;string_wt&quot;: 6.410256410256411, &quot;weight&quot;: 6.410256410256411, &quot;upstream_weight&quot;: 6.410256410256411, &quot;source_wt&quot;: 10}, {&quot;index&quot;: &quot;SC00031764 / SC00033418&quot;, &quot;r_id&quot;: &quot;R03813825&quot;, &quot;sbo_term&quot;: &quot;SBO:0000336&quot;, &quot;stoichiometry&quot;: 0.0, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;, &quot;r_isreversible&quot;: true, &quot;direction&quot;: &quot;forward&quot;, &quot;string_wt&quot;: 3.058103975535168, &quot;weight&quot;: 3.058103975535168, &quot;upstream_weight&quot;: 3.058103975535168, &quot;source_wt&quot;: 10}, {&quot;index&quot;: &quot;SC00034806 / SC00029555&quot;, &quot;r_id&quot;: &quot;R03631600&quot;, &quot;sbo_term&quot;: &quot;SBO:0000336&quot;, &quot;stoichiometry&quot;: 0.0, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;, &quot;r_isreversible&quot;: true, &quot;direction&quot;: &quot;reverse&quot;, &quot;string_wt&quot;: 3.4602076124567476, &quot;weight&quot;: 3.4602076124567476, &quot;upstream_weight&quot;: 3.4602076124567476, &quot;source_wt&quot;: 10}, {&quot;index&quot;: &quot;SC00006965 / SC00031210&quot;, &quot;r_id&quot;: &quot;R01149106&quot;, &quot;sbo_term&quot;: &quot;SBO:0000336&quot;, &quot;stoichiometry&quot;: 0.0, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;, &quot;r_isreversible&quot;: true, &quot;direction&quot;: &quot;forward&quot;, &quot;string_wt&quot;: 3.745318352059925, &quot;weight&quot;: 3.745318352059925, &quot;upstream_weight&quot;: 3.745318352059925, &quot;source_wt&quot;: 10}, {&quot;index&quot;: &quot;SC00032461 / SC00034934&quot;, &quot;r_id&quot;: &quot;R03853793&quot;, &quot;sbo_term&quot;: &quot;SBO:0000336&quot;, &quot;stoichiometry&quot;: 0.0, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;, &quot;r_isreversible&quot;: true, &quot;direction&quot;: &quot;forward&quot;, &quot;string_wt&quot;: 3.4843205574912894, &quot;weight&quot;: 3.4843205574912894, &quot;upstream_weight&quot;: 3.4843205574912894, &quot;source_wt&quot;: 10}, {&quot;index&quot;: &quot;SC00011294 / SC00033161&quot;, &quot;r_id&quot;: &quot;R01452767&quot;, &quot;sbo_term&quot;: &quot;SBO:0000336&quot;, &quot;stoichiometry&quot;: 0.0, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;, &quot;r_isreversible&quot;: true, &quot;direction&quot;: &quot;forward&quot;, &quot;string_wt&quot;: 5.208333333333333, &quot;weight&quot;: 5.208333333333333, &quot;upstream_weight&quot;: 5.208333333333333, &quot;source_wt&quot;: 10}, {&quot;index&quot;: &quot;SC00028387 / SC00027845&quot;, &quot;r_id&quot;: &quot;R03432647&quot;, &quot;sbo_term&quot;: &quot;SBO:0000336&quot;, &quot;stoichiometry&quot;: 0.0, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;, &quot;r_isreversible&quot;: true, &quot;direction&quot;: &quot;reverse&quot;, &quot;string_wt&quot;: 1.081081081081081, &quot;weight&quot;: 1.081081081081081, &quot;upstream_weight&quot;: 1.081081081081081, &quot;source_wt&quot;: 10}, {&quot;index&quot;: &quot;SC00030930 / SC00031882&quot;, &quot;r_id&quot;: &quot;R03755607&quot;, &quot;sbo_term&quot;: &quot;SBO:0000336&quot;, &quot;stoichiometry&quot;: 0.0, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;, &quot;r_isreversible&quot;: true, &quot;direction&quot;: &quot;forward&quot;, &quot;string_wt&quot;: 2.8490028490028494, &quot;weight&quot;: 2.8490028490028494, &quot;upstream_weight&quot;: 2.8490028490028494, &quot;source_wt&quot;: 10}, {&quot;index&quot;: &quot;SC00019035 / SC00023658&quot;, &quot;r_id&quot;: &quot;R02024262&quot;, &quot;sbo_term&quot;: &quot;SBO:0000336&quot;, &quot;stoichiometry&quot;: 0.0, &quot;sc_Source&quot;: &quot;.&quot;, &quot;species_type&quot;: &quot;protein&quot;, &quot;r_isreversible&quot;: true, &quot;direction&quot;: &quot;forward&quot;, &quot;string_wt&quot;: 1.0204081632653061, &quot;weight&quot;: 1.0204081632653061, &quot;upstream_weight&quot;: 1.0204081632653061, &quot;source_wt&quot;: 10}]" data-columns="[{&quot;title&quot;: &quot;index&quot;, &quot;field&quot;: &quot;index&quot;}, {&quot;title&quot;: &quot;r_id&quot;, &quot;field&quot;: &quot;r_id&quot;}, {&quot;title&quot;: &quot;sbo_term&quot;, &quot;field&quot;: &quot;sbo_term&quot;}, {&quot;title&quot;: &quot;stoichiometry&quot;, &quot;field&quot;: &quot;stoichiometry&quot;}, {&quot;title&quot;: &quot;sc_Source&quot;, &quot;field&quot;: &quot;sc_Source&quot;}, {&quot;title&quot;: &quot;species_type&quot;, &quot;field&quot;: &quot;species_type&quot;}, {&quot;title&quot;: &quot;r_isreversible&quot;, &quot;field&quot;: &quot;r_isreversible&quot;}, {&quot;title&quot;: &quot;direction&quot;, &quot;field&quot;: &quot;direction&quot;}, {&quot;title&quot;: &quot;string_wt&quot;, &quot;field&quot;: &quot;string_wt&quot;}, {&quot;title&quot;: &quot;weight&quot;, &quot;field&quot;: &quot;weight&quot;}, {&quot;title&quot;: &quot;upstream_weight&quot;, &quot;field&quot;: &quot;upstream_weight&quot;}, {&quot;title&quot;: &quot;source_wt&quot;, &quot;field&quot;: &quot;source_wt&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataStretch&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<div class="content-section bio-section">
  <div class="section-content">
    <p>Most <code class="language-plaintext highlighter-rouge">NapistuGraph</code> objects —
including this one — contain both species and reaction vertices:</p>

<ul>
  <li><strong>Species vertices</strong> represent molecular entities (genes, proteins,
metabolites, etc.)</li>
  <li><strong>Reaction vertices</strong> represent biochemical or regulatory reactions</li>
</ul>

<p>This bipartite structure is borrowed from metabolic modeling where
metabolites are connected via reaction nodes that define their
interconversion. But, the <code class="language-plaintext highlighter-rouge">NapistuGraph</code> used here is not a strict
bipartite network since species nodes can be connected to other species
nodes.</p>

  </div>
</div>

<h4 id="creating-an-appropriate-graph-with-data-attributes">Creating an appropriate graph with data attributes</h4>

<p>To prepare the graph for network propagation, I’ll:</p>

<ol>
  <li>Reverse all edges to allow signal flow from observed effects to
their upstream causes.</li>
  <li>Add species attributes from <em>species_data</em> to the vertices using
<code class="language-plaintext highlighter-rouge">add_results_table_to_graph()</code>. This entails specifying which
attributes to use and how to transform them to make them suitable
for personalized PageRank (non-negative, where larger values
represent stronger signals).</li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># create a copy of the graph
</span><span class="n">napistu_graph</span><span class="p">.</span><span class="n">reverse_edges</span><span class="p">()</span>
<span class="k">assert</span> <span class="n">napistu_graph</span><span class="p">.</span><span class="n">is_reversed</span>

<span class="k">for</span> <span class="n">attr_to_graph</span> <span class="ow">in</span> <span class="n">ATTRIBUTES_TO_GRAPH_SPEC</span><span class="p">:</span>
    
    <span class="n">data_handling</span><span class="p">.</span><span class="n">add_results_table_to_graph</span><span class="p">(</span>
        <span class="n">napistu_graph</span><span class="p">,</span>
        <span class="n">sbml_dfs</span><span class="p">,</span>
        <span class="n">attribute_names</span> <span class="o">=</span> <span class="n">attr_to_graph</span><span class="p">[</span><span class="s">"attribute_names"</span><span class="p">],</span>
        <span class="n">table_name</span> <span class="o">=</span> <span class="n">attr_to_graph</span><span class="p">[</span><span class="s">"table_name"</span><span class="p">],</span>
        <span class="n">transformation</span> <span class="o">=</span> <span class="n">attr_to_graph</span><span class="p">[</span><span class="s">"transformation"</span><span class="p">],</span>
        <span class="n">custom_transformations</span> <span class="o">=</span> <span class="n">CUSTOM_TRANSFORMATIONS</span>
    <span class="p">)</span>
</code></pre></div></div>

<h1 id="network-propagation-with-personalized-pagerank-ppr">Network propagation with Personalized PageRank (PPR)</h1>

<p>To link gene-level changes to common regulators, I’ll apply personalized
PageRank (PPR) to each vertex attribute in the NapistuGraph.
Conceptually, PageRank begins with a signal on a random vertex that, at
each step, either:</p>

<ul>
  <li>Moves to a connected child node with probability $\alpha$</li>
  <li>Resets to a random vertex with probability $1-\alpha$</li>
</ul>

<p>In PPR, the reset step is biased — rather than resetting to any vertex
uniformly, it follows a user-defined probability distribution, often
weighted by input signals like gene dysregulation scores. Repeating this
random walk causes the signal to accumulate at hub vertices — nodes
central to the input signal.</p>

<p>The actual PageRank algorithm finds the stationary distribution of this
process using some slick linear algebra - power iteration to find the
stationary distribution and sparse matrix storage and operations. This
excellent <a href="https://www.r-bloggers.com/2014/04/from-random-walks-to-personalized-pagerank/">blog
post</a>
by Stefan Weigert offers a clear and intuitive overview of PPR. For
example, here is Stefan’s visualization of a random walk following the
PPR process which really nails the intuition for me:</p>

<p><img src="https://redirect.viglink.com/?format=go&amp;jsonp=vglnk_175467987420910&amp;key=949efb41171ac6ec1bf7f206d57e90b8&amp;libId=me371o8001021u9s000UAbikftpu4&amp;loc=https%3A%2F%2Fwww.r-bloggers.com%2F2014%2F04%2Ffrom-random-walks-to-personalized-pagerank%2F&amp;v=1&amp;out=https%3A%2F%2Fi2.wp.com%2F1.bp.blogspot.com%2F-5DGkqiLF87U%2FUzqm0Vah16I%2FAAAAAAAABIE%2FaPgVRreUvts%2Fs1600%2Fg4.gif&amp;ref=https%3A%2F%2Fwww.google.com%2F&amp;title=From%20Random%20Walks%20to%20Personalized%20PageRank%20%7C%20R-bloggers&amp;txt=" alt="Personalized pagerank
animation" />.</p>

<h3 id="applying-ppr">Applying PPR</h3>

<p>I’ve prepared the conditions for PPR by defining attributes as reset
probability distributions (after L1 normalization). To ensure
non-negativity and highlight signal strength, I applied ad hoc
transformations — like converting $\log_{10}(\text{p-values})$ to
$-\log_{10}(\text{p-values})$ and squaring effect sizes. While these
choices are reasonable, ideally such transformations would be learned
rather than preset.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">annotated_vertices</span> <span class="o">=</span> <span class="n">napistu_graph</span><span class="p">.</span><span class="n">get_vertex_dataframe</span><span class="p">()</span>

<span class="c1"># find valid attributes - numeric + 1+ non-zero values
</span><span class="n">invalid_attributes</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">annotated_vertices</span><span class="p">.</span><span class="n">columns</span> <span class="k">if</span> <span class="n">annotated_vertices</span><span class="p">[</span><span class="n">x</span><span class="p">].</span><span class="n">dtype</span> <span class="ow">not</span> <span class="ow">in</span> <span class="p">[</span><span class="s">"float64"</span><span class="p">,</span> <span class="s">"int64"</span><span class="p">]</span> <span class="ow">or</span> <span class="n">annotated_vertices</span><span class="p">[</span><span class="n">x</span><span class="p">].</span><span class="n">nunique</span><span class="p">()</span> <span class="o">==</span> <span class="mi">1</span><span class="p">]</span>
<span class="n">valid_attributes</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">annotated_vertices</span><span class="p">.</span><span class="n">columns</span> <span class="k">if</span> <span class="n">x</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">invalid_attributes</span><span class="p">]</span>

<span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Invalid attributes: </span><span class="si">{</span><span class="n">invalid_attributes</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Valid attributes: </span><span class="si">{</span><span class="n">valid_attributes</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

<span class="c1"># create masks for each modality
</span><span class="k">assert</span> <span class="p">[</span><span class="n">x</span> <span class="ow">in</span> <span class="n">valid_attributes</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">REGEXES_TO_MASKS</span><span class="p">.</span><span class="n">values</span><span class="p">()]</span>
<span class="n">valid_attributes</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">valid_attributes</span><span class="p">)</span> <span class="o">-</span> <span class="nb">set</span><span class="p">(</span><span class="n">REGEXES_TO_MASKS</span><span class="p">.</span><span class="n">values</span><span class="p">()))</span>

<span class="n">ppr_results</span> <span class="o">=</span> <span class="n">net_propagation</span><span class="p">.</span><span class="n">net_propagate_attributes</span><span class="p">(</span>
    <span class="n">napistu_graph</span><span class="p">,</span>
    <span class="n">attributes</span> <span class="o">=</span> <span class="n">valid_attributes</span><span class="p">,</span>
    <span class="n">propagation_method</span> <span class="o">=</span> <span class="s">"personalized_pagerank"</span><span class="p">,</span>
    <span class="n">additional_propagation_args</span> <span class="o">=</span> <span class="p">{</span> <span class="s">"damping"</span><span class="p">:</span> <span class="mf">0.85</span> <span class="p">}</span>
<span class="p">)</span>
</code></pre></div></div>

<h4 id="controlling-for-pprs-biases">Controlling for PPR’s biases</h4>

<p>While PPR reveals convergence points of biological signals in a network,
it is inherently biased toward highly connected hub nodes — an effect
evident when examining vertices with the highest median PPR values
across attributes.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">top_5_by_median_ppr</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">ppr_results</span><span class="p">.</span><span class="n">median</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
    <span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)[</span><span class="mi">0</span><span class="p">:</span><span class="mi">5</span><span class="p">]</span>
    <span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="s">"median PPR"</span><span class="p">)</span>
    <span class="p">.</span><span class="n">to_frame</span><span class="p">()</span>
    <span class="p">.</span><span class="n">merge</span><span class="p">(</span>
        <span class="n">annotated_vertices</span><span class="p">[[</span><span class="s">"name"</span><span class="p">,</span> <span class="s">"node_name"</span><span class="p">]],</span>
        <span class="n">left_index</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span>
        <span class="n">right_on</span> <span class="o">=</span> <span class="s">"name"</span>
    <span class="p">)</span>
<span class="p">)</span>
<span class="n">top_5_by_median_ppr</span><span class="p">[</span><span class="s">"degree"</span><span class="p">]</span> <span class="o">=</span> <span class="n">napistu_graph</span><span class="p">.</span><span class="n">degree</span><span class="p">(</span><span class="n">top_5_by_median_ppr</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">tolist</span><span class="p">())</span>
<span class="n">format_numeric_columns</span><span class="p">(</span><span class="n">top_5_by_median_ppr</span><span class="p">,</span> <span class="s">"{:.2e}"</span><span class="p">,</span> <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>

<span class="n">display_tabulator</span><span class="p">(</span><span class="n">top_5_by_median_ppr</span><span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span> <span class="o">=</span> <span class="bp">True</span><span class="p">),</span> <span class="n">width</span><span class="o">=</span><span class="s">"500px"</span><span class="p">)</span>
</code></pre></div></div>

<div class="data-table" style="width: 500px; display: inline-block;" data-table="[{&quot;median PPR&quot;: &quot;7.35e-04&quot;, &quot;name&quot;: &quot;SC00003547&quot;, &quot;node_name&quot;: &quot;ACTB&quot;, &quot;degree&quot;: &quot;1.37e+04&quot;}, {&quot;median PPR&quot;: &quot;5.49e-04&quot;, &quot;name&quot;: &quot;SC00025851&quot;, &quot;node_name&quot;: &quot;GAPDH&quot;, &quot;degree&quot;: &quot;1.04e+04&quot;}, {&quot;median PPR&quot;: &quot;4.86e-04&quot;, &quot;name&quot;: &quot;SC00004948&quot;, &quot;node_name&quot;: &quot;TP53 gene&quot;, &quot;degree&quot;: &quot;9.41e+03&quot;}, {&quot;median PPR&quot;: &quot;4.61e-04&quot;, &quot;name&quot;: &quot;SC00003685&quot;, &quot;node_name&quot;: &quot;LRRK2&quot;, &quot;degree&quot;: &quot;9.40e+03&quot;}, {&quot;median PPR&quot;: &quot;4.09e-04&quot;, &quot;name&quot;: &quot;SC00004540&quot;, &quot;node_name&quot;: &quot;INS gene&quot;, &quot;degree&quot;: &quot;7.92e+03&quot;}]" data-columns="[{&quot;title&quot;: &quot;median PPR&quot;, &quot;field&quot;: &quot;median PPR&quot;}, {&quot;title&quot;: &quot;name&quot;, &quot;field&quot;: &quot;name&quot;}, {&quot;title&quot;: &quot;node_name&quot;, &quot;field&quot;: &quot;node_name&quot;}, {&quot;title&quot;: &quot;degree&quot;, &quot;field&quot;: &quot;degree&quot;}]" data-options="{&quot;layout&quot;: &quot;fitColumns&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p>To correct for this, we must address two sources of bias:</p>

<ol>
  <li><strong>Topological Bias</strong> - Dense network regions with high in-degree
nodes tend to attract signals regardless of biological relevance —
a common issue in network analysis. Napistu offers several null
distributions to address this, the most relevant being:
    <ul>
      <li><em>vertex_permutation</em>: A non-parametric method that shuffles
signals across vertices.</li>
      <li><em>parametric_null</em>: A parametric method that models nulls using a
distribution derived from the observed signal. For example,
binary data may be modeled with a Bernoulli null distribution.</li>
    </ul>
  </li>
  <li><strong>Ascertainment Bias</strong> - This bias occurs because experiments
measure only a subset of the network. For instance, metabolomics
data focus on central carbon metabolism simply because it was
measured. To address this, I limited null permutations to resample
only from vertices with measured values (e.g., transcriptomics or
proteomics nodes identified by modality masks).</li>
</ol>

<h4 id="building-null-distributions">Building null distributions</h4>

<p>To robustly assess signal enrichment, I generated 500 null PPR
distributions by shuffling reset probabilities among measured vertices
(using modality masks). Each vertex’s observed PPR value was compared to
its null distribution, allowing its empirical quantile to serve as a
non-parametric p-value.</p>

<p>Because biological signals can both enrich and deplete regions, I
separately characterized signals in the right (<em>enrichment</em>) and left
(<em>depletion</em>) tails by:</p>

<ul>
  <li>Computing two-tailed p-values where either strong enrichment or
depletion would result in a small p-value</li>
  <li>Assigning features as enriched or depleted based on their quantile
(greater than or less than 0.5)</li>
  <li>Applying FDR correction separately to enriched and depleted sets</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">attr_masks</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="k">for</span> <span class="n">attr</span> <span class="ow">in</span> <span class="n">valid_attributes</span><span class="p">:</span>
    <span class="k">for</span> <span class="n">regex</span><span class="p">,</span> <span class="n">mask</span> <span class="ow">in</span> <span class="n">REGEXES_TO_MASKS</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
        <span class="k">if</span> <span class="n">re</span><span class="p">.</span><span class="n">search</span><span class="p">(</span><span class="n">regex</span><span class="p">,</span> <span class="n">attr</span><span class="p">):</span>
            <span class="n">attr_masks</span><span class="p">[</span><span class="n">attr</span><span class="p">]</span> <span class="o">=</span> <span class="n">mask</span>
            <span class="k">continue</span>

    <span class="k">if</span> <span class="n">attr</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">attr_masks</span><span class="p">:</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Could not find a modality-specific mask for </span><span class="si">{</span><span class="n">attr</span><span class="si">}</span><span class="s">; using </span><span class="si">{</span><span class="n">attr</span><span class="si">}</span><span class="s"> as its own mask"</span><span class="p">)</span>
        <span class="c1"># default behavior is to use the attribute as its own mask but adding it anyways to be explicit
</span>        <span class="n">attr_masks</span><span class="p">[</span><span class="n">attr</span><span class="p">]</span> <span class="o">=</span> <span class="n">attr</span>
        
<span class="k">if</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">PPR_NULL_TMP_PATH</span><span class="p">)</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">OVERWRITE</span><span class="p">:</span>
    <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Loading PPR nulls from cache at </span><span class="si">{</span><span class="n">PPR_NULL_TMP_PATH</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="n">ppr_with_nulls</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">PPR_NULL_TMP_PATH</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Calibrating PPR enrichments by permuting vertex attributes among masked vertices"</span><span class="p">)</span>
    <span class="n">ppr_with_nulls</span> <span class="o">=</span> <span class="n">net_propagation</span><span class="p">.</span><span class="n">network_propagation_with_null</span><span class="p">(</span>
        <span class="n">napistu_graph</span><span class="p">,</span>
        <span class="n">attributes</span> <span class="o">=</span> <span class="n">valid_attributes</span><span class="p">,</span>
        <span class="n">mask</span> <span class="o">=</span> <span class="n">attr_masks</span><span class="p">,</span>
        <span class="n">propagation_method</span> <span class="o">=</span> <span class="s">"personalized_pagerank"</span><span class="p">,</span>
        <span class="n">additional_propagation_args</span> <span class="o">=</span> <span class="p">{</span> <span class="s">"damping"</span><span class="p">:</span> <span class="mf">0.85</span> <span class="p">},</span>
        <span class="n">null_strategy</span> <span class="o">=</span> <span class="s">"vertex_permutation"</span><span class="p">,</span>
        <span class="n">n_samples</span> <span class="o">=</span> <span class="n">N_NULL_SAMPLES</span><span class="p">,</span>
        <span class="n">verbose</span> <span class="o">=</span> <span class="bp">True</span>    
    <span class="p">)</span>   

    <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Saving PPR nulls to cache at </span><span class="si">{</span><span class="n">PPR_NULL_TMP_PATH</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="n">ppr_with_nulls</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">PPR_NULL_TMP_PATH</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="comparing-subgraph-enrichments-and-depletions">Comparing subgraph enrichments and depletions</h3>

<p>Next, I will calculate p-values and q-values stratified by attribute and
distinguishing enrichments and depletions.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># name index to vertex_id
</span><span class="n">tall_ppr_enrichments</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">ppr_with_nulls</span>
    <span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
    <span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">"index"</span><span class="p">:</span> <span class="s">"vertex_name"</span><span class="p">})</span>
    <span class="p">.</span><span class="n">melt</span><span class="p">(</span><span class="n">id_vars</span><span class="o">=</span><span class="p">[</span><span class="s">"vertex_name"</span><span class="p">],</span> <span class="n">var_name</span><span class="o">=</span><span class="s">"attribute"</span><span class="p">,</span> <span class="n">value_name</span><span class="o">=</span><span class="s">"ppr_null_quantile"</span><span class="p">)</span>
    <span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="n">p_value</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">hypothesis_testing</span><span class="p">.</span><span class="n">quantile_to_pvalue</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s">"ppr_null_quantile"</span><span class="p">],</span> <span class="s">"two-tailed"</span><span class="p">))</span>
    <span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="n">is_enriched</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">"ppr_null_quantile"</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mf">0.5</span><span class="p">)</span>
    <span class="c1"># correct for 0 p-values by flooring based on the # of null samples
</span>    <span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="n">p_value</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">floor_pvalue_by_resolution</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s">"p_value"</span><span class="p">],</span> <span class="n">N_NULL_SAMPLES</span><span class="p">))</span>
    <span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="n">nulls_gt_observed</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">((</span><span class="mi">1</span> <span class="o">-</span> <span class="n">x</span><span class="p">[</span><span class="s">"ppr_null_quantile"</span><span class="p">])</span><span class="o">*</span><span class="n">N_NULL_SAMPLES</span><span class="p">).</span><span class="n">fillna</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">))</span>
<span class="p">)</span>

<span class="c1"># combine observed and null summaries
</span><span class="n">tall_ppr_results</span> <span class="o">=</span>  <span class="p">(</span>
    <span class="n">ppr_results</span>
    <span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
    <span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">"index"</span><span class="p">:</span> <span class="s">"vertex_name"</span><span class="p">})</span>
    <span class="p">.</span><span class="n">melt</span><span class="p">(</span><span class="n">id_vars</span><span class="o">=</span><span class="p">[</span><span class="s">"vertex_name"</span><span class="p">],</span> <span class="n">var_name</span><span class="o">=</span><span class="s">"attribute"</span><span class="p">,</span> <span class="n">value_name</span><span class="o">=</span><span class="s">"ppr_score"</span><span class="p">)</span>
    <span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">tall_ppr_enrichments</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="p">[</span><span class="s">"vertex_name"</span><span class="p">,</span> <span class="s">"attribute"</span><span class="p">])</span>
    <span class="p">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">subset</span><span class="o">=</span><span class="p">[</span><span class="s">"p_value"</span><span class="p">])</span>
<span class="p">)</span>

<span class="n">fdr_controlled_results</span> <span class="o">=</span> <span class="n">multi_model_fitting</span><span class="p">.</span><span class="n">control_fdr</span><span class="p">(</span>
    <span class="n">tall_ppr_results</span><span class="p">,</span>
    <span class="n">grouping_vars</span> <span class="o">=</span> <span class="p">[</span><span class="s">"attribute"</span><span class="p">,</span> <span class="s">"is_enriched"</span><span class="p">],</span>
    <span class="n">require_groups</span> <span class="o">=</span> <span class="bp">True</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Now, we can compare the distributions of enrichment and depletion
p-values aggregating over all attributes.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plot_ppr_enrichment_histograms</span><span class="p">(</span><span class="n">fdr_controlled_results</span><span class="p">)</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-08-27-napistu_network_propagation/plot_ppr_enrichment_histograms-output-1.png" alt="" /></p>

<p>From these p-value histograms I can see that there are vertices which
are enriched and depleted for my signals and the depletions are
particularly pronounced. This observation is why I stratified
enrichments and depletions when calculating FDR.</p>

<h3 id="identifying-enriched-subgraphs">Identifying enriched subgraphs</h3>

<p>To explore the subnetworks enriched for each attribute, I will filter
the data to include only vertices significantly enriched for a given
attribute (q &lt; 0.1). I will then count the number of enriched vertices
for each attribute:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n_enriched_vertices</span> <span class="o">=</span> <span class="n">fdr_controlled_results</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">"is_enriched == True"</span><span class="p">).</span><span class="n">query</span><span class="p">(</span><span class="s">"q_value &lt; 0.1"</span><span class="p">).</span><span class="n">value_counts</span><span class="p">(</span><span class="s">"attribute"</span><span class="p">)</span>

<span class="c1"># add back zeros
</span><span class="n">missing_attributes</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">fdr_controlled_results</span><span class="p">[</span><span class="s">"attribute"</span><span class="p">].</span><span class="n">unique</span><span class="p">())</span> <span class="o">-</span> <span class="nb">set</span><span class="p">(</span><span class="n">n_enriched_vertices</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">tolist</span><span class="p">())</span>
<span class="n">missing_attributes</span>

<span class="n">n_enriched_vertices</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span>
        <span class="n">n_enriched_vertices</span><span class="p">,</span>
        <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">({</span><span class="n">attr</span><span class="p">:</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">attr</span> <span class="ow">in</span> <span class="n">missing_attributes</span><span class="p">}).</span><span class="n">rename</span><span class="p">(</span><span class="s">"count"</span><span class="p">)</span>
    <span class="p">])</span>
<span class="p">)</span>

<span class="c1"># reformat the attributes to include modality and measure
</span><span class="n">attr_metadata</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="k">for</span> <span class="n">key</span> <span class="ow">in</span> <span class="n">n_enriched_vertices</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">tolist</span><span class="p">():</span>
    <span class="k">for</span> <span class="n">mod</span> <span class="ow">in</span> <span class="n">MODALITIES</span><span class="p">:</span>
        <span class="c1"># match str
</span>        <span class="k">if</span> <span class="n">re</span><span class="p">.</span><span class="n">search</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span> <span class="n">key</span><span class="p">):</span>
            <span class="n">attr_metadata</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
                <span class="s">"modality"</span> <span class="p">:</span> <span class="n">mod</span><span class="p">,</span>
                <span class="s">"variable"</span> <span class="p">:</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">f</span><span class="s">"_</span><span class="si">{</span><span class="n">mod</span><span class="si">}</span><span class="s">$"</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span> <span class="n">key</span><span class="p">)</span>
            <span class="p">}</span>
            <span class="k">break</span>

    <span class="k">if</span> <span class="n">key</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">attr_metadata</span><span class="p">:</span>
        <span class="n">attr_metadata</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"modality"</span> <span class="p">:</span> <span class="s">"unknown"</span><span class="p">,</span>
            <span class="s">"variable"</span> <span class="p">:</span> <span class="n">key</span>
        <span class="p">}</span>

<span class="n">attr_metadata_df</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">attr_metadata</span><span class="p">).</span><span class="n">T</span>
    <span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">VAR_METADATA</span><span class="p">,</span> <span class="n">on</span> <span class="o">=</span> <span class="s">"variable"</span><span class="p">,</span> <span class="n">how</span> <span class="o">=</span> <span class="s">"left"</span><span class="p">)</span>
<span class="p">).</span><span class="n">assign</span><span class="p">(</span><span class="n">attribute</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">row</span><span class="p">[</span><span class="s">'variable'</span><span class="p">]</span><span class="si">}</span><span class="s">_</span><span class="si">{</span><span class="n">row</span><span class="p">[</span><span class="s">'modality'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)).</span><span class="n">set_index</span><span class="p">(</span><span class="s">"attribute"</span><span class="p">)</span>

<span class="n">attr_signif_counts</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span>
    <span class="p">[</span>
        <span class="n">n_enriched_vertices</span><span class="p">,</span>
        <span class="n">attr_metadata_df</span>
    <span class="p">],</span>
    <span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span>
<span class="p">)</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">create_stacked_barplot_seaborn</span><span class="p">(</span><span class="n">attr_signif_counts</span><span class="p">)</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-08-27-napistu_network_propagation/discovery_counts-output-1.png" alt="" /></p>

<p>This plot shows that setting vertex reset probabilities based on
$-\log_{10}(\text{q-values})$) produces the largest subgraph of upstream
enriched vertices. A similar result was observed with a hard
thresholding approach, where vertices with q &gt; 0.1 were assigned a
reset probability of 0. Notably, despite being monotonic
transformations, $-\log_{10}(\text{q-values})$ diverge substantially
from $-\log_{10}(\text{p-values})$.</p>

<p>So, which metric is more trustworthy? I lean toward p-values — if
everything becomes significant, it’s as uninformative as if nothing is
significant. The strong enrichment for proteomics run order in the
transcriptomics data, despite minimal nominal significance, is a red
flag that q-values may be inappropriate in this context.</p>

<div class="content-section ai-aside">
  <div class="section-content">
    <p>Choosing appropriate transformations
for network propagation underscores a fundamental challenge in
biological network analysis: the lack of clear benchmarks. Unlike many
machine learning domains, we often lack ground truth, making it
difficult to select hyperparameters or validate methodological choices
systematically.</p>

<p>Supervised machine learning offers a path forward by reframing ambiguous
biological questions as well-defined prediction tasks, yielding two key
advantages. First, machine learning methods can potentially capture
functional relationships more accurately than algorithms like
personalized PageRank — provided the training data and prediction
tasks reflect meaningful biological objectives. The increasing
availability of Perturb-seq data, for example, enables direct prediction
of regulatory relationships from expression signatures. Second, even
when using traditional network methods, supervised tasks can inform
hyperparameter tuning and network design in contexts where ground truth
is scarce.</p>

<p>For example, if a network representation accurately recovers masked
protein–protein interactions, it suggests that the underlying data
integration and edge-weighting strategies capture meaningful biology.
These insights can then refine traditional network analyses, such as the
personalized PageRank approach used here. This creates a virtuous cycle
in which machine learning tasks guide the construction of better
networks, which in turn enhance the effectiveness of both ML and non-ML
methods for biological discovery.</p>

  </div>
</div>

<h2 id="interpreting-network-enrichments">Interpreting network enrichments</h2>

<p>To summarize the strongest network-based enrichments, I took the <strong>union
of the top five enriched vertices</strong> per attribute and measured how often
their enrichment signal exceeded the null across 500 permutations. (This
is akin to comparing the rank of a top signal for one attribute against
its ranks across others. However, because the coarse-grained empirical
nulls create many tied ranks — causing abrupt jumps from 1 to hundreds
or thousands — the raw ranks are difficult to interpret.)</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ppr_signif_w_metadata</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">fdr_controlled_results</span>
    <span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">attr_metadata_df</span><span class="p">,</span> <span class="n">on</span> <span class="o">=</span> <span class="s">"attribute"</span><span class="p">,</span> <span class="n">how</span> <span class="o">=</span> <span class="s">"left"</span><span class="p">)</span>
    <span class="c1"># remove q-value based ranking
</span>    <span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">'summary != "q"'</span><span class="p">)</span>
    <span class="c1"># sort each attribute in ascending order by q-value and provide the rank, handling ties appropriately
</span>    <span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="n">rank</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">"attribute"</span><span class="p">,</span> <span class="s">"is_enriched"</span><span class="p">])[</span><span class="s">"q_value"</span><span class="p">].</span><span class="n">rank</span><span class="p">(</span><span class="n">method</span><span class="o">=</span><span class="s">"min"</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
    <span class="c1"># replace rank with . if q-value is above 0.1
</span>    <span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="n">rank</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">"rank"</span><span class="p">].</span><span class="n">where</span><span class="p">((</span><span class="n">x</span><span class="p">[</span><span class="s">"q_value"</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="mf">0.1</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s">"is_enriched"</span><span class="p">]</span> <span class="o">==</span> <span class="bp">True</span><span class="p">),</span> <span class="s">"."</span><span class="p">))</span>
    <span class="c1"># similar approach but work with quantiles relative to null
</span>    <span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="n">display_nulls_gt_observed</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">"nulls_gt_observed"</span><span class="p">].</span><span class="n">where</span><span class="p">((</span><span class="n">x</span><span class="p">[</span><span class="s">"q_value"</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="mf">0.1</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s">"is_enriched"</span><span class="p">]</span> <span class="o">==</span> <span class="bp">True</span><span class="p">),</span> <span class="s">"."</span><span class="p">))</span>
<span class="p">)</span>

<span class="c1"># loop through
</span><span class="n">phenotypes_of_interest</span> <span class="o">=</span> <span class="n">PPR_LINEAR_PHENOTYPES</span>
<span class="k">for</span> <span class="n">phenotype</span> <span class="ow">in</span> <span class="n">phenotypes_of_interest</span><span class="p">:</span>

    <span class="n">phenotype_ppr_results</span> <span class="o">=</span> <span class="n">ppr_signif_w_metadata</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">ppr_signif_w_metadata</span><span class="p">[</span><span class="s">"phenotype"</span><span class="p">]</span> <span class="o">==</span> <span class="n">phenotype</span><span class="p">]</span>

    <span class="c1"># find top N vertices for each attribute 
</span>    <span class="n">top_phenotype_vertices</span> <span class="o">=</span> <span class="n">phenotype_ppr_results</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">"q_value &lt; 0.1 &amp; is_enriched == True"</span><span class="p">).</span><span class="n">sort_values</span><span class="p">([</span><span class="s">"q_value"</span><span class="p">,</span> <span class="s">"ppr_score"</span><span class="p">],</span> <span class="n">ascending</span> <span class="o">=</span> <span class="p">[</span><span class="bp">True</span><span class="p">,</span> <span class="bp">False</span><span class="p">]).</span><span class="n">groupby</span><span class="p">(</span><span class="s">"attribute"</span><span class="p">).</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)[</span><span class="s">"vertex_name"</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>

    <span class="n">top_phenotype_stats</span> <span class="o">=</span> <span class="n">phenotype_pivot</span> <span class="o">=</span> <span class="p">(</span>
        <span class="n">phenotype_ppr_results</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">"vertex_name in @top_phenotype_vertices"</span><span class="p">).</span><span class="n">merge</span><span class="p">(</span>
            <span class="n">annotated_vertices</span><span class="p">[[</span><span class="s">"name"</span><span class="p">,</span> <span class="s">"node_name"</span><span class="p">,</span> <span class="s">"node_type"</span><span class="p">]],</span>
            <span class="n">left_on</span> <span class="o">=</span> <span class="s">"vertex_name"</span><span class="p">,</span>
            <span class="n">right_on</span> <span class="o">=</span> <span class="s">"name"</span>
        <span class="p">)</span>
        <span class="c1">#.assign(rank=lambda x: x["rank"].apply(lambda val: str(int(float(val))) if val != "." else val))
</span>        <span class="c1">#.pivot_table(index = ["modality", "summary"], columns = "node_name", values = "rank", aggfunc="first")
</span>        <span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">index</span> <span class="o">=</span> <span class="p">[</span><span class="s">"modality"</span><span class="p">,</span> <span class="s">"summary"</span><span class="p">],</span> <span class="n">columns</span> <span class="o">=</span> <span class="s">"node_name"</span><span class="p">,</span> <span class="n">values</span> <span class="o">=</span> <span class="s">"display_nulls_gt_observed"</span><span class="p">,</span> <span class="n">aggfunc</span><span class="o">=</span><span class="s">"first"</span><span class="p">)</span>
        <span class="p">.</span><span class="n">T</span>
    <span class="p">)</span>

    <span class="c1"># Apply to your table
</span>    <span class="n">reordered_table</span> <span class="o">=</span> <span class="p">(</span>
        <span class="n">reorder_by_rank_sum</span><span class="p">(</span><span class="n">top_phenotype_stats</span><span class="p">)</span>
        <span class="p">.</span><span class="n">fillna</span><span class="p">(</span><span class="s">"."</span><span class="p">)</span>
    <span class="p">)</span>

    <span class="n">display_tabulator</span><span class="p">(</span>
        <span class="n">reordered_table</span><span class="p">,</span>
        <span class="n">caption</span><span class="o">=</span><span class="sa">f</span><span class="s">"Association ranks for </span><span class="si">{</span><span class="n">phenotype</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
        <span class="n">wrap_columns</span><span class="o">=</span><span class="p">[</span><span class="s">"node_name"</span><span class="p">],</span>
        <span class="n">column_widths</span><span class="o">=</span><span class="p">{</span><span class="s">"node_name"</span> <span class="p">:</span> <span class="s">"50%"</span><span class="p">}</span>
    <span class="p">)</span>
    
    <span class="n">display</span><span class="p">(</span><span class="n">HTML</span><span class="p">(</span><span class="s">'&lt;div style="margin-bottom: 30px;"&gt;&lt;/div&gt;'</span><span class="p">))</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Association ranks for MMA_urine
</figcaption>

<div class="data-table" style="" data-table="[{&quot;node_name&quot;: &quot;IL6 gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;IGF1&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;Cyclin D&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 1, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;NADH&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;NAD+&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;LGALS3 gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;L-MM-CoA&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;SUCC-CoA&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;2xMMAA:2xMUT:AdoCbl&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;CDKN2A gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;Hydrogen peroxide&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 2, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;Ac-CoA&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 4, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;IL8 gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;C-X-C motif chemokine ligand 6 [cellular_component]&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;BPTF&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;SPARC related modular calcium binding 2 [cellular_component]&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;retinoic acid receptor responder 2 [cellular_component]&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}]" data-columns="[{&quot;title&quot;: &quot;node_name&quot;, &quot;field&quot;: &quot;node_name&quot;, &quot;formatter&quot;: &quot;textarea&quot;, &quot;variableHeight&quot;: true, &quot;width&quot;: &quot;50%&quot;}, {&quot;title&quot;: &quot;proteomics&quot;, &quot;columns&quot;: [{&quot;title&quot;: &quot;est&quot;, &quot;field&quot;: &quot;proteomics_est&quot;}, {&quot;title&quot;: &quot;log10p&quot;, &quot;field&quot;: &quot;proteomics_log10p&quot;}, {&quot;title&quot;: &quot;stat&quot;, &quot;field&quot;: &quot;proteomics_stat&quot;}]}, {&quot;title&quot;: &quot;transcriptomics&quot;, &quot;columns&quot;: [{&quot;title&quot;: &quot;est&quot;, &quot;field&quot;: &quot;transcriptomics_est&quot;}, {&quot;title&quot;: &quot;log10p&quot;, &quot;field&quot;: &quot;transcriptomics_log10p&quot;}, {&quot;title&quot;: &quot;stat&quot;, &quot;field&quot;: &quot;transcriptomics_stat&quot;}]}]" data-options="{&quot;layout&quot;: &quot;fitColumns&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<div style="margin-bottom: 30px;"></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Association ranks for case
</figcaption>

<div class="data-table" style="" data-table="[{&quot;node_name&quot;: &quot;cellular communication network factor 5 [cellular_component]&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;EDIL3 gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;EGF gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;TNF&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;NADH&quot;, &quot;proteomics_est&quot;: 0, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;INS gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;NAD+&quot;, &quot;proteomics_est&quot;: 0, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;MITF-M dimer:EDIL3 gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;thrombospondin type 1 domain containing 4 [cellular_component]&quot;, &quot;proteomics_est&quot;: 0, &quot;proteomics_log10p&quot;: 1, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;complement factor D [cellular_component]&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 1}, {&quot;node_name&quot;: &quot;BDNF gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 1, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;Ub&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;L-Glu&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;Cx43:ZO-1:c-src gap junction&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;HAPLN1&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 2, &quot;transcriptomics_stat&quot;: 3}, {&quot;node_name&quot;: &quot;CDKN1A gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: 1, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;Ac-CoA&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 1, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;sirtuin 5 [cellular_component]&quot;, &quot;proteomics_est&quot;: 0, &quot;proteomics_log10p&quot;: 1, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;EGFR gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 2}, {&quot;node_name&quot;: &quot;SIRT5:Zn2+&quot;, &quot;proteomics_est&quot;: 0, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;2OG&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}]" data-columns="[{&quot;title&quot;: &quot;node_name&quot;, &quot;field&quot;: &quot;node_name&quot;, &quot;formatter&quot;: &quot;textarea&quot;, &quot;variableHeight&quot;: true, &quot;width&quot;: &quot;50%&quot;}, {&quot;title&quot;: &quot;proteomics&quot;, &quot;columns&quot;: [{&quot;title&quot;: &quot;est&quot;, &quot;field&quot;: &quot;proteomics_est&quot;}, {&quot;title&quot;: &quot;log10p&quot;, &quot;field&quot;: &quot;proteomics_log10p&quot;}, {&quot;title&quot;: &quot;stat&quot;, &quot;field&quot;: &quot;proteomics_stat&quot;}]}, {&quot;title&quot;: &quot;transcriptomics&quot;, &quot;columns&quot;: [{&quot;title&quot;: &quot;est&quot;, &quot;field&quot;: &quot;transcriptomics_est&quot;}, {&quot;title&quot;: &quot;log10p&quot;, &quot;field&quot;: &quot;transcriptomics_log10p&quot;}, {&quot;title&quot;: &quot;stat&quot;, &quot;field&quot;: &quot;transcriptomics_stat&quot;}]}]" data-options="{&quot;layout&quot;: &quot;fitColumns&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<div style="margin-bottom: 30px;"></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Association ranks for OHCblPlus
</figcaption>

<div class="data-table" style="" data-table="[{&quot;node_name&quot;: &quot;INS gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;inhibin subunit beta A [cellular_component]&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;TNF&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;EGF gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;regulator of calcineurin 2 [cellular_component]&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;IL6 gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 1, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;BDNF gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 2, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;cellular communication network factor 5 [cellular_component]&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: 2, &quot;transcriptomics_stat&quot;: 2}, {&quot;node_name&quot;: &quot;IL1B gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 1, &quot;transcriptomics_log10p&quot;: 4, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;p-T2609,S2612,T2638,T2647-PRKDC:XRCC5:XRCC6:p-S645-DCLRE1C:DNA DSB ends&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;L-MM-CoA&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;NADH&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;9606.ENSP00000450353 [cellular_component]&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;2xMMAA:2xMUT:AdoCbl&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;p-T2609,S2612,T2638,T2647-PRKDC:XRCC5:XRCC6:p-S516,S645-DCLRE1C:DNA DSB ends&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;Ac-CoA&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 1, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;NAD+&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 1, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;EDIL3 gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;2OG&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}]" data-columns="[{&quot;title&quot;: &quot;node_name&quot;, &quot;field&quot;: &quot;node_name&quot;, &quot;formatter&quot;: &quot;textarea&quot;, &quot;variableHeight&quot;: true, &quot;width&quot;: &quot;50%&quot;}, {&quot;title&quot;: &quot;proteomics&quot;, &quot;columns&quot;: [{&quot;title&quot;: &quot;est&quot;, &quot;field&quot;: &quot;proteomics_est&quot;}, {&quot;title&quot;: &quot;log10p&quot;, &quot;field&quot;: &quot;proteomics_log10p&quot;}, {&quot;title&quot;: &quot;stat&quot;, &quot;field&quot;: &quot;proteomics_stat&quot;}]}, {&quot;title&quot;: &quot;transcriptomics&quot;, &quot;columns&quot;: [{&quot;title&quot;: &quot;est&quot;, &quot;field&quot;: &quot;transcriptomics_est&quot;}, {&quot;title&quot;: &quot;log10p&quot;, &quot;field&quot;: &quot;transcriptomics_log10p&quot;}, {&quot;title&quot;: &quot;stat&quot;, &quot;field&quot;: &quot;transcriptomics_stat&quot;}]}]" data-options="{&quot;layout&quot;: &quot;fitColumns&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<div style="margin-bottom: 30px;"></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Association ranks for responsive_to_acute_treatment
</figcaption>

<div class="data-table" style="" data-table="[{&quot;node_name&quot;: &quot;CDKN2A gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 3, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;Protonated Carbamino DeoxyHbA&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;BPTF&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;CDKN1A gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;OxyHbA&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;Hemoglobin A is protonated and carbamated causing release of oxygen&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;NOTCH1 gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;L-Glu&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;NADH&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 0, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;DNMT1 mRNA&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 1}, {&quot;node_name&quot;: &quot;Hydrogen peroxide&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 1, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;INO80&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: 0, &quot;transcriptomics_stat&quot;: 1}, {&quot;node_name&quot;: &quot;Ac-CoA&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: 2, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;IL8 gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;MMP1&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;VWF&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;CXCL1&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;BCL6 gene&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: 0}, {&quot;node_name&quot;: &quot;NAD+&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: 0, &quot;transcriptomics_est&quot;: &quot;.&quot;, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}, {&quot;node_name&quot;: &quot;secreted frizzled related protein 4 [cellular_component]&quot;, &quot;proteomics_est&quot;: &quot;.&quot;, &quot;proteomics_log10p&quot;: &quot;.&quot;, &quot;proteomics_stat&quot;: &quot;.&quot;, &quot;transcriptomics_est&quot;: 0, &quot;transcriptomics_log10p&quot;: &quot;.&quot;, &quot;transcriptomics_stat&quot;: &quot;.&quot;}]" data-columns="[{&quot;title&quot;: &quot;node_name&quot;, &quot;field&quot;: &quot;node_name&quot;, &quot;formatter&quot;: &quot;textarea&quot;, &quot;variableHeight&quot;: true, &quot;width&quot;: &quot;50%&quot;}, {&quot;title&quot;: &quot;proteomics&quot;, &quot;columns&quot;: [{&quot;title&quot;: &quot;est&quot;, &quot;field&quot;: &quot;proteomics_est&quot;}, {&quot;title&quot;: &quot;log10p&quot;, &quot;field&quot;: &quot;proteomics_log10p&quot;}, {&quot;title&quot;: &quot;stat&quot;, &quot;field&quot;: &quot;proteomics_stat&quot;}]}, {&quot;title&quot;: &quot;transcriptomics&quot;, &quot;columns&quot;: [{&quot;title&quot;: &quot;est&quot;, &quot;field&quot;: &quot;transcriptomics_est&quot;}, {&quot;title&quot;: &quot;log10p&quot;, &quot;field&quot;: &quot;transcriptomics_log10p&quot;}, {&quot;title&quot;: &quot;stat&quot;, &quot;field&quot;: &quot;transcriptomics_stat&quot;}]}]" data-options="{&quot;layout&quot;: &quot;fitColumns&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<div style="margin-bottom: 30px;"></div>

<p>These results are fascinating:</p>

<ul>
  <li><strong>MUT</strong>, the major genetic cause of MMA, frequently appears.</li>
  <li>Both MUT’s substrate, <strong>methylmalonyl CoA</strong> (L-MM-CoA), and its
product, <strong>succinyl CoA</strong> (SUCC-CoA), are represented along with a
number of other metabolites discussed in the original study such as
glutamine. Recovering metabolite associations is particularly
interesting because there was no metabolomics data on these cell
lines.</li>
  <li><strong>2-oxoglutarate</strong> (2OG), aka alpha-ketoglutarate, appears in
several subgraphs. It is the natural counterpart to
dimethyl-oxoglutarate, which Forny et al., demonstrated can rescue
the MMA-associated metabolic defect.</li>
</ul>

<div class="content-section bio-section">
  <div class="section-content">
    <p>Transcriptomics associations
frequently highlight growth-related genes (e.g., cyclins, <em>IGF1</em>,
EGF/R), suggesting that cell line growth rate may be confounded with
disease severity. This underscores the inherent biological variability
in these datasets. Ideally, doubling time should be included as a
covariate in future analyses.</p>

  </div>
</div>

<h3 id="visualizing-induced-subgraphs">Visualizing induced subgraphs</h3>

<p>To more comprehensively visualize these enriched vertices, I generated
induced subgraphs retaining both enriched vertices and their molecular
interactions. Below is an example focused on MMA urine proteomics:</p>

<p><img src="https://www.shackett.org/figure/napistu_ppr/log10_p_MMA_urine_proteomics_component_1.png" alt="Induced subgraph of network enriched for PPR signal upstream of MMA_urine proteins" style="width: 100%;" /></p>

<p>This plot spans the enzymatic causes of MMA — deficiencies in
propionate metabolism — with related metabolic effects, such as
altered glutamine and glutamate levels, previously observed in MMA
patients. Beyond these reassuring associations, the analyses also
implicate additional regulatory pathways, including <strong>sirtuins and ROS
signaling</strong>. To investigate any of these regulators, the graph can be
traversed to generate causal, mechanistic hypotheses linking upstream
regulators to downstream molecular effects.</p>

<div class="content-section bio-section">
  <div class="section-content">
    <p>We know far more about how
proteins shape metabolism than how metabolism influences proteins and
gene expression. Against this backdrop, it’s notable that sirtuins and
ROS signaling emerged — two of the more well-characterized pathways
involved in metabolic sensing. Sirtuins function as metabolic sensors by
using NAD+ as a cofactor for their histone deacetylase activity,
directly linking cellular energy status to chromatin remodeling and the
transcriptional regulation of metabolic genes. Reactive oxygen species
act as secondary messengers, modulating transcription factor activity
and epigenetic modifications to translate metabolic stress and
mitochondrial dysfunction into adaptive gene expression programs.</p>

<p>Nonetheless, the hub-like roles of NADH and other cofactors make them
challenging to interpret mechanistically. Take water as an example;
though not a cofactor, water frequently participates in reactions, many
of which produce or consume it. However, the flux of water in individual
reactions is negligible compared to its large pool size, so it’s rarely
considered a regulated entity or regulator. To address this issue, water
is removed from all reactions during the <code class="language-plaintext highlighter-rouge">drop_cofactors</code> step of the
pathway build process. While this violates mass balance, it preserves
the regulatory intent of the <code class="language-plaintext highlighter-rouge">SBML_dfs</code> and <code class="language-plaintext highlighter-rouge">NapistuGraph</code>.</p>

<p>Handling NADH is more nuanced. In the <code class="language-plaintext highlighter-rouge">drop_cofactors</code> step, NAD+ and
NADH are removed from reactions only when both are present and NADH acts
as the substrate. The rationale is that many reactions consume small
amounts of energy (NADH → NAD+) without significantly affecting cellular
energetics, whereas energy production (NAD+ → NADH) is typically more
physiologically meaningful.</p>

  </div>
</div>

<h2 id="summary-and-next-steps">Summary and next steps</h2>

<p>This analysis demonstrates how personalized PageRank on genome-scale
networks can transform statistical associations into mechanistic
biological insights. This approach:</p>

<ul>
  <li>✅ Recovered known regulators (e.g., <em>MMUT</em>)</li>
  <li>✅ Validated and expanded findings on glutamine/glutamate metabolism</li>
  <li>✅ Linked statistical results to coherent subgraphs of molecular
interactions</li>
  <li>✅ Identified new regulatory hypotheses (e.g., sirtuins, ROS
signaling)</li>
</ul>

<p>By tracing disease signals upstream, we uncover <strong>coordinated regulatory
modules</strong> driving MMA pathophysiology — insights that would be
difficult to glean from gene lists alone.</p>

<h3 id="what-napistu-provides">What Napistu provides</h3>

<p><strong>Napistu is a comprehensive genome-scale network biology framework
designed to bridge the gap between pathway databases and practical
analysis</strong>. It integrates diverse biological data sources — such as
<em>Reactome</em>, <em>BiGG</em>, <em>TRRUST</em>, <em>STRING</em> — into unified network
representations that capture both metabolic and gene-centered regulatory
mechanisms. It addresses many of the complex data engineering challenges
which emerge when working with pathway data including identifier
mapping, information consolidation, and translating biological pathways
into analysis-ready graph structures</p>

<p>This allows researchers to focus on <strong>biological discovery</strong> rather than
data munging.</p>

<h3 id="methodology-takeaways">Methodology takeaways</h3>

<p>This post outlines a workflow for integrating multimodal genomics data
with genome-scale biological networks by:</p>

<ul>
  <li>Extracting feature-level data from <code class="language-plaintext highlighter-rouge">AnnData</code> or <code class="language-plaintext highlighter-rouge">MuData</code> objects
from the <code class="language-plaintext highlighter-rouge">scverse</code> project and adding them to a Napistu pathway
representation</li>
  <li>Transforming pathway-associated data into vertex attributes on a
genome-scale molecular network</li>
  <li>Aggregating signals using network propagation to identify subgraphs
enriched for biological signals.</li>
</ul>

<p>A critical challenge addressed here is that network propagation methods,
like personalized PageRank, naturally concentrate signal in hub nodes.
This bias arises from two sources:</p>

<ul>
  <li><strong>Topological bias</strong> (due to network structure), and</li>
  <li><strong>Ascertainment bias</strong> (due to limited experimental coverage).</li>
</ul>

<p>To overcome this, I created modality-specific vertex permutation null
distributions, allowing me to distinguish genuine biological enrichments
from connectivity artifacts.</p>

<p>Napistu’s disease-agnostic design supports any dataset with systematic
molecular identifiers. Its modular architecture allows researchers to
swap pathway sources, topologies, or propagation algorithms, making it
highly adaptable across diverse biological applications.</p>

<h3 id="try-it-yourself">Try it yourself!</h3>

<p>The <strong>Napistu framework</strong> and all associated analysis workflows are
<strong>open source</strong> and ready to use. The <strong>human consensus network</strong>
featured here is available for direct download from my public
repository, and the well-documented code serves as a robust template for
applying these methods to your own data.</p>

<p>Whether you’re investigating rare diseases, cancer, or complex traits,
Napistu provides a systematic pipeline that transforms statistical
associations into mechanistic insights.</p>

<p>🔗 Get started:
<a href="https://github.com/napistu/napistu">github.com/napistu/napistu</a></p>]]></content><author><name>Sean Hackett</name></author><category term="napistu" /><category term="genomics" /><category term="python" /><category term="networks" /><summary type="html"><![CDATA[This is part two of a two-part series on Napistu — a new framework for building genome-scale molecular networks and integrating them with high-dimensional data. Using a methylmalonic acidemia (MMA) multimodal dataset as a case study, I’ll demonstrate how to distill disease-relevant signals into mechanistic insights through network-based analysis. From statistical associations to biological mechanisms Modern genomics excels at identifying disease-associated genes and proteins through statistical analysis. Methods like Gene Set Enrichment Analysis (GSEA) group these genes into functional categories, offering useful biological context. However, we aim to go beyond simply identifying which genes and gene sets change. Our goal is to understand why these genes change together, uncovering the mechanistic depth typically seen in Figure 1 of a Cell paper. To achieve this, we must identify key molecular components, summarize their interactions, and characterize the dynamic cascades that drive emergent biological behavior. In this post, I’ll demonstrate how to gain this insight by mapping statistical disease signatures onto genome-scale biological networks. Then, using personalized PageRank, I’ll trace signals from dysregulated genes back to their shared regulatory origins. This transforms lists of differentially expressed genes into interconnected modules that reveal upstream mechanisms driving coordinated molecular changes.]]></summary></entry><entry><title type="html">Network Biology with Napistu, Part 1: Creating Multimodal Disease Profiles</title><link href="https://www.shackett.org/multiomic_profiles/" rel="alternate" type="text/html" title="Network Biology with Napistu, Part 1: Creating Multimodal Disease Profiles" /><published>2025-08-19T00:00:00+00:00</published><updated>2025-08-19T00:00:00+00:00</updated><id>https://www.shackett.org/multiomic_profiles</id><content type="html" xml:base="https://www.shackett.org/multiomic_profiles/"><![CDATA[<p>This is part one of a two-part post highlighting
<strong><a href="https://github.com/napistu/napistu">Napistu</a></strong> — a new framework
for building genome-scale networks of molecular biology and
biochemistry. In this post, I’ll tackle a fundamental challenge in
computational biology: how to extract meaningful disease signatures from
complex multimodal datasets.</p>

<p>Using methylmalonic acidemia (MMA) as my test case, I’ll demonstrate how
to systematically extract disease signatures from multimodal data. My
approach combines three complementary analytical strategies: exploratory
data analysis to assess data structure and quality, differential
expression analysis to identify disease-associated features, and factor
analysis to uncover coordinated gene expression programs across data
types. The end goal is to distill thousands of molecular measurements
into a handful of interpretable disease signatures — each capturing a
distinct aspect of disease biology that can be mapped to regulatory
networks.</p>

<p>Throughout this post, I’ll use two types of asides to provide additional
context without disrupting the main analytical flow. Green boxes contain
biological details, while blue boxes reflect on the computational
workflow and AI-assisted development process.</p>

<!--more-->

<div class="content-section bio-section">
  <div class="section-content">
    <p><strong>For biologists</strong>: I identify a
previously unreported batch effect related to sample freezing dates that
impacts both data modalities. By accounting for this and other technical
covariates, I demonstrate improved power to detect disease-relevant
associations in the proteomics dataset. I also show that urine MMA
levels exhibit stronger statistical associations with molecular features
than traditional enzyme activity measures in this dataset. In part two,
I will link these statistical associations to upstream regulators using
a genome-scale mechanistic network, enabling me to define a regulon
underlying the molecular pathophysiology of MMA.</p>

  </div>
</div>

<div class="content-section ai-aside">
  <div class="section-content">
    <p><strong>For computational folks</strong>: While
developing this analysis and the underlying software, I intentionally
explored AI-assisted coding across the full development lifecycle —
from building statistical pipelines in Python to contributing to
Napistu, a complex scientific programming framework for network biology.
I’ll share specific examples where AI excelled (e.g., rapid prototyping,
API exploration) and where it failed in potentially dangerous ways
(e.g., confusing p-values with q-values, breaking regression formulas).
I’ll also offer practical strategies for maintaining code quality while
working at AI speed.</p>

  </div>
</div>

<h2 id="what-is-methylmalonic-acidemia">What is Methylmalonic acidemia?</h2>

<p>Methylmalonic acidemia (MMA) — not to be confused with mixed martial
arts (which I used to blog about at <a href="https://www.fightprior.com/">Fight
Prior</a>) — is an inborn error of
metabolism characterized by the buildup of methylmalonic acid.</p>

<div class="content-section bio-section">
  <div class="section-content">
    <p>MMA is genetically heterogeneous,
caused by defects in approximately 20 genes involved in propionate
metabolism. Classical isolated MMA is primarily due to autosomal
recessive mutations in the enzyme methylmalonyl-CoA mutase (<em>MMUT</em>),
which converts methylmalonyl-CoA to succinyl-CoA as part of the
propionate catabolic pathway. This pathway processes metabolites derived
from odd-chain fatty acids, cholesterol, and certain amino acids. When
disrupted, methylmalonic acid accumulates, leading to metabolic acidosis
and multi-organ dysfunction, including neurological abnormalities,
kidney failure, and growth impairment.</p>

<p>Despite being linked to a well-characterized metabolic pathway, MMA
presents significant clinical and research challenges. Patients exhibit
substantial variability in disease severity, treatment responsiveness,
and clinical outcomes—even among those with identical genetic
mutations. While traditional biochemical assays provide valuable
diagnostic information, they offer only a partial view of disease
pathophysiology. This variability suggests that understanding MMA
requires moving beyond simple “broken enzyme” models toward
systems-level approaches capable of capturing the complex downstream
effects of metabolic disruption.</p>

  </div>
</div>

<h3 id="the-forny-multi-omics-approach">The Forny multi-omics approach</h3>

<p>To address the complexity of MMA, <a href="https://www.nature.com/articles/s42255-022-00720-8">Forny et al.,
2023</a> conducted one
of the largest multi-omics studies of a rare metabolic disorder. They
profiled 210 patient-derived fibroblast lines and 20 controls using
transcriptomics and proteomics to better understand disease etiology and
identify potential therapeutic interventions.</p>

<div class="content-section bio-section">
  <div class="section-content">
    <p>The study achieved a molecular
diagnosis in 177 out of 210 cases (84%), though 33 patients remained
without a clear genetic explanation. While 148 patients had <em>MMUT</em>
mutations, others carried defects in <em>ACSF3</em>, <em>TCN2</em>, <em>SUCLA2</em>, and
additional genes — revealing broader genetic heterogeneity than
previously recognized. Beyond diagnostic insights, the study uncovered
unexpected disruption of TCA cycle anaplerosis (metabolic replenishment
pathways), particularly involving glutamine metabolism. It also
identified physical interactions between <em>MMUT</em> and other metabolic
enzymes, suggesting coordinated regulation, and demonstrated that
dimethyl-oxoglutarate treatment can restore metabolic function in
cellular models.</p>

<p>The findings suggest that MMA involves more than just a “broken enzyme.”
<em>MMUT</em> deficiency triggers a systematic rewiring of cellular metabolism,
especially in how cells replenish TCA cycle intermediates. This
anaplerotic shift includes increased reliance on glutamine metabolism
and appears to involve direct protein-protein interactions between
<em>MMUT</em> and glutamine-processing enzymes. These results indicate that
<em>MMUT</em> may function as part of a larger metabolic regulatory complex —
an unexpected insight with potential therapeutic implications.</p>

  </div>
</div>

<p>While the Forny study made significant advances, it also highlighted a
fundamental puzzle. MMA exhibits many features of a classical autosomal
recessive metabolic disorder: patients have well-defined enzymatic
defects, and some respond to metabolic interventions, such as vitamin
supplementation. Yet, the presence of treatment-resistant cases, highly
variable disease severity, and patients without clear genetic
explanations suggest that MMA pathophysiology extends beyond mere
metabolic dysfunction.</p>

<p>This paradox suggests that MMA may be influenced by broader regulatory
networks that either shape cellular metabolism or respond dynamically to
its disruption. If this is the case, understanding MMA requires moving
beyond metabolic modeling and adopting approaches that simultaneously
capture both metabolic and genic regulatory mechanisms. To address this
challenge, I aim to create molecular disease signatures that can be
mapped onto biological networks linking metabolic pathways to their
upstream regulatory controls. Rather than viewing MMA solely as a
metabolic disorder, this systems-level approach treats gene-centric and
metabolism-centric regulation as interconnected, enabling me to trace
disease signals from downstream molecular effects to potential upstream
drivers.</p>

<h2 id="strategy-for-creating-disease-profiles">Strategy for creating disease profiles</h2>

<p>To create interpretable disease signatures from the Forny dataset, I’ll
follow a systematic approach:</p>

<ol>
  <li><strong>Preprocessing</strong>: Address missing phenotype data and technical
covariates</li>
  <li><strong>Exploratory analysis</strong>: Identify batch effects and major sources
of variation</li>
  <li><strong>Supervised regression</strong>: Find individual disease-associated
features</li>
  <li><strong>Unsupervised factor analysis</strong>: Discover coordinated multi-omic
programs</li>
</ol>

<h3 id="ai-as-a-development-collaborator">AI as a development collaborator</h3>

<p>While I typically perform exploratory data analysis and statistical
analysis in R, I implemented this analysis in Python. This provided an
opportunity to explore AI-assisted development — not as a replacement
for careful software engineering, but as a collaborator for writing and
testing code more efficiently. Throughout the analysis, I’ll highlight
specific examples of where this approach succeeded and where it fell
short.</p>

<h1 id="creating-molecular-profiles-of-mma">Creating molecular profiles of MMA</h1>

<h2 id="environment-setup">Environment setup</h2>

<p>If you’d like to reproduce this analysis, follow these steps:</p>

<ol>
  <li>
    <p>Install <a href="https://docs.astral.sh/uv/#highlights">uv</a> (or just use
<code class="language-plaintext highlighter-rouge">pip</code> install)</p>
  </li>
  <li>
    <p>Setup a Python environment:</p>

    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>uv venv <span class="nt">--python</span> 3.11
<span class="nb">source</span> .venv/bin/activate

<span class="c"># Core dependencies</span>
uv pip <span class="nb">install</span> <span class="s2">"napistu[scverse]==0.5.5"</span>
<span class="c"># Personal utilities package with genomics analysis functions</span>
uv pip <span class="nb">install</span> <span class="s2">"git+https://github.com/shackett/shackett-utils.git@v0.1.2[all]"</span> 
<span class="c"># Additional dependencies</span>
uv pip <span class="nb">install </span>openpyxl scikit-learn mofapy2 ipykernel nbformat nbclient
python <span class="nt">-m</span> ipykernel <span class="nb">install</span> <span class="nt">--user</span> <span class="nt">--name</span><span class="o">=</span>forny-2023
</code></pre></div>    </div>
  </li>
  <li>
    <p>Download <code class="language-plaintext highlighter-rouge">Source Data Fig.1</code> and <code class="language-plaintext highlighter-rouge">Source Data Fig.2</code> from <a href="https://www.nature.com/articles/s42255-022-00720-8#Sec39">Forny et
al., 2023</a></p>
  </li>
  <li>
    <p>Download the
<a href="https://github.com/shackett/shackett/blob/master/posts/posted/creating_multimodal_profiles.qmd"><code class="language-plaintext highlighter-rouge">creating_multimodal_profiles.qmd</code></a>
notebook</p>
  </li>
  <li>
    <p>Modify the following code block in your copy of the notebook to set
appropriate paths:</p>

    <p>a.  <code class="language-plaintext highlighter-rouge">SUPPLEMENTAL_DATA_DIR</code> should point to the directory containing
    the download from step 3 (“42255_2022_720_MOESM3_ESM.xlsx” and
    “42255_2022_720_MOESM4_ESM.xlsx”)
b.  <code class="language-plaintext highlighter-rouge">CACHE_DIR</code> should point to a location where intermediate
    results and outputs can be saved</p>
  </li>
  <li>
    <p>Run the notebook and render an html output (or just open the
notebook in your browser):</p>

    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>quarto render creating_multimodal_profiles.qmd
</code></pre></div>    </div>
  </li>
</ol>

<p>First, I’ll load the necessary Python modules, configure paths, set
global parameters, and define utility functions.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">textwrap</span>

<span class="kn">import</span> <span class="nn">anndata</span> <span class="k">as</span> <span class="n">ad</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">mudata</span> <span class="k">as</span> <span class="n">md</span>
<span class="kn">import</span> <span class="nn">muon</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">scanpy</span> <span class="k">as</span> <span class="n">sc</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="kn">from</span> <span class="nn">sklearn.impute</span> <span class="kn">import</span> <span class="n">KNNImputer</span>

<span class="c1"># this analysis is largely upstream of Napistu but we can use some of its utils
</span><span class="kn">from</span> <span class="nn">napistu</span> <span class="kn">import</span> <span class="n">utils</span> <span class="k">as</span> <span class="n">napistu_utils</span>

<span class="c1"># import local modules
</span><span class="kn">from</span> <span class="nn">shackett_utils.applications</span> <span class="kn">import</span> <span class="n">forny_imputation</span>
<span class="kn">import</span> <span class="nn">shackett_utils.genomics.adata_processing</span> <span class="k">as</span> <span class="n">processing</span>
<span class="kn">from</span> <span class="nn">shackett_utils.genomics</span> <span class="kn">import</span> <span class="n">adata_regression</span>
<span class="kn">from</span> <span class="nn">shackett_utils.genomics</span> <span class="kn">import</span> <span class="n">mdata_eda</span>
<span class="kn">from</span> <span class="nn">shackett_utils.genomics</span> <span class="kn">import</span> <span class="n">mdata_factor_analysis</span>
<span class="kn">from</span> <span class="nn">shackett_utils.statistics</span> <span class="kn">import</span> <span class="n">stats_viz</span>
<span class="kn">from</span> <span class="nn">shackett_utils.statistics</span> <span class="kn">import</span> <span class="n">transform</span>
<span class="kn">from</span> <span class="nn">shackett_utils.blog.html_utils</span> <span class="kn">import</span> <span class="n">display_tabulator</span>
<span class="kn">from</span> <span class="nn">shackett_utils.utils.pd_utils</span> <span class="kn">import</span> <span class="n">format_numeric_columns</span>

<span class="c1"># File paths and data organization
# All input data should be placed in the SUPPLEMENTAL_DATA_DIR
# Cached results and models will be stored in CACHE_DIR
</span>
<span class="c1"># paths
</span><span class="n">PROJECT_DIR</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">expanduser</span><span class="p">(</span><span class="s">"~/napistu_mma_posts"</span><span class="p">)</span>
<span class="n">SUPPLEMENTAL_DATA_DIR</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">PROJECT_DIR</span><span class="p">,</span> <span class="s">"input"</span><span class="p">)</span>
<span class="n">CACHE_DIR</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">PROJECT_DIR</span><span class="p">,</span> <span class="s">"cache"</span><span class="p">)</span>

<span class="c1"># Define the path to save hyperparameter scan results
</span><span class="n">MOFA_PARAM_SCAN_MODELS_PATH</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">CACHE_DIR</span><span class="p">,</span> <span class="s">"mofa_param_scan_h5mu"</span><span class="p">)</span>
<span class="c1"># Final results 
</span><span class="n">OPTIMAL_MODEL_H5MU_PATH</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">CACHE_DIR</span><span class="p">,</span> <span class="s">"mofa_optimal_model.h5mu"</span><span class="p">)</span>

<span class="c1"># formats
</span><span class="n">SUPPLEMENTAL_DATA_FILES</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"transcriptomics"</span> <span class="p">:</span> <span class="p">{</span>
        <span class="s">"file"</span> <span class="p">:</span> <span class="s">"42255_2022_720_MOESM3_ESM.xlsx"</span><span class="p">,</span>
        <span class="s">"sheet"</span> <span class="p">:</span> <span class="s">"Source Data transcriptomics"</span>
    <span class="p">},</span>
    <span class="s">"proteomics"</span> <span class="p">:</span> <span class="p">{</span>
        <span class="s">"file"</span> <span class="p">:</span> <span class="s">"42255_2022_720_MOESM3_ESM.xlsx"</span><span class="p">,</span>
        <span class="s">"sheet"</span> <span class="p">:</span> <span class="s">"Source Data proteomics"</span>
    <span class="p">},</span>
    <span class="s">"phenotypes"</span> <span class="p">:</span> <span class="p">{</span>
        <span class="s">"file"</span> <span class="p">:</span> <span class="s">"42255_2022_720_MOESM4_ESM.xlsx"</span><span class="p">,</span>
        <span class="s">"sheet"</span> <span class="p">:</span> <span class="mi">0</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c1"># other globals
# path to proteomics filenames for extracting run order
</span><span class="n">PROTEOMICS_FILE_NAMES_URL</span> <span class="o">=</span> <span class="s">"https://github.com/user-attachments/files/20616164/202502_PHRT-5_MMA_Sample-annotation.txt"</span>
<span class="c1"># filter genes with fewer than this # of counts summed over samples
</span><span class="n">READ_CUTOFF</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="mi">400</span><span class="p">)</span>
<span class="c1"># filter phenotypes with more than this # of missing values
</span><span class="n">MAX_MISSING_PHENOTYPE</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="mi">180</span><span class="p">)</span>
<span class="c1"># measure to use for analysis
</span><span class="n">ANALYSIS_LAYER</span> <span class="o">=</span> <span class="s">"log2_centered"</span>
<span class="c1"># cutoff for qvalues (FDR-adjusted p-values)
</span><span class="n">FDR_CUTOFF</span> <span class="o">=</span> <span class="mf">0.1</span>
<span class="c1"># Overwrite the results if they already exist   
</span><span class="n">OVERWRITE</span> <span class="o">=</span> <span class="bp">False</span>
<span class="c1"># Define the optimal number of factors
</span><span class="n">OPTIMAL_FACTOR</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="mi">30</span><span class="p">)</span>
<span class="c1"># Define the range of factors to test for MOFA
</span><span class="n">FACTOR_RANGE</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">51</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="c1"># naming scheme for feature- and factor-level regression results
</span><span class="n">FEATURE_REGRESSION_STR</span> <span class="o">=</span> <span class="s">"feature_regression_{}"</span>
<span class="n">FACTOR_REGRESSION_STR</span> <span class="o">=</span> <span class="s">"mofa_regression_{}"</span>

<span class="c1"># Statistical model specifications
# Used for both feature- and factor-level regression
# We use different covariates for each data modality based on identified batch effects:
# - Transcriptomics: only date_freezing affects expression
# - Proteomics: both proteomics_runorder and date_freezing are significant
# The 's()' notation indicates spline smoothing for non-linear relationships
</span><span class="n">REGRESSION_FORMULAS</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"transcriptomics"</span> <span class="p">:</span> <span class="p">{</span>
        <span class="s">"case"</span> <span class="p">:</span> <span class="s">"~ case + s(date_freezing)"</span><span class="p">,</span>
        <span class="s">"OHCblPlus"</span> <span class="p">:</span> <span class="s">"~ OHCblPlus + s(date_freezing)"</span><span class="p">,</span>
        <span class="s">"MMA_urine"</span> <span class="p">:</span> <span class="s">"~ MMA_urine + s(date_freezing)"</span><span class="p">,</span>
        <span class="s">"responsive_to_acute_treatment"</span> <span class="p">:</span> <span class="s">"~ responsive_to_acute_treatment + s(date_freezing)"</span><span class="p">,</span>
        <span class="s">"proteomics_runorder"</span> <span class="p">:</span> <span class="s">"~ s(proteomics_runorder) + s(date_freezing)"</span><span class="p">,</span> <span class="c1"># we don't expect this to be significant
</span>        <span class="s">"date_freezing"</span> <span class="p">:</span> <span class="s">"~ s(proteomics_runorder) + s(date_freezing)"</span>
    <span class="p">},</span>
    <span class="s">"proteomics"</span> <span class="p">:</span> <span class="p">{</span>
        <span class="s">"case"</span> <span class="p">:</span> <span class="s">"~ case + s(proteomics_runorder) + s(date_freezing)"</span><span class="p">,</span>
        <span class="s">"OHCblPlus"</span> <span class="p">:</span> <span class="s">"~ OHCblPlus + s(proteomics_runorder) + s(date_freezing)"</span><span class="p">,</span>
        <span class="s">"MMA_urine"</span> <span class="p">:</span> <span class="s">"~ MMA_urine + s(proteomics_runorder) + s(date_freezing)"</span><span class="p">,</span>
        <span class="s">"responsive_to_acute_treatment"</span> <span class="p">:</span> <span class="s">"~ responsive_to_acute_treatment + s(proteomics_runorder) + s(date_freezing)"</span><span class="p">,</span>
        <span class="s">"proteomics_runorder"</span> <span class="p">:</span> <span class="s">"~ s(proteomics_runorder) + s(date_freezing)"</span><span class="p">,</span>
        <span class="s">"date_freezing"</span> <span class="p">:</span> <span class="s">"~ s(proteomics_runorder) + s(date_freezing)"</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c1"># utility functions
</span>
<span class="k">def</span> <span class="nf">create_stacked_barplot_from_regression</span><span class="p">(</span><span class="n">regression_results_df</span><span class="p">,</span> <span class="n">q_threshold</span><span class="o">=</span><span class="n">FDR_CUTOFF</span><span class="p">):</span>
    <span class="s">"""
    Create a stacked barplot showing significant results by modality and model_name
    
    Parameters:
    regression_results_df: DataFrame with columns including 'q_value', 'modality', 'model_name'
    q_threshold: significance threshold for q_value (default: 0.1)
    """</span>
    <span class="c1"># Get significant results using your counting logic
</span>    <span class="n">significant_counts</span> <span class="o">=</span> <span class="n">regression_results_df</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="sa">f</span><span class="s">"q_value &lt; </span><span class="si">{</span><span class="n">q_threshold</span><span class="si">}</span><span class="s">"</span><span class="p">).</span><span class="n">value_counts</span><span class="p">([</span><span class="s">"modality"</span><span class="p">,</span> <span class="s">"model_name"</span><span class="p">])</span>
    
    <span class="c1"># Convert to DataFrame for easier manipulation
</span>    <span class="n">df</span> <span class="o">=</span> <span class="n">significant_counts</span><span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
    <span class="n">df</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'modality'</span><span class="p">,</span> <span class="s">'model_name'</span><span class="p">,</span> <span class="s">'count'</span><span class="p">]</span>
    
    <span class="c1"># Get all unique model_names from the original dataframe to ensure they all appear
</span>    <span class="n">all_model_names</span> <span class="o">=</span> <span class="n">regression_results_df</span><span class="p">[</span><span class="s">'model_name'</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
    <span class="n">all_modalities</span> <span class="o">=</span> <span class="n">regression_results_df</span><span class="p">[</span><span class="s">'modality'</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
    
    <span class="c1"># Create a complete DataFrame with all combinations, filling missing with 0
</span>    <span class="kn">from</span> <span class="nn">itertools</span> <span class="kn">import</span> <span class="n">product</span>
    <span class="n">all_combinations</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span>
        <span class="nb">list</span><span class="p">(</span><span class="n">product</span><span class="p">(</span><span class="n">all_modalities</span><span class="p">,</span> <span class="n">all_model_names</span><span class="p">)),</span> 
        <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">'modality'</span><span class="p">,</span> <span class="s">'model_name'</span><span class="p">]</span>
    <span class="p">)</span>
    <span class="n">all_combinations</span><span class="p">[</span><span class="s">'count'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
    
    <span class="c1"># Merge with actual counts, keeping all combinations
</span>    <span class="n">df_complete</span> <span class="o">=</span> <span class="n">all_combinations</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span>
        <span class="n">df</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="p">[</span><span class="s">'modality'</span><span class="p">,</span> <span class="s">'model_name'</span><span class="p">],</span> 
        <span class="n">how</span><span class="o">=</span><span class="s">'left'</span><span class="p">,</span> <span class="n">suffixes</span><span class="o">=</span><span class="p">(</span><span class="s">'_empty'</span><span class="p">,</span> <span class="s">'_actual'</span><span class="p">)</span>
    <span class="p">)</span>
    
    <span class="c1"># Use actual counts where available, otherwise use 0
</span>    <span class="n">df_complete</span><span class="p">[</span><span class="s">'count'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df_complete</span><span class="p">[</span><span class="s">'count_actual'</span><span class="p">].</span><span class="n">fillna</span><span class="p">(</span><span class="n">df_complete</span><span class="p">[</span><span class="s">'count_empty'</span><span class="p">])</span>
    <span class="n">df_complete</span> <span class="o">=</span> <span class="n">df_complete</span><span class="p">[[</span><span class="s">'modality'</span><span class="p">,</span> <span class="s">'model_name'</span><span class="p">,</span> <span class="s">'count'</span><span class="p">]]</span>
    
    <span class="c1"># Set seaborn style
</span>    <span class="n">sns</span><span class="p">.</span><span class="n">set_style</span><span class="p">(</span><span class="s">"whitegrid"</span><span class="p">)</span>
    
    <span class="c1"># Group by model_name and sum counts across modalities to get total counts for ordering
</span>    <span class="n">total_counts</span> <span class="o">=</span> <span class="n">df_complete</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'model_name'</span><span class="p">)[</span><span class="s">'count'</span><span class="p">].</span><span class="nb">sum</span><span class="p">().</span><span class="n">sort_values</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    
    <span class="c1"># Create pivot table with model_name as index and modality as columns
</span>    <span class="n">pivot_df</span> <span class="o">=</span> <span class="n">df_complete</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="s">'model_name'</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="s">'modality'</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="s">'count'</span><span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    
    <span class="c1"># Reorder by total counts (descending)
</span>    <span class="n">pivot_df</span> <span class="o">=</span> <span class="n">pivot_df</span><span class="p">.</span><span class="n">reindex</span><span class="p">(</span><span class="n">total_counts</span><span class="p">.</span><span class="n">index</span><span class="p">)</span>
    
    <span class="c1"># Create the plot
</span>    <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">14</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span>
    
    <span class="c1"># Use seaborn color palette
</span>    <span class="n">colors</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">(</span><span class="s">"husl"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">pivot_df</span><span class="p">.</span><span class="n">columns</span><span class="p">))</span>
    
    <span class="c1"># Plot stacked bars
</span>    <span class="n">pivot_df</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">'bar'</span><span class="p">,</span> <span class="n">stacked</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
    
    <span class="c1"># Customize the plot
</span>    <span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="sa">f</span><span class="s">'Significant Results by Model Name and Modality (q &lt; </span><span class="si">{</span><span class="n">q_threshold</span><span class="si">}</span><span class="s">)'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Model Name (ordered by total significant count)'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Count of Significant Results'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s">'Modality'</span><span class="p">,</span> <span class="n">bbox_to_anchor</span><span class="o">=</span><span class="p">(</span><span class="mf">1.05</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">loc</span><span class="o">=</span><span class="s">'upper left'</span><span class="p">)</span>
    
    <span class="c1"># Add total count labels at the top of each bar
</span>    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">model_name</span><span class="p">,</span> <span class="n">total_count</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">total_counts</span><span class="p">.</span><span class="n">items</span><span class="p">()):</span>
        <span class="k">if</span> <span class="n">total_count</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>  <span class="c1"># Only show label if there are any significant results
</span>            <span class="n">ax</span><span class="p">.</span><span class="n">text</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">total_count</span> <span class="o">+</span> <span class="nb">max</span><span class="p">(</span><span class="n">total_counts</span><span class="p">)</span> <span class="o">*</span> <span class="mf">0.01</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">total_count</span><span class="p">)),</span> 
            <span class="n">ha</span><span class="o">=</span><span class="s">'center'</span><span class="p">,</span> <span class="n">va</span><span class="o">=</span><span class="s">'bottom'</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
    
    <span class="c1"># Rotate x-axis labels for better readability
</span>    <span class="n">plt</span><span class="p">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">rotation</span><span class="o">=</span><span class="mi">45</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s">'right'</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
    
    <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
    
    <span class="k">return</span> <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span><span class="p">,</span> <span class="n">pivot_df</span>

<span class="k">def</span> <span class="nf">runorder_from_filename</span><span class="p">(</span><span class="n">fn</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
    <span class="s">"""Retrieve runorder form filename

    Args:
        fn (str): Filename ending in runorder

    Returns:
        runorder
    Example:
        &gt;&gt;&gt; runorder_from_filename("asdf_2")
        2
    """</span>
    <span class="k">return</span> <span class="nb">int</span><span class="p">(</span><span class="n">fn</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"_"</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>

<span class="c1"># load data
</span><span class="n">supplemental_data_path</span> <span class="o">=</span> <span class="p">{</span>
    <span class="n">x</span> <span class="p">:</span> <span class="p">{</span>
        <span class="s">"path"</span> <span class="p">:</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">SUPPLEMENTAL_DATA_DIR</span><span class="p">,</span> <span class="n">y</span><span class="p">[</span><span class="s">"file"</span><span class="p">]),</span>
        <span class="s">"sheet"</span> <span class="p">:</span> <span class="n">y</span><span class="p">[</span><span class="s">"sheet"</span><span class="p">]</span>
    <span class="p">}</span>
        <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">SUPPLEMENTAL_DATA_FILES</span><span class="p">.</span><span class="n">items</span><span class="p">()</span>
<span class="p">}</span>

<span class="k">assert</span> <span class="nb">all</span><span class="p">([</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s">"path"</span><span class="p">])</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">supplemental_data_path</span><span class="p">.</span><span class="n">values</span><span class="p">()])</span>
</code></pre></div></div>

<p>Next, I’ll load the transcriptomics and proteomics datasets (stored in
separate sheets of the same Excel file), along with the phenotypic data.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">supplemental_data</span> <span class="o">=</span> <span class="p">{</span>
    <span class="n">x</span> <span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_excel</span><span class="p">(</span><span class="n">y</span><span class="p">[</span><span class="s">"path"</span><span class="p">],</span> <span class="n">sheet_name</span> <span class="o">=</span> <span class="n">y</span><span class="p">[</span><span class="s">"sheet"</span><span class="p">])</span> <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">supplemental_data_path</span><span class="p">.</span><span class="n">items</span><span class="p">()</span>
<span class="p">}</span>

<span class="c1"># formatting
</span><span class="n">supplemental_data</span><span class="p">[</span><span class="s">"transcriptomics"</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">supplemental_data</span><span class="p">[</span><span class="s">"transcriptomics"</span><span class="p">]</span>
    <span class="p">.</span><span class="n">rename</span><span class="p">({</span><span class="s">"Unnamed: 0"</span> <span class="p">:</span> <span class="s">"ensembl_gene"</span><span class="p">},</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
    <span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">"ensembl_gene"</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">supplemental_data</span><span class="p">[</span><span class="s">"proteomics"</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">supplemental_data</span><span class="p">[</span><span class="s">"proteomics"</span><span class="p">]</span>
    <span class="p">.</span><span class="n">rename</span><span class="p">({</span><span class="s">"PG.ProteinAccessions"</span> <span class="p">:</span> <span class="s">"uniprot"</span><span class="p">},</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
    <span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">"uniprot"</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">supplemental_data</span><span class="p">[</span><span class="s">"phenotypes"</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">supplemental_data</span><span class="p">[</span><span class="s">"phenotypes"</span><span class="p">]</span>
    <span class="p">.</span><span class="n">rename</span><span class="p">({</span><span class="s">"Unnamed: 0"</span> <span class="p">:</span> <span class="s">"patient_id"</span><span class="p">},</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
    <span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">"patient_id"</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>

<h2 id="data-overview">Data overview</h2>

<p>The Forny et al. dataset contains three key components:</p>

<ul>
  <li><strong>Transcriptomics</strong>: RNA-seq data from patient-derived fibroblasts
(210 MMA patients + 20 controls)</li>
  <li><strong>Proteomics</strong>: Mass spectrometry data from the same cell lines</li>
  <li><strong>Phenotypes</strong>: Clinical measurements including enzymatic assays and
biochemical markers. The main disease phenotypes I will focus on
are:
    <ul>
      <li><em>case</em>: A binary variable indicating whether individuals exhibit
clinical MMA or are healthy controls</li>
      <li><em>responsive_to_acute_treatment</em>: A binary variable indicating
whether patients show clinical improvement following acute
vitamin supplementation interventions</li>
      <li><em>OHCblPlus</em>: A quantitative measure of methylmalonyl-CoA mutase
enzyme activity, where higher values indicate better enzymatic
function</li>
      <li><em>MMA_urine</em>: A quantitative measurement of methylmalonic acid
levels in urine, where elevated levels indicate metabolic
dysfunction and greater disease severity</li>
    </ul>
  </li>
</ul>

<p>Key analytical challenges include severe class imbalance (only 20
controls versus 210 cases), missing phenotype data for many clinical
measurements, potential technical variation from sample processing, and
disease heterogeneity, characterized by varying genetic subtypes and
levels of severity.</p>

<h3 id="formatting-phenotypes-and-technical-covariates">Formatting phenotypes and technical covariates</h3>

<p>Clinical data present unique challenges that require careful
preprocessing. Measurements are often right-skewed and contain missing
values because only a subset of tests is typically ordered for each
patient. Since I’m interested in quantitative biomarkers, having
complete data across patients increases statistical power. While
imputation can address missingness, it is beneficial to first transform
the variables to more closely approximate a multivariate Gaussian
distribution. This transformation also enhances regression modeling by
linearizing relationships between dependent and independent variables.</p>

<p>A critical aspect of this dataset is accounting for technical variation.
The proteomics data were collected across different instrument runs, and
I can extract the run order from the original PRIDE repository
filenames. This run order represents a major source of variation that
must be controlled in my statistical models.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># load run order
</span><span class="n">proteomics_file_names</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">PROTEOMICS_FILE_NAMES_URL</span><span class="p">,</span> <span class="n">sep</span> <span class="o">=</span> <span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>
<span class="n">proteomics_file_names</span><span class="p">[</span><span class="s">"proteomics_runorder"</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="n">runorder_from_filename</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span> <span class="k">for</span> <span class="n">fn</span> <span class="ow">in</span> <span class="n">proteomics_file_names</span><span class="p">[</span><span class="s">"File Name"</span><span class="p">]]</span>
<span class="n">proteomics_file_names</span><span class="p">[</span><span class="s">"patient_id"</span><span class="p">]</span> <span class="o">=</span> <span class="n">proteomics_file_names</span><span class="p">[</span><span class="s">"Run Label"</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"_"</span><span class="p">,</span> <span class="s">""</span><span class="p">)</span>
<span class="c1"># some samples have multiple files in the PRIDE proteomics repository but they each sample is just a single observation in the
# final dataset. Here we'll take the max run order for each patient. These likely reflect reruns due to technical failures
# so the max run order sample is usually the one with the best technical quality.
</span><span class="n">proteomics_file_names</span> <span class="o">=</span> <span class="n">proteomics_file_names</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">"patient_id"</span><span class="p">)[</span><span class="s">"proteomics_runorder"</span><span class="p">].</span><span class="nb">max</span><span class="p">()</span>

<span class="c1"># select continuos phenotypes without too many missing values
</span><span class="n">continuous_df</span> <span class="o">=</span> <span class="n">forny_imputation</span><span class="p">.</span><span class="n">_select_continuous_measures</span><span class="p">(</span><span class="n">supplemental_data</span><span class="p">[</span><span class="s">"phenotypes"</span><span class="p">],</span> <span class="n">max_missing</span> <span class="o">=</span> <span class="n">MAX_MISSING_PHENOTYPE</span><span class="p">)</span>

<span class="c1"># identify transformations which improve variable normality
</span><span class="n">normalizing_transforms</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">continuous_df</span><span class="p">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="n">normalizing_transforms</span><span class="p">[</span><span class="n">col</span><span class="p">]</span> <span class="o">=</span> <span class="n">transform</span><span class="p">.</span><span class="n">best_normalizing_transform</span><span class="p">(</span><span class="n">continuous_df</span><span class="p">[</span><span class="n">col</span><span class="p">])[</span><span class="s">"best"</span><span class="p">]</span>
    
<span class="n">func_transform_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">col</span><span class="p">:</span> <span class="n">transform</span><span class="p">.</span><span class="n">transform_func_map</span><span class="p">[</span><span class="n">trans</span><span class="p">]</span> <span class="k">for</span> <span class="n">col</span><span class="p">,</span> <span class="n">trans</span> <span class="ow">in</span> <span class="n">normalizing_transforms</span><span class="p">.</span><span class="n">items</span><span class="p">()}</span>

<span class="c1"># Apply transformation
</span><span class="n">transformed_df</span> <span class="o">=</span> <span class="n">forny_imputation</span><span class="p">.</span><span class="n">transform_columns</span><span class="p">(</span><span class="n">continuous_df</span><span class="p">,</span> <span class="n">func_transform_dict</span><span class="p">)</span>

<span class="c1"># Create the imputer
</span><span class="n">imputer</span> <span class="o">=</span> <span class="n">KNNImputer</span><span class="p">(</span><span class="n">n_neighbors</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">weights</span><span class="o">=</span><span class="s">"uniform"</span><span class="p">)</span>

<span class="c1"># Fit and transform the data
</span><span class="n">imputed_array</span> <span class="o">=</span> <span class="n">imputer</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">transformed_df</span><span class="p">)</span>
<span class="n">imputed_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">imputed_array</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">transformed_df</span><span class="p">.</span><span class="n">columns</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="n">transformed_df</span><span class="p">.</span><span class="n">index</span><span class="p">)</span>

<span class="n">binary_phenotypes</span> <span class="o">=</span> <span class="n">supplemental_data</span><span class="p">[</span><span class="s">"phenotypes"</span><span class="p">][[</span><span class="n">col</span> <span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">supplemental_data</span><span class="p">[</span><span class="s">"phenotypes"</span><span class="p">].</span><span class="n">columns</span> <span class="k">if</span> <span class="n">supplemental_data</span><span class="p">[</span><span class="s">"phenotypes"</span><span class="p">][</span><span class="n">col</span><span class="p">].</span><span class="n">dropna</span><span class="p">().</span><span class="n">nunique</span><span class="p">()</span> <span class="o">==</span> <span class="mi">2</span><span class="p">]]</span>

<span class="c1"># create a table of phenotypes
</span><span class="n">phenotypes_df</span> <span class="o">=</span> <span class="p">(</span>
    <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span>
        <span class="n">binary_phenotypes</span><span class="p">,</span>
        <span class="n">imputed_df</span>
    <span class="p">],</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
    <span class="c1"># add proteomics run order since this is a batch effect in the proteomics data
</span>    <span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">proteomics_file_names</span><span class="p">,</span> <span class="n">how</span> <span class="o">=</span> <span class="s">"left"</span><span class="p">)</span>
<span class="p">)</span>

<span class="n">forny_imputation</span><span class="p">.</span><span class="n">plot_clustered_correlation_heatmap</span><span class="p">(</span><span class="n">transformed_df</span><span class="p">)</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-08-19-multiomic_profiles/phenotypes-output-1.png" alt="" /></p>

<p>The correlation heatmap reveals important structure within the clinical
data. Block-diagonal patterns show related assays clustering together:
AdoCbl assays correlate strongly with one another, OHCbl assays form a
distinct cluster, and both are anti-correlated with MMA excretion into
urine. This correlation structure supports my use of a K-Nearest
Neighbors imputation strategy.</p>

<div class="content-section ai-aside">
  <div class="section-content">
    <p>To create the transformation module
for identifying optimal normalizing transformations, I began by
providing high-level problem guidance to Claude. Initially, it produced
overly complicated code focused on finding transformations that would
yield Kolmogorov–Smirnov (KS) test p-values greater than 0.05 for
assessing normality. While it suggested useful transformations like
Box-Cox and Yeo-Johnson, it also recommended quantile normalization —
which, although effective at making data Gaussian, defeats the purpose
of preserving the original data’s distribution. The criterion of
accepting only transformations with p &gt; 0.05 was problematic, since KS
tests are almost always significant when comparing real-world data to
parametric distributions. With further guidance and iterative refinement
in Cursor, the final module performed well. Claude was especially
helpful in handling <code class="language-plaintext highlighter-rouge">matplotlib</code>’s tricky syntax, enabling me to quickly
create complex, specialized plots for visualizing missing value patterns
and comparing transformations.</p>

  </div>
</div>

<h3 id="organizing-results-with-scverse">Organizing results with <code class="language-plaintext highlighter-rouge">scverse</code></h3>

<p>To make my analysis more reproducible and compatible with the broader
genomics ecosystem, I’ll organize my data using the <code class="language-plaintext highlighter-rouge">scverse</code> framework.
This ecosystem, built around <code class="language-plaintext highlighter-rouge">AnnData</code> and <code class="language-plaintext highlighter-rouge">MuData</code> objects, provides
standardized containers for genomics data that integrate observation
metadata, feature annotations, and multiple data layers.</p>

<p>Using AnnData/MuData offers several advantages:</p>

<ul>
  <li><strong>Standardization</strong>: A common format across the Python genomics
ecosystem</li>
  <li><strong>Integration</strong>: Easy to combine with existing tools like <code class="language-plaintext highlighter-rouge">scanpy</code>,
<code class="language-plaintext highlighter-rouge">muon</code>, and <code class="language-plaintext highlighter-rouge">scvi-tools</code></li>
  <li><strong>Metadata management</strong>: Keeps sample annotations and results
together</li>
  <li><strong>Multimodal support</strong>: Native integration of any combination of
data modalities</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">transcr_adata</span> <span class="o">=</span> <span class="n">ad</span><span class="p">.</span><span class="n">AnnData</span><span class="p">(</span>
    <span class="n">X</span> <span class="o">=</span> <span class="n">supplemental_data</span><span class="p">[</span><span class="s">"transcriptomics"</span><span class="p">].</span><span class="n">T</span><span class="p">,</span>
    <span class="c1"># some samples are missing
</span>    <span class="n">obs</span> <span class="o">=</span> <span class="n">phenotypes_df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">phenotypes_df</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">supplemental_data</span><span class="p">[</span><span class="s">"transcriptomics"</span><span class="p">].</span><span class="n">columns</span><span class="p">)],</span>
<span class="p">)</span>

<span class="n">protein_metadata_vars</span> <span class="o">=</span> <span class="n">supplemental_data</span><span class="p">[</span><span class="s">"proteomics"</span><span class="p">].</span><span class="n">columns</span><span class="p">[</span><span class="n">supplemental_data</span><span class="p">[</span><span class="s">"proteomics"</span><span class="p">].</span><span class="n">columns</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">"PG"</span><span class="p">)]</span>

<span class="n">proteomics_adata</span> <span class="o">=</span> <span class="n">ad</span><span class="p">.</span><span class="n">AnnData</span><span class="p">(</span>
    <span class="c1"># drop protein metadata vars and transpose
</span>    <span class="n">X</span> <span class="o">=</span> <span class="n">supplemental_data</span><span class="p">[</span><span class="s">"proteomics"</span><span class="p">].</span><span class="n">drop</span><span class="p">(</span><span class="n">protein_metadata_vars</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">).</span><span class="n">T</span><span class="p">,</span>
    <span class="c1"># some samples are missing
</span>    <span class="n">obs</span> <span class="o">=</span> <span class="n">phenotypes_df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">phenotypes_df</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">supplemental_data</span><span class="p">[</span><span class="s">"proteomics"</span><span class="p">].</span><span class="n">columns</span><span class="p">)],</span>
    <span class="n">var</span> <span class="o">=</span> <span class="n">supplemental_data</span><span class="p">[</span><span class="s">"proteomics"</span><span class="p">][</span><span class="n">protein_metadata_vars</span><span class="p">],</span>
<span class="p">)</span>

<span class="n">mdata</span> <span class="o">=</span> <span class="n">md</span><span class="p">.</span><span class="n">MuData</span><span class="p">({</span><span class="s">"transcriptomics"</span><span class="p">:</span> <span class="n">transcr_adata</span><span class="p">,</span> <span class="s">"proteomics"</span><span class="p">:</span> <span class="n">proteomics_adata</span><span class="p">})</span>
<span class="n">mdata</span>
</code></pre></div></div>

<pre>MuData object with n_obs × n_vars = 230 × 19537
  2 modalities
    transcriptomics:    221 x 14749
      obs:  &#x27;case&#x27;, &#x27;gender&#x27;, &#x27;consanguinity&#x27;, &#x27;mut_category&#x27;, &#x27;wgs_zygosity&#x27;, &#x27;acidosis&#x27;, &#x27;metabolic_acidosis&#x27;, &#x27;metabolic_ketoacidosis&#x27;, &#x27;ketosis&#x27;, &#x27;hyperammonemia&#x27;, &#x27;abnormal_muscle_tone&#x27;, &#x27;musc_hypotonia&#x27;, &#x27;musc_hypertonia&#x27;, &#x27;fct_respiratory_abnormality&#x27;, &#x27;dyspnea&#x27;, &#x27;tachypnea&#x27;, &#x27;reduced_consciousness&#x27;, &#x27;lethargy&#x27;, &#x27;coma&#x27;, &#x27;seizures&#x27;, &#x27;general_tonic_clonic_seizure&#x27;, &#x27;any_GI_problem&#x27;, &#x27;failure_to_thrive&#x27;, &#x27;any_delay&#x27;, &#x27;behavioral_abnormality&#x27;, &#x27;concurrent_infection&#x27;, &#x27;urine_ketones&#x27;, &#x27;dialysis&#x27;, &#x27;peritoneal_dialysis&#x27;, &#x27;insulin&#x27;, &#x27;diet&#x27;, &#x27;carnitine&#x27;, &#x27;cobalamin&#x27;, &#x27;bicarb&#x27;, &#x27;glucose_IV&#x27;, &#x27;cobalamin_responsive&#x27;, &#x27;antibiotic_treatment&#x27;, &#x27;protein_restriction&#x27;, &#x27;tube_feeding_day&#x27;, &#x27;tube_feeding_night&#x27;, &#x27;tube_feeding_overall&#x27;, &#x27;language_delay&#x27;, &#x27;any_neurological_abnormalities_chronic&#x27;, &#x27;impaired_kidney_fct&#x27;, &#x27;hemat_abnormality&#x27;, &#x27;anemia&#x27;, &#x27;neutropenia&#x27;, &#x27;skin_abnormalities&#x27;, &#x27;hearing_impairment&#x27;, &#x27;osteoporosis&#x27;, &#x27;failure_to_thrive_chronic&#x27;, &#x27;global_dev_delay_chr&#x27;, &#x27;hypotonia_chr&#x27;, &#x27;basal_ganglia_abnormality_chr&#x27;, &#x27;failure_to_thrive_or_tube_feeding&#x27;, &#x27;irritability&#x27;, &#x27;hyperventilation&#x27;, &#x27;hypothermia&#x27;, &#x27;somnolence&#x27;, &#x27;vomiting&#x27;, &#x27;dehydration&#x27;, &#x27;feeding_problem&#x27;, &#x27;responsive_to_acute_treatment&#x27;, &#x27;n_passage&#x27;, &#x27;date_collection&#x27;, &#x27;date_freezing&#x27;, &#x27;onset_age&#x27;, &#x27;OHCblMinus&#x27;, &#x27;OHCblPlus&#x27;, &#x27;ratio&#x27;, &#x27;SimultOHCblMinus&#x27;, &#x27;SimultOHCblPlus&#x27;, &#x27;AdoCblMinus&#x27;, &#x27;AdoCblPlus&#x27;, &#x27;SimultAdoCblMinus&#x27;, &#x27;SimultAdoCblPlus&#x27;, &#x27;prot_mut_level&#x27;, &#x27;rnaseq_mut_level&#x27;, &#x27;MMA_urine&#x27;, &#x27;ammonia_umolL&#x27;, &#x27;pH&#x27;, &#x27;base_excess&#x27;, &#x27;MMA_urine_after_treat&#x27;, &#x27;carnitine_dose&#x27;, &#x27;natural_protein_amount&#x27;, &#x27;total_protein_amount&#x27;, &#x27;weight_centile_quant&#x27;, &#x27;length_centile_quant&#x27;, &#x27;head_circumfernce_quant&#x27;, &#x27;proteomics_runorder&#x27;
    proteomics: 230 x 4788
      obs:  &#x27;case&#x27;, &#x27;gender&#x27;, &#x27;consanguinity&#x27;, &#x27;mut_category&#x27;, &#x27;wgs_zygosity&#x27;, &#x27;acidosis&#x27;, &#x27;metabolic_acidosis&#x27;, &#x27;metabolic_ketoacidosis&#x27;, &#x27;ketosis&#x27;, &#x27;hyperammonemia&#x27;, &#x27;abnormal_muscle_tone&#x27;, &#x27;musc_hypotonia&#x27;, &#x27;musc_hypertonia&#x27;, &#x27;fct_respiratory_abnormality&#x27;, &#x27;dyspnea&#x27;, &#x27;tachypnea&#x27;, &#x27;reduced_consciousness&#x27;, &#x27;lethargy&#x27;, &#x27;coma&#x27;, &#x27;seizures&#x27;, &#x27;general_tonic_clonic_seizure&#x27;, &#x27;any_GI_problem&#x27;, &#x27;failure_to_thrive&#x27;, &#x27;any_delay&#x27;, &#x27;behavioral_abnormality&#x27;, &#x27;concurrent_infection&#x27;, &#x27;urine_ketones&#x27;, &#x27;dialysis&#x27;, &#x27;peritoneal_dialysis&#x27;, &#x27;insulin&#x27;, &#x27;diet&#x27;, &#x27;carnitine&#x27;, &#x27;cobalamin&#x27;, &#x27;bicarb&#x27;, &#x27;glucose_IV&#x27;, &#x27;cobalamin_responsive&#x27;, &#x27;antibiotic_treatment&#x27;, &#x27;protein_restriction&#x27;, &#x27;tube_feeding_day&#x27;, &#x27;tube_feeding_night&#x27;, &#x27;tube_feeding_overall&#x27;, &#x27;language_delay&#x27;, &#x27;any_neurological_abnormalities_chronic&#x27;, &#x27;impaired_kidney_fct&#x27;, &#x27;hemat_abnormality&#x27;, &#x27;anemia&#x27;, &#x27;neutropenia&#x27;, &#x27;skin_abnormalities&#x27;, &#x27;hearing_impairment&#x27;, &#x27;osteoporosis&#x27;, &#x27;failure_to_thrive_chronic&#x27;, &#x27;global_dev_delay_chr&#x27;, &#x27;hypotonia_chr&#x27;, &#x27;basal_ganglia_abnormality_chr&#x27;, &#x27;failure_to_thrive_or_tube_feeding&#x27;, &#x27;irritability&#x27;, &#x27;hyperventilation&#x27;, &#x27;hypothermia&#x27;, &#x27;somnolence&#x27;, &#x27;vomiting&#x27;, &#x27;dehydration&#x27;, &#x27;feeding_problem&#x27;, &#x27;responsive_to_acute_treatment&#x27;, &#x27;n_passage&#x27;, &#x27;date_collection&#x27;, &#x27;date_freezing&#x27;, &#x27;onset_age&#x27;, &#x27;OHCblMinus&#x27;, &#x27;OHCblPlus&#x27;, &#x27;ratio&#x27;, &#x27;SimultOHCblMinus&#x27;, &#x27;SimultOHCblPlus&#x27;, &#x27;AdoCblMinus&#x27;, &#x27;AdoCblPlus&#x27;, &#x27;SimultAdoCblMinus&#x27;, &#x27;SimultAdoCblPlus&#x27;, &#x27;prot_mut_level&#x27;, &#x27;rnaseq_mut_level&#x27;, &#x27;MMA_urine&#x27;, &#x27;ammonia_umolL&#x27;, &#x27;pH&#x27;, &#x27;base_excess&#x27;, &#x27;MMA_urine_after_treat&#x27;, &#x27;carnitine_dose&#x27;, &#x27;natural_protein_amount&#x27;, &#x27;total_protein_amount&#x27;, &#x27;weight_centile_quant&#x27;, &#x27;length_centile_quant&#x27;, &#x27;head_circumfernce_quant&#x27;, &#x27;proteomics_runorder&#x27;
      var:  &#x27;PG.ProteinDescriptions&#x27;, &#x27;PG.ProteinNames&#x27;, &#x27;PG.Qvalue&#x27;</pre>

<h3 id="data-normalization">Data normalization</h3>

<p>Following the methodology from Forny et al., I will apply a three-step
normalization process designed to make the data more suitable for
statistical analysis:</p>

<ol>
  <li><strong>Filtering poorly measured features</strong>: Remove genes with very few
reads across samples</li>
  <li><strong>Log-transformation</strong>: Stabilize variance and make distributions
more Gaussian (necessary because I’m working with processed data
rather than original counts)</li>
  <li><strong>Row and column centering</strong>: Remove systematic biases like library
size effects while preserving biological signal (following the Forny
methodology)</li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># filter to drop features with low counts
</span><span class="n">processing</span><span class="p">.</span><span class="n">filter_features_by_counts</span><span class="p">(</span><span class="n">mdata</span><span class="p">[</span><span class="s">"transcriptomics"</span><span class="p">],</span> <span class="n">min_counts</span> <span class="o">=</span> <span class="n">READ_CUTOFF</span><span class="p">)</span>

<span class="c1"># add a pseudocount before logging
</span><span class="n">processing</span><span class="p">.</span><span class="n">log2_transform</span><span class="p">(</span><span class="n">mdata</span><span class="p">[</span><span class="s">"transcriptomics"</span><span class="p">],</span> <span class="n">pseudocount</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="c1"># proteomics has a minimum value of 1 so no pseudocounts are needed before logging
</span><span class="n">processing</span><span class="p">.</span><span class="n">log2_transform</span><span class="p">(</span><span class="n">mdata</span><span class="p">[</span><span class="s">"proteomics"</span><span class="p">],</span> <span class="n">pseudocount</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span>

<span class="c1"># row and column center as per Forny paper
</span><span class="n">processing</span><span class="p">.</span><span class="n">center_rows_and_columns_mudata</span><span class="p">(</span>
    <span class="n">mdata</span><span class="p">,</span>
    <span class="n">layer</span> <span class="o">=</span> <span class="s">"log2"</span><span class="p">,</span>
    <span class="n">new_layer_name</span> <span class="o">=</span> <span class="s">"log2_centered"</span>
<span class="p">)</span>
</code></pre></div></div>

<h2 id="exploratory-data-analysis">Exploratory data analysis</h2>

<p>Before testing specific hypotheses, I would like to better understand
the major sources of variation shaping this dataset. Principal Component
Analysis (PCA) is ideal for this because it identifies the main patterns
of variation without relying on assumptions about what should be
important.</p>

<p>This analysis will help me assess data quality, detect technical batch
effects, and select covariates necessary for reliably characterizing
subtle biological signals.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Process each modality
</span><span class="k">for</span> <span class="n">modality</span> <span class="ow">in</span> <span class="p">[</span><span class="s">'transcriptomics'</span><span class="p">,</span> <span class="s">'proteomics'</span><span class="p">]:</span>
    <span class="c1"># add PCs
</span>    <span class="n">sc</span><span class="p">.</span><span class="n">pp</span><span class="p">.</span><span class="n">pca</span><span class="p">(</span><span class="n">mdata</span><span class="p">[</span><span class="n">modality</span><span class="p">],</span> <span class="n">layer</span> <span class="o">=</span> <span class="n">ANALYSIS_LAYER</span><span class="p">)</span>
    
<span class="n">mdata_eda</span><span class="p">.</span><span class="n">plot_mudata_pca_variance</span><span class="p">(</span><span class="n">mdata</span><span class="p">,</span> <span class="n">n_pcs</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="mi">50</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-08-19-multiomic_profiles/scree_plot-output-1.png" alt="" /></p>

<p>The scree plots show that variance is distributed across many components
instead of being concentrated in just a few. The gradual decline —
with no sharp “elbow” — indicates the presence of multiple sources of
variation rather than a small number of dominant factors. This pattern
is typical of patient-derived samples, which exhibit both genetic and
environmental heterogeneity.</p>

<p>To identify the major observed sources of variation — biological or
technical — I will calculate correlations between principal components
and sample attributes.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># put these at the top regardless of significance
</span><span class="n">PRIORITIZED_PHENOTYPES</span> <span class="o">=</span> <span class="p">[</span><span class="s">"date_freezing"</span><span class="p">,</span> <span class="s">"proteomics_runorder"</span><span class="p">,</span> <span class="s">"case"</span><span class="p">,</span> <span class="s">"responsive_to_acute_treatment"</span><span class="p">,</span> <span class="s">"OHCblPlus"</span><span class="p">,</span> <span class="s">"MMA_urine"</span><span class="p">]</span>

<span class="n">results</span> <span class="o">=</span> <span class="n">mdata_eda</span><span class="p">.</span><span class="n">analyze_pc_metadata_correlation_mudata</span><span class="p">(</span>
    <span class="n">mdata</span><span class="p">,</span>
    <span class="n">n_pcs</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>
    <span class="n">prioritized_vars</span><span class="o">=</span><span class="n">PRIORITIZED_PHENOTYPES</span><span class="p">,</span>  <span class="c1"># This will always show 'case' variable at the top
</span>    <span class="n">pca_kwargs</span><span class="o">=</span><span class="p">{</span><span class="s">'svd_solver'</span><span class="p">:</span> <span class="s">'arpack'</span><span class="p">},</span>
    <span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">7</span><span class="p">)</span>
<span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-08-19-multiomic_profiles/pc_corrplot-output-1.png" alt="" /></p>

<p>The PCA analysis reveals several key patterns. Batch effects dominate
the primary sources of variation — <em>date_freezing</em> and
<em>proteomics_runorder</em> show stronger associations with the leading PCs
than disease variables, highlighting the importance of including them as
covariates. Disease signals are subtle; the weak correlations between
disease markers and PCs suggest that, like its variable clinical
presentation, MMA is also molecularly heterogeneous. Finally,
<em>proteomics_runorder</em> strongly affects proteomics data but not
transcriptomics.</p>

<p>This analysis informs my regression model specification: I will use
different covariates for each data modality based on the identified
batch effects.</p>

<h2 id="supervised-analysis">Supervised analysis</h2>

<p>I can now systematically identify molecular features associated with
disease phenotypes, including case–control status, responsiveness to
acute treatment, <em>OHCblPlus</em> (enzyme activity), and urine MMA levels. To
do this, I will use Generalized Additive Models (GAMs) to account for
potentially nonlinear effects of <em>date_freezing</em> and
<em>proteomics_runorder</em> on feature abundance.</p>

<p>My modeling strategy involves feature-wise testing, where each gene or
protein is independently analyzed through regression models that capture
specific biological effects while adjusting for covariates. For
transcripts, I control for <em>date_freezing</em>; for proteins, I control for
both <em>date_freezing</em> and <em>proteomics_runorder</em> (I will later validate
that including both is appropriate). Finally, I apply false discovery
rate (FDR) control to account for multiple testing across all features
(see my <a href="https://www.shackett.org/lfdr_shrinkage/">lFDR shrinkage post</a>
for a detailed review).</p>

<div class="content-section bio-section">
  <div class="section-content">
    <p>Forny et al., approached this
problem by regressing feature-level abundances on <em>OHCblPlus</em> while
accounting for genetic-relatedness of individuals using random effects.
This is especially important in rare disease settings where multiple
cases and controls often come from the same family. Since genotypes are
protected medical information under Switzerland’s Federal Act on Data
Protection (FADP), this information wasn’t available for my re-analysis.</p>

  </div>
</div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Process each modality
</span><span class="n">regression_results</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
<span class="k">for</span> <span class="n">modality</span> <span class="ow">in</span> <span class="n">mdata</span><span class="p">.</span><span class="n">mod_names</span><span class="p">:</span>
    
    <span class="n">modality_results</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
    <span class="c1"># Apply regression per feature
</span>    <span class="k">for</span> <span class="n">formula_name</span><span class="p">,</span> <span class="n">formula</span> <span class="ow">in</span> <span class="n">REGRESSION_FORMULAS</span><span class="p">[</span><span class="n">modality</span><span class="p">].</span><span class="n">items</span><span class="p">():</span>

        <span class="n">summaries</span> <span class="o">=</span> <span class="n">adata_regression</span><span class="p">.</span><span class="n">adata_model_fitting</span><span class="p">(</span>
            <span class="n">mdata</span><span class="p">[</span><span class="n">modality</span><span class="p">],</span>
            <span class="n">formula</span><span class="p">,</span>
            <span class="n">n_jobs</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
            <span class="n">layer</span> <span class="o">=</span> <span class="n">ANALYSIS_LAYER</span><span class="p">,</span>
            <span class="n">model_name</span> <span class="o">=</span> <span class="n">formula_name</span><span class="p">,</span>
            <span class="n">progress_bar</span> <span class="o">=</span> <span class="bp">False</span>
        <span class="p">)</span>

        <span class="c1"># Remove intercepts and covariates
</span>        <span class="n">mask</span> <span class="o">=</span> <span class="p">[</span><span class="n">name</span> <span class="o">==</span> <span class="n">term</span> <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">term</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">summaries</span><span class="p">[</span><span class="s">"model_name"</span><span class="p">],</span> <span class="n">summaries</span><span class="p">[</span><span class="s">"term"</span><span class="p">])]</span>
        <span class="n">summaries</span> <span class="o">=</span> <span class="n">summaries</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">mask</span><span class="p">]</span>

        <span class="n">modality_results</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">summaries</span><span class="p">)</span>
    
    <span class="c1"># combine results and add modality first
</span>    <span class="n">modality_results</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="n">modality_results</span><span class="p">).</span><span class="n">assign</span><span class="p">(</span><span class="n">modality</span> <span class="o">=</span> <span class="n">modality</span><span class="p">)[[</span><span class="s">'modality'</span><span class="p">]</span> <span class="o">+</span> <span class="p">[</span><span class="n">col</span> <span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="n">modality_results</span><span class="p">).</span><span class="n">columns</span> <span class="k">if</span> <span class="n">col</span> <span class="o">!=</span> <span class="s">'modality'</span><span class="p">]]</span>
    
    <span class="c1"># add results to adata's modality-level var table
</span>    <span class="n">adata_regression</span><span class="p">.</span><span class="n">add_regression_results_to_anndata</span><span class="p">(</span>
        <span class="n">mdata</span><span class="p">[</span><span class="n">modality</span><span class="p">],</span>
        <span class="n">modality_results</span><span class="p">,</span>
        <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span>
    <span class="p">)</span>
    
    <span class="n">regression_results</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">modality_results</span><span class="p">)</span>

<span class="n">regression_results_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="n">regression_results</span><span class="p">)</span>

<span class="n">create_stacked_barplot_from_regression</span><span class="p">(</span><span class="n">regression_results_df</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-08-19-multiomic_profiles/regression-output-1.png" alt="" /></p>

<p>The batch effect validation confirms my modeling strategy: when fitting
a model with both <em>proteomics_runorder</em> and <em>date_freezing</em> to both
modalities, I observe significant associations for both covariates in
the proteomics data, while runorder shows no effect in the
transcriptomics data—confirming that this is indeed a
proteomics-specific technical artifact. Although the overall number of
nominally significant associations appears modest for most biological
effects of interest, the p-value histograms — as you will
see—suggest that this reflects limitations in statistical power rather
than an absence of biological signal. This highlights an important
opportunity for network-based methods, which can potentially recover
weak but mechanistically coherent associations by pooling signals across
connected molecular components, rather than analyzing features in
isolation.</p>

<div class="content-section ai-aside">
  <div class="section-content">
    <p>To break ground on this problem, I
asked Claude to draft a differential expression workflow that accepts an
<code class="language-plaintext highlighter-rouge">AnnData</code> object and a regression formula as input, and outputs
<a href="https://cran.r-project.org/web/packages/broom/vignettes/broom.html">broom</a>-like
tidy summaries for each feature-by-term combination (term, effect size,
t-statistic, p-value). Claude quickly produced a working example with
some nice features, such as parallelization. However, the actual
implementation amounted to spaghetti code. Under the hood, the functions
ignored the provided formulas and instead reformulated them as a set of
simple linear regressions. This approach mishandled covariates, and the
large gap between my intent—as someone with a fair bit of statistical
knowledge—and the actual implementation was concerning.</p>

<p>There was no easy fix, and it didn’t make sense to build a proper
statistical framework from a collection of loosely connected .py files
lacking tests. So, I sidestepped the issue and implemented the workflow
in my <a href="https://github.com/shackett/shackett-utils">shackett-utils</a>
package as feature-level OLS and GAM modules, a multi-model fitting
module, and a lightweight wrapper that specifically supports <code class="language-plaintext highlighter-rouge">AnnData</code>
as input.</p>

<p>To flesh out these modules, I collaborated extensively with Claude —
particularly on writing tests. Claude was especially helpful in
implementing tricky features, such as calculating p-values in log-space
to avoid underflow to zero.</p>

<p>This vignette nicely encapsulates my experience doing science with AI.
It’s excellent for breaking ground on a problem and for tasks that are
either easily verifiable (like visualization) or routine (like
exploratory data analysis). For such one-off tasks, AI is probably
sufficient. But for problems where the implementation strategy is
unclear, it’s important to approach the work like traditional software
development: implement specific features, write tests consistently, and
avoid feature bloat. In this context, AI can be a powerful asset —
implementing features and tests in real time — while allowing us
humans to focus on providing code review.</p>

  </div>
</div>

<h3 id="interpreting-p-value-histograms">Interpreting p-value histograms</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># plot p-value histograms
</span><span class="n">TERMS_FOR_PVALUE_HISTOGRAM</span> <span class="o">=</span> <span class="p">[</span><span class="s">"case"</span><span class="p">,</span> <span class="s">"responsive_to_acute_treatment"</span><span class="p">,</span> <span class="s">"MMA_urine"</span><span class="p">,</span> <span class="s">"OHCblPlus"</span><span class="p">,</span> <span class="s">"date_freezing"</span><span class="p">]</span>
<span class="n">stats_viz</span><span class="p">.</span><span class="n">plot_pvalue_histograms</span><span class="p">(</span>
    <span class="n">regression_results_df</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="s">"term in @TERMS_FOR_PVALUE_HISTOGRAM"</span><span class="p">),</span>
    <span class="n">term_column</span> <span class="o">=</span> <span class="s">"term"</span><span class="p">,</span>
    <span class="n">fdr_cutoff</span> <span class="o">=</span> <span class="n">FDR_CUTOFF</span>
<span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-08-19-multiomic_profiles/pvalue_histograms-output-1.png" alt="" /></p>

<p><img src="/figure/source/2025-08-19-multiomic_profiles/pvalue_histograms-output-2.png" alt="" /></p>

<p><img src="/figure/source/2025-08-19-multiomic_profiles/pvalue_histograms-output-3.png" alt="" /></p>

<p><img src="/figure/source/2025-08-19-multiomic_profiles/pvalue_histograms-output-4.png" alt="" /></p>

<p><img src="/figure/source/2025-08-19-multiomic_profiles/pvalue_histograms-output-5.png" alt="" /></p>

<p>P-value distributions provide powerful diagnostics for my regression
models. You can think of a p-value histogram as a mixture of two
distributions: a Uniform(0,1) distribution representing true null
hypotheses, and a distribution skewed toward zero representing true
positives. Visually examining this mixture helps estimate the proportion
of real signals and detect pathological features, such as p-value
clumping or enrichment near 1.</p>

<p>My results show varying levels of biological signal across different
phenotypes: <em>MMA_urine</em> and <em>OHCblPlus</em> display clear evidence of
biological associations, whereas <em>case</em> status shows little signal after
covariate adjustment.</p>

<h3 id="validating-results-focus-on-mmut">Validating results: focus on <em>MMUT</em></h3>

<p>Before moving forward, it’s important to verify that my models are
capturing true biological signals by spot-checking some gold-standard
associations. Let’s examine the top associations and confirm that they
make biological sense, focusing on <em>MMUT</em> (UniProt: P22033), the primary
gene involved in MMA pathogenesis.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">PHENOTYPE_EXAMPLE</span> <span class="o">=</span> <span class="p">[</span><span class="s">"MMA_urine"</span><span class="p">,</span> <span class="s">"OHCblPlus"</span><span class="p">]</span>

<span class="k">for</span> <span class="n">phenotype</span> <span class="ow">in</span> <span class="n">PHENOTYPE_EXAMPLE</span><span class="p">:</span>
    <span class="n">example_stat_summaries</span> <span class="o">=</span> <span class="n">mdata</span><span class="p">[</span><span class="s">"proteomics"</span><span class="p">].</span><span class="n">var</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="n">mdata</span><span class="p">[</span><span class="s">"proteomics"</span><span class="p">].</span><span class="n">var</span><span class="p">.</span><span class="n">columns</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="n">phenotype</span><span class="p">)].</span><span class="n">sort_values</span><span class="p">(</span><span class="sa">f</span><span class="s">"log10p_</span><span class="si">{</span><span class="n">phenotype</span><span class="si">}</span><span class="s">"</span><span class="p">).</span><span class="n">head</span><span class="p">()</span>
    <span class="n">format_numeric_columns</span><span class="p">(</span><span class="n">example_stat_summaries</span><span class="p">,</span> <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
    
    <span class="n">display_tabulator</span><span class="p">(</span>
        <span class="n">example_stat_summaries</span><span class="p">,</span>
        <span class="n">caption</span><span class="o">=</span><span class="sa">f</span><span class="s">"Top associations with </span><span class="si">{</span><span class="n">phenotype</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
        <span class="n">layout</span><span class="o">=</span><span class="s">"fitDataStretch"</span><span class="p">,</span>
        <span class="n">width</span> <span class="o">=</span> <span class="s">"auto"</span>
    <span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Top associations with MMA_urine
</figcaption>

<div class="data-table" style="width: auto; display: inline-block;" data-table="[{&quot;uniprot&quot;: &quot;P22033&quot;, &quot;est_MMA_urine&quot;: &quot;-0.207&quot;, &quot;p_MMA_urine&quot;: &quot;0.000&quot;, &quot;log10p_MMA_urine&quot;: &quot;-8.239&quot;, &quot;q_MMA_urine&quot;: &quot;0.000&quot;, &quot;stat_MMA_urine&quot;: &quot;-6.077&quot;, &quot;stderr_MMA_urine&quot;: &quot;0.034&quot;}, {&quot;uniprot&quot;: &quot;O43617&quot;, &quot;est_MMA_urine&quot;: &quot;0.072&quot;, &quot;p_MMA_urine&quot;: &quot;0.000&quot;, &quot;log10p_MMA_urine&quot;: &quot;-6.295&quot;, &quot;q_MMA_urine&quot;: &quot;0.001&quot;, &quot;stat_MMA_urine&quot;: &quot;5.187&quot;, &quot;stderr_MMA_urine&quot;: &quot;0.014&quot;}, {&quot;uniprot&quot;: &quot;O75976&quot;, &quot;est_MMA_urine&quot;: &quot;-0.135&quot;, &quot;p_MMA_urine&quot;: &quot;0.000&quot;, &quot;log10p_MMA_urine&quot;: &quot;-5.239&quot;, &quot;q_MMA_urine&quot;: &quot;0.009&quot;, &quot;stat_MMA_urine&quot;: &quot;-4.655&quot;, &quot;stderr_MMA_urine&quot;: &quot;0.029&quot;}, {&quot;uniprot&quot;: &quot;Q9UJW0&quot;, &quot;est_MMA_urine&quot;: &quot;0.086&quot;, &quot;p_MMA_urine&quot;: &quot;0.000&quot;, &quot;log10p_MMA_urine&quot;: &quot;-5.061&quot;, &quot;q_MMA_urine&quot;: &quot;0.010&quot;, &quot;stat_MMA_urine&quot;: &quot;4.561&quot;, &quot;stderr_MMA_urine&quot;: &quot;0.019&quot;}, {&quot;uniprot&quot;: &quot;O43237&quot;, &quot;est_MMA_urine&quot;: &quot;0.057&quot;, &quot;p_MMA_urine&quot;: &quot;0.000&quot;, &quot;log10p_MMA_urine&quot;: &quot;-4.999&quot;, &quot;q_MMA_urine&quot;: &quot;0.010&quot;, &quot;stat_MMA_urine&quot;: &quot;4.528&quot;, &quot;stderr_MMA_urine&quot;: &quot;0.013&quot;}]" data-columns="[{&quot;title&quot;: &quot;uniprot&quot;, &quot;field&quot;: &quot;uniprot&quot;}, {&quot;title&quot;: &quot;est_MMA_urine&quot;, &quot;field&quot;: &quot;est_MMA_urine&quot;}, {&quot;title&quot;: &quot;p_MMA_urine&quot;, &quot;field&quot;: &quot;p_MMA_urine&quot;}, {&quot;title&quot;: &quot;log10p_MMA_urine&quot;, &quot;field&quot;: &quot;log10p_MMA_urine&quot;}, {&quot;title&quot;: &quot;q_MMA_urine&quot;, &quot;field&quot;: &quot;q_MMA_urine&quot;}, {&quot;title&quot;: &quot;stat_MMA_urine&quot;, &quot;field&quot;: &quot;stat_MMA_urine&quot;}, {&quot;title&quot;: &quot;stderr_MMA_urine&quot;, &quot;field&quot;: &quot;stderr_MMA_urine&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataStretch&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Top associations with OHCblPlus
</figcaption>

<div class="data-table" style="width: auto; display: inline-block;" data-table="[{&quot;uniprot&quot;: &quot;P61106&quot;, &quot;est_OHCblPlus&quot;: &quot;0.070&quot;, &quot;p_OHCblPlus&quot;: &quot;0.000&quot;, &quot;log10p_OHCblPlus&quot;: &quot;-5.308&quot;, &quot;q_OHCblPlus&quot;: &quot;0.014&quot;, &quot;stat_OHCblPlus&quot;: &quot;4.691&quot;, &quot;stderr_OHCblPlus&quot;: &quot;0.015&quot;}, {&quot;uniprot&quot;: &quot;P22033&quot;, &quot;est_OHCblPlus&quot;: &quot;0.283&quot;, &quot;p_OHCblPlus&quot;: &quot;0.000&quot;, &quot;log10p_OHCblPlus&quot;: &quot;-5.221&quot;, &quot;q_OHCblPlus&quot;: &quot;0.014&quot;, &quot;stat_OHCblPlus&quot;: &quot;4.645&quot;, &quot;stderr_OHCblPlus&quot;: &quot;0.061&quot;}, {&quot;uniprot&quot;: &quot;Q13428&quot;, &quot;est_OHCblPlus&quot;: &quot;0.123&quot;, &quot;p_OHCblPlus&quot;: &quot;0.000&quot;, &quot;log10p_OHCblPlus&quot;: &quot;-4.033&quot;, &quot;q_OHCblPlus&quot;: &quot;0.090&quot;, &quot;stat_OHCblPlus&quot;: &quot;3.987&quot;, &quot;stderr_OHCblPlus&quot;: &quot;0.031&quot;}, {&quot;uniprot&quot;: &quot;P32119&quot;, &quot;est_OHCblPlus&quot;: &quot;-0.099&quot;, &quot;p_OHCblPlus&quot;: &quot;0.000&quot;, &quot;log10p_OHCblPlus&quot;: &quot;-3.939&quot;, &quot;q_OHCblPlus&quot;: &quot;0.090&quot;, &quot;stat_OHCblPlus&quot;: &quot;-3.931&quot;, &quot;stderr_OHCblPlus&quot;: &quot;0.025&quot;}, {&quot;uniprot&quot;: &quot;Q7L0Y3&quot;, &quot;est_OHCblPlus&quot;: &quot;0.224&quot;, &quot;p_OHCblPlus&quot;: &quot;0.000&quot;, &quot;log10p_OHCblPlus&quot;: &quot;-3.849&quot;, &quot;q_OHCblPlus&quot;: &quot;0.090&quot;, &quot;stat_OHCblPlus&quot;: &quot;3.877&quot;, &quot;stderr_OHCblPlus&quot;: &quot;0.058&quot;}]" data-columns="[{&quot;title&quot;: &quot;uniprot&quot;, &quot;field&quot;: &quot;uniprot&quot;}, {&quot;title&quot;: &quot;est_OHCblPlus&quot;, &quot;field&quot;: &quot;est_OHCblPlus&quot;}, {&quot;title&quot;: &quot;p_OHCblPlus&quot;, &quot;field&quot;: &quot;p_OHCblPlus&quot;}, {&quot;title&quot;: &quot;log10p_OHCblPlus&quot;, &quot;field&quot;: &quot;log10p_OHCblPlus&quot;}, {&quot;title&quot;: &quot;q_OHCblPlus&quot;, &quot;field&quot;: &quot;q_OHCblPlus&quot;}, {&quot;title&quot;: &quot;stat_OHCblPlus&quot;, &quot;field&quot;: &quot;stat_OHCblPlus&quot;}, {&quot;title&quot;: &quot;stderr_OHCblPlus&quot;, &quot;field&quot;: &quot;stderr_OHCblPlus&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataStretch&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">GENE_ASSOCIATIONS</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"transcriptomics"</span> <span class="p">:</span> <span class="s">"ENSG00000146085"</span><span class="p">,</span>
    <span class="s">"proteomics"</span> <span class="p">:</span> <span class="s">"P22033"</span>
<span class="p">}</span>

<span class="k">for</span> <span class="n">modality</span><span class="p">,</span> <span class="n">identifier</span> <span class="ow">in</span> <span class="n">GENE_ASSOCIATIONS</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>

    <span class="n">gene_summaries</span> <span class="o">=</span> <span class="n">mdata</span><span class="p">[</span><span class="n">modality</span><span class="p">].</span><span class="n">var</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">mdata</span><span class="p">[</span><span class="n">modality</span><span class="p">].</span><span class="n">var_names</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="n">identifier</span><span class="p">)]</span>
    <span class="n">df_transposed</span> <span class="o">=</span> <span class="n">gene_summaries</span><span class="p">.</span><span class="n">T</span><span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
    <span class="n">df_transposed</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'column'</span><span class="p">,</span> <span class="s">'value'</span><span class="p">]</span>  <span class="c1"># Rename columns
</span>    <span class="k">if</span> <span class="n">modality</span> <span class="o">==</span> <span class="s">"proteomics"</span><span class="p">:</span>
        <span class="c1"># remove entries starting with PG.
</span>        <span class="n">df_transposed</span> <span class="o">=</span> <span class="n">df_transposed</span><span class="p">[</span><span class="o">~</span><span class="n">df_transposed</span><span class="p">[</span><span class="s">"column"</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">"PG."</span><span class="p">)]</span>

    <span class="c1"># Extract the prefix and term from the column names
</span>    <span class="n">df_transposed</span><span class="p">[</span><span class="s">'prefix'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df_transposed</span><span class="p">[</span><span class="s">'column'</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">extract</span><span class="p">(</span><span class="sa">r</span><span class="s">'^([^_]+)_'</span><span class="p">)</span>
    <span class="n">df_transposed</span><span class="p">[</span><span class="s">'term'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df_transposed</span><span class="p">[</span><span class="s">'column'</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">extract</span><span class="p">(</span><span class="sa">r</span><span class="s">'^[^_]+_(.+)$'</span><span class="p">)</span>

    <span class="c1"># Pivot to get prefixes as columns and terms as rows
</span>    <span class="n">df_pivoted</span> <span class="o">=</span> <span class="n">df_transposed</span><span class="p">.</span><span class="n">pivot</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="s">'term'</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="s">'prefix'</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="s">'value'</span><span class="p">)</span>
    <span class="n">format_numeric_columns</span><span class="p">(</span><span class="n">df_pivoted</span><span class="p">,</span> <span class="n">inplace</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
    <span class="n">df_pivoted</span> <span class="o">=</span> <span class="n">df_pivoted</span><span class="p">.</span><span class="n">fillna</span><span class="p">(</span><span class="s">"."</span><span class="p">)</span>

    <span class="n">display_tabulator</span><span class="p">(</span>
        <span class="n">df_pivoted</span><span class="p">,</span>
        <span class="n">caption</span><span class="o">=</span><span class="sa">f</span><span class="s">"Top associations with </span><span class="si">{</span><span class="n">modality</span><span class="si">}</span><span class="s"> </span><span class="si">{</span><span class="n">identifier</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
        <span class="n">layout</span><span class="o">=</span><span class="s">"fitDataStretch"</span><span class="p">,</span>
        <span class="n">width</span> <span class="o">=</span> <span class="s">"auto"</span>
    <span class="p">)</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Top associations with transcriptomics ENSG00000146085
</figcaption>

<div class="data-table" style="width: auto; display: inline-block;" data-table="[{&quot;term&quot;: &quot;MMA_urine&quot;, &quot;est&quot;: &quot;-0.038&quot;, &quot;log10p&quot;: &quot;-0.378&quot;, &quot;p&quot;: &quot;0.418&quot;, &quot;q&quot;: &quot;0.873&quot;, &quot;stat&quot;: &quot;-0.811&quot;, &quot;stderr&quot;: &quot;0.046&quot;}, {&quot;term&quot;: &quot;OHCblPlus&quot;, &quot;est&quot;: &quot;0.383&quot;, &quot;log10p&quot;: &quot;-6.451&quot;, &quot;p&quot;: &quot;0.000&quot;, &quot;q&quot;: &quot;0.003&quot;, &quot;stat&quot;: &quot;5.261&quot;, &quot;stderr&quot;: &quot;0.073&quot;}, {&quot;term&quot;: &quot;case&quot;, &quot;est&quot;: &quot;-0.606&quot;, &quot;log10p&quot;: &quot;-1.622&quot;, &quot;p&quot;: &quot;0.024&quot;, &quot;q&quot;: &quot;0.627&quot;, &quot;stat&quot;: &quot;-2.275&quot;, &quot;stderr&quot;: &quot;0.266&quot;}, {&quot;term&quot;: &quot;date_freezing&quot;, &quot;est&quot;: &quot;nan&quot;, &quot;log10p&quot;: &quot;nan&quot;, &quot;p&quot;: &quot;0.922&quot;, &quot;q&quot;: &quot;0.927&quot;, &quot;stat&quot;: &quot;nan&quot;, &quot;stderr&quot;: &quot;nan&quot;}, {&quot;term&quot;: &quot;proteomics_runorder&quot;, &quot;est&quot;: &quot;nan&quot;, &quot;log10p&quot;: &quot;nan&quot;, &quot;p&quot;: &quot;0.820&quot;, &quot;q&quot;: &quot;0.973&quot;, &quot;stat&quot;: &quot;nan&quot;, &quot;stderr&quot;: &quot;nan&quot;}, {&quot;term&quot;: &quot;responsive_to_acute_treatment&quot;, &quot;est&quot;: &quot;-0.210&quot;, &quot;log10p&quot;: &quot;-0.666&quot;, &quot;p&quot;: &quot;0.216&quot;, &quot;q&quot;: &quot;0.756&quot;, &quot;stat&quot;: &quot;-1.242&quot;, &quot;stderr&quot;: &quot;0.169&quot;}]" data-columns="[{&quot;title&quot;: &quot;term&quot;, &quot;field&quot;: &quot;term&quot;}, {&quot;title&quot;: &quot;est&quot;, &quot;field&quot;: &quot;est&quot;}, {&quot;title&quot;: &quot;log10p&quot;, &quot;field&quot;: &quot;log10p&quot;}, {&quot;title&quot;: &quot;p&quot;, &quot;field&quot;: &quot;p&quot;}, {&quot;title&quot;: &quot;q&quot;, &quot;field&quot;: &quot;q&quot;}, {&quot;title&quot;: &quot;stat&quot;, &quot;field&quot;: &quot;stat&quot;}, {&quot;title&quot;: &quot;stderr&quot;, &quot;field&quot;: &quot;stderr&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataStretch&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Top associations with proteomics P22033
</figcaption>

<div class="data-table" style="width: auto; display: inline-block;" data-table="[{&quot;term&quot;: &quot;MMA_urine&quot;, &quot;est&quot;: -0.2065000018988299, &quot;log10p&quot;: -8.238681857784085, &quot;p&quot;: 5.77189129824518e-09, &quot;q&quot;: 2.763581553599792e-05, &quot;stat&quot;: -6.076596915972019, &quot;stderr&quot;: 0.033982836899392715}, {&quot;term&quot;: &quot;OHCblPlus&quot;, &quot;est&quot;: 0.2825622666318544, &quot;log10p&quot;: -5.220589590447866, &quot;p&quot;: 6.0174211693464486e-06, &quot;q&quot;: 0.014405706279415398, &quot;stat&quot;: 4.645306202223956, &quot;stderr&quot;: 0.06082747925133046}, {&quot;term&quot;: &quot;case&quot;, &quot;est&quot;: -0.18656270778825215, &quot;log10p&quot;: -0.4229257203603832, &quot;p&quot;: 0.37763677458004863, &quot;q&quot;: 0.8947809783310032, &quot;stat&quot;: -0.8841481557356449, &quot;stderr&quot;: 0.21100842271511033}, {&quot;term&quot;: &quot;date_freezing&quot;, &quot;est&quot;: &quot;.&quot;, &quot;log10p&quot;: &quot;.&quot;, &quot;p&quot;: 0.5597588049122902, &quot;q&quot;: 0.6666978004776233, &quot;stat&quot;: &quot;.&quot;, &quot;stderr&quot;: &quot;.&quot;}, {&quot;term&quot;: &quot;proteomics_runorder&quot;, &quot;est&quot;: &quot;.&quot;, &quot;log10p&quot;: &quot;.&quot;, &quot;p&quot;: 0.01337735789196437, &quot;q&quot;: 0.022497642987961152, &quot;stat&quot;: &quot;.&quot;, &quot;stderr&quot;: &quot;.&quot;}, {&quot;term&quot;: &quot;responsive_to_acute_treatment&quot;, &quot;est&quot;: -0.3485970720865991, &quot;log10p&quot;: -2.067490616237422, &quot;p&quot;: 0.008560702085537941, &quot;q&quot;: 0.5123580198194458, &quot;stat&quot;: -2.6543371299990586, &quot;stderr&quot;: 0.13133112148671286}]" data-columns="[{&quot;title&quot;: &quot;term&quot;, &quot;field&quot;: &quot;term&quot;}, {&quot;title&quot;: &quot;est&quot;, &quot;field&quot;: &quot;est&quot;}, {&quot;title&quot;: &quot;log10p&quot;, &quot;field&quot;: &quot;log10p&quot;}, {&quot;title&quot;: &quot;p&quot;, &quot;field&quot;: &quot;p&quot;}, {&quot;title&quot;: &quot;q&quot;, &quot;field&quot;: &quot;q&quot;}, {&quot;title&quot;: &quot;stat&quot;, &quot;field&quot;: &quot;stat&quot;}, {&quot;title&quot;: &quot;stderr&quot;, &quot;field&quot;: &quot;stderr&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataStretch&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p>The regression analysis successfully identified expected biological
relationships, providing confidence in my approach. As anticipated,
<em>MMUT</em> showed strong associations with both the <em>MMA_urine</em> and
<em>OHCblPlus</em> biomarkers. Notably, <em>MMUT</em> exhibited stronger associations
with protein levels for <em>MMA_urine</em>, but with both transcript and
protein levels for <em>OHCblPlus</em>, suggesting different underlying
biological mechanisms.</p>

<div class="content-section bio-section">
  <div class="section-content">
    <p>The association patterns reveal
distinct regulatory mechanisms. <em>OHCblPlus</em> correlates with both <em>MMUT</em>
protein and transcript levels, consistent with genetic defects impacting
transcription—likely through mutations affecting transcription or
causing nonsense-mediated decay—which in turn result in depleted
protein levels and reduced enzymatic activity. In contrast, the
association of <em>MMA_urine</em> with <em>MMUT</em> protein levels—but not with its
transcripts—suggests that translational and/or post-translational
regulation of <em>MMUT</em> plays a major role in influencing its metabolic
impact.</p>

<p>This disconnect between transcript levels and metabolic outcomes could
be key to understanding idiopathic MMA cases. The selective correlation
with protein levels indicates that <em>MMUT</em> function depends not only on
gene expression but also on post-transcriptional regulatory networks
that modulate protein stability, localization, or activity.
Understanding these mechanisms could help uncover the etiology
underlying cases without clear genetic explanations.</p>

<p>This makes urine MMA particularly valuable as an integrated readout of
disease pathophysiology, rather than merely a simple marker of enzyme
deficiency.</p>

  </div>
</div>

<h2 id="unsupervised-analysis">Unsupervised analysis</h2>

<p>Unsupervised analyses help to identify patterns in the data with minimal
assumptions about what should be important. Factor analysis is
particularly intuitive for genomics data because it describes cellular
states through gene expression programs—coordinated changes in related
genes that reflect underlying biological processes.</p>

<p>Multi-Omic Factor Analysis (<code class="language-plaintext highlighter-rouge">MOFA</code>) extends this concept to multimodal
data by discovering the principal sources of variation across different
data types. Rather than concatenating features and treating transcripts
and proteins equivalently, <code class="language-plaintext highlighter-rouge">MOFA</code> disentangles axes of heterogeneity
shared across multiple modalities from those specific to individual data
types. This will allow me to determine whether certain biological
programs create coordinated responses across transcriptomics and
proteomics or manifest differently in each modality.</p>

<p>Each factor represents a biological or technical program with two key
components: loadings (which features participate in the program) and
usages (how much each sample expresses the program). <code class="language-plaintext highlighter-rouge">MOFA</code> requires
selecting the optimal number of factors — too few factors miss
important biological programs, while too many introduce noise and
overfitting. I will systematically test different numbers of factors to
select an optimal value that balances variance explained and factor
interpretability.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># update the MuData object so `nvars` and `nobs` are appropriate
</span><span class="n">intersected_mdata</span> <span class="o">=</span> <span class="n">md</span><span class="p">.</span><span class="n">MuData</span><span class="p">({</span><span class="n">k</span><span class="p">:</span> <span class="n">v</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">mdata</span><span class="p">.</span><span class="n">mod</span><span class="p">.</span><span class="n">items</span><span class="p">()})</span>

<span class="n">muon</span><span class="p">.</span><span class="n">pp</span><span class="p">.</span><span class="n">intersect_obs</span><span class="p">(</span><span class="n">intersected_mdata</span><span class="p">)</span>  <span class="c1"># This modifies mdata in-place
</span>
<span class="c1"># Run the factor scan
</span><span class="n">results_dict</span> <span class="o">=</span> <span class="n">mdata_factor_analysis</span><span class="p">.</span><span class="n">run_mofa_factor_scan</span><span class="p">(</span>
    <span class="n">intersected_mdata</span><span class="p">,</span> 
    <span class="n">factor_range</span><span class="o">=</span><span class="n">FACTOR_RANGE</span><span class="p">,</span>
    <span class="n">use_layer</span><span class="o">=</span><span class="n">ANALYSIS_LAYER</span><span class="p">,</span>  <span class="c1"># Adjust to your normalized layer
</span>    <span class="n">models_dir</span><span class="o">=</span><span class="n">MOFA_PARAM_SCAN_MODELS_PATH</span><span class="p">,</span>
    <span class="n">overwrite</span><span class="o">=</span><span class="n">OVERWRITE</span>
<span class="p">)</span>

<span class="c1"># Extract variance metrics from all models
</span><span class="n">metrics</span> <span class="o">=</span> <span class="n">mdata_factor_analysis</span><span class="p">.</span><span class="n">calculate_variance_metrics</span><span class="p">(</span><span class="n">results_dict</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mdata_factor_analysis</span><span class="p">.</span><span class="n">visualize_factor_scan_results</span><span class="p">(</span><span class="n">metrics</span><span class="p">,</span> <span class="n">user_factors</span><span class="o">=</span><span class="n">OPTIMAL_FACTOR</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<pre><code class="language-output">    Optimal number of factors based on different criteria:
      Elbow method: 10 factors
      Threshold method: 48 factors
      Balanced method: 24 factors
      User-specified: 30 factors
</code></pre>

<p><img src="/figure/source/2025-08-19-multiomic_profiles/factor_scan_results-output-2.png" alt="" /></p>

<p><img src="/figure/source/2025-08-19-multiomic_profiles/factor_scan_results-output-3.png" alt="" /></p>

<p><img src="/figure/source/2025-08-19-multiomic_profiles/factor_scan_results-output-4.png" alt="" /></p>

<p>The factor scan results help me identify the optimal model complexity. I
look for the point where adding more factors yields diminishing returns
in variance explained while avoiding overfitting that would reduce
factor interpretability.</p>

<div class="content-section ai-aside">
  <div class="section-content">
    <p>Claude did well organizing the
hyperparameter scan and implementing various approaches for selecting
the optimal number of factors (K). While I wasn’t particularly impressed
with the specific model selection criteria it chose, this didn’t matter
much in practice. Factor analyses like MOFA exhibit diminishing returns,
where later factors tend to have smaller, sparser loadings and capture
less meaningful variation. This means the overall results are relatively
robust to the exact choice of K, as long as it falls within a reasonable
range.</p>

  </div>
</div>

<p>With 30 factors selected as optimal, I can fit the final MOFA model and
examine the distributions of the factors:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">OPTIMAL_MODEL_H5MU_PATH</span><span class="p">)</span> <span class="ow">or</span> <span class="n">OVERWRITE</span><span class="p">:</span>

    <span class="n">optimal_model</span> <span class="o">=</span> <span class="n">mdata_factor_analysis</span><span class="p">.</span><span class="n">create_minimal_mudata</span><span class="p">(</span>
        <span class="n">intersected_mdata</span><span class="p">,</span>
        <span class="n">include_layers</span><span class="o">=</span><span class="p">[</span><span class="n">ANALYSIS_LAYER</span><span class="p">],</span>
        <span class="n">include_obsm</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="n">include_varm</span><span class="o">=</span><span class="bp">True</span>
    <span class="p">)</span>

    <span class="n">mdata_factor_analysis</span><span class="p">.</span><span class="n">_mofa</span><span class="p">(</span>
        <span class="n">optimal_model</span><span class="p">,</span>
        <span class="n">n_factors</span><span class="o">=</span><span class="n">OPTIMAL_FACTOR</span><span class="p">,</span>
        <span class="n">use_obs</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
        <span class="n">use_var</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
        <span class="n">use_layer</span><span class="o">=</span><span class="n">ANALYSIS_LAYER</span><span class="p">,</span>
        <span class="n">convergence_mode</span><span class="o">=</span><span class="s">"medium"</span><span class="p">,</span>
        <span class="n">verbose</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
        <span class="n">save_metadata</span><span class="o">=</span><span class="bp">True</span>
    <span class="p">)</span>

    <span class="n">md</span><span class="p">.</span><span class="n">write_h5mu</span><span class="p">(</span><span class="n">OPTIMAL_MODEL_H5MU_PATH</span><span class="p">,</span> <span class="n">optimal_model</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
    <span class="n">optimal_model</span> <span class="o">=</span> <span class="n">md</span><span class="p">.</span><span class="n">read_h5mu</span><span class="p">(</span><span class="n">OPTIMAL_MODEL_H5MU_PATH</span><span class="p">)</span>
    
<span class="n">mdata_factor_analysis</span><span class="p">.</span><span class="n">plot_mofa_factor_histograms</span><span class="p">(</span><span class="n">optimal_model</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/figure/source/2025-08-19-multiomic_profiles/optimal_mofa_summary-output-1.png" alt="" /></p>

<p>With 30 factors extracted, I aim to identify those that capture disease
biology versus technical variation or normal biological heterogeneity.
To do this, I’ll apply the same regression approach used for individual
features, testing whether factor usages (how much each sample expresses
each factor) correlate with disease phenotypes such as MMA_urine levels
and treatment responsiveness. Covariates are less of a concern here
because different biological and technical effects are generally
captured in separate components — one of the key strengths of factor
analysis is its ability to account for both observed and latent
confounding variables.</p>

<p>Factors that show significant associations across multiple disease
measures, explain substantial variance in disease severity, and contain
biologically coherent gene/protein sets will be my primary targets for
network analysis—these represent coordinated disease programs that I
can trace back to their regulatory origins.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Apply regression per factor
</span><span class="n">factor_regressions</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
<span class="k">for</span> <span class="n">formula_name</span><span class="p">,</span> <span class="n">formula</span> <span class="ow">in</span> <span class="n">REGRESSION_FORMULAS</span><span class="p">[</span><span class="s">"proteomics"</span><span class="p">].</span><span class="n">items</span><span class="p">():</span>

    <span class="c1"># Run regression analysis
</span>    <span class="n">regression_results</span> <span class="o">=</span> <span class="n">mdata_factor_analysis</span><span class="p">.</span><span class="n">regress_factors_with_formula</span><span class="p">(</span>
        <span class="n">optimal_model</span><span class="p">,</span>
        <span class="n">formula</span><span class="o">=</span><span class="n">formula</span><span class="p">,</span>
        <span class="n">factors</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>  <span class="c1"># Use all factors
</span>        <span class="n">progress_bar</span><span class="o">=</span><span class="bp">False</span>
    <span class="p">).</span><span class="n">assign</span><span class="p">(</span><span class="n">formula_name</span> <span class="o">=</span> <span class="n">formula_name</span><span class="p">)</span>

    <span class="n">factor_regressions</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">regression_results</span><span class="p">)</span>
    
    <span class="c1"># Generate summary table
</span>    <span class="n">summary_table</span> <span class="o">=</span> <span class="n">mdata_factor_analysis</span><span class="p">.</span><span class="n">summarize_factor_regression</span><span class="p">(</span>
        <span class="n">regression_results</span><span class="p">,</span>
        <span class="n">alpha</span><span class="o">=</span><span class="n">FDR_CUTOFF</span><span class="p">,</span>
        <span class="n">group_by_factor</span><span class="o">=</span><span class="bp">False</span>
    <span class="p">).</span><span class="n">query</span><span class="p">(</span><span class="sa">f</span><span class="s">"term == '</span><span class="si">{</span><span class="n">formula_name</span><span class="si">}</span><span class="s">'"</span><span class="p">)</span>

    <span class="k">if</span> <span class="n">summary_table</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>

        <span class="n">display_tabulator</span><span class="p">(</span>
            <span class="n">summary_table</span><span class="p">,</span>
            <span class="n">caption</span><span class="o">=</span><span class="sa">f</span><span class="s">"Factors associated with </span><span class="si">{</span><span class="n">formula_name</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
            <span class="n">layout</span><span class="o">=</span><span class="s">"fitDataStretch"</span>
        <span class="p">)</span>

<span class="n">factor_regressions_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="n">factor_regressions</span><span class="p">)</span>
<span class="n">optimal_model</span><span class="p">.</span><span class="n">uns</span><span class="p">[</span><span class="s">"factor_regressions"</span><span class="p">]</span> <span class="o">=</span> <span class="n">factor_regressions_df</span>
</code></pre></div></div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Factors associated with case
</figcaption>

<div class="data-table" style="" data-table="[{&quot;index&quot;: 36, &quot;factor_name&quot;: &quot;Factor_10&quot;, &quot;term&quot;: &quot;case&quot;, &quot;estimate&quot;: &quot;0.207&quot;, &quot;standard_error&quot;: &quot;0.052&quot;, &quot;statistic&quot;: 3.991545478953449, &quot;p_value&quot;: &quot;9.22e-05&quot;, &quot;q_value&quot;: &quot;2.77e-03&quot;}, {&quot;index&quot;: 32, &quot;factor_name&quot;: &quot;Factor_9&quot;, &quot;term&quot;: &quot;case&quot;, &quot;estimate&quot;: &quot;-0.650&quot;, &quot;standard_error&quot;: &quot;0.236&quot;, &quot;statistic&quot;: -2.7563012075787734, &quot;p_value&quot;: &quot;6.39e-03&quot;, &quot;q_value&quot;: &quot;9.58e-02&quot;}]" data-columns="[{&quot;title&quot;: &quot;index&quot;, &quot;field&quot;: &quot;index&quot;}, {&quot;title&quot;: &quot;factor_name&quot;, &quot;field&quot;: &quot;factor_name&quot;}, {&quot;title&quot;: &quot;term&quot;, &quot;field&quot;: &quot;term&quot;}, {&quot;title&quot;: &quot;estimate&quot;, &quot;field&quot;: &quot;estimate&quot;}, {&quot;title&quot;: &quot;standard_error&quot;, &quot;field&quot;: &quot;standard_error&quot;}, {&quot;title&quot;: &quot;statistic&quot;, &quot;field&quot;: &quot;statistic&quot;}, {&quot;title&quot;: &quot;p_value&quot;, &quot;field&quot;: &quot;p_value&quot;}, {&quot;title&quot;: &quot;q_value&quot;, &quot;field&quot;: &quot;q_value&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataStretch&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Factors associated with MMA_urine
</figcaption>

<div class="data-table" style="" data-table="[{&quot;index&quot;: 28, &quot;factor_name&quot;: &quot;Factor_8&quot;, &quot;term&quot;: &quot;MMA_urine&quot;, &quot;estimate&quot;: &quot;-0.150&quot;, &quot;standard_error&quot;: &quot;0.045&quot;, &quot;statistic&quot;: -3.342278260663408, &quot;p_value&quot;: &quot;9.92e-04&quot;, &quot;q_value&quot;: &quot;2.98e-02&quot;}, {&quot;index&quot;: 8, &quot;factor_name&quot;: &quot;Factor_3&quot;, &quot;term&quot;: &quot;MMA_urine&quot;, &quot;estimate&quot;: &quot;0.040&quot;, &quot;standard_error&quot;: &quot;0.014&quot;, &quot;statistic&quot;: 2.814093785284151, &quot;p_value&quot;: &quot;5.38e-03&quot;, &quot;q_value&quot;: &quot;5.94e-02&quot;}, {&quot;index&quot;: 36, &quot;factor_name&quot;: &quot;Factor_10&quot;, &quot;term&quot;: &quot;MMA_urine&quot;, &quot;estimate&quot;: &quot;0.025&quot;, &quot;standard_error&quot;: &quot;0.009&quot;, &quot;statistic&quot;: 2.7808623950226816, &quot;p_value&quot;: &quot;5.94e-03&quot;, &quot;q_value&quot;: &quot;5.94e-02&quot;}, {&quot;index&quot;: 60, &quot;factor_name&quot;: &quot;Factor_16&quot;, &quot;term&quot;: &quot;MMA_urine&quot;, &quot;estimate&quot;: &quot;0.038&quot;, &quot;standard_error&quot;: &quot;0.015&quot;, &quot;statistic&quot;: 2.6002535780948643, &quot;p_value&quot;: &quot;1.00e-02&quot;, &quot;q_value&quot;: &quot;7.51e-02&quot;}]" data-columns="[{&quot;title&quot;: &quot;index&quot;, &quot;field&quot;: &quot;index&quot;}, {&quot;title&quot;: &quot;factor_name&quot;, &quot;field&quot;: &quot;factor_name&quot;}, {&quot;title&quot;: &quot;term&quot;, &quot;field&quot;: &quot;term&quot;}, {&quot;title&quot;: &quot;estimate&quot;, &quot;field&quot;: &quot;estimate&quot;}, {&quot;title&quot;: &quot;standard_error&quot;, &quot;field&quot;: &quot;standard_error&quot;}, {&quot;title&quot;: &quot;statistic&quot;, &quot;field&quot;: &quot;statistic&quot;}, {&quot;title&quot;: &quot;p_value&quot;, &quot;field&quot;: &quot;p_value&quot;}, {&quot;title&quot;: &quot;q_value&quot;, &quot;field&quot;: &quot;q_value&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataStretch&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Factors associated with proteomics_runorder
</figcaption>

<div class="data-table" style="" data-table="[{&quot;index&quot;: 6, &quot;factor_name&quot;: &quot;Factor_3&quot;, &quot;term&quot;: &quot;proteomics_runorder&quot;, &quot;p_value&quot;: &quot;1.11e-16&quot;, &quot;q_value&quot;: &quot;1.67e-15&quot;}, {&quot;index&quot;: 27, &quot;factor_name&quot;: &quot;Factor_10&quot;, &quot;term&quot;: &quot;proteomics_runorder&quot;, &quot;p_value&quot;: &quot;1.11e-16&quot;, &quot;q_value&quot;: &quot;1.67e-15&quot;}, {&quot;index&quot;: 39, &quot;factor_name&quot;: &quot;Factor_14&quot;, &quot;term&quot;: &quot;proteomics_runorder&quot;, &quot;p_value&quot;: &quot;3.37e-07&quot;, &quot;q_value&quot;: &quot;3.37e-06&quot;}, {&quot;index&quot;: 21, &quot;factor_name&quot;: &quot;Factor_8&quot;, &quot;term&quot;: &quot;proteomics_runorder&quot;, &quot;p_value&quot;: &quot;5.79e-04&quot;, &quot;q_value&quot;: &quot;4.34e-03&quot;}, {&quot;index&quot;: 69, &quot;factor_name&quot;: &quot;Factor_24&quot;, &quot;term&quot;: &quot;proteomics_runorder&quot;, &quot;p_value&quot;: &quot;1.51e-02&quot;, &quot;q_value&quot;: &quot;7.87e-02&quot;}, {&quot;index&quot;: 12, &quot;factor_name&quot;: &quot;Factor_5&quot;, &quot;term&quot;: &quot;proteomics_runorder&quot;, &quot;p_value&quot;: &quot;1.57e-02&quot;, &quot;q_value&quot;: &quot;7.87e-02&quot;}]" data-columns="[{&quot;title&quot;: &quot;index&quot;, &quot;field&quot;: &quot;index&quot;}, {&quot;title&quot;: &quot;factor_name&quot;, &quot;field&quot;: &quot;factor_name&quot;}, {&quot;title&quot;: &quot;term&quot;, &quot;field&quot;: &quot;term&quot;}, {&quot;title&quot;: &quot;p_value&quot;, &quot;field&quot;: &quot;p_value&quot;}, {&quot;title&quot;: &quot;q_value&quot;, &quot;field&quot;: &quot;q_value&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataStretch&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<figcaption style="font-weight:bold; margin-bottom:0.5em">
    Factors associated with date_freezing
</figcaption>

<div class="data-table" style="" data-table="[{&quot;index&quot;: 1, &quot;factor_name&quot;: &quot;Factor_1&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;6.29e-13&quot;, &quot;q_value&quot;: &quot;1.89e-11&quot;}, {&quot;index&quot;: 88, &quot;factor_name&quot;: &quot;Factor_30&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;2.52e-05&quot;, &quot;q_value&quot;: &quot;3.41e-04&quot;}, {&quot;index&quot;: 22, &quot;factor_name&quot;: &quot;Factor_8&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;3.41e-05&quot;, &quot;q_value&quot;: &quot;3.41e-04&quot;}, {&quot;index&quot;: 49, &quot;factor_name&quot;: &quot;Factor_17&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;1.13e-04&quot;, &quot;q_value&quot;: &quot;8.48e-04&quot;}, {&quot;index&quot;: 55, &quot;factor_name&quot;: &quot;Factor_19&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;2.69e-04&quot;, &quot;q_value&quot;: &quot;1.62e-03&quot;}, {&quot;index&quot;: 31, &quot;factor_name&quot;: &quot;Factor_11&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;4.07e-04&quot;, &quot;q_value&quot;: &quot;2.03e-03&quot;}, {&quot;index&quot;: 46, &quot;factor_name&quot;: &quot;Factor_16&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;8.81e-04&quot;, &quot;q_value&quot;: &quot;3.77e-03&quot;}, {&quot;index&quot;: 37, &quot;factor_name&quot;: &quot;Factor_13&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;6.41e-03&quot;, &quot;q_value&quot;: &quot;2.40e-02&quot;}, {&quot;index&quot;: 64, &quot;factor_name&quot;: &quot;Factor_22&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;9.81e-03&quot;, &quot;q_value&quot;: &quot;3.27e-02&quot;}, {&quot;index&quot;: 28, &quot;factor_name&quot;: &quot;Factor_10&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;1.22e-02&quot;, &quot;q_value&quot;: &quot;3.51e-02&quot;}, {&quot;index&quot;: 52, &quot;factor_name&quot;: &quot;Factor_18&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;1.36e-02&quot;, &quot;q_value&quot;: &quot;3.51e-02&quot;}, {&quot;index&quot;: 43, &quot;factor_name&quot;: &quot;Factor_15&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;1.40e-02&quot;, &quot;q_value&quot;: &quot;3.51e-02&quot;}, {&quot;index&quot;: 25, &quot;factor_name&quot;: &quot;Factor_9&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;1.79e-02&quot;, &quot;q_value&quot;: &quot;4.13e-02&quot;}, {&quot;index&quot;: 13, &quot;factor_name&quot;: &quot;Factor_5&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;2.03e-02&quot;, &quot;q_value&quot;: &quot;4.35e-02&quot;}, {&quot;index&quot;: 67, &quot;factor_name&quot;: &quot;Factor_23&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;2.17e-02&quot;, &quot;q_value&quot;: &quot;4.35e-02&quot;}, {&quot;index&quot;: 58, &quot;factor_name&quot;: &quot;Factor_20&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;3.15e-02&quot;, &quot;q_value&quot;: &quot;5.90e-02&quot;}, {&quot;index&quot;: 10, &quot;factor_name&quot;: &quot;Factor_4&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;4.45e-02&quot;, &quot;q_value&quot;: &quot;7.85e-02&quot;}, {&quot;index&quot;: 85, &quot;factor_name&quot;: &quot;Factor_29&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;5.24e-02&quot;, &quot;q_value&quot;: &quot;8.73e-02&quot;}, {&quot;index&quot;: 79, &quot;factor_name&quot;: &quot;Factor_27&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;5.64e-02&quot;, &quot;q_value&quot;: &quot;8.91e-02&quot;}, {&quot;index&quot;: 40, &quot;factor_name&quot;: &quot;Factor_14&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;6.13e-02&quot;, &quot;q_value&quot;: &quot;9.08e-02&quot;}, {&quot;index&quot;: 61, &quot;factor_name&quot;: &quot;Factor_21&quot;, &quot;term&quot;: &quot;date_freezing&quot;, &quot;p_value&quot;: &quot;6.36e-02&quot;, &quot;q_value&quot;: &quot;9.08e-02&quot;}]" data-columns="[{&quot;title&quot;: &quot;index&quot;, &quot;field&quot;: &quot;index&quot;}, {&quot;title&quot;: &quot;factor_name&quot;, &quot;field&quot;: &quot;factor_name&quot;}, {&quot;title&quot;: &quot;term&quot;, &quot;field&quot;: &quot;term&quot;}, {&quot;title&quot;: &quot;p_value&quot;, &quot;field&quot;: &quot;p_value&quot;}, {&quot;title&quot;: &quot;q_value&quot;, &quot;field&quot;: &quot;q_value&quot;}]" data-options="{&quot;layout&quot;: &quot;fitDataStretch&quot;, &quot;responsiveLayout&quot;: &quot;collapse&quot;}">
</div>

<p>While I generally like this approach and have seen it successfully
applied to single-cell RNA-seq data using
<a href="https://github.com/dylkot/cNMF">cNMF</a>, it didn’t work particularly well
for this dataset. Most disease phenotypes show only weak correlations
with factor usages, similar to the feature-wise significance tests. The
clearest exception is urine MMA levels, which are strongly associated
with latent factor 3 (<em>LF3</em>). However, <em>LF3</em> is bimodal and also
correlated with proteomics run order and sample freezing date, so buyer
beware. Curiously, case status was also associated with a couple of
factors despite little feature-level signal.</p>

<p>The prominence of non-disease biology in my factor analysis highlights
both a key strength and a limitation of the method: it is agnostic to
the biology of interest and instead captures the dominant sources of
variation in the data. In human disease datasets, coherent disease
signatures are often distributed across many small factors rather than
concentrated in a few large ones, making them harder to detect amid
technical variation and biological heterogeneity.</p>

<h2 id="summary-and-next-steps">Summary and next steps</h2>

<p>This analysis demonstrates a systematic approach to extracting
disease-relevant molecular profiles from multimodal genomics data. Using
the Forny et al. MMA dataset, I have shown how careful data processing,
covariate adjustment, and the use of both supervised and unsupervised
methods can reveal subtle but biologically meaningful disease signals.</p>

<p>Methodological insights:</p>

<ul>
  <li>
    <p><strong>scverse integration</strong>: Demonstrated how organizing multimodal data
using <code class="language-plaintext highlighter-rouge">AnnData</code>/<code class="language-plaintext highlighter-rouge">MuData</code> containers facilitates reproducible
analysis while maintaining seamless integration with the broader
genomics ecosystem.</p>
  </li>
  <li>
    <p><strong>Modality-specific covariate modeling</strong>: PCA revealed that batch
effects (proteomics run order, sample freezing dates) impacted
transcriptomics and proteomics data differently, enabling tailored
regression models for each modality that improved my statistical
power for detecting biological signals.</p>
  </li>
  <li>
    <p><strong>AI-assisted development workflow</strong>: Large language models (LLMs)
proved effective for rapid prototyping and handling complex
visualization syntax (e.g., <code class="language-plaintext highlighter-rouge">matplotlib</code>), but encountered serious
issues with statistical implementation — initially producing
spaghetti code that silently converted multiple regressions into
simple regressions. The solution was to treat AI as a development
collaborator, following software engineering best practices by
implementing specific features with tests in a structured Python
package rather than relying on AI for end-to-end development.</p>
  </li>
</ul>

<p>Biological insights:</p>

<ul>
  <li>
    <p><strong>MMUT regulatory patterns</strong>: Different association patterns between
transcript and protein levels suggest that translational or
post-translational control mechanisms shape metabolic impact,
potentially explaining idiopathic MMA cases.</p>
  </li>
  <li>
    <p><strong>Biomarker performance</strong>: Urine MMA showed stronger statistical
associations with molecular features than traditional enzyme
activity measures, suggesting it captures integrated disease
pathophysiology beyond simple enzyme deficiency.</p>
  </li>
  <li>
    <p><strong>Disease heterogeneity</strong>: Weak PCA associations with disease status
mirror the clinical heterogeneity observed in MMA patients,
confirming that molecular signatures reflect the complex and
variable nature of disease presentation.</p>
  </li>
</ul>

<h3 id="transitioning-to-network-analysis">Transitioning to network analysis</h3>

<p>While my statistical approach successfully identified disease-associated
molecular programs, interpreting their biological significance requires
additional context. These molecular signatures represent starting points
for deeper mechanistic investigation. In part two, I will map these
statistical associations onto genome-scale biological networks using
<code class="language-plaintext highlighter-rouge">Napistu</code>. This approach will trace disease signals from downstream
molecular effects to potential upstream regulatory causes, converting
statistical associations into testable biological hypotheses.</p>]]></content><author><name>Sean Hackett</name></author><category term="napistu" /><category term="genomics" /><category term="python" /><summary type="html"><![CDATA[This is part one of a two-part post highlighting Napistu — a new framework for building genome-scale networks of molecular biology and biochemistry. In this post, I’ll tackle a fundamental challenge in computational biology: how to extract meaningful disease signatures from complex multimodal datasets. Using methylmalonic acidemia (MMA) as my test case, I’ll demonstrate how to systematically extract disease signatures from multimodal data. My approach combines three complementary analytical strategies: exploratory data analysis to assess data structure and quality, differential expression analysis to identify disease-associated features, and factor analysis to uncover coordinated gene expression programs across data types. The end goal is to distill thousands of molecular measurements into a handful of interpretable disease signatures — each capturing a distinct aspect of disease biology that can be mapped to regulatory networks. Throughout this post, I’ll use two types of asides to provide additional context without disrupting the main analytical flow. Green boxes contain biological details, while blue boxes reflect on the computational workflow and AI-assisted development process.]]></summary></entry><entry><title type="html">Flattening the Gompertz Distribution</title><link href="https://www.shackett.org/gompertz/" rel="alternate" type="text/html" title="Flattening the Gompertz Distribution" /><published>2025-02-02T00:00:00+00:00</published><updated>2025-02-02T00:00:00+00:00</updated><id>https://www.shackett.org/gompertz</id><content type="html" xml:base="https://www.shackett.org/gompertz/"><![CDATA[<p>In this post I’ll explore the Gompertz law of mortality which describes individuals’ accelerating risk of death with age.</p>

<p>The Gompertz equation describes per-year hazard (i.e., the likelihood of surviving from time $t$ to $t+1$) as the product of age-independent parameter $\alpha$ and an age-dependent component which increases exponentially with time scaled by another parameter $beta$ ($e^{\beta \cdot t}$).</p>

<p>The equation is thus:</p>

\[\large h(t) = \alpha \cdot e^{\beta \cdot t}\]

<p>The Gompertz equation is often studied by taking its natural log resulting in a linear relationship between log(hazard) and increasing risk with age.</p>

\[\large \ln(h(t)) = \ln(\alpha) + \beta \cdot t\]

<p>Formulating and estimating the parameters of demographic hazard models like the Gompertz’s equation is an active area of research, and there is a lot of information out there catering to both the academic and lay audiences. Still, when reviewing this literature, <strong>I did not see a clear summary of how decreases in $\beta$ (the chief aim of longevity research) would lead to lifespan extension.</strong></p>

<!--more-->

<p>For me, this is one of the most interesting properties of the Gompertz model, and this potential has been touted by my colleagues <a href="https://calicolabs.com/people/j-graham-ruby">Graham Ruby</a> and <a href="https://calicolabs.com/people/eugene-melamud">Eugene Melamud</a> at Calico for some time.</p>

<p>Here, I’ll focus on a quantitative treatment of two related questions related to the potential of lifespan extension through slowing aging:</p>

<ol>
  <li>How was the lifespan extension seen across the $20^{th}$ borne out in changes in baseline mortality ($\alpha$)? <em>The 1.9-fold lifespan extension across the century was borne-out through a 26-fold decrease in baseline mortality ($\alpha$).</em></li>
  <li>What would life expectancy be if there were a comparable decrease in $\beta$? <em>A 26-fold decrease in $\beta$ would extend human life expectancy to around 1,000 years.</em></li>
</ol>

<p>I am NOT an expert on demographic models, so through ignorance (but also conveniently for brevity!) I’ll be leaving out lots of relevant information.</p>

<p><em>Additionally, while I am a current Calico employee, I want to emphasize that this and any other posts on my personal blog are my personal thoughts and do not necessarily reflect the opinions of my employer.</em></p>

<h1 id="background">Background</h1>

<h2 id="life-tables">Life Tables</h2>

<p>Demographic models of mortality and life expectancy are built from “life tables.” Life tables contain both the number of individual of a given age that are living in the selected year, and the number of deaths that occurred in these ages during the selected year. This is sufficient to calculate the probability of an individual of a given age dying that year (i.e., the hazard).</p>

<p>Life tables can either be “period” life tables which describe the mortality of a population within a narrow window of time, while “cohort” life tables describe the mortality of a cohort of individual born within a narrow window of time. In this post I’ll be focusing on period life tables, which can be obtained from many sources, for example here is an example from the <a href="https://www.ssa.gov/oact/STATS/table4c6.html">social security administration</a>.</p>

<p><img src="https://www.shackett.org/figure/gompertz/lifetable.png" alt="Example Life Table" class="align-center" /></p>

<h2 id="the-origin-of-life-tables">The Origin of Life Tables</h2>

<p>As mentioned above, creating life tables requires (1) the number of individuals at each age, and (2) the number of deaths at each age. The easiest way to accomplish (1) is through a census, while (2) requires solid record keeping. These conditions were rarely in place at the same time so early life tables were derived from birth and death records rather than direct summaries of demographics.</p>

<p>In 1693 the British astronomer Edmon Halley studied the birth and death records of the city of Breslau. Breslau happened to be in a steady-state where the number of births approximately equaled the number of deaths. He used this information to create the first life table describing the distribution of residents’ ages. Using this information, Halley demonstrated how to price annuities leading to the birth of <a href="https://en.wikipedia.org/wiki/Actuarial_science">Actuarial Science</a>.</p>

<p>Around 1750, the Swiss mathematician Leonard Euler re-derived the work of Halley while exploring the exponential growth of populations. His formulation of exponential growth allowed life tables to be generated for non-stationary populations.</p>

<p>With proper censuses it is now easier to just plug in measured values into a life table, but the lack of this information produced elegant math for describing population dynamics amid indirect measurements. These continue to be fundamental for studying both evolutionary and ecological dynamics.</p>

<h2 id="gompertzs-law">Gompertz’s Law</h2>

<p>Nearly 200 years ago, Benjamin Gompertz, a British actuary, described the mortality of populations in a life table (i.e., the survival curve) as a symmetrical sigmoidal function. Because the function is sigmoidal it indicates that the absolute mortality peaks at the average lifespan; past this point, while relative mortality increases the lower number of individuals remaining results in fewer total deaths.</p>

<p>Later, the modern Gompertz law of mortality was derived from the more general Gompertz equation. Unlike the sigmoidal formulation, the Gompertz law focuses on relative mortality hence risk continues to increase exponentially despite the winnowing of the aging population.</p>

\[\large h(t) = \alpha \cdot e^{\beta \cdot t}\]

<p>While I will primarily focus on this formulation of the Gompertz equation, its form has been broadly amended and challenged to better predict mortality in the very young and very old.</p>

<p>The Gompertz equation implies a vanishingly small risk for the very young which even for modern humans is untrue due to early childhood mortality (<a href="https://en.wikipedia.org/wiki/Gompertz%E2%80%93Makeham_law_of_mortality">Gompertz-Makeham Law Wiki</a>); and more generally will be challenged by high rates of extrinsic mortality (e.g., from disease or predation). To better account for baseline risk, the Gompertz equation is often discussed as the Gompertz–Makeham law of mortality. This formulation adds the Makeham term ($\lambda$) though it is generally appreciated that the Gompertz terms outweight the Makeham term when extrinsic mortality is low.</p>

\[\large h(t) = \alpha \cdot e^{\beta \cdot t} + \lambda\]

<p>For the very old, the Gompertz equation would predict an eventual hazard of one (mathematically it would continue to go past this point which is a bit of a red flag) implying guaranteed death by a defined age. This point has been hotly contested because it bears on whether there is a fundamental upper limit on human lifespan. Coming up with alternative models to describe the hazard of the very old is surprisingly hard because there are so few individuals to fit. This work generally focuses on “extreme value distributions” which describe the distribution of maximum/minimum values from a set of observations.</p>

<h2 id="historical-changes-in-life-expectancy">Historical Changes in Life Expectancy</h2>

<p>To explore how changes in life expectancy map onto parameters of the Gompertz, we’ll start by obtain some demographic data on how life expectancy changes across the $20^{th}$ century.</p>

<p>To do this, I’ll use a table of life expectancy in the USA ranging from 1900-1998 which I stumbled into on Andrew Noymer’s website (Associate Professor, Public Health at UC–Irvine): <a href="https://u.demog.berkeley.edu/~andrew/1918/figure2.html">USA Life Expectancy 20th century site</a></p>

<p>I could have copied this data to a local file but instead I decided to knock some of the rust off of my webscraping skills and directly read the table using <strong>rvest</strong>. If you are interested in another example of rvest webscraping I talked about how I scraped &gt;200K MMA webpages a while back in this post: <a href="http://www.fightprior.com/2016/04/29/scrapingMMA/">FightPrior</a>.</p>

<p>My approach to reading the table didn’t work great. CSS selectors which define the portion html to extract can be a little finicky and in this case the fields I selected pulled in some white space above and below the table. Because of this rather than extracting a table with a command like “rvest::html_table” I had to serialize the table as a long character vector. After doing this I reformat it to a matrix and then applied a few more clunky operations to set the first rows as the variable names and to convert the matrix to a nice tibble.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># load packages and create a default plotting theme</span><span class="w">
</span><span class="n">suppressPackageStartupMessages</span><span class="p">(</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">))</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">patchwork</span><span class="p">)</span><span class="w"> </span><span class="c1"># for combining plots</span><span class="w">

</span><span class="n">theme_bw_mod</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">theme_bw</span><span class="p">(</span><span class="n">base_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="p">)</span><span class="w">

</span><span class="n">pretty_kable</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">tab</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">tab</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">kableExtra</span><span class="o">::</span><span class="n">kable_styling</span><span class="p">(</span><span class="s2">"striped"</span><span class="p">,</span><span class="w"> </span><span class="n">position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"left"</span><span class="p">,</span><span class="w"> </span><span class="n">full_width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="c1"># read html</span><span class="w">
</span><span class="n">US_LIFE_EXPECTANCY_URL</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"https://u.demog.berkeley.edu/~andrew/1918/figure2.html"</span><span class="w">
</span><span class="n">us_life_expectancy_html</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rvest</span><span class="o">::</span><span class="n">read_html</span><span class="p">(</span><span class="n">US_LIFE_EXPECTANCY_URL</span><span class="p">)</span><span class="w">

</span><span class="n">us_life_expectancy_matrix</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">us_life_expectancy_html</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># setup with Selector Gadget as a CSS selector</span><span class="w">
  </span><span class="n">rvest</span><span class="o">::</span><span class="n">html_nodes</span><span class="p">(</span><span class="s2">"tr~ tr+ tr p"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> 
  </span><span class="n">rvest</span><span class="o">::</span><span class="n">html_text2</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">matrix</span><span class="p">(</span><span class="w">
    </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">byrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="p">{</span><span class="n">.</span><span class="p">[</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">.</span><span class="p">)),]}</span><span class="w">

</span><span class="c1"># turn first row into column names</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">us_life_expectancy_matrix</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">us_life_expectancy_matrix</span><span class="p">[</span><span class="m">1</span><span class="p">,]</span><span class="w">
</span><span class="n">us_life_expectancy_matrix</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">us_life_expectancy_matrix</span><span class="p">[</span><span class="m">-1</span><span class="p">,]</span><span class="w">

</span><span class="n">us_life_expectancy_tbl</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tibble</span><span class="o">::</span><span class="n">as_tibble</span><span class="p">(</span><span class="n">us_life_expectancy_matrix</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate_all</span><span class="p">(</span><span class="n">as.numeric</span><span class="p">)</span><span class="w">

</span><span class="n">pretty_kable</span><span class="p">(</span><span class="n">head</span><span class="p">(</span><span class="n">us_life_expectancy_tbl</span><span class="p">))</span></code></pre></figure>

<table class="table table-striped" style="width: auto !important; ">
 <thead>
  <tr>
   <th style="text-align:right;"> Year </th>
   <th style="text-align:right;"> M </th>
   <th style="text-align:right;"> F </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1900 </td>
   <td style="text-align:right;"> 46.3 </td>
   <td style="text-align:right;"> 48.3 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1901 </td>
   <td style="text-align:right;"> 47.6 </td>
   <td style="text-align:right;"> 50.6 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1902 </td>
   <td style="text-align:right;"> 49.8 </td>
   <td style="text-align:right;"> 53.4 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1903 </td>
   <td style="text-align:right;"> 49.1 </td>
   <td style="text-align:right;"> 52.0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1904 </td>
   <td style="text-align:right;"> 46.2 </td>
   <td style="text-align:right;"> 49.1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1905 </td>
   <td style="text-align:right;"> 47.3 </td>
   <td style="text-align:right;"> 50.2 </td>
  </tr>
</tbody>
</table>

<p>With summaries of life expectancy each year we can now flag some years of interest which will be useful for plotting. I identified the end of each decade as well as 1918 when the Spanish Flu killed 50 million people worldwide.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">select_lifespans</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">us_life_expectancy_tbl</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">avg_lifespan</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="n">M</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">`F`</span><span class="p">)</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="w">
    </span><span class="n">case_when</span><span class="p">(</span><span class="w">
      </span><span class="n">Year</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1900</span><span class="p">,</span><span class="w"> </span><span class="m">1998</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
      </span><span class="n">avg_lifespan</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">avg_lifespan</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
      </span><span class="n">Year</span><span class="w"> </span><span class="o">%%</span><span class="w"> </span><span class="m">10</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
      </span><span class="kc">TRUE</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="kc">FALSE</span><span class="w">
      </span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w">

</span><span class="n">lifespan_extension_20th</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">round</span><span class="p">(</span><span class="nf">max</span><span class="p">(</span><span class="n">select_lifespans</span><span class="o">$</span><span class="n">avg_lifespan</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">select_lifespans</span><span class="o">$</span><span class="n">avg_lifespan</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">

</span><span class="c1">#lifespan_extension_20th</span><span class="w">
</span><span class="c1"># 1.9</span><span class="w">

</span><span class="n">pretty_kable</span><span class="p">(</span><span class="n">select_lifespans</span><span class="p">)</span></code></pre></figure>

<table class="table table-striped" style="width: auto !important; ">
 <thead>
  <tr>
   <th style="text-align:right;"> Year </th>
   <th style="text-align:right;"> M </th>
   <th style="text-align:right;"> F </th>
   <th style="text-align:right;"> avg_lifespan </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1900 </td>
   <td style="text-align:right;"> 46.3 </td>
   <td style="text-align:right;"> 48.3 </td>
   <td style="text-align:right;"> 47.30 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1910 </td>
   <td style="text-align:right;"> 48.4 </td>
   <td style="text-align:right;"> 51.8 </td>
   <td style="text-align:right;"> 50.10 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1918 </td>
   <td style="text-align:right;"> 36.6 </td>
   <td style="text-align:right;"> 42.2 </td>
   <td style="text-align:right;"> 39.40 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1920 </td>
   <td style="text-align:right;"> 53.6 </td>
   <td style="text-align:right;"> 54.6 </td>
   <td style="text-align:right;"> 54.10 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1930 </td>
   <td style="text-align:right;"> 58.1 </td>
   <td style="text-align:right;"> 61.6 </td>
   <td style="text-align:right;"> 59.85 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1940 </td>
   <td style="text-align:right;"> 60.8 </td>
   <td style="text-align:right;"> 65.2 </td>
   <td style="text-align:right;"> 63.00 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1950 </td>
   <td style="text-align:right;"> 65.6 </td>
   <td style="text-align:right;"> 71.1 </td>
   <td style="text-align:right;"> 68.35 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1960 </td>
   <td style="text-align:right;"> 66.6 </td>
   <td style="text-align:right;"> 73.1 </td>
   <td style="text-align:right;"> 69.85 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1970 </td>
   <td style="text-align:right;"> 67.1 </td>
   <td style="text-align:right;"> 74.7 </td>
   <td style="text-align:right;"> 70.90 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1980 </td>
   <td style="text-align:right;"> 70.0 </td>
   <td style="text-align:right;"> 77.4 </td>
   <td style="text-align:right;"> 73.70 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1990 </td>
   <td style="text-align:right;"> 71.8 </td>
   <td style="text-align:right;"> 78.8 </td>
   <td style="text-align:right;"> 75.30 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1998 </td>
   <td style="text-align:right;"> 73.8 </td>
   <td style="text-align:right;"> 79.5 </td>
   <td style="text-align:right;"> 76.65 </td>
  </tr>
</tbody>
</table>

<p>Finally, we can visualize the changes in life expectancy across the $20^{th}$ century. From a low point in 1918, life expectancy rose 1.9-fold.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">us_life_expectancy_tbl</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">tidyr</span><span class="o">::</span><span class="n">gather</span><span class="p">(</span><span class="n">sex</span><span class="p">,</span><span class="w"> </span><span class="n">life_expectancy</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">Year</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">sex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">sex</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"M"</span><span class="p">,</span><span class="w"> </span><span class="s2">"F"</span><span class="p">)))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">life_expectancy</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sex</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_path</span><span class="p">(</span><span class="n">linewidth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="s2">"Life expectancy"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="s2">"Year"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_color_manual</span><span class="p">(</span><span class="w">
    </span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"F"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"pink"</span><span class="p">,</span><span class="w"> </span><span class="s2">"M"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"dodgerblue"</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"From 1918, US life expectancy rose by **90%**"</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_bw_mod</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="w">
    </span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"none"</span><span class="p">,</span><span class="w">
    </span><span class="n">plot.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ggtext</span><span class="o">::</span><span class="n">element_markdown</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">17</span><span class="p">,</span><span class="w"> </span><span class="n">lineheight</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.2</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span></code></pre></figure>

<p><img src="/figure/source/2025-02-02-gompertz/life_expectancy-1.png" alt="plot of chunk life_expectancy" /></p>

<h2 id="unpacking-the-gompertz-equation">Unpacking the Gompertz Equation</h2>

<p>As mentioned above, the form of the Gompertz equation is:</p>

\[\large \ln(h(t)) = \ln(\alpha) + \beta \cdot t \\
\large h(t) = \alpha \cdot e^{\beta \cdot t}\]

<p>For a given, value of $\alpha$ and $\beta$ we could use the Gompertz equation to predict an individual’s hazard at an age $t$.</p>

<p>Surviving to an age $t$ entails surviving every preceding year as well. Thus, the survival function can be described as the probability of surviving until a point $t$:</p>

\[\large S(t) = \prod_{x=1}^{t}\left(1-h(x)\right)\]

<p>For the survival function, this is analytically equivalent to:</p>

\[\large S(t) = e^{\frac{\alpha}{\beta}(1-e^{\beta t})}\]

<p>Also, since <a href="https://www.britannica.com/science/life-expectancy">life expectancy</a> is defined “assuming that the age-specific death rates for the year in question will apply throughout the lifetime of individuals born in that year” it should simply be the integral under the survival curve (or the sum since we are discretizing to year-by-year changes).</p>

<p>Based on this formulation I wrote equations which map $\alpha$, $\beta$ and $t$ onto hazard and survival and created plots which summarize a Gompertz model approximately fitting to the modern US population.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">gompertz_hazard</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">alpha</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">alpha</span><span class="o">*</span><span class="nf">exp</span><span class="p">(</span><span class="n">beta</span><span class="o">*</span><span class="n">times</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">gompertz_survival</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">alpha</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="nf">exp</span><span class="p">((</span><span class="n">alpha</span><span class="o">/</span><span class="n">beta</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="m">1</span><span class="o">-</span><span class="nf">exp</span><span class="p">(</span><span class="n">beta</span><span class="o">*</span><span class="n">times</span><span class="p">)))</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="c1"># doubling of hazard every 8 years</span><span class="w">
</span><span class="n">BETA_CURRENT</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.0861</span><span class="w">
</span><span class="n">ALPHA_CURRENT</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.000064</span><span class="w">
</span><span class="n">AGES</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">105</span><span class="p">)</span><span class="w">

</span><span class="c1"># </span><span class="w">
</span><span class="n">gompertz_df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tibble</span><span class="o">::</span><span class="n">tibble</span><span class="p">(</span><span class="w">
  </span><span class="n">age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">AGES</span><span class="p">,</span><span class="w">
  </span><span class="n">hazard</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gompertz_hazard</span><span class="p">(</span><span class="n">ALPHA_CURRENT</span><span class="p">,</span><span class="w"> </span><span class="n">BETA_CURRENT</span><span class="p">,</span><span class="w"> </span><span class="n">AGES</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="w">
    </span><span class="n">log_hazard</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">hazard</span><span class="p">),</span><span class="w">
    </span><span class="n">survival</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gompertz_survival</span><span class="p">(</span><span class="n">ALPHA_CURRENT</span><span class="p">,</span><span class="w"> </span><span class="n">BETA_CURRENT</span><span class="p">,</span><span class="w"> </span><span class="n">AGES</span><span class="p">)</span><span class="w">
    </span><span class="p">)</span><span class="w">

</span><span class="c1"># the integral of survival is life expectancy</span><span class="w">
</span><span class="n">life_expectancy</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">gompertz_df</span><span class="o">$</span><span class="n">survival</span><span class="p">)</span><span class="w">

</span><span class="n">hazard_grob</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">gompertz_df</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hazard</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_bw_mod</span><span class="w">

</span><span class="n">survival_grob</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">gompertz_df</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">survival</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_polygon</span><span class="p">(</span><span class="w">
    </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">bind_rows</span><span class="p">(</span><span class="w">
      </span><span class="n">gompertz_df</span><span class="p">,</span><span class="w">
      </span><span class="n">tibble</span><span class="o">::</span><span class="n">tibble</span><span class="p">(</span><span class="n">age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">survival</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
      </span><span class="n">tibble</span><span class="o">::</span><span class="n">tibble</span><span class="p">(</span><span class="n">age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">survival</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
      </span><span class="p">),</span><span class="w">
    </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_vline</span><span class="p">(</span><span class="n">xintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">life_expectancy</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">annotate</span><span class="p">(</span><span class="s2">"text"</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">life_expectancy</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.25</span><span class="p">,</span><span class="w"> </span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"life expectancy"</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_bw_mod</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning in geom_vline(xintercept = life_expectancy, color = "blue", width = 2):
## Ignoring unknown parameters: `width`</code></pre></figure>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">log_hazard_grob</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">gompertz_df</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">log_hazard</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="s2">"log(hazard)"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_bw_mod</span><span class="w">
  
</span><span class="n">log_hazard_grob</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">hazard_grob</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">survival_grob</span></code></pre></figure>

<p><img src="/figure/source/2025-02-02-gompertz/gompertz_basics-1.png" alt="plot of chunk gompertz_basics" /></p>

<p>To do this, I fixed $\beta$ at the value of 0.0861 which imputes a doubling of risk every 8 years. <strong>Since it is widely believed that historical lifespan extension has primarily been through modifying $\alpha$ rather than $\beta$ we can fit Gompertz models to achieve a desired life expectancy</strong>. For a life expectancy of around 76.6 (the life expectancy in 1998 which is also startling close to the current life expectancy in the USA of 77.2), $\alpha$ would be $\sim0.000064$.</p>

<p>To explore how $\alpha$ has changed across the $20^{th}$ century we can explore how different levels of baseline risk map onto historical changes in life expectancy.</p>

<h2 id="20th-century-lifespan-extension-was-through-a-26x-drop-in-baseline-hazard">20th century lifespan extension was through a <strong>26x</strong> drop in baseline hazard</h2>

<p>To map values of $\alpha$ onto the “select_lifespans” defined above, I estimated the Gompertz survival function for a range of $\alpha$ values starting at a modern value of $\sim0.000064$ and going down to a 26-fold decrease in $\alpha$ (initially I did a wider range then I dialed it in). Tidyr’s crossing is really helpful for doing the all-by-all comparisons of the set of $\alpha$ parameters and ages being evaluated. After calculating the survival functions I grouped by $\alpha$ parameters and integrated over ages to infer the life expectancy.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># define fold-change of max/min alpha or beta</span><span class="w">
</span><span class="n">ALPHA_BETA_FC</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">26</span><span class="w">

</span><span class="c1"># create a uniform sequence in log-space (the log transform is just so ages will be evenly space when plotting lifespan ~ log(param)</span><span class="w">
</span><span class="n">alpha_possibilities</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="nf">log</span><span class="p">(</span><span class="n">ALPHA_CURRENT</span><span class="p">),</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">ALPHA_CURRENT</span><span class="o">*</span><span class="n">ALPHA_BETA_FC</span><span class="p">),</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">))</span><span class="w">

</span><span class="n">gompertz_curves</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tidyr</span><span class="o">::</span><span class="n">crossing</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">alpha_possibilities</span><span class="p">,</span><span class="w"> </span><span class="n">age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">AGES</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">survival</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gompertz_survival</span><span class="p">(</span><span class="n">alpha</span><span class="p">,</span><span class="w"> </span><span class="n">BETA_CURRENT</span><span class="p">,</span><span class="w"> </span><span class="n">age</span><span class="p">))</span><span class="w">

</span><span class="n">life_expectancy_by_alpha</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">gompertz_curves</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">life_expectancy</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">survival</span><span class="p">),</span><span class="w"> </span><span class="n">.by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">alpha</span><span class="p">)</span><span class="w">

</span><span class="c1"># what is the alpha corresponding to the selected years life expectancies</span><span class="w">
</span><span class="n">select_lifespans_w_alpha</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">select_lifespans</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">tidyr</span><span class="o">::</span><span class="n">crossing</span><span class="p">(</span><span class="n">life_expectancy_by_alpha</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">year_lifespan_vs_alpha_prediction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">abs</span><span class="p">(</span><span class="n">avg_lifespan</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">life_expectancy</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">Year</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">year_lifespan_vs_alpha_prediction</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">year_lifespan_vs_alpha_prediction</span><span class="p">))</span><span class="w">

</span><span class="n">stopifnot</span><span class="p">(</span><span class="nf">all</span><span class="p">(</span><span class="n">select_lifespans_w_alpha</span><span class="o">$</span><span class="n">year_lifespan_vs_alpha_prediction</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">

</span><span class="n">life_expectancy_by_alpha</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w">  </span><span class="n">log10</span><span class="p">(</span><span class="n">alpha</span><span class="p">),</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">life_expectancy</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ggrepel</span><span class="o">::</span><span class="n">geom_text_repel</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">select_lifespans_w_alpha</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">log10</span><span class="p">(</span><span class="n">alpha</span><span class="p">),</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">life_expectancy</span><span class="p">,</span><span class="w"> </span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="p">),</span><span class="w"> </span><span class="n">force</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">nudge_x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.06</span><span class="p">,</span><span class="w"> </span><span class="n">nudge_y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="nf">expression</span><span class="p">(</span><span class="n">log</span><span class="p">[</span><span class="m">10</span><span class="p">]</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">alpha</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="s2">"Life expectancy"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"20&lt;sup&gt;th&lt;/sup&gt; century lifespan extension was primarily &lt;br&gt;through a **26x** drop in baseline hazard"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_bw_mod</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="w">
    </span><span class="n">plot.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ggtext</span><span class="o">::</span><span class="n">element_markdown</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">13</span><span class="p">,</span><span class="w"> </span><span class="n">lineheight</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.2</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span></code></pre></figure>

<p><img src="/figure/source/2025-02-02-gompertz/alpha_var-1.png" alt="plot of chunk alpha_var" /></p>

<p>The nearly doubling of life expectancy across the $20^{th}$ century was chiefly driven by the massive medical advancements of antibiotics and vaccines, as well improvements in nutrition and hygiene. The deceleration of lifespan extension over the last thirty years reflects the challenge of removing risks one-by-one like in a game of wack-a-mole. Going forward, little is projected to change as judged by the social security administration, where lifespan extension in 2,100 is only projected to reach $\sim85$ years</p>

<p><img src="https://www.shackett.org/figure/gompertz/ssa_projected_survival.png" alt="SSA projected survival" class="align-center" /></p>

<p>These are and will-be, hard fought for gains, but radically extending lifespan should tackle the underlying risk factors of diseases of aging … <strong>aging itself</strong>.</p>

<h2 id="a-26x-drop-in-age-dependent-risk-would-increase-human-life-expectancy-by-125x">A <strong>26x</strong> drop in age-dependent risk would increase human life expectancy by <strong>12.5x</strong></h2>

<p>The 26-fold drop in age-independent hazard across the 20th century is MASSIVE and I was curious what a comparable drop in $\beta$ going forward would look like. To explore this, I fixed $\alpha$ at a modern value (0.000064) and explored a range of $\beta$ values ranging from the current value (0.0861; where risk would double every 8 years) down to 0.0033 where risk would double every 208 years.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">AGES_EXTENDED</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">100000</span><span class="p">)</span><span class="w">
</span><span class="n">beta_possibilites</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="nf">log</span><span class="p">(</span><span class="n">BETA_CURRENT</span><span class="p">),</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">BETA_CURRENT</span><span class="o">/</span><span class="n">ALPHA_BETA_FC</span><span class="p">),</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">))</span><span class="w">

</span><span class="n">gompertz_curves</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tidyr</span><span class="o">::</span><span class="n">crossing</span><span class="p">(</span><span class="n">beta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">beta_possibilites</span><span class="p">,</span><span class="w"> </span><span class="n">age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">AGES_EXTENDED</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="w">
    </span><span class="n">hazard</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gompertz_hazard</span><span class="p">(</span><span class="n">ALPHA_CURRENT</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">age</span><span class="p">),</span><span class="w">
    </span><span class="n">survival</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gompertz_survival</span><span class="p">(</span><span class="n">ALPHA_CURRENT</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">age</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w">

</span><span class="n">life_expectancy_by_beta</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">gompertz_curves</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">life_expectancy</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">survival</span><span class="p">),</span><span class="w"> </span><span class="n">.by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">beta</span><span class="p">)</span><span class="w">

</span><span class="n">life_expectancy_range_ratio</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">life_expectancy_by_beta</span><span class="o">$</span><span class="n">life_expectancy</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">life_expectancy_by_beta</span><span class="o">$</span><span class="n">life_expectancy</span><span class="p">[</span><span class="n">nrow</span><span class="p">(</span><span class="n">life_expectancy_by_beta</span><span class="p">)]</span><span class="w">

</span><span class="n">life_expectancy_by_beta</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">log10</span><span class="p">(</span><span class="n">beta</span><span class="p">),</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">life_expectancy</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="nf">expression</span><span class="p">(</span><span class="n">log</span><span class="p">[</span><span class="m">10</span><span class="p">]</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">beta</span><span class="p">),</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-5</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">),</span><span class="w"> </span><span class="n">expand</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.02</span><span class="p">,</span><span class="m">0.02</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="s2">"Life expectancy"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"A **26x** drop in age-dependent risk would&lt;br&gt;increase human lifespan by **12.5x**"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_bw_mod</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="w">
    </span><span class="n">plot.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ggtext</span><span class="o">::</span><span class="n">element_markdown</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">14</span><span class="p">,</span><span class="w"> </span><span class="n">lineheight</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.2</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span></code></pre></figure>

<p><img src="/figure/source/2025-02-02-gompertz/beta_var-1.png" alt="plot of chunk beta_var" /></p>

<p>I don’t think it will be as easy to modify $\beta$ as $\alpha$ even a two-fold drop in $\beta$ would increase life expectancy to 138 years. Precipitously dropping $\beta$ and achieving a life expectancy of 1,000 is starting to sound like science fiction but it is an interesting thought experiment. But why stop there! We can push this further and explore what a world with a $\beta$ of zero would look like. This concept is reflected in an interesting game by <a href="https://polstats.com/#!/life">Polstats</a> where you can simulate 100 individual’s lifespans in a world where you only die of unnatural causes. The average lifespan in this cohort was ~10,000 years old and the longest lived individual died at the ripe old age of 57,912 in a car accident.</p>

<p><img src="https://www.shackett.org/figure/gompertz/polstats.png" alt="Lifespans based on unnatural deaths" class="align-center" /></p>

<p>Playing this game a few times I was struck by how much the average lifespan of the cohort can shift based on a few long-lived stragglers who continue to dodge bullets (and cars).</p>

<p><strong>In this world, how old would we expect the oldest person to be?</strong></p>

<p>If hazard is constant over time then lifespans would follow a Geometric distribution, where the mean equals (1/$\alpha$ = 15,625). If we thought of lifespans as continuous in nature, which would be appropriate, then we could similarly think of lifespans as following an exponential distribution (equivalent to exponential decay).</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">tibble</span><span class="p">(</span><span class="n">age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">100000</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">survival</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dgeom</span><span class="p">(</span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">ALPHA_CURRENT</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">survival</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_path</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_bw_mod</span></code></pre></figure>

<p><img src="/figure/source/2025-02-02-gompertz/hazard_constant-1.png" alt="plot of chunk hazard_constant" /></p>

<p>The maximum lifespan under this model would be equivalent to the maximum of $N$ geometric draws. This distribution is described in <a href="https://math.stackexchange.com/questions/26167/expectation-of-the-maximum-of-i-i-d-geometric-random-variables">StackExchange</a> and its quite involved (there’s no <code class="language-plaintext highlighter-rouge">cgeom</code> function), so I’ll cop out and just obtain the maximum value of a set of random draws.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1234</span><span class="p">)</span><span class="w">
</span><span class="n">current_earth_pop_w_geometric_lifespan</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rgeom</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">7.9e9</span><span class="p">,</span><span class="w"> </span><span class="n">ALPHA_CURRENT</span><span class="p">)</span><span class="w">
</span><span class="n">max_lifespan</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">current_earth_pop_w_geometric_lifespan</span><span class="p">)</span></code></pre></figure>

<p>From this simulation the maximum age of the oldest person on earth would be 345,353 years old.</p>

<h2 id="conclusion">Conclusion</h2>

<p>In this post, we’ve explored the Gompertz equation and how it can be used to model historical lifespan extention (decreasing $\alpha$) and the potential that slowing aging has for radical lifespan extension (decreasing $\beta$).</p>

<p>I’ve found that this framing is particularly helpful when discussing my professional work with non-scientists. I tried this out first in a presentation at a local meetup (<strong>Future Tech Immersive: AI x Synthetic Biology Meetup</strong>) and simplified the narrative last November when I spoke at the 2024 D One Growing Together Conference in Zürich. Both talks went quite well - aging is deeply personal to all of us. Its something we see every day in ourselves, our family, and our friends. We accept it as a given, but it may not be and that “what if?” continues to inspire me.</p>]]></content><author><name>Sean Hackett</name></author><category term="aging" /><category term="epidemiology" /><summary type="html"><![CDATA[A review of the False Discovery Rate and its use for shrinkage estimation]]></summary></entry><entry><title type="html">False Discovery Rate (FDR) Overview and lFDR-Based Shrinkage</title><link href="https://www.shackett.org/lfdr_shrinkage/" rel="alternate" type="text/html" title="False Discovery Rate (FDR) Overview and lFDR-Based Shrinkage" /><published>2022-06-11T00:00:00+00:00</published><updated>2022-06-11T00:00:00+00:00</updated><id>https://www.shackett.org/lfdr_shrinkage</id><content type="html" xml:base="https://www.shackett.org/lfdr_shrinkage/"><![CDATA[<p>Coming from a quantitative genetics background, correcting for multiple comparisons meant controlling the family-wise error rate (FWER) using a procedure like Bonferroni correction. This all changed when I took John Storey’s “Advanced Statistics for Biology” class in grad school. John is an expert in statistical interpretation of high-dimensional data and literally wrote the book, well paper, on false-discovery rate (FDR) as an author of <a href="https://www.pnas.org/content/100/16/9440">Storey &amp; Tibshirani 2006</a>. His description of the FDR has grounded my interpretation of hundreds of genomic datasets and I’ve continued to pay this knowledge forward with dozens of white-board style descriptions of the FDR for colleagues. As an interviewer and paper reviewer I still regularly see accomplished individuals and groups where “FDR control” is a clear blind spot. In this post I’ll layout how I whiteboard the FDR problem, and then highlight a specialized application of the FDR for “denoising” genomic datasets.</p>

<!--more-->

<h1 id="multiple-hypothesis-testing">Multiple Hypothesis Testing</h1>

<p>Statistical tests are designed so that if the null hypothesis is true, observed statistics will follow a defined null distribution, hence an observed statistic can be compared to the quantiles of the null distribution to calculate the p-value. In quantitative biology we are frequently testing hundreds to millions of hypotheses in parallel. The p-value for a single test can be interpreted roughly as: p &lt; 0.05, yay! [only slightly sarcastically]. But, when we have many tests, some will possess small p-values by chance (10,000 tests would have 500 p &lt; 0.05 findings by chance). Controlling for multiple hypotheses acknowledges this challenge with FDR and FWER providing alternative perspectives for winnowing spurious associations to a set of high-confidence “discoveries”.</p>

<ul>
  <li>The <strong>FWER</strong> is the probability of making one or more false discoveries. FWER is common in genetic association studies providing an interpretation that of all loci there is an $\alpha$ chance that are one or more are spurious. Bonferroni correction controls the FWER by accepting tests whose p-values are less than $\frac{\alpha}{N}$ where $\alpha$ is the type I error rate and $N$ is the number of hypotheses.</li>
  <li>The <strong>FDR</strong> involves selecting a set of observation (the positive tests) which constrains the expected number of false positives ($\mathbb{E}$FP) selected relative to true positives to a desired proportion: i.e. $\mathbb{E}\left[\frac{\text{FP}}{\text{FP} + \text{TP}}\right] \leq \alpha$. The FDR is less conservative than the FWER and is useful whenever we want to interpret the trends in a dataset even if individual findings may be spurious. This thought process fits nicely into genomics where differential expression analysis is frequently coupled to reductionist approaches like Gene Set Enrichment Analysis (GSEA).</li>
</ul>

<p>The FWER and FDR control different properties so its not entirely fair to compare them yet I often see people controlling the FWER and then interpreting results as if they were controlling the FDR (and vice versa) so its important to note their practical differences. Whenever we have multiple hypotheses, the FWER will be more conservative (underestimating changes) than the FDR and this difference in power between the FDR and FWER widens with the number of hypotheses being tested. If we carried out more tests the p-value threshold for detecting discoveries would drop (perhaps resulting a drop in discoveries with more features!) while this threshold should be constant if we are controlling the FDR resulting in a proportional increase in the number of significant hits as we detect more features. This distinction is frequently misunderstood - folks will say things like “I’m not sure these changes would survive the FDR.” The shadow of the FWER results in a perception that collecting more information will prevent us from detecting “high confidence” hits. In reality, a well-selected FDR procedure can help to squeeze the most power out of a dataset.</p>

<p>There are multiple ways to control the FDR and I think the <a href="https://www.pnas.org/content/100/16/9440">Storey &amp; Tibshirani</a> “q-value” framework is particularly appealing because of its Bayesian elegance and statistical power. When the assumptions underlying the q-value approach breakdown (basically when my p-values don’t look nice like the ones below), I fall back to the Benjamini-Hochberg approach for controlling FDR. BH preceded Storey’s q-value but is a special case of it (where $\hat{\pi}_{0}$ set at one).</p>

<h1 id="controlling-the-false-discovery-rate-fdr-with-q-values">Controlling the False Discovery Rate (FDR) with Q-Values</h1>

<p>Using the approach of <a href="https://www.pnas.org/content/100/16/9440">Storey &amp; Tibshirani</a>, we can think about a p-value histogram as a mixture of two distributions:</p>

<ul>
  <li>Negatives - features with no signal that follow our null distribution and whose p-values will in turn be distributed as $\sim\text{Unif}(0,1)$</li>
  <li>Positives - features containing some signal which will consequently have elevated test statistics and tend towards having small p-values.</li>
</ul>

<p>To see this visually, we can generate a mini-simulation containing a mixture of negatives and positives.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyr</span><span class="p">)</span><span class="w">

</span><span class="c1"># ggplot default theme</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bottom"</span><span class="p">))</span><span class="w">

</span><span class="c1"># define simulation parameters</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1234</span><span class="p">)</span><span class="w">
</span><span class="n">n_sims</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">100000</span><span class="w">
</span><span class="n">pi0</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0.5</span><span class="w">
</span><span class="n">beta</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1.5</span><span class="w">

</span><span class="n">simple_pvalue_mixture</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">truth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="s2">"Positive"</span><span class="p">,</span><span class="w"> </span><span class="n">n_sims</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="o">-</span><span class="n">pi0</span><span class="p">)),</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="s2">"Negative"</span><span class="p">,</span><span class="w"> </span><span class="n">n_sims</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">pi0</span><span class="p">)))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># positives are centered around beta; negatives around 0</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">truth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">truth</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Positive"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Negative"</span><span class="p">)),</span><span class="w">
         </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">truth</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"Positive"</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
         </span><span class="c1"># observations sampled from a normal distribution centered on 0</span><span class="w">
         </span><span class="c1"># or beta with an SD of 1 (the default)</span><span class="w">
         </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mu</span><span class="p">),</span><span class="w">
         </span><span class="c1"># carryout a 1-tailed wald test about 0 </span><span class="w">
         </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pnorm</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">lower.tail</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">))</span><span class="w">

</span><span class="n">observation_grob</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">simple_pvalue_mixture</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">truth</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_density</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ggtitle</span><span class="p">(</span><span class="s2">"Observations with and without signal"</span><span class="p">)</span><span class="w">

</span><span class="n">pvalues_grob</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">simple_pvalue_mixture</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">truth</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_histogram</span><span class="p">(</span><span class="n">bins</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="o">=</span><span class="m">0.01</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ggtitle</span><span class="p">(</span><span class="s2">"P-values of observations with and without signal"</span><span class="p">)</span><span class="w">

</span><span class="n">gridExtra</span><span class="o">::</span><span class="n">grid.arrange</span><span class="p">(</span><span class="n">observation_grob</span><span class="p">,</span><span class="w"> </span><span class="n">pvalues_grob</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>

<p><img src="/figure/source/2022-06-11-lfdr_shrinkage/pvalue_hist_sim-1.png" alt="plot of chunk pvalue_hist_sim" /></p>

<p>While there is a mixture of positive and negative observations, their values cannot be clearly separated (that would be too easy!) rather noise works against some positives, and some negative observations take on extreme values by chance. This is paralleled by the p-values of positive and negative observations. True positive p-values tend to be small, but may also be large; while true negative p-values are uniformly distributed from 0 to 1 and are as likely to be small as large.</p>

<p>To control the FDR at a level $\alpha$, the Storey procedure first estimates the fraction of null hypothesis (0.5 in our simulation): $\hat{\pi}_{0}$.</p>

<p>This is done by looking at large p-values (near 1). Because large p-values will rarely be signal-containing positives there will be fewer large p-values than would be expected from the number of tests. For example, there are 5106 p-values &gt; 0.9 in our example, which is close to 5000, the value we would expect from $N\pi_{0}*0.1$ (10<sup>5</sup> * $\pi_{0}$ * 0.1). (I’ll use the true value of $\pi_{0}$ (0.5) as a stand-in for the estimate of $\hat{\pi}_{0}$ so the numbers are a little clearer.)</p>

<p>Just as we expected 5000 null p-values on the interval from [0.9,1], we would expect 5000 null p-values on the interval [0,0.1]. But, there are actually 34415 with p-values &lt; 0.1 because positives tend have small p-values. If we chose 0.1 as a possible cutoff, then we would expect 5000 false positives while the observed number of p-values &lt; 0.1 equals the denominator of the FDR ($\text{FP} + \text{TP}$). The ratio of these two values, 0.145, would be the expected FDR at a p-value cutoff of 0.1. Now, we usually don’t want to choose a cutoff and then live with the FDR we would get, but rather control the FDR at a level $\alpha$ by tuning the cutoff as a parameter $\lambda$.</p>

<p>To apply q-value based FDR control we can use the q-value package:</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># install q-value from bioconductor if needed</span><span class="w">
</span><span class="c1"># remotes::install_bioc("qvalue")</span><span class="w">

</span><span class="n">library</span><span class="p">(</span><span class="n">qvalue</span><span class="p">)</span><span class="w">
</span><span class="n">qvalue_estimates</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">qvalue</span><span class="p">(</span><span class="n">simple_pvalue_mixture</span><span class="o">$</span><span class="n">p</span><span class="p">)</span></code></pre></figure>

<p>The q-value object contains an estimate of $\pi_{0}$ of 0.496 which is close to the true value of 0.5 It also contains a vector of q-values, lFDR, and other goodies.</p>

<p>The q-values are the quantity that we’re usually interested in; if we take all of the q-values less than a target cutoff of say 0.05, then that should give us a set of “discoveries” realizing a 5% FDR.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">simple_qvalues</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">simple_pvalue_mixture</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">q</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">qvalue_estimates</span><span class="o">$</span><span class="n">qvalues</span><span class="p">)</span><span class="w">

</span><span class="n">fdr_pvalue_cutoff</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">simple_qvalues</span><span class="o">$</span><span class="n">p</span><span class="p">[</span><span class="n">simple_qvalues</span><span class="o">$</span><span class="n">q</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.05</span><span class="p">])</span><span class="w">

</span><span class="n">simple_qvalues</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">simple_qvalues</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">hypothesis_type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="n">p</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">fdr_pvalue_cutoff</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">truth</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"Positive"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"TP"</span><span class="p">,</span><span class="w">
                                     </span><span class="n">p</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">fdr_pvalue_cutoff</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">truth</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"Negative"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"FP"</span><span class="p">,</span><span class="w">
                                     </span><span class="n">p</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">fdr_pvalue_cutoff</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">truth</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"Positive"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"FN"</span><span class="p">,</span><span class="w">
                                     </span><span class="n">p</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">fdr_pvalue_cutoff</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">truth</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"Negative"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"TN"</span><span class="p">))</span><span class="w">

</span><span class="n">hypothesis_type_counts</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">simple_qvalues</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">count</span><span class="p">(</span><span class="n">hypothesis_type</span><span class="p">)</span><span class="w">

</span><span class="n">TP</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">hypothesis_type_counts</span><span class="o">$</span><span class="n">n</span><span class="p">[</span><span class="n">hypothesis_type_counts</span><span class="o">$</span><span class="n">hypothesis_type</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"TP"</span><span class="p">]</span><span class="w">
</span><span class="n">FP</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">hypothesis_type_counts</span><span class="o">$</span><span class="n">n</span><span class="p">[</span><span class="n">hypothesis_type_counts</span><span class="o">$</span><span class="n">hypothesis_type</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"FP"</span><span class="p">]</span><span class="w">
</span><span class="n">FDR</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">FP</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">TP</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">FP</span><span class="p">)</span><span class="w">

</span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">(</span><span class="n">hypothesis_type_counts</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">kableExtra</span><span class="o">::</span><span class="n">kable_styling</span><span class="p">(</span><span class="n">full_width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span></code></pre></figure>

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> hypothesis_type </th>
   <th style="text-align:right;"> n </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> FN </td>
   <td style="text-align:right;"> 38910 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> FP </td>
   <td style="text-align:right;"> 616 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> TN </td>
   <td style="text-align:right;"> 49384 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> TP </td>
   <td style="text-align:right;"> 11090 </td>
  </tr>
</tbody>
</table>

<p>In this case, due to our simulation, we know whether individual discoveries are true or false positives. As a result we can determine that the realized FDR is 0.053, close to our target of 0.05.</p>

<p>In most cases we would take our discoveries and work with them further, confident that as a population, only ~5% of them are bogus. But, in some cases we care about how likely an individual observation is to be a false positive. In this case we can look at the local density of p-values near an observation of interest to estimate a local version of the FDR, the local FDR (lFDR).</p>

<p>We took advantage of this property during my collaboration with Google Brain aimed to improve the accuracy of peptide matches to proteomics’ spectra using labels from traditional informatics (<a href="https://arxiv.org/abs/1808.06576">arXiv</a>). In this study we weighted peptides’ labels with their lFDR using a cross-entropy loss to more strongly penalize failed prediction with high-confidence labels.</p>

<h1 id="lfdr-based-shrinkage">lFDR-based shrinkage</h1>

<p>Because the lFDR reflects the relative odds that an observation is null it is a useful measure for shrinkage or thresholding aiming to remove noise and better approximate true value. To do this we can weight an observation by 1-lFDR. One interpretation of this is that we are using the lFDR to hedge our bets between the positive and negative mixture components, weighting by our null hypothesis that $\mu$ = 0 with confidence lFDR, and by the alternative $\mu \neq 0$ value of x with confidence 1-lFDR:</p>

\[x_{\text{shrinkage}} = \text{lFDR}\cdot0 + (1-\text{lFDR})\cdot x\]

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">true_values</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tribble</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">truth</span><span class="p">,</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">mu</span><span class="p">,</span><span class="w">
                       </span><span class="s2">"Positive"</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w">
                       </span><span class="s2">"Negative"</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">

</span><span class="n">shrinkage_estimates</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">simple_qvalues</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">lfdr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">qvalue_estimates</span><span class="o">$</span><span class="n">lfdr</span><span class="p">,</span><span class="w">
         </span><span class="n">xs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="o">*</span><span class="p">(</span><span class="m">1</span><span class="o">-</span><span class="n">lfdr</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="n">truth</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">xs</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">gather</span><span class="p">(</span><span class="n">processing</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">truth</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">processing</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="n">processing</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"x"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"original value"</span><span class="p">,</span><span class="w">
                                </span><span class="n">processing</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"xs"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"shrinkage estimate"</span><span class="p">))</span><span class="w">

</span><span class="n">ggplot</span><span class="p">(</span><span class="n">shrinkage_estimates</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">processing</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_density</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_vline</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">true_values</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">xintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mu</span><span class="p">),</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"chartreuse"</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">facet_grid</span><span class="p">(</span><span class="n">truth</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free_y"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_brewer</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Set1"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ggtitle</span><span class="p">(</span><span class="s2">"lFDR-based shrinkage improves agreement between observations and the true mean"</span><span class="p">)</span></code></pre></figure>

<p><img src="/figure/source/2022-06-11-lfdr_shrinkage/lFDR_shrinkage_ex-1.png" alt="plot of chunk lFDR_shrinkage_ex" /></p>

<p>Using lFDR-based shrinkage, values which are just noise were aggressively shrunk toward their true mean of 0 such that there is very little remaining variation. Positives were shrunk using the same methodology retaining extreme values near their measured value. We can verify that there is an overall decrease in uncertainty about the true mean reflecting the removal of noise.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">shrinkage_estimates</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">inner_join</span><span class="p">(</span><span class="n">true_values</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"truth"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">resid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">mu</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">processing</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">summarize</span><span class="p">(</span><span class="n">RMSE</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">mean</span><span class="p">(</span><span class="n">resid</span><span class="o">^</span><span class="m">2</span><span class="p">)))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">kableExtra</span><span class="o">::</span><span class="n">kable_styling</span><span class="p">(</span><span class="n">full_width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span></code></pre></figure>

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> processing </th>
   <th style="text-align:right;"> RMSE </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> original value </td>
   <td style="text-align:right;"> 0.9994577 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> shrinkage estimate </td>
   <td style="text-align:right;"> 0.8103141 </td>
  </tr>
</tbody>
</table>

<h1 id="future-work">Future Work</h1>

<p>In a future post I’ll describe how lFDR-based shrinkage is particularly useful for signal processing of time-resolved peturbation data. In this case, early direct changes are rare, while late indirect changes are quite common. This intuition can be folded into how we estimate the lFDR by estimating a $\hat{\pi}_{0}$ which decreases monotonically with time using the <a href="https://academic.oup.com/biostatistics/article/22/1/68/5499195">functional FDR</a>.</p>]]></content><author><name>Sean Hackett</name></author><category term="statistics" /><summary type="html"><![CDATA[A review of the False Discovery Rate and its use for shrinkage estimation]]></summary></entry><entry><title type="html">Romic: Data Structures and EDA for Genomics</title><link href="https://www.shackett.org/romic/" rel="alternate" type="text/html" title="Romic: Data Structures and EDA for Genomics" /><published>2022-05-08T00:00:00+00:00</published><updated>2022-05-08T00:00:00+00:00</updated><id>https://www.shackett.org/romic</id><content type="html" xml:base="https://www.shackett.org/romic/"><![CDATA[<p>Romic is an R package, which I developed, that is now is now available on <a href="https://cran.r-project.org/web/packages/romic/index.html">CRAN</a>. There is already a nice README for romic on <a href="https://github.com/calico/romic">GitHub</a> and a <a href="https://calico.github.io/romic/articles/romic.html">pkgdown site</a>, so here, I will add some context regarding the problems this package addresses.</p>

<p>The first problem we’ll consider is that genomics data analysis involves a lot of shuffling between various forms of wide and tall data and incrementally tacking on attributes as needed. Romic aims to simplify this process, by providing a set of flexible data structures that accommodate a range of measurements and metadata and can be readily inter-converted based on the needs of an analysis.</p>

<p>The second challenge we’ll contend with is decreasing the time it takes to generate a plot so that mechanics of plotting rarely interrupt the thought process of data interpretation. Building upon romic’s data structure, the meaning of variables (feature-, sample-, measurement-level) are encoded in a schema, so they can be appropriately surfaced to filter or reorder a dataset, and add ggplot2 aesthetics. Interactivity is facilitated using Shiny apps composed from romic-centric Shiny modules.</p>

<p>Both of these solutions increase the speed, clarity, and succinctness of analysis. I’ve developed and will continue to refine this package to save myself (and hopefully others!) time.</p>

<!--more-->

<p>While, romic is discussed in the parlance of genomics, romic’s data structures are useful for any moderately sized feature-level data, and its interactive visualizations can be used for any data with dense continuous measurements. Because of its generality, romic serves as a useful underlying data structure that can be combined with application-specific schemas and methods to create powerful, succinct workflows. One such application that I’ll discuss in a future post, is the <a href="https://github.com/calico/claman">claman</a> R package which builds upon romic to create an opinionated workflow for mass spectrometry data analysis.</p>

<h1 id="data-structures-for-genomics">Data Structures for Genomics</h1>

<h2 id="conventional-formatting">Conventional Formatting</h2>

<p>Datasets in genomics are often generated and shared in wide formats (one row per gene, one column per sample), often with extra rows and columns added for feature and sample metadata. At first blush this is a good format, because it supports both folks who want to work with a matrix-level dataset as well as individuals who are interested in specific genes.</p>

<p>That said, to manipulate and visualize such data requires integrating metadata with measurements. For example, when correcting for batch effects we often want to incorporate sample-level information, such as the date samples were collected. Combining numeric measurements with categorical and numeric meta-data is awkward in matrices. One could do this with attributes, but generally we would just maintain separate tables for samples, and features, since each variable in a table can have its own class. A benefit of this approach is that working with matrices can be very fast, while the major downsides are having to maintain multiple similar versions of a dataset, and needing to be careful about maintaining the alignment of measurements, features, and samples.</p>

<h2 id="romics-tabular-representations">Romic’s Tabular Representations</h2>

<p>An alternative to manipulating matrices is to work fully with tabular data. This mode of operation is very similar to working with SQL, allowing us to maintain a complex, yet organized dataset. Using tabular “tidy” data also allow us to tap into the expansive suite of tools in the tidyverse. Working with features, samples, and measurements tables allow us to separately modify each table, while the three tables can be combined (using primary key - foreign key relationships) if we need to add sample- or feature-level context to measurements.</p>

<p>Romic provides two data structures, a triple_omic and a tidy_omic class for representing these two scenarios. These formats can be used interchangeable in romic’s functions by treating them as a T*omics (tomic) meta-class. Most exported functions from romic, take a tomic object which means they can convert to whatever format makes most sense for a function under the hood and then return a triple_omic or tidy_omic object depending on the input type.</p>

<p>Using a schema, tables can be combined and then broken apart again without constant guidance, and validators quickly flag data manipulation errors (such as non-unique primary keys, or measurements of the same sample with different sample attributes).</p>

<p>By taking care of many of the joins and reshaping operations that we may have to do, romic helps to simplify analyses while avoid common data manipulation errors. It directly supports dplyr and some ggplot operations, while data can also be easily pulled out of the romic format (and then added back if desired) based on users’ needs.</p>

<h1 id="exploratory-data-analysis-for-genomics">Exploratory Data Analysis for Genomics</h1>

<p>To demonstrate how easily romic can be used for formatting and exploratory data analysis we can reanalyze and existing dataset.</p>

<p>Following a tradition set by Dave Robinson of teaching statistical analysis of genomics data using yeast microarrays (<a href="http://varianceexplained.org/r/tidy-genomics">link</a>), I generally teach statistical genomics with the <a href="https://www.molbiolcell.org/doi/full/10.1091/mbc.e07-08-0779">Brauer et al. 2008</a> dataset and this study formed the basis of romic’s <a href="https://calico.github.io/romic/articles/romic.html">vignette</a> and examples. To expand this theme, here we can look at another old-school yeast expression dataset. This one has over 5,500 citations!</p>

<p>In <a href="https://www.molbiolcell.org/doi/full/10.1091/mbc.11.12.4241">Gasch et al. 2000</a> the authors explored how yeast expression depends on a range of stressors. Gasch2K revealed that regardless of the nature of a stressor, yeast tend to respond to the threat with a relatively stereotypical gene expression response termed the “environmental stress response” (the ESR).</p>

<p>David Botstein (the senior author of both of both the Brauer and Gasch papers) describes the logic behind the ESR with a Star Trek-themed analogy. The idea is that when the Starship Enterprise is cruising along, most power goes to the engine. But, when the Enterprise is under attack (whether from Klingons, Romulans or asteroids) power needs to be redirected to the shields to combat the threat. Cells follow this “shields up and shields down” growing fast when conditions are good and hunkering down when they are not. An interesting corollary of this behavior is that when facing one stress (such as desiccation), cells will simultaneously become more resistant to other stressors (such as heat shock).</p>

<p><img src="https://www.shackett.org/figure/romic/shields_up_down.png" alt="Shields Up, Shields Down" class="align-center" /></p>

<p>While humans have more complicated stress sensing pathways than yeast, the mammalian equivalent of the ESR, termed the integrated stress response (ISR), still serves an important role in sensing and responding to diverse stresses. Modulating this pathway is being actively explored as an anti-aging/disease therapy by <a href="https://www.calicolabs.com/publication/the-small-molecule-isrib-rescues-the-stability-and-activity-of-vanishing-white-matter-disease-eif2b-mutant-complexes">Calico</a>, <a href="https://www.alzforum.org/therapeutics/dnl343">Denali</a> and <a href="https://altoslabs.com/">Altos</a>.</p>

<h2 id="data-loading">Data Loading</h2>

<p>In what can only be described as par for the course in bioinformatics, while writing this post the Stanford site that was hosting Gasch2K was down requiring me to obtain the dataset using Wayback Machine. Having moved the dataset to my site (hosted on GitHub pages) we can read it directly from a url.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># environment setup</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">suppressPackageStartupMessages</span><span class="p">(</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">))</span><span class="w">
</span><span class="c1"># install from CRAN with install.packages("romic)</span><span class="w">
</span><span class="c1"># right now its probably better to install the dev version from GitHub</span><span class="w">
</span><span class="c1"># with remotes::install_github("romic)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">romic</span><span class="p">)</span><span class="w">

</span><span class="n">gasch_2000</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">readr</span><span class="o">::</span><span class="n">read_tsv</span><span class="p">(</span><span class="w">
  </span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"https://www.shackett.org/files/gasch2000.txt"</span><span class="p">,</span><span class="w">
  </span><span class="n">col_types</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">readr</span><span class="o">::</span><span class="n">cols</span><span class="p">()</span><span class="w"> </span><span class="c1"># to accept default column types</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span><span class="n">gasch_matrix</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">gasch_2000</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">UID</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">NAME</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">GWEIGHT</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">as.matrix</span><span class="p">()</span><span class="w">
</span><span class="n">rownames</span><span class="p">(</span><span class="n">gasch_matrix</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">gasch_2000</span><span class="o">$</span><span class="n">UID</span><span class="w">

</span><span class="c1"># output</span><span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">gasch_matrix</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="p">{</span><span class="nf">c</span><span class="p">(</span><span class="s2">"rows"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">.</span><span class="p">[</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="s2">"columns"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">.</span><span class="p">[</span><span class="m">2</span><span class="p">])}</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">t</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">kableExtra</span><span class="o">::</span><span class="n">kable_styling</span><span class="p">(</span><span class="n">full_width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span></code></pre></figure>

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> rows </th>
   <th style="text-align:right;"> columns </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 6152 </td>
   <td style="text-align:right;"> 173 </td>
  </tr>
</tbody>
</table>

<h2 id="process-metadata">Process metadata</h2>

<p>To interpret any of the patterns in this dataset, we’ll need some metadata describing both the measured genes and  samples.</p>

<h3 id="genes">Genes</h3>

<p>Genes are frequently summarized using Gene Ontology (GO) terms that capture their sub-cellular localization (CC), molecular function (MF) or biological process (BP). These are typically one-to-many relationships where a given gene will belong to multiple GO terms in each of three ontologies. The GO slim ontologies used here are a curated subset of GO terms which map each gene to a single BP, MF and CC term. These ontologies are convenient for the kind of quick data slicing and inspection but we would be better off with the full ontologies for systematic approaches like Gene Set Enrichment Analysis (GSEA).</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">goslim_mappings</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">readr</span><span class="o">::</span><span class="n">read_tsv</span><span class="p">(</span><span class="w">
    </span><span class="s2">"https://downloads.yeastgenome.org/curation/literature/go_slim_mapping.tab"</span><span class="p">,</span><span class="w">
    </span><span class="n">col_names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"ORF"</span><span class="p">,</span><span class="w"> </span><span class="s2">"common"</span><span class="p">,</span><span class="w"> </span><span class="s2">"SGD"</span><span class="p">,</span><span class="w"> </span><span class="s2">"category"</span><span class="p">,</span><span class="w"> </span><span class="s2">"geneset"</span><span class="p">,</span><span class="w"> </span><span class="s2">"GO"</span><span class="p">,</span><span class="w"> </span><span class="s2">"class"</span><span class="p">),</span><span class="w">
    </span><span class="n">col_types</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">readr</span><span class="o">::</span><span class="n">cols</span><span class="p">()</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">GO</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">ORF</span><span class="p">,</span><span class="w"> </span><span class="n">category</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">slice</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">tidyr</span><span class="o">::</span><span class="n">spread</span><span class="p">(</span><span class="n">category</span><span class="p">,</span><span class="w"> </span><span class="n">geneset</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="w">
    </span><span class="n">ORF</span><span class="p">,</span><span class="w"> </span><span class="n">common</span><span class="p">,</span><span class="w"> </span><span class="n">SGD</span><span class="p">,</span><span class="w"> </span><span class="n">class</span><span class="p">,</span><span class="w">
    </span><span class="n">cellular_compartment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">C</span><span class="p">,</span><span class="w">
    </span><span class="n">molecular_function</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">,</span><span class="w">
    </span><span class="n">biological_process</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">P</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w">

</span><span class="n">feature_metadata</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">gasch_2000</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="n">UID</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">goslim_mappings</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"UID"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ORF"</span><span class="p">))</span><span class="w">

</span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">(</span><span class="n">feature_metadata</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">dplyr</span><span class="o">::</span><span class="n">slice</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">))</span></code></pre></figure>

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> UID </th>
   <th style="text-align:left;"> common </th>
   <th style="text-align:left;"> SGD </th>
   <th style="text-align:left;"> class </th>
   <th style="text-align:left;"> cellular_compartment </th>
   <th style="text-align:left;"> molecular_function </th>
   <th style="text-align:left;"> biological_process </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> YAL001C </td>
   <td style="text-align:left;"> TFC3 </td>
   <td style="text-align:left;"> S000000001 </td>
   <td style="text-align:left;"> ORF|Verified </td>
   <td style="text-align:left;"> mitochondrion </td>
   <td style="text-align:left;"> DNA binding </td>
   <td style="text-align:left;"> biosynthetic process </td>
  </tr>
  <tr>
   <td style="text-align:left;"> YAL002W </td>
   <td style="text-align:left;"> VPS8 </td>
   <td style="text-align:left;"> S000000002 </td>
   <td style="text-align:left;"> ORF|Verified </td>
   <td style="text-align:left;"> CORVET complex </td>
   <td style="text-align:left;"> enzyme binding </td>
   <td style="text-align:left;"> endosomal transport </td>
  </tr>
  <tr>
   <td style="text-align:left;"> YAL003W </td>
   <td style="text-align:left;"> EFB1 </td>
   <td style="text-align:left;"> S000000003 </td>
   <td style="text-align:left;"> ORF|Verified </td>
   <td style="text-align:left;"> eukaryotic translation elongation factor 1 complex </td>
   <td style="text-align:left;"> enzyme regulator activity </td>
   <td style="text-align:left;"> biosynthetic process </td>
  </tr>
  <tr>
   <td style="text-align:left;"> YAL004W </td>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> S000002136 </td>
   <td style="text-align:left;"> ORF|Dubious </td>
   <td style="text-align:left;"> cellular component </td>
   <td style="text-align:left;"> molecular function </td>
   <td style="text-align:left;"> biological process </td>
  </tr>
  <tr>
   <td style="text-align:left;"> YAL005C </td>
   <td style="text-align:left;"> SSA1 </td>
   <td style="text-align:left;"> S000000004 </td>
   <td style="text-align:left;"> ORF|Verified </td>
   <td style="text-align:left;"> cell wall </td>
   <td style="text-align:left;"> ATP hydrolysis activity </td>
   <td style="text-align:left;"> biosynthetic process </td>
  </tr>
</tbody>
</table>

<h3 id="samples">Samples</h3>

<p>Working with a fresh dataset invariably involves some data munging to format data and metadata in a usable format. In the case of the Gasch2K dataset, organizing samples was the most painful part of this process. Gasch2Ks samples are identified with short irregularly formatted names so it requires a bit of work to organize them. We could address this problem with a manually curated spreadsheet (I generally use tibble::tribble() for small tables and Google Sheets for larger ones). Luckily, the samples here are still organized enough that we can programmatically summarize them. Samples are defined in two ways: first, by the type of stressor (e.g., heat, starvation, …) and second, by the severity of the stressor. Within each stressor, samples are typically arranged in order of increasing stress. With this setup, we can capture each stressor using regulator expressions (since there are inconsistencies in the data, such as “diauxic” and “Diauxic”).</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span><span class="w">

</span><span class="n">experiment_labels</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tibble</span><span class="o">::</span><span class="n">tibble</span><span class="p">(</span><span class="n">sample</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">gasch_matrix</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">experiment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"hs\\-1"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Heat Shock (A) (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"hs\\-2"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Heat Shock (B) (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"^37C to 25C"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Cold Shock (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"^heat shock"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Heat Shock (severity)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"^29C to 33C"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"29C to 33C (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"^29C \\+1M sorbitol to 33C \\+ 1M sorbitol"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"29C + Sorbitol to 33C + Sorbitol (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"^29C \\+1M sorbitol to 33C \\+ \\*NO sorbitol"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"29C + Sorbitol to 33C (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"^constant 0.32 mM H2O2"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Hydrogen peroxide (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"^1 ?mM Menadione"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Menadione (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"^2.5mM DTT"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"DTT (A) (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"^dtt"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"DTT (B) (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"diamide"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Diamide (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"^1M sorbitol"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Sorbitol (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"^Hypo-osmotic shock"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Hypo-Osmotic Shock (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"^aa starv"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Amino Acid Starvation (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"^Nitrogen Depletion"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Nitrogen Depletion (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"^[Dd]iauxic [Ss]hift"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Diauxic Shift (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"ypd-2"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"YPD (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"ypd-1"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"YPD stationary phase (duration)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"overexpression"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"TF Overexpression"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"car-1"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Carbon Sources (A)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"car-2"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Carbon Sources (B)"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"ct-1"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Temperature Gradient"</span><span class="p">,</span><span class="w">
    </span><span class="n">str_detect</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="s2">"ct-2"</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"Temperature Gradient, Steady State"</span><span class="w">
    </span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">experiment</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">experiment_order</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">())</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="w">
    </span><span class="n">experiment_order</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">experiment</span><span class="p">),</span><span class="w"> </span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">experiment_order</span><span class="p">),</span><span class="w">
    </span><span class="n">experiment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="nf">is.na</span><span class="p">(</span><span class="n">experiment</span><span class="p">),</span><span class="w"> </span><span class="s2">"Other"</span><span class="p">,</span><span class="w"> </span><span class="n">experiment</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w">

</span><span class="n">experiment_labels</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">dplyr</span><span class="o">::</span><span class="n">sample_n</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">kableExtra</span><span class="o">::</span><span class="n">kable_styling</span><span class="p">(</span><span class="n">full_width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span></code></pre></figure>

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> sample </th>
   <th style="text-align:left;"> experiment </th>
   <th style="text-align:right;"> experiment_order </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 37C to 25C shock - 90 min </td>
   <td style="text-align:left;"> Cold Shock (duration) </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 29C +1M sorbitol to 33C + *NO sorbitol - 30 minutes </td>
   <td style="text-align:left;"> 29C + Sorbitol to 33C (duration) </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> heat shock 25 to 37, 20 minutes </td>
   <td style="text-align:left;"> Heat Shock (severity) </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> YPD stationary phase 5 d ypd-1 </td>
   <td style="text-align:left;"> YPD stationary phase (duration) </td>
   <td style="text-align:right;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> constant 0.32 mM H2O2 (80 min) redo </td>
   <td style="text-align:left;"> Hydrogen peroxide (duration) </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
</tbody>
</table>

<h2 id="formatting-for-romic">Formatting for romic</h2>

<p>Romic organizes genomic datasets as sets of measurement-, sample-, and feature-level variables. We’ve essentially created three tables capturing each of these aspects of our dataset already. Romic can bundle these together using a feature primary key shared between the features and measurements table (here, “UID”), and a sample primary key shared between the samples and measurements table (here, “sample”).</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># tidy gasch measurements</span><span class="w">
</span><span class="n">tall_gasch</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">gasch_2000</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">NAME</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">GWEIGHT</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">tidyr</span><span class="o">::</span><span class="n">gather</span><span class="p">(</span><span class="s2">"sample"</span><span class="p">,</span><span class="w"> </span><span class="s2">"expression"</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">UID</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">dplyr</span><span class="o">::</span><span class="n">filter</span><span class="p">(</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">expression</span><span class="p">))</span><span class="w">
  
</span><span class="n">triple_omic</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">create_triple_omic</span><span class="p">(</span><span class="w">
  </span><span class="n">measurement_df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tall_gasch</span><span class="p">,</span><span class="w">
  </span><span class="n">feature_df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">feature_metadata</span><span class="p">,</span><span class="w">
  </span><span class="n">sample_df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">experiment_labels</span><span class="p">,</span><span class="w">
  </span><span class="n">feature_pk</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"UID"</span><span class="p">,</span><span class="w">
  </span><span class="n">sample_pk</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"sample"</span><span class="w">
</span><span class="p">)</span></code></pre></figure>

<h1 id="plotting-at-the-tips-of-your-fingers">Plotting At the Tips of Your Fingers</h1>

<p>When its inefficient to explore a dataset, analyses will either be cursory or take longer than it should. While creating bespoke plots that explore specific aspects of a dataset are difficult to automate, the early stages of exploratory data analysis (EDA) should be. During EDA we hope to identify the major sources of variation in a dataset. Ideally this variation will reflect planned factors in our experimental design, but it is also frequently the case that unexpected sources of variability should be identified so they can be accounted for during modeling.</p>

<p>To support this early exploration, romic provides several specialized and general purpose interactive Shiny apps built form composable Shiny modules. We’ll use two of these apps to demonstrate a general workflow where we’ll</p>

<ol>
  <li>Interactively visualize our dataset in Shiny</li>
  <li>Share the Shiny app using shinyapp.io (or Rstudio Connect)</li>
  <li>Create a static visualization summarizing our findings.</li>
</ol>

<h2 id="principal-components-analysis">Principal Components Analysis</h2>

<p>To explore the major factors driving variation in a dataset it is a good idea to look at a low dimensional representation of samples. Principal components analysis can address this problem by sequentially capturing and then removing the most prominent one-dimensional pattern in the data. As an example, the Brauer 2008 experiment explored gene expression as yeast grew at different rates in different environments. When applying Singular Value Decomposition (SVD) (PCA is a special case of SVD), the most prominent pattern in samples (one principal component (PC) occactionally called an eigengene; a vector over samples) closely mirrored the growth rate, while the corresponding pattern across genes reflected how their expression changes with growth rate (one loading; a vector over genes) <a href="https://pubmed.ncbi.nlm.nih.gov/17959824/#&amp;gid=article-figures&amp;pid=figure-3-uid-2">see Brauer 2008 - Figure 3</a>. Having captured this pattern it could be removed from the data allowing for the estimation of the second prominent pattern, which could then be removed to estimate the third pattern, and so forth. In PCA/SVD, each pattern is constructed to maximize the amount of variation in the dataset that is explained and this fraction of variation explained is often important for interpreting PCs. Romic currently does not expose this information (though it probably should).</p>

<p>If we have a simple design comparing a gene knockout to a wild type (functioning gene) we should hope that the mutants will look more similar to one another and the wild type individuals will look more similar to one another. The differences between the mutant and wild type would manifest as a set of correlated expression changes that would largely be captured by the leading principal components. To visualize how sets of samples are separated in principal component space, it is common to create a scatter plot of PC1 x PC2 and then labeling each sample by the elements of the experimental design that are driving their separation in expression space.</p>

<p>One of the features of PCA is that subtle patterns in the data (later principal components) would be totally ignored if we were looking at the leading principal components. Alternatives to PCA, such as t-snee and UMAP, are increasingly popular for visualizing sample similarity because they can simultaneously capture all expression patterns driving sample similarity as a two-dimensional summary. As a result, samples may group together even if they have similar values of PC1 and PC2. The downside of this more holistic view of sample similarity is that distances are difficult to interpret. Samples that are very close to each other are likely quite similar while samples at a moderate distance could either be similar or totally different - see <a href="https://www.biorxiv.org/content/10.1101/2021.08.25.457696v1">The Specious Art of Single-Cell Genomics</a>.</p>

<h3 id="shiny-app">Shiny app</h3>

<p>SVD and PCA are fundamentally linear algebra techniques and therefore do not work if our dataset has missing values. (Optimization-based variants do exist but they are not implemented in romic.) If we filtered all genes which are missing measurements in at least one sample from the Gasch2K dataset we would be dropping more than 80% of our features. To avoid this outcome it is common to perform some form of missing value imputation on genomics datasets. Imputation methods should be avoided if possible and otherwise thoughtfully applied using a technique that is appropriate for the data modality you are working with. For microarrays, the standard approach for missing value imputation is K-nearest neighbors imputation. In KNN imputation, the K most similar neighbors of a sample with missing values are found using non-missing measurements and the missing value is imputed using their average expression.</p>

<p>Romic makes some decision about how to proceed when a dataset would not otherwise be able to perform an operation but imputation must be performed explicitly. Otherwise romic would toss out all the genes with missing values when estimating the PCs.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">imputed_triple</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">triple_omic</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># overwrite existing expression so that we don't have the</span><span class="w">
  </span><span class="c1"># raw expression changes which contains lots of missing values</span><span class="w">
  </span><span class="n">impute_missing_values</span><span class="p">(</span><span class="n">impute_var_name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"expression"</span><span class="p">)</span></code></pre></figure>

<p>With our imputed dataset we can now easily create a local Shiny app where we can overlay different sample attributes on PCs 1-5.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">app_pcs</span><span class="p">(</span><span class="n">imputed_triple</span><span class="p">)</span></code></pre></figure>

<p>Running Shiny apps requires a live R session working under the hood so its often quite challenging for other users (particularly non-technical ones) to setup the dependencies required to run an app. Luckily RStudio has created a couple of nice frameworks where Shiny apps can be deployed to a remote server running R. This allows users to just navigate to a url to access results. My employer, Calico, uses the enterprise product RStudio Connect to host internal apps on our own Google Cloud Platform server. Here, I’ll demonstrate deployment to a similar service hosted by Rstudio shinyapps.io. Here is the end product: <a href="https://seanhacks.shinyapps.io/romic-PCs/">romic PCs</a>. (This app isn’t behaving very well on the free tier of shinyapps.io but it works fine locally and on Connect ¯\_(ツ)_/¯).</p>

<p>When deploying content to Connect or shinyapps.io, R has to understand how to run your app on the remote server. To do this it will either attempt to automatically identify package versions and where to obtain them (CRAN, Bioconductor, GitHub, Rstudio package manager) or read these versions out of a file. I generally use renv for non-trivial deployments since it can manage python environments as well. Beyond this, its nice to have all of the files we would want to deploy to the server in a single directory. In most cases I store data on Google Cloud Storage or Cloud SQL to make it easy access results on a remote server.</p>

<p>To deploy this app, I put the following code in an “app.R” file in a directory containing a .Rds file of “imputed_triple”. Then I ran the Shiny app with shiny::runApp() and hit the publish button in the top right of the Rstudio pop-up. shinyapps.io is one of the options and the deployment proceeded without any hiccups.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">shiny</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">romic</span><span class="p">)</span><span class="w">

</span><span class="n">tidy_omics</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s2">"gasch2K.Rds"</span><span class="p">)</span><span class="w">
</span><span class="n">app_pcs</span><span class="p">(</span><span class="n">tidy_omics</span><span class="p">)</span></code></pre></figure>

<h3 id="static-pc-plot">Static PC Plot</h3>

<p>Having interactively explored the relationships between the PCs and our experimental design we may want to summarize our results using a static figure. Since romic’s apps call ggplot2-based plotting functions it is easy to recreate dynamically-generated plots. Of course, we could also just save our plot from the Shiny app’s interface.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">samples_with_PCs</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">imputed_triple</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">add_pca_loadings</span><span class="p">(</span><span class="n">npcs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># if you aren't used to the {} syntax, it doesn't use the object you</span><span class="w">
  </span><span class="c1"># piped in as the first argument. The object is still accessible with "."</span><span class="w">
  </span><span class="p">{</span><span class="n">.</span><span class="o">$</span><span class="n">samples</span><span class="p">}</span><span class="w">

</span><span class="n">plot_bivariate</span><span class="p">(</span><span class="w">
  </span><span class="n">tomic_table</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">samples_with_PCs</span><span class="p">,</span><span class="w">
  </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"PC1"</span><span class="p">,</span><span class="w">
  </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"PC2"</span><span class="p">,</span><span class="w">
  </span><span class="n">color_var</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"experiment"</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ggtitle</span><span class="p">(</span><span class="w">
    </span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Most stressors modulate a common set of genes"</span><span class="p">,</span><span class="w">  
    </span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Gasch2K expression principal components labelled by experiment"</span><span class="w">
    </span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">guides</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">guide_legend</span><span class="p">(</span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bottom"</span><span class="p">)</span></code></pre></figure>

<p><img src="/figure/source/2022-05-08-romic/gasch_pcs-1.png" alt="plot of chunk gasch_pcs" /></p>

<p>Based on this analysis we can see that most experiments cluster together aside from the “YPD” timecourses. These represent starvation conditions where the yeast clearly react with added measures beyond the ESR. Overlaying “experiment_order” and lassoing points to see them in a table using the interactive app, we can also see that samples are roughly ordered form less severe to more severe within the non-YPD conditions.</p>

<h2 id="heatmaps">Heatmaps</h2>

<p>While PCA allows us to summarize latent features of our dataset it is also helpful to view observation-level results in some format. This often involves plotting individual features but for genomics data it is common to also visualize the complete dataset using a heatmap. Heatmaps are essentially a visualization of a matrix of expression values (such as expression mean-centered by gene) with genes rearranged such that covarying genes are nearby one another. Samples may also be organized by similarity but frequently are organized by the experimental design. To order features and/or samples, hierarchical clustering is applied to create a tree linking all genes through successive merges of similarly behaving clusters of genes. The main parameters used when hierarchical clustering are a distance measure which defines how dissimilar all pairs of genes are, and the hierarchical clustering method which can affect the degree to which many small clusters or few large clusters are created. Generally it is important to choose a distance measure appropriate for your problem (here, Euclidean distance), while I generally don’t focus on the hierarchical clustering method (Ward.D2 is the default in romic). Both options are exposed whenever hierarchical clustering is performed in romic.</p>

<h3 id="shiny-app-1">Shiny app</h3>

<p>Heatmaps can be quite slow to render so to demo this function we can filter the Gasch2K dataset to a subset of conditions. To do this we’ll filter the samples table to a subset of experiments exploring the relationship between heat and gene expression.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># filter to a few experiments for the demo</span><span class="w">
</span><span class="n">heatshock_triple_omic</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">imputed_triple</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter_tomic</span><span class="p">(</span><span class="w">
    </span><span class="n">filter_type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"category"</span><span class="p">,</span><span class="w">
    </span><span class="n">filter_table</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"samples"</span><span class="p">,</span><span class="w">
    </span><span class="n">filter_variable</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"experiment"</span><span class="p">,</span><span class="w">
    </span><span class="n">filter_value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w">
      </span><span class="s2">"Heat Shock (A) (duration)"</span><span class="p">,</span><span class="w">
      </span><span class="s2">"Heat Shock (B) (duration)"</span><span class="p">,</span><span class="w">
      </span><span class="s2">"Heat Shock (severity)"</span><span class="p">,</span><span class="w">
      </span><span class="s2">"Temperature Gradient"</span><span class="w">
      </span><span class="p">))</span></code></pre></figure>

<p>Following the same deployment approach used above we can easily create a minimal Shiny app for helping us browse and explore heatmaps based on this dataset: <a href="https://seanhacks.shinyapps.io/romic-heatmap/">Shiny romic heatmap</a>. Since the app is ggplot2-based it is quite easy to add facets to organize samples.</p>

<h3 id="static-heatmaps">Static Heatmaps</h3>

<p>Once we find a nice view of our heatmap we can reproduce the results with a static visualization.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot_heatmap</span><span class="p">(</span><span class="w">
  </span><span class="n">tomic</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">heatshock_triple_omic</span><span class="p">,</span><span class="w">
  </span><span class="n">cluster_dim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"rows"</span><span class="p">,</span><span class="w">
  </span><span class="n">change_threshold</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">facet_grid</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">experiment</span><span class="p">,</span><span class="w"> </span><span class="n">scales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free_x"</span><span class="p">)</span></code></pre></figure>

<p><img src="/figure/source/2022-05-08-romic/static_heatmap-1.png" alt="plot of chunk static_heatmap" /></p>

<p>From this plot we can see when the yeast are most stressed out by heat and it is apparent that they respond to heat by turning up or down genes in a graded fashion to respond to both progressive and severe heat. Interestingly, the heat shock experiment stresses out the yeast transiently but by 80 minutes the stress has passed and the yeast have learned to leave with the elevated temperature. Yeast are tough.</p>

<h1 id="wrapping-up">Wrapping Up</h1>

<p>Romic is built around a core data structure (the T*Omic) that efficiently tracks data and metadata as a dataset is filtered, mutated and reorganized during analysis. To enable this modulation, romic distinguishes feature-, sample- and measurement-level variables using a schema. An added benefit of this approach is that variables can be automatically mapped to feasible aesthetics during plotting. It wouldn’t make much sense to color by a measurement in a sample-level plot, nor to color a heatmap by a categorical variable. This property can be exploited with dynamic visualizations which map variables to feasible ggplot2 aesthetics.</p>]]></content><author><name>Sean Hackett</name></author><category term="R" /><category term="analysis" /><category term="software" /><summary type="html"><![CDATA[An R package for exploratory data analysis of high-dimensional datasets]]></summary></entry><entry><title type="html">Time zero normalization with the Multivariate Gaussian distribution</title><link href="https://www.shackett.org/time_zero_normalization/" rel="alternate" type="text/html" title="Time zero normalization with the Multivariate Gaussian distribution" /><published>2022-05-08T00:00:00+00:00</published><updated>2022-05-08T00:00:00+00:00</updated><id>https://www.shackett.org/time_zero_normalization</id><content type="html" xml:base="https://www.shackett.org/time_zero_normalization/"><![CDATA[<p>Timecourses are a powerful experimental design for evaluating the impact of a perturbation. These perturbations are usually chemicals because chemicals, such as a drug, can be introduced quickly and with high temporal precision. Although, with some technologies, such as the estradiol-driven promoters that I used in the induction dynamics expression atlas (<a href="https://idea.research.calicolabs.com">IDEA</a>), it is possible to rapidly perturb a single gene further increasing specificity and broadening applicability. By rapidly perturbing individuals, they can be synchronized based on the time when dosing began. We often call this point when dosing begins “time zero” while all subsequent measurements correspond to the time post perturbation. (Since time zero corresponds to a point when a perturbation is applied, but will not yet impact the system, this measurement is usually taken before adding the perturbation.)</p>

<p>One of the benefits of collecting a time zero measurement is it allows us to remove, or account for, effects that are shared among all time points. In many cases this may just amount to analyzing fold-changes of post-perturbation measurement with respect to their time zero observation, rather than the original measurements themselves. This can be useful if there is considerable variation among timecourses irrespective of the perturbation, such as if we were studying humans or mice. Similarly, undesirable variation due to day-to-day variation in instruments, sample stability, or any of the many other factors which could produce batch effects, can sometimes by addressed by measuring each timecourse together and working with fold-changes. In either case, correcting for individual effects using pre-perturbation measurements will increase our power to detect perturbations’ effects.</p>

<p>Aside from correcting for unwanted variation, the kinetics of timecourses are a rich source of information which can be either a blessing or a curse. With temporal information, ephemeral responses can be observed. We can see both which features are changing and when they are changing. And, the ordering of events can point us towards causality. In practice, each of these goals can be difficult, or impossible to achieve, leaving us with a nagging feeling that we’re leaving information on the table. There are many competing options for identifying differences in timecourses, few ways of summarizing dynamics in an intuitive way, and causal inference is often out of reach. In this post, and others to follow, I’ll pick apart a few of these limitations, discussing developments that were applied to the IDEA, but will likely be useful for others thinking about biological timeseries analysis (or other timeseries if you are so inclined!). Here, I evaluate a few established methods for identifying features which vary across time and then introduce an alternative approach based on the Multivariate Gaussian distribution and Mahalanobis distance which increases power and does not require any assumptions about responses’ kinetics.</p>

<!--more-->

<h2 id="our-timecourse-experiment">Our timecourse experiment</h2>

<p>To evaluate methods for detecting temporal dynamics its helpful to use a dataset where there a clear-cut examples of timecourses with and without signal. With such a dataset in hand, we can easily detect signals that we are missing (false negatives), noise that we think is real (false positives) and evaluate overall recall (what fraction of signals are we detecting). We rarely have such positive and negative examples in real timecourses, so instead we can simulate timecourses with and without signal. Going forward I will also use genes as short-hand for whatever features that we might be working with since I’ve primarily worked with these methods in the context of gene expression data.</p>

<h3 id="environment-setup">Environment Setup</h3>

<p>First, I’m going to setup the R environment by loading some bread-and-butter packages and setting the global options for future and ggplot2.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># general use packages</span><span class="w">
</span><span class="n">suppressPackageStartupMessages</span><span class="p">(</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">))</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">future</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyr</span><span class="p">)</span><span class="w">

</span><span class="c1"># R package for simulating dynamics</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">impulse</span><span class="p">)</span><span class="w">

</span><span class="c1"># global options</span><span class="w">
</span><span class="c1"># setup parallelization</span><span class="w">
</span><span class="n">plan</span><span class="p">(</span><span class="s2">"multisession"</span><span class="p">,</span><span class="w"> </span><span class="n">workers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">)</span><span class="w">

</span><span class="c1"># ggplot default theme</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">())</span></code></pre></figure>

<h3 id="simulate-timecourses-containing-signal">Simulate timecourses containing signal</h3>

<p>First, we can generate the subset of our timecourses which contain signal. These timecourses should follow a broad range of biologically-feasible patterns.</p>

<p>To construct such timecourses, we can use the phenomonological timecourse model of Chechik &amp; Koller which represents timecourses as a pair of sigmoidal responses, called an impulse, We’ll also use a simpler single sigmoidal version of the C &amp; K model.</p>

<p>To simulate data from these models, we can use the simulate_timecourses() function from the impulse R package, available on <a href="https://github.com/calico/impulse">GitHub</a>.</p>

<p>This function will draw a set of parameters for sigmoidal and impulse from appropriate distributions to define a simulated timecourse. We’ll then add independent normally distributed noise to each observation. (For most genomic data types, measurements are log-normal so we could think of these abundance units as already having been log-transformed).</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">timepts</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w"> </span><span class="m">40</span><span class="p">,</span><span class="w"> </span><span class="m">60</span><span class="p">,</span><span class="w"> </span><span class="m">90</span><span class="p">)</span><span class="w"> </span><span class="c1"># time points measured</span><span class="w">
</span><span class="n">measurement_sd</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0.5</span><span class="w"> </span><span class="c1"># standard deviation of Gaussian noise added to each observation</span><span class="w">
</span><span class="n">total_measurements</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">10000</span><span class="w"> </span><span class="c1"># total number of genes</span><span class="w">
</span><span class="n">signal_frac</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0.2</span><span class="w"> </span><span class="c1"># what fraction of genes contain real signal</span><span class="w">

</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1234</span><span class="p">)</span><span class="w">

</span><span class="c1"># simulate timecourses containing signal </span><span class="w">

</span><span class="n">alt_timecourses</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">impulse</span><span class="o">::</span><span class="n">simulate_timecourses</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">total_measurements</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">signal_frac</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
                                                 </span><span class="n">timepts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">timepts</span><span class="p">,</span><span class="w">
                                                 </span><span class="n">prior_pars</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">v_sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.8</span><span class="p">,</span><span class="w">
                                                                </span><span class="n">rate_shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
                                                                </span><span class="n">rate_scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.25</span><span class="p">,</span><span class="w">
                                                                </span><span class="n">time_shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
                                                                </span><span class="n">time_scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">),</span><span class="w">
                                                 </span><span class="n">measurement_sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">measurement_sd</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">unnest_legacy</span><span class="p">(</span><span class="n">measurements</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">true_model</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">signal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"contains signal"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="c1"># drop timecourses where no true value's magnitude is greater than 1 (these</span><span class="w">
  </span><span class="c1"># aren't really signal containing</span><span class="w">
  </span><span class="c1"># and timecourses where the initial value isn't ~zero</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">tc_id</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="nf">any</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">sim_fit</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
         </span><span class="nf">abs</span><span class="p">(</span><span class="n">sim_fit</span><span class="p">[</span><span class="n">time</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="p">])</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w">

</span><span class="c1"># only retain the target number of signal containing timecourses</span><span class="w">
</span><span class="n">alt_timecourses</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">alt_timecourses</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">semi_join</span><span class="p">(</span><span class="w">
    </span><span class="n">alt_timecourses</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
      </span><span class="n">distinct</span><span class="p">(</span><span class="n">tc_id</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
      </span><span class="n">sample_n</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="n">total_measurements</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">signal_frac</span><span class="p">)),</span><span class="w">
    </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"tc_id"</span><span class="p">)</span><span class="w">

</span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">(</span><span class="n">alt_timecourses</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">slice</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">timepts</span><span class="p">)))</span></code></pre></figure>

<table>
  <thead>
    <tr>
      <th style="text-align: right">tc_id</th>
      <th style="text-align: right">time</th>
      <th style="text-align: right">sim_fit</th>
      <th style="text-align: right">abundance</th>
      <th style="text-align: left">signal</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: right">2</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">-0.0976223</td>
      <td style="text-align: right">0.0957832</td>
      <td style="text-align: left">contains signal</td>
    </tr>
    <tr>
      <td style="text-align: right">2</td>
      <td style="text-align: right">5</td>
      <td style="text-align: right">-0.2018300</td>
      <td style="text-align: right">-0.5742951</td>
      <td style="text-align: left">contains signal</td>
    </tr>
    <tr>
      <td style="text-align: right">2</td>
      <td style="text-align: right">10</td>
      <td style="text-align: right">-0.3967604</td>
      <td style="text-align: right">-0.2166940</td>
      <td style="text-align: left">contains signal</td>
    </tr>
    <tr>
      <td style="text-align: right">2</td>
      <td style="text-align: right">20</td>
      <td style="text-align: right">-1.1307509</td>
      <td style="text-align: right">-0.9666770</td>
      <td style="text-align: left">contains signal</td>
    </tr>
    <tr>
      <td style="text-align: right">2</td>
      <td style="text-align: right">30</td>
      <td style="text-align: right">-1.8594480</td>
      <td style="text-align: right">-0.9893428</td>
      <td style="text-align: left">contains signal</td>
    </tr>
    <tr>
      <td style="text-align: right">2</td>
      <td style="text-align: right">40</td>
      <td style="text-align: right">-2.1534230</td>
      <td style="text-align: right">-1.8150251</td>
      <td style="text-align: left">contains signal</td>
    </tr>
    <tr>
      <td style="text-align: right">2</td>
      <td style="text-align: right">60</td>
      <td style="text-align: right">-2.2445179</td>
      <td style="text-align: right">-2.7889178</td>
      <td style="text-align: left">contains signal</td>
    </tr>
    <tr>
      <td style="text-align: right">2</td>
      <td style="text-align: right">90</td>
      <td style="text-align: right">-2.2489452</td>
      <td style="text-align: right">-2.5141336</td>
      <td style="text-align: left">contains signal</td>
    </tr>
  </tbody>
</table>

<h3 id="simulate-timecourses-which-are-just-noise">Simulate timecourses which are just noise</h3>

<p>Timecourses which are just noise are easy to generate, we can just generate these using independent draws from a normal distribution (with the same standard deviation that we used to add noise to the signals).</p>

<p>With timecourses with and without signals in hand, we can combine the two sets together while tracking their origin.</p>

<p>Additionally since we are interested in time-dependent changes with respect to time zero, we can transform abundances into fold changes which subtract the initial value of a timecourse from every measurement. We’ll work with both the native abundance scale and time-zero normalized fold changes going forward.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">null_timecourses</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">crossing</span><span class="p">(</span><span class="n">tc_id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="nf">max</span><span class="p">(</span><span class="n">alt_timecourses</span><span class="o">$</span><span class="n">tc_id</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
                                         </span><span class="nf">max</span><span class="p">(</span><span class="n">alt_timecourses</span><span class="o">$</span><span class="n">tc_id</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">total_measurements</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="o">-</span><span class="n">signal_frac</span><span class="p">)),</span><span class="w">
                             </span><span class="n">time</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">timepts</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">signal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"no signal"</span><span class="p">,</span><span class="w">
         </span><span class="n">sim_fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w">
         </span><span class="n">abundance</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">(),</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">measurement_sd</span><span class="p">))</span><span class="w">

</span><span class="n">simulated_timecourses</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">bind_rows</span><span class="p">(</span><span class="n">alt_timecourses</span><span class="p">,</span><span class="w"> </span><span class="n">null_timecourses</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">signal</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">signal</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"contains signal"</span><span class="p">,</span><span class="w"> </span><span class="s2">"no signal"</span><span class="p">)))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">tc_id</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">fold_change</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">abundance</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">abundance</span><span class="p">[</span><span class="n">time</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="p">])</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ungroup</span><span class="p">()</span></code></pre></figure>

<h3 id="example-timecourses">Example timecourses</h3>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">example_tcs</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">simulated_timecourses</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">distinct</span><span class="p">(</span><span class="n">signal</span><span class="p">,</span><span class="w"> </span><span class="n">tc_id</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">signal</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">sample_n</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.character</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">()))</span><span class="w">

</span><span class="n">simulated_timecourses</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">inner_join</span><span class="p">(</span><span class="n">example_tcs</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"signal"</span><span class="p">,</span><span class="w"> </span><span class="s2">"tc_id"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">label</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_path</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sim_fit</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">abundance</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">signal</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free_y"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="s2">"Abundance"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_color_brewer</span><span class="p">(</span><span class="s2">"Example Timecourse"</span><span class="p">,</span><span class="w"> </span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Set2"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ggtitle</span><span class="p">(</span><span class="s2">"Simulated timecourses with and without signal"</span><span class="p">,</span><span class="w"> </span><span class="s2">"line: true values, points: observed values"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bottom"</span><span class="p">)</span></code></pre></figure>

<p><img src="/figure/source/2022-05-08-time_zero_normalization/timecourse_examples-1.png" alt="plot of chunk timecourse_examples" /></p>

<h2 id="models-to-try">Models to try</h2>

<p>At this point, we want to fit a few flavors of time series models to each gene in order to determine how reliably each model can discriminate signal-containing and no-signal timecourses.</p>

<p>To make it easy to iterate over features, I like to using the nest function from tidyr to store all the data for a feature in a single row. Here, expression data will be stored as a list of gene-level tables.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">nested_timecourses</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">simulated_timecourses</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">nest</span><span class="p">(</span><span class="n">timecourse_data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="n">signal</span><span class="p">,</span><span class="w"> </span><span class="n">tc_id</span><span class="p">))</span><span class="w"> 

</span><span class="n">nested_timecourses</span></code></pre></figure>

<figure class="highlight"><pre><code class="language-text" data-lang="text">## # A tibble: 10,000 × 3
##    tc_id signal          timecourse_data 
##    &lt;int&gt; &lt;fct&gt;           &lt;list&gt;          
##  1     2 contains signal &lt;tibble [8 × 4]&gt;
##  2     9 contains signal &lt;tibble [8 × 4]&gt;
##  3    10 contains signal &lt;tibble [8 × 4]&gt;
##  4    12 contains signal &lt;tibble [8 × 4]&gt;
##  5    13 contains signal &lt;tibble [8 × 4]&gt;
##  6    14 contains signal &lt;tibble [8 × 4]&gt;
##  7    15 contains signal &lt;tibble [8 × 4]&gt;
##  8    16 contains signal &lt;tibble [8 × 4]&gt;
##  9    22 contains signal &lt;tibble [8 × 4]&gt;
## 10    23 contains signal &lt;tibble [8 × 4]&gt;
## # … with 9,990 more rows</code></pre></figure>

<p>Having nested one-gene per row, we can apply multiple regression models to each gene, and we’ll also do this treating both the fold-change and original expression level as responses to evaluate the effect of time zero normalization. The regression models that we’ll try are:</p>

<ul>
  <li>linear effect of time on expression. From our sigmoid and impulse generate process, we shouldn’t expect a linear relationship to work great, but it can serve as a nice baseline.</li>
  <li>cubic relationship between time and expression. This fill fit a linear term, a quadratic ($t^2$) and a cubic ($t^3$) term to allow for more complicated dynamics such as genes that go up and then down again. One feature of cubic regression, or other polynomial regression models such as quadratic regression, is that they are zero-centric. What I mean by this is that additional terms in a polyomial regression models, such as moving from a cubic or a quardic model, allows for extra flexibily around zero. This can be helpful, but if changes are occurring at late timepoints, we may need a high degree polynomial to capture this change, and the cost will be a prediction which overfits to the noise in earlier timepoints.</li>
  <li>predicting expression with a spline over timepoints using generalized additive models (GAMs). These models are similar to the cubic models but provide support evenly across time. This will allow them to detect late changes without requiring many degrees of freedom. GAMs are powerful models but they can run into problem when changes occur rapidly especially if we have relatively few timepoints.</li>
</ul>

<p>With these models our main goal is determine whether each timecourses contains dynamic signal rather than testing for the significance of individual parameters. Approaching the problem this way will also allow us to compare these different approaches even though they fit a different number of parameters. To summarize each model’s prediction of the role of time, ANOVA can be used to determine how much variation is explained by time relative to the noise left over. Because cubic regression and GAMs fit more parameters they must do a better job of explaining the temporal dynamics to justify their extra degrees of freedom.</p>

<p>There are many other approaches that we could try, a few of these are worth mentioning:</p>

<ul>
  <li>mgcv is an alternative implementation of GAMs to the gam package used below. It is able to decide how flexible of a spline should be fit to each gene using cross-validation. For our synthetic dataset, mgcv actually fails for a number of features owing to the complexity of our dynamics relative to the relatively small number of timepoints.</li>
  <li>If we had replicates of timepoints then we could fit a model which treats each timepoint as a categorical variable. This would allow us detect dynamics without assuming that we expect certain patterns. The downside is that this would require us to collect twice as much data, or perhaps cut back on time points to provide repeated measures of the timepoints that we do have. In general, I think its better to have more unique timepoints represented even if we don’t have repeated measures since this provides more even representation of measuremetns over the time period we care about.</li>
  <li>Since time points are not evenly spaced we could have tried transforming time when fitting the above models. While the timepoints are “exponentially” sampled, taking log(time) would send time zero to -Inf so a better transformation would be using the square root of time as the independent variable.</li>
</ul>

<p>A couple of notes:</p>

<ul>
  <li>Since the time zero fold change must be zero by definition, I applied this as a constraint (this is the “+ 0” in the formulas below).</li>
  <li><em>future</em> was used to parallelize over genes; its settings were setup in the “environment setup” section.</li>
</ul>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">broom</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">furrr</span><span class="p">)</span><span class="w">
</span><span class="n">suppressPackageStartupMessages</span><span class="p">(</span><span class="n">library</span><span class="p">(</span><span class="n">gam</span><span class="p">))</span><span class="w">

</span><span class="n">fit_regression</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="w"> </span><span class="p">(</span><span class="n">one_tc</span><span class="p">,</span><span class="w"> </span><span class="n">model_fxn</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"lm"</span><span class="p">,</span><span class="w"> </span><span class="n">model_formula</span><span class="p">,</span><span class="w"> </span><span class="n">null_formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">

  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">all.vars</span><span class="p">(</span><span class="n">model_formula</span><span class="p">)[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"fold_change"</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">one_tc</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">one_tc</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
      </span><span class="n">filter</span><span class="p">(</span><span class="n">time</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">
  
  </span><span class="n">alt_fit</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">do.call</span><span class="p">(</span><span class="n">model_fxn</span><span class="p">,</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">one_tc</span><span class="p">,</span><span class="w"> </span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">model_formula</span><span class="p">))</span><span class="w">
  
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">model_fxn</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"lm"</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">null_fit</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">do.call</span><span class="p">(</span><span class="n">model_fxn</span><span class="p">,</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">one_tc</span><span class="p">,</span><span class="w"> </span><span class="n">formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">null_formula</span><span class="p">))</span><span class="w">
    </span><span class="n">model_anova</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">anova</span><span class="p">(</span><span class="n">null_fit</span><span class="p">,</span><span class="w"> </span><span class="n">alt_fit</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">model_anova</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">alt_fit</span><span class="w">
  </span><span class="p">}</span><span class="w">
  
  </span><span class="n">model_anova</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">broom</span><span class="o">::</span><span class="n">tidy</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
    </span><span class="n">filter</span><span class="p">(</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">statistic</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">standard_models</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">nested_timecourses</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">linear_abundance</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">future_map</span><span class="p">(</span><span class="n">timecourse_data</span><span class="p">,</span><span class="w"> </span><span class="n">fit_regression</span><span class="p">,</span><span class="w"> </span><span class="n">model_fxn</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"lm"</span><span class="p">,</span><span class="w">
                                       </span><span class="n">model_formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.formula</span><span class="p">(</span><span class="n">abundance</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">time</span><span class="p">),</span><span class="w">
                                       </span><span class="n">null_formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.formula</span><span class="p">(</span><span class="n">abundance</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="p">)),</span><span class="w">
         </span><span class="n">linear_foldchange</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">future_map</span><span class="p">(</span><span class="n">timecourse_data</span><span class="p">,</span><span class="w"> </span><span class="n">fit_regression</span><span class="p">,</span><span class="w"> </span><span class="n">model_fxn</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"lm"</span><span class="p">,</span><span class="w">
                                        </span><span class="n">model_formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.formula</span><span class="p">(</span><span class="n">fold_change</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">time</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
                                        </span><span class="n">null_formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.formula</span><span class="p">(</span><span class="n">fold_change</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">0</span><span class="p">)),</span><span class="w">
         </span><span class="n">cubic_abundance</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">future_map</span><span class="p">(</span><span class="n">timecourse_data</span><span class="p">,</span><span class="w"> </span><span class="n">fit_regression</span><span class="p">,</span><span class="w"> </span><span class="n">model_fxn</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"lm"</span><span class="p">,</span><span class="w">
                                      </span><span class="n">model_formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.formula</span><span class="p">(</span><span class="n">abundance</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">poly</span><span class="p">(</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">degree</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">raw</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)),</span><span class="w">
                                      </span><span class="n">null_formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.formula</span><span class="p">(</span><span class="n">abundance</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="p">)),</span><span class="w">
         </span><span class="n">cubic_foldchange</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">future_map</span><span class="p">(</span><span class="n">timecourse_data</span><span class="p">,</span><span class="w"> </span><span class="n">fit_regression</span><span class="p">,</span><span class="w"> </span><span class="n">model_fxn</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"lm"</span><span class="p">,</span><span class="w">
                                       </span><span class="n">model_formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.formula</span><span class="p">(</span><span class="n">fold_change</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">poly</span><span class="p">(</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">degree</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">raw</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
                                       </span><span class="n">null_formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.formula</span><span class="p">(</span><span class="n">fold_change</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">0</span><span class="p">)),</span><span class="w">
         </span><span class="n">gam_abundance</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">future_map</span><span class="p">(</span><span class="n">timecourse_data</span><span class="p">,</span><span class="w"> </span><span class="n">fit_regression</span><span class="p">,</span><span class="w"> </span><span class="n">model_fxn</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gam"</span><span class="p">,</span><span class="w">
                                    </span><span class="n">model_formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.formula</span><span class="p">(</span><span class="n">abundance</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">time</span><span class="p">))),</span><span class="w">
         </span><span class="n">gam_foldchange</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">future_map</span><span class="p">(</span><span class="n">timecourse_data</span><span class="p">,</span><span class="w"> </span><span class="n">fit_regression</span><span class="p">,</span><span class="w"> </span><span class="n">model_fxn</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gam"</span><span class="p">,</span><span class="w">
                                     </span><span class="n">model_formula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.formula</span><span class="p">(</span><span class="n">fold_change</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">time</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">0</span><span class="p">)))</span></code></pre></figure>

<p>Each model x gene can be summarized by a single p-value. We expect the signal-containing timecourses to have relatively low p-values, while the no-signal timecourses’ pvalues should be uniformly distributed between 0 and 1.</p>

<p>To correct for multiple tests we can use the Storey q-value approach to control the false discovery rate (FDR). To do this we will estimate q-values separately for each model and select a q-value cutoff of 0.1 as a cutoff for significance. At this level we expect that 1/10 of genes with a q-value of less than 0.1 will be from the no-signal group.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fdr_control</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">pvalues</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">qvalue</span><span class="o">::</span><span class="n">qvalue</span><span class="p">(</span><span class="n">pvalues</span><span class="p">)</span><span class="o">$</span><span class="n">qvalues</span><span class="w"> 
</span><span class="p">}</span><span class="w">

</span><span class="n">all_model_fits</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">standard_models</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">timecourse_data</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">gather</span><span class="p">(</span><span class="n">model_type</span><span class="p">,</span><span class="w"> </span><span class="n">model_data</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">tc_id</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">signal</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">unnest</span><span class="p">(</span><span class="n">model_data</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">model_type</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">qvalue</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fdr_control</span><span class="p">(</span><span class="n">p.value</span><span class="p">),</span><span class="w">
         </span><span class="n">discovery</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">qvalue</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.1</span><span class="p">,</span><span class="w"> </span><span class="s2">"positive"</span><span class="p">,</span><span class="w"> </span><span class="s2">"negative"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">separate</span><span class="p">(</span><span class="n">model_type</span><span class="p">,</span><span class="w"> </span><span class="n">into</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"model"</span><span class="p">,</span><span class="w"> </span><span class="s2">"response"</span><span class="p">))</span><span class="w">

</span><span class="n">ggplot</span><span class="p">(</span><span class="n">all_model_fits</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">p.value</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">signal</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">facet_grid</span><span class="p">(</span><span class="n">model</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">response</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_histogram</span><span class="p">(</span><span class="n">bins</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">25</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_brewer</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Set1"</span><span class="p">)</span></code></pre></figure>

<p><img src="/figure/source/2022-05-08-time_zero_normalization/fdr_control-1.png" alt="plot of chunk fdr_control" /></p>

<p>Visually, there are a large number of no-signal timecourses with small p-values in the fold-change data. This suggests that something pathological is going on.</p>

<p>We can also summarize models based on the FDR that was actually realized given that we were shooting for 0.1, and based on the total recall of signal-containing timecourses at the cutoff of 0.1</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">all_model_fits</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">count</span><span class="p">(</span><span class="n">signal</span><span class="p">,</span><span class="w"> </span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">response</span><span class="p">,</span><span class="w"> </span><span class="n">discovery</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">correct</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="n">signal</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"no signal"</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">discovery</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"negative"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"true negative"</span><span class="p">,</span><span class="w">
                             </span><span class="n">signal</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"no signal"</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">discovery</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"positive"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"false positive"</span><span class="p">,</span><span class="w">
                             </span><span class="n">signal</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"contains signal"</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">discovery</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"negative"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"false negative"</span><span class="p">,</span><span class="w">
                             </span><span class="n">signal</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"contains signal"</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">discovery</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"positive"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"true positive"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">response</span><span class="p">,</span><span class="w"> </span><span class="n">correct</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">spread</span><span class="p">(</span><span class="n">correct</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">response</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">fdr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">`false positive`</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">`false positive`</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">`true positive`</span><span class="p">),</span><span class="w">
         </span><span class="n">recall</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">`true positive`</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">`false negative`</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">`true positive`</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">()</span></code></pre></figure>

<table>
  <thead>
    <tr>
      <th style="text-align: left">model</th>
      <th style="text-align: left">response</th>
      <th style="text-align: right">false negative</th>
      <th style="text-align: right">false positive</th>
      <th style="text-align: right">true negative</th>
      <th style="text-align: right">true positive</th>
      <th style="text-align: right">fdr</th>
      <th style="text-align: right">recall</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">cubic</td>
      <td style="text-align: left">abundance</td>
      <td style="text-align: right">1912</td>
      <td style="text-align: right">10</td>
      <td style="text-align: right">7990</td>
      <td style="text-align: right">88</td>
      <td style="text-align: right">0.1020408</td>
      <td style="text-align: right">0.0440</td>
    </tr>
    <tr>
      <td style="text-align: left">gam</td>
      <td style="text-align: left">abundance</td>
      <td style="text-align: right">1118</td>
      <td style="text-align: right">128</td>
      <td style="text-align: right">7872</td>
      <td style="text-align: right">882</td>
      <td style="text-align: right">0.1267327</td>
      <td style="text-align: right">0.4410</td>
    </tr>
    <tr>
      <td style="text-align: left">linear</td>
      <td style="text-align: left">abundance</td>
      <td style="text-align: right">1532</td>
      <td style="text-align: right">55</td>
      <td style="text-align: right">7945</td>
      <td style="text-align: right">468</td>
      <td style="text-align: right">0.1051625</td>
      <td style="text-align: right">0.2340</td>
    </tr>
    <tr>
      <td style="text-align: left">cubic</td>
      <td style="text-align: left">foldchange</td>
      <td style="text-align: right">280</td>
      <td style="text-align: right">2928</td>
      <td style="text-align: right">5072</td>
      <td style="text-align: right">1720</td>
      <td style="text-align: right">0.6299484</td>
      <td style="text-align: right">0.8600</td>
    </tr>
    <tr>
      <td style="text-align: left">gam</td>
      <td style="text-align: left">foldchange</td>
      <td style="text-align: right">276</td>
      <td style="text-align: right">4151</td>
      <td style="text-align: right">3849</td>
      <td style="text-align: right">1724</td>
      <td style="text-align: right">0.7065532</td>
      <td style="text-align: right">0.8620</td>
    </tr>
    <tr>
      <td style="text-align: left">linear</td>
      <td style="text-align: left">foldchange</td>
      <td style="text-align: right">377</td>
      <td style="text-align: right">3463</td>
      <td style="text-align: right">4537</td>
      <td style="text-align: right">1623</td>
      <td style="text-align: right">0.6808887</td>
      <td style="text-align: right">0.8115</td>
    </tr>
  </tbody>
</table>

<p>In this summary we can see that working with abundances does accurately control the FDR, but recall is low for linear and cubic regression and moderate for GAMs. Working with fold changes in contrast fails to control the FDR. While we intended for 1/10 of our discoveries to be null, around 60% actually are! While recall is pretty good, most of the kinetic responses will be garbage.</p>

<p>To figure out what is going wrong, we can plot examples of false positives (model thinks there is signal when there isn’t) and false negatives (model doesn’t think there is signal when there really is).</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">extreme_false_negatives</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">all_model_fits</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">signal</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"contains signal"</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">response</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"abundance"</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">model</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"cubic"</span><span class="p">,</span><span class="w"> </span><span class="s2">"gam"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">response</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">arrange</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="n">qvalue</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">slice</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.character</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">()))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="n">tc_id</span><span class="p">,</span><span class="w"> </span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">response</span><span class="p">,</span><span class="w"> </span><span class="n">label</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
   </span><span class="n">mutate</span><span class="p">(</span><span class="n">facet_label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glue</span><span class="o">::</span><span class="n">glue</span><span class="p">(</span><span class="s2">"{response} {model} model false negatives"</span><span class="p">))</span><span class="w">

</span><span class="n">extreme_false_positives</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">all_model_fits</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">signal</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"no signal"</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">response</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"foldchange"</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">model</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"cubic"</span><span class="p">,</span><span class="w"> </span><span class="s2">"gam"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">response</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">sample_n</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.character</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">()))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="n">tc_id</span><span class="p">,</span><span class="w"> </span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">response</span><span class="p">,</span><span class="w"> </span><span class="n">label</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">facet_label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">glue</span><span class="o">::</span><span class="n">glue</span><span class="p">(</span><span class="s2">"{response} {model} model false positives"</span><span class="p">))</span><span class="w">

</span><span class="n">select_misclassifications</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">bind_rows</span><span class="p">(</span><span class="n">extreme_false_negatives</span><span class="p">,</span><span class="w"> 
                                       </span><span class="n">extreme_false_positives</span><span class="p">)</span><span class="w">

</span><span class="n">simulated_timecourses</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">inner_join</span><span class="p">(</span><span class="n">select_misclassifications</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"tc_id"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">label</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_path</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sim_fit</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">abundance</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">facet_wrap</span><span class="p">(</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">facet_label</span><span class="p">,</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free_y"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="s2">"Abundance"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_color_brewer</span><span class="p">(</span><span class="s2">"Example Timecourse"</span><span class="p">,</span><span class="w"> </span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Set2"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ggtitle</span><span class="p">(</span><span class="s2">"Timecourses missed by GAM"</span><span class="p">,</span><span class="w"> </span><span class="s2">"line: true values, points: observed values"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bottom"</span><span class="p">)</span></code></pre></figure>

<p><img src="/figure/source/2022-05-08-time_zero_normalization/unnamed-chunk-2-1.png" alt="plot of chunk unnamed-chunk-2" /></p>

<p>From this we can say a few things:</p>

<ul>
  <li>
    <p>When working with abundances we need to include an intercept term so the average value of a feature can be separated from its change over time. Doing this however can remove some early responses since the intercept becomes the point of reference rather than the value at time zero.</p>
  </li>
  <li>
    <p>Changes which are showing up primarily in one or two timepoints might be missed since the polynomial and gam models used above can’t contort themselves to fit these dynamics. This might be appropriate if these points were just noise but in many cases these are large changes beyond what we would expect as noise in our generative process.</p>
  </li>
  <li>
    <p>Working with fold change enforces the value at time zero as the appropriate reference. This makes conceptual sense for a perturbation timecourse since at time zero (and before) the system is in a reference state, and all subsequent timepoints capture the dynamics of interest. However, working directly with fold-changes creates a problem. We are no longer controlling the FDR! In this simulation if we wanted to find discoveries at a 10% FDR than we would in fact be realizing a 60% FDR. This is a big problem, especially since outside of working in a simulation, we don’t know which timecourses contain real signal and which are spurious (if that was the case then why would we do the experiment…), so we would would think there were strong signals in our dataset when it may in fact entirely be noise.</p>
  </li>
</ul>

<p>If we wanted to get around these issues, then we could still probably make a regression model work, but it would require adding more samples and cost to the analysis. The two main paths we could take are:</p>

<ul>
  <li>
    <p><em>replicates of each timecourse</em> - if we had multiple biological replicates at each timepoint, than rather than treating time as a numerical variable, we could treat it as a categorical variable. In this case we could fit an ANOVA model which would assess whether the variation between timepoints is greater than the variation within timepoints. This would be a powerful way of detecting differences across time which is agnostic to types of changes occurring.</p>
  </li>
  <li>
    <p><em>denser sampling</em> - if we had more measurements near timepoints where rapid dynamics were occurring then it would be easier to distinguish smooth rapid responses from single outlier observations. This would still require us to fit a model which can appropriately capture such dynamics, but with more observations we could either fit a more flexible model (with more degrees of freedom) or use the same simple models but with more power to detect significant changes.</p>
  </li>
</ul>

<p>In most cases, I think adopting one of these options is probably the smart way to go, however there are reasons why collecting more samples in a given experiment is not feasible such as if MANY similar experiments are being performed, like in IDEA, or if there are constraints on how frequently samples can be collected.</p>

<p>In such cases, I think we can find a path forward by stepping away from regression and thinking about likelihood-based methods which capture the nature of fold changes.</p>

<h2 id="timecourse-fold-change-likelihood">Timecourse fold change likelihood</h2>

<p>Using likelihood methods, we start with a statistical model for how our observations were generated. We can then sample or optimize the parameters of this model in order to find the most likely value (i.e., the frequentist MLE) or generate a distribution of parameters incorporating both likelihood and parameter plausibility (i.e., the Bayesian approach).</p>

<p>Before we posit an appropriate likelihood for fold change, lets figure out why the regression approaches using fold change were so anticonservative. The big problem here was that many timecourses which were just noise looked like they actually contained signal. So, lets work with just the “no signal” timecourses.</p>

<p>To do this, we can look at how the value at time zero influences fold-change estimates of later timepoints.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">timecourse_spread</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">simulated_timecourses</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">signal</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"no signal"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">tc_id</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">tzero_value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">abundance</span><span class="p">[</span><span class="n">time</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="p">])</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">fold_change</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="n">tc_id</span><span class="p">,</span><span class="w"> </span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">tzero_value</span><span class="p">,</span><span class="w"> </span><span class="n">fold_change</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">spread</span><span class="p">(</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">fold_change</span><span class="p">)</span><span class="w">

</span><span class="n">ggplot</span><span class="p">(</span><span class="n">timecourse_spread</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">`5`</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">`10`</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tzero_value</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_color_gradient2</span><span class="p">(</span><span class="s1">'Time zero value'</span><span class="p">,</span><span class="w"> </span><span class="n">low</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"GREEN"</span><span class="p">,</span><span class="w"> </span><span class="n">high</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"RED"</span><span class="p">,</span><span class="w"> </span><span class="n">mid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"BLACK"</span><span class="p">,</span><span class="w"> </span><span class="n">midpoint</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-2</span><span class="o">:</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">limits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="s2">"Fold change at 5 minutes"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="s2">"Fold change at 10 minutes"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">coord_cartesian</span><span class="p">(</span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_minimal</span><span class="p">()</span></code></pre></figure>

<p><img src="/figure/source/2022-05-08-time_zero_normalization/corr_vs_tzero-1.png" alt="plot of chunk corr_vs_tzero" /></p>

<p>From this plot, a large positive value at time zero (due to noise) results in later timepoints appearing as consistent negative fold changes. Conversely, if the time zero value is negative then later timepoints appear consistently positive.</p>

<p>While we simulated abundances as independent Normal draws which would possess a spherical covariance structure, normalizing to time zero has induced a correlation between observations and this dependence will need to be accounted for in order to have a useful null hypothesis.</p>

<p>Based on the value of time zero, which all subsequent time points are normalized with respect to, these later timepoints are biased such that they are all higher or lower than otherwise expected.  In order to test for timecourse-level signal based on the aggregate signal of all observations, we need to account for the dependence of these observations. Luckily, the form of this dependence is quite straight-forward.</p>

<p>We can see this dependence using the sample covariance matrix of our null timecourses.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">cov</span><span class="p">(</span><span class="n">timecourse_spread</span><span class="p">[,</span><span class="m">3</span><span class="o">:</span><span class="n">ncol</span><span class="p">(</span><span class="n">timecourse_spread</span><span class="p">)])</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">()</span></code></pre></figure>

<table>
  <thead>
    <tr>
      <th style="text-align: left"> </th>
      <th style="text-align: right">5</th>
      <th style="text-align: right">10</th>
      <th style="text-align: right">20</th>
      <th style="text-align: right">30</th>
      <th style="text-align: right">40</th>
      <th style="text-align: right">60</th>
      <th style="text-align: right">90</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">5</td>
      <td style="text-align: right">0.5141863</td>
      <td style="text-align: right">0.2514794</td>
      <td style="text-align: right">0.2583391</td>
      <td style="text-align: right">0.2533660</td>
      <td style="text-align: right">0.2557825</td>
      <td style="text-align: right">0.2539230</td>
      <td style="text-align: right">0.2559496</td>
    </tr>
    <tr>
      <td style="text-align: left">10</td>
      <td style="text-align: right">0.2514794</td>
      <td style="text-align: right">0.5076230</td>
      <td style="text-align: right">0.2605876</td>
      <td style="text-align: right">0.2536349</td>
      <td style="text-align: right">0.2504740</td>
      <td style="text-align: right">0.2512202</td>
      <td style="text-align: right">0.2540698</td>
    </tr>
    <tr>
      <td style="text-align: left">20</td>
      <td style="text-align: right">0.2583391</td>
      <td style="text-align: right">0.2605876</td>
      <td style="text-align: right">0.5131240</td>
      <td style="text-align: right">0.2505837</td>
      <td style="text-align: right">0.2562055</td>
      <td style="text-align: right">0.2559705</td>
      <td style="text-align: right">0.2547911</td>
    </tr>
    <tr>
      <td style="text-align: left">30</td>
      <td style="text-align: right">0.2533660</td>
      <td style="text-align: right">0.2536349</td>
      <td style="text-align: right">0.2505837</td>
      <td style="text-align: right">0.4973020</td>
      <td style="text-align: right">0.2534935</td>
      <td style="text-align: right">0.2517957</td>
      <td style="text-align: right">0.2526950</td>
    </tr>
    <tr>
      <td style="text-align: left">40</td>
      <td style="text-align: right">0.2557825</td>
      <td style="text-align: right">0.2504740</td>
      <td style="text-align: right">0.2562055</td>
      <td style="text-align: right">0.2534935</td>
      <td style="text-align: right">0.4982398</td>
      <td style="text-align: right">0.2498487</td>
      <td style="text-align: right">0.2531005</td>
    </tr>
    <tr>
      <td style="text-align: left">60</td>
      <td style="text-align: right">0.2539230</td>
      <td style="text-align: right">0.2512202</td>
      <td style="text-align: right">0.2559705</td>
      <td style="text-align: right">0.2517957</td>
      <td style="text-align: right">0.2498487</td>
      <td style="text-align: right">0.4986771</td>
      <td style="text-align: right">0.2508796</td>
    </tr>
    <tr>
      <td style="text-align: left">90</td>
      <td style="text-align: right">0.2559496</td>
      <td style="text-align: right">0.2540698</td>
      <td style="text-align: right">0.2547911</td>
      <td style="text-align: right">0.2526950</td>
      <td style="text-align: right">0.2531005</td>
      <td style="text-align: right">0.2508796</td>
      <td style="text-align: right">0.5066272</td>
    </tr>
  </tbody>
</table>

<p>Observations variances are approximately $2\text{Var}(x_t)$ (2 * 0.5) because:</p>

\[\mathcal{N}(\mu_{A}, \sigma^{2}_{A}) - \mathcal{N}(\mu_{B}, \sigma^{2}_{B}) = \mathcal{N}(\mu_{A} - \mu_{B}, \sigma^{2}_{A} + \sigma^{2}_{B})\]

<p>Observation covariances are $\text{Var}(\log_2x_t)$ because the shared normalization to time zero adds the variance of time zero as a covariance to to the later time points.</p>

<p>Normalization of Normal (or log-Normal) observations to a common reference produces a Multivariate Gaussian distribution.</p>

\[\mathbf{f}_{i} \sim \mathcal{MN}\left(\mu = \mathbf{0}, \Sigma = 
\begin{bmatrix}
    2\sigma_{\epsilon}^2 &amp; \sigma_{\epsilon}^2 &amp; \dots  &amp; \sigma_{\epsilon}^2 \\
    \sigma_{\epsilon}^2 &amp; 2\sigma_{\epsilon}^2 &amp; \dots  &amp; \sigma_{\epsilon}^2 \\
    \vdots &amp; \vdots &amp; \ddots &amp; \vdots \\
    \sigma_{\epsilon}^2 &amp; \sigma_{\epsilon}^2 &amp; \dots  &amp; 2\sigma_{\epsilon}^2
\end{bmatrix}\right)\]

<p>If we expect null fold-change measurements to follow this distribution, then we can sample from a multivariate normal distribution to explore whether this works as a fold-change generative process. We can then assess whether these draws are likely to have come from this distribution as a null hypothesis. For this purpose, we’ll use Mahalanobis distance, which is a multivariate generalization of the Wald test that assesses how many standard deviations an observation is from the mean of the distribution. This requires an estimate of the covariance matrix, an assumption that will be discussed below. We expect these statistics to be $\chi^{2}$ distributed with degrees of freedom equal to the number of timepoints that we have.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">timecourse_covariance</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">measurement_sd</span><span class="o">^</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">timepts</span><span class="p">)</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">timepts</span><span class="p">)</span><span class="m">-1</span><span class="p">)</span><span class="w">
</span><span class="n">diag</span><span class="p">(</span><span class="n">timecourse_covariance</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="n">measurement_sd</span><span class="o">^</span><span class="m">2</span><span class="w">

</span><span class="n">n_fold_changes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">timecourse_covariance</span><span class="p">)</span><span class="w">

</span><span class="c1"># simulate draws from multivariate normal</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">mvtnorm</span><span class="p">)</span><span class="w">
</span><span class="n">r_multivariate_normal</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rmvt</span><span class="p">(</span><span class="m">10000</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">timecourse_covariance</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">

</span><span class="n">r_multivariate_mahalanobis_dist</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mahalanobis</span><span class="p">(</span><span class="n">r_multivariate_normal</span><span class="p">,</span><span class="w">
                                               </span><span class="n">center</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_fold_changes</span><span class="p">),</span><span class="w">
                                               </span><span class="n">cov</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">timecourse_covariance</span><span class="p">,</span><span class="w">
                                               </span><span class="n">inverted</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">

</span><span class="c1"># test multivariate normality</span><span class="w">
</span><span class="n">hist</span><span class="p">(</span><span class="n">pchisq</span><span class="p">(</span><span class="n">r_multivariate_mahalanobis_dist</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_fold_changes</span><span class="p">,</span><span class="w"> </span><span class="n">lower.tail</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">),</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"p-values for MN fold-change generative process"</span><span class="p">)</span></code></pre></figure>

<p><img src="/figure/source/2022-05-08-time_zero_normalization/null_mahalanobis-1.png" alt="plot of chunk null_mahalanobis" /></p>

<p>Having taken draws from the Multivariate Gaussian distribution and then used the Mahalanobis distance to calculate p-values, the $\text{Unif}(0,1)$ distribution of these p-values confirms that the Mahalanobis distance is appropriate.</p>

<p>We can now test whether fold-changes are really Multivariate Gaussian distributed by inspecting the distribution of Mahalanobis distance p-values from the no-signal timecourses.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># test timecourse samples for multivariate normality</span><span class="w">
</span><span class="n">time_course_mahalanobis_dist</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mahalanobis</span><span class="p">(</span><span class="n">timecourse_spread</span><span class="p">[,</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">)],</span><span class="w"> </span><span class="n">center</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">n_fold_changes</span><span class="p">),</span><span class="w"> </span><span class="n">cov</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">timecourse_covariance</span><span class="p">)</span><span class="w">
</span><span class="n">hist</span><span class="p">(</span><span class="n">pchisq</span><span class="p">(</span><span class="n">time_course_mahalanobis_dist</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_fold_changes</span><span class="p">,</span><span class="w"> </span><span class="n">lower.tail</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">),</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"p-values for no-signal timecourses using MN fold-change model"</span><span class="p">)</span></code></pre></figure>

<p><img src="/figure/source/2022-05-08-time_zero_normalization/unnamed-chunk-4-1.png" alt="plot of chunk unnamed-chunk-4" /></p>

<p>The p-values for the no-signal fold change timecourses are indeed $\text{Unif}(0,1)$ distributed as we hoped.</p>

<p>Now, we can calculate the Mahalanobis distances and their corresponding p-values for the signal-containing timecourses as well. Signal in these timecourses will both increase the overall variance in expression for a feature, and deviations of nearby timepoints may be similar. These factors will make it harder for the Multivariate Gaussian noise model to explain the signal-containing expression vector, resulting in a high Mahalanobis distance and a small p-value.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">timecourse_mvn</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">simulated_timecourses</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">group_by</span><span class="p">(</span><span class="n">tc_id</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">tzero_value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">abundance</span><span class="p">[</span><span class="n">time</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="p">])</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ungroup</span><span class="p">()</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">filter</span><span class="p">(</span><span class="n">fold_change</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="n">tc_id</span><span class="p">,</span><span class="w"> </span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">tzero_value</span><span class="p">,</span><span class="w"> </span><span class="n">fold_change</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">spread</span><span class="p">(</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">fold_change</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">mahalanobis_dist</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mahalanobis</span><span class="p">(</span><span class="n">.</span><span class="p">[,</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">)],</span><span class="w">
                                        </span><span class="n">center</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">n_fold_changes</span><span class="p">),</span><span class="w">
                                        </span><span class="n">cov</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">timecourse_covariance</span><span class="p">),</span><span class="w">
         </span><span class="n">pvalue</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pchisq</span><span class="p">(</span><span class="n">mahalanobis_dist</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_fold_changes</span><span class="p">,</span><span class="w"> </span><span class="n">lower.tail</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">),</span><span class="w">
         </span><span class="n">qvalue</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fdr_control</span><span class="p">(</span><span class="n">pvalue</span><span class="p">),</span><span class="w">
         </span><span class="n">discovery</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">qvalue</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">0.1</span><span class="p">,</span><span class="w"> </span><span class="s2">"positive"</span><span class="p">,</span><span class="w"> </span><span class="s2">"negative"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">left_join</span><span class="p">(</span><span class="n">simulated_timecourses</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
              </span><span class="n">distinct</span><span class="p">(</span><span class="n">tc_id</span><span class="p">,</span><span class="w"> </span><span class="n">signal</span><span class="p">),</span><span class="w">
            </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"tc_id"</span><span class="p">)</span><span class="w">

</span><span class="n">timecourse_mvn</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pvalue</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">signal</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_histogram</span><span class="p">(</span><span class="n">bins</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">scale_fill_brewer</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Set1"</span><span class="p">)</span></code></pre></figure>

<p><img src="/figure/source/2022-05-08-time_zero_normalization/signal_mahalanobis-1.png" alt="plot of chunk signal_mahalanobis" /></p>

<p>Based on the p-value distributions, most of the signal-containing timecourses have small p-values suggesting increased recall. We can verify this as before by summarizing results based on the realized FDR and the overall recall of signal-containing timecourses at this FDR cutoff.</p>

<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">timecourse_mvn</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">count</span><span class="p">(</span><span class="n">signal</span><span class="p">,</span><span class="w"> </span><span class="n">discovery</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">correct</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="n">signal</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"no signal"</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">discovery</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"negative"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"true negative"</span><span class="p">,</span><span class="w">
                             </span><span class="n">signal</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"no signal"</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">discovery</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"positive"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"false positive"</span><span class="p">,</span><span class="w">
                             </span><span class="n">signal</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"contains signal"</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">discovery</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"negative"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"false negative"</span><span class="p">,</span><span class="w">
                             </span><span class="n">signal</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"contains signal"</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">discovery</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"positive"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"true positive"</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">select</span><span class="p">(</span><span class="n">correct</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">spread</span><span class="p">(</span><span class="n">correct</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">fdr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">`false positive`</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">`false positive`</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">`true positive`</span><span class="p">),</span><span class="w">
         </span><span class="n">recall</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">`true positive`</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">`false negative`</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">`true positive`</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">()</span></code></pre></figure>

<table>
  <thead>
    <tr>
      <th style="text-align: right">false negative</th>
      <th style="text-align: right">false positive</th>
      <th style="text-align: right">true negative</th>
      <th style="text-align: right">true positive</th>
      <th style="text-align: right">fdr</th>
      <th style="text-align: right">recall</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: right">201</td>
      <td style="text-align: right">194</td>
      <td style="text-align: right">7806</td>
      <td style="text-align: right">1799</td>
      <td style="text-align: right">0.0973407</td>
      <td style="text-align: right">0.8995</td>
    </tr>
  </tbody>
</table>

<p>The Multivariate Gaussian test did a great job of appropriately identifying signal-containing timecourses. The high power of this test arises from using estimates of noise level of observations, and our resulting ability to step away from assumptions regrading the types of time-dependent signals we expect.</p>

<p>The test uses not just a pattern of expression but also information about the magnitude of noise associated with each observation. This information is available in many contexts in genomics. A couple examples where observation-level estimates of noise are available are (1) via the mean-variance relationships of <a href="https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf">RNAseq data</a> or (2) via consistency of peptides in <a href="https://pubs.acs.org/doi/abs/10.1021%2Facs.jproteome.7b00699">proteomics data</a>. Having these estimates can identify large fold-changes which are unlikely to occur by chance, even if all post time zero timepoints are similar. Similarly, complex rapid dynamics will look like a poorly fit regression model with high residual error. But, if we know the magnitude of residual error that we expect, then signals can be identified by just looking for an excess of variation in the timecourse (accounting for the bias introduced by time zero normalization). Using noise estimates may seem like cheating since this information was not directly used by the other tests. If we were to use an estimate of the noise level it would be to carry-out a weighted least squared regression. But, since all observations have the same level of noise added, here, weighted regressions would be equivalent to the un-weighted regressions used above.</p>

<p>Using Mahalanobis distance, we can look for any departures from the null noise model to define signal. This lets us step away from models which look for particular types of signals, such as the linear, cubic, or smooth relationships sought in the regression models we applied. Some of these models can be quite flexible, but when the underlying data does not follow these relationships these models will often fail. In this case, we simulated signals as biologically feasible sigmoidal or impulse responses, so none of the regression models applied could capture every instance of a simulated signals containing timecourse. In this case, if we were to use a regression model then we would be best off fitting a non-linear least squares model following the sigmoidal or impulse form. Doing this is actually non-trivial if we want to avoid pathological fits, but these issues have been addressed in the <a href="https://github.com/calico/impulse">impulse</a> R package. I’ll discuss this problem in a future post.</p>]]></content><author><name>Sean Hackett</name></author><category term="statistics" /><category term="idea" /><category term="dynamics" /><summary type="html"><![CDATA[A powerful approaches for identifying signals in genomic time series]]></summary></entry></feed>