Notes on building Modus: an agentic framework for pharmacometrics

By now you've used some kind of agent: an LLM in a loop with tools. The concept is simple. The loop, called a harness, is what gives LLMs digital arms. State-of-the-art harnesses like Claude Code and Codex are some of the most capable software I've used. There are still limitations, and a kind of jaggedness in performance where in-training topics land at PhD level and out-of-sample falls off a cliff. That gets more nuanced once you look closely at knowledge work.

Even a genius needs time onboarding to a new job. The idea has been described as a context graph (AI's trillion-dollar opportunity), a network of knowledge that an agent can traverse to condense months or years of onboarding and company-specific guidelines into a single ramp-up. That's great and all. But what does this actually look like?

It turns out for pharmacometrics, and many other fields, you don't need a fancy graph database. Neo4j is great if you have the relationships to justify it. Most of us don't, and the graph structure we do end up with fits in a json file just fine. And the development process is organic.

Where Modus came from

Let's say you're doing exploratory data analysis on a Phase I dataset with your agent or copilot. A single ascending dose study, four cohorts, a couple of dozen subjects, some BLQ data, the usual. You ask for the EDA. The agent produces something plausible. Concentration profiles on a linear scale (you always log scale these). BLQ values imputed as zero (you treat them as missing at this stage). Dose proportionality assessed against the wrong reference dose. You correct it. Next session, you correct the same three things in the first ten minutes. These are the rules of the road for this kind of analysis, and right now they live only in your head.

Then the agent flags subject 7 as a potential outlier. But subject 7 was already excluded from the dataset during construction because of a known protocol deviation, which the previous task's output noted clearly. The agent doing EDA never sees what the agent doing data_construction decided. You paste in the construction output for additional context and the flag goes away. That kind of handoff from one task's findings to the next has to be carried by the system as state, not pasted in by you.

Run the same task three times and the outputs disagree. Three different outlier criteria, three different summary statistics, three different response models. The agent is defaulting to arbitrary conventions you never specified. The reason you never wrote them down is that some of your conventions don't fit cleanly into a one-line rule. How you handle model selection is a small decision tree that depends on the shape of the data, hard to compress into a single prompt or rule. So instead you point the agent at eda_template.qmd from a previous study and drop a one-liner in examples/index.md describing when to reach for it. Examples and a simple index let the system carry rules that are too detailed or branchy for inline prompting.

The reports are now consistent in what they decide but they ramble in how they explain it. You sketch the structure your team uses for reporting and the agent rewrites accordingly. The output now has a contract.

Before you sign off you want to QC the results. You ask the agent a checklist of things to verify. All dose cohorts covered, BLQ handling stated explicitly, every flagged outlier with a recorded rationale. It catches that subject 14 was flagged but the rationale was never written into outlier_log.md. That kind of mechanical verification is easy to bake into the task. Then, for final scientific judgment you want a second pass from a fresh agent with no investment in defending the work.

I went through some version of this over and over until I codified it. The session becomes a task entry in a single json file, a library. The rules go into a rules field. The handoff becomes depends_on, with each task writing a concise summary into a shared progress.md that the next agent reads on startup. The more complex rules live as examples retrieved from a simple index. The output structure becomes produces. Self-checking is driven by the verify field, with a second task for independent review. The agent commits automatically when a task finishes, so the git log doubles as a human-facing audit trail.

Modus grows organically, one friction point at a time.

What's in it

Diagram of Modus framework

Modus is a framework built around a task library: a json file. The fields capture rules and structure that are obvious to a seasoned pharmacometrician but befuddle even state-of-the-art foundation models. The json was inspired by this fantastic Anthropic post on long-running harnesses, and I've added only the bare minimum to make the system work.

Three principles run through it:

Don't write anything an LLM can easily infer on its own.
Project state is managed by files.
Knowledge compounds within the task library across projects and analysts.

A small library with three tasks looks like this:

[
  {
    "task_id": "data_construction",
    "description": "Build the analysis-ready dataset from source PK and dosing files",
    "rules": [
      "NONMEM-ready format with EVID, MDV, AMT, RATE columns",
      "Time variable in hours from first dose",
      "Flag and document any record exclusions in the construction log"
    ],
    "depends_on": [],
    "produces": ["analysis_dataset.csv", "construction_log.md"],
    "verify": [
      "row count matches source after exclusions are reconciled",
      "no duplicate ID/time combinations"
    ],
    "passes": false
  },
  {
    "task_id": "eda",
    "description": "Initial exploratory analysis of the PK dataset",
    "rules": [
      "Concentration plots on log scale",
      "BLQ as NA at this stage; M-method handling deferred to modeling",
      "Dose proportionality with power model and 90% CI"
    ],
    "depends_on": ["data_construction"],
    "produces": ["eda_report.md", "profile_plots/", "outlier_log.md"],
    "verify": [
      "all dose cohorts covered, BLQ handling stated explicitly",
      "every flagged outlier has a recorded rationale in outlier_log.md"
    ],
    "passes": false
  },
  {
    "task_id": "eda_review",
    "description": "Independent scientific review of the EDA output",
    "rules": [
      "Read the eda outputs without re-running the analysis",
      "Check that each conclusion is supported by the figures and tables",
      "Flag pharmacometric judgment calls that warrant discussion"
    ],
    "depends_on": ["eda"],
    "produces": ["eda_review.md"],
    "verify": [
      "every claim in eda_report.md is traced to a specific figure or table",
      "review records disagreements with rationale, not a blanket sign-off"
    ],
    "passes": false
  }
]

On disk:

your-project/
├── data/          # your data
├── tasks.json          # the task library
├── prompt.md           # core agent prompt
├── run.sh              # outer loop
├── examples/
│   ├── index.md
│   └── eda_template.qmd
└── workspace/
    ├── progress.md     # shared across tasks
    ├── data_construction/
    ├── eda/
    └── eda_review/

Each fresh agent runs through this short routine (prompt.md):

On each invocation:

1. Read tasks.json. Pick the highest-priority task where
   passes is false and every depends_on task has passes: true.
2. Read workspace/progress.md and upstream outputs in workspace/{dep}/.
3. Load examples from examples/index.md if your task names one.
4. Execute the task, following the rules exactly, writing outputs
   to workspace/{task_id}/ with the names in produces.
5. Check every criterion in the verify field. If any fails, the task fails.
6. Append a concise summary to workspace/progress.md.
7. Set passes: true in tasks.json and git commit.

The outer loop is even simpler (run.sh):

loop up to MAX_ITERATIONS:
    if every task in tasks.json has passes: true: exit success
    spawn a fresh agent with prompt.md

exit failure

That "spawn a fresh agent" line is just invoking your chosen agent harness with prompt.md. Each invocation is just the CLI agent's non-interactive mode: claude -p for Claude Code, codex exec for Codex. Both take a prompt, run once, exit (I also recommend piping the output to a log for tracing). Both drop in without changing anything else, and the script is just a wrapper that picks the next task and hands its prompt to whichever agent you've installed.

This is similar to having an agent fire off a subagent, except the prompt and I/O contract live in the task entry rather than being implicit in the call. Each task also runs in a clean, scoped context window, which matters because agent performance erodes as context grows.

How this maps to what you already use

A lot of agentic patterns are converging on the same primitives, just packaged differently across tools.

Concept	Where you've seen it	In Modus
Always-on prompt	`CLAUDE.md`, `AGENTS.md`	Core agent prompt
Domain knowledge	Skills, resources, instructions	`rules` field and `examples/` folder
Specialized agent	Custom subagents	A task with its own `rules` and I/O contract
I/O contract	Implicit in the prompt	`produces` field, made explicit
State management	Memory features, chat history	Shared `progress.md` and git audit trail
Verification	Eval scripts, code review	`verify` field, run by the agent and a fresh one

Skills, subagents, and instructions are essentially interchangeable. Modus just gives you a single composable frame to assemble them in. Modus sits as a layer over whichever harness you're running rather than replacing it. The agent and tooling in the middle stay the same.

Modus gives you a single composable framework for assembling the primitives you already use.

Why the editable layer matters

The crucial thing is this: all an expert pharmacometrician needs to look at and edit is the task library. That's the surface where the work lives, and it grows organically as analyses run.

But the deeper reason for putting the task library front and center is consensus. When how the work is done lives in one editable file, expert alignment stops being locked in silos and becomes a diff. Reviewers can challenge a rule, propose a different one, and the change is visible to everyone the next time the task runs. The library is what compounds across runs, projects, and people, because it's the only artifact that holds their judgment in a form they can read, debate, and edit together.

Most systems that "remember" start blank for the next project. CLAUDE.md, AGENTS.md, and Cursor rules all give each repo its own notebook to fill in from scratch. That's useful, but it isn't how expertise actually accumulates. The task library is meant to be the institution. Individual projects draw from it and feed back into it. A correction earned on one study is sitting in the task definition when the next one starts.

Where this is going, and how to join

The formal evaluation is in a paper with co-authors that's under review. The collective work isn't, and that's what I'd like help with: task libraries worth sharing, benchmarks worth running, patterns that hold up across organizations.

The ISoP AI/ML SIG agentic subgroup I co-chair is where that's happening. If any of that sounds like work you want to do, come find us.

What's your Modus?

Views expressed are my own and do not represent my employer.