Causal Abstraction for Mechanistic Interpretability

A framework for mechanistic interpretability — reverse-engineering the algorithms language models use internally using causal abstraction.

You write a high-level causal model describing how you think an LM solves a task, then run experiments to test whether the LM's internal components actually implement that algorithm.

Quick Start

Causalab is built to be driven by a coding agent (e.g. Claude Code). The fast path:

Clone and install:

git clone https://github.com/goodfire-ai/causalab.git
cd causalab
uv sync

Open the directory in your coding agent of choice.
Describe what you want to do — e.g. "walk me through the codebase", "set up a new task from this spec", "run the weekdays pipeline". The agent routes the request through the matching skill (/getting-started, /setup-task, /run-experiment, …).

Prefer to look around first? Run the end-to-end weekdays pipeline (Llama-3.1-8B, ≥24 GB VRAM):

./scripts/run_exp.sh weekdays_8b_pipeline          # inline
./scripts/run_exp.sh --slurm weekdays_8b_pipeline  # sbatch

Or open demos/weekdays_geometry.ipynb for the same pipeline rendered as a notebook.

Working with a coding agent

The full workflow is skill-driven. Each skill is a focused entry point — invoke it by name (/<skill>), or describe your goal and let the agent route to it.

Investigating a task — `/setup-task`

Use /setup-task to explore, understand, or create a task. Whether you want to inspect an existing task's causal model, browse its counterfactuals, or build a new task from scratch, this is the entry point.

/setup-task                     # interactive
/setup-task path/to/spec.md     # from a spec file
/setup-task path/to/paper.pdf   # from a paper PDF

Running experiments — `/plan-experiment` then `/run-experiment`

/plan-experiment crystallizes a research objective into RESEARCH_OBJECTIVE.md + PLAN.md (analysis DAG, sweep strategy, expected artifacts). /run-experiment then materializes the runner config(s) and executes the pipeline. /interpret-experiment is auto-invoked at the end and writes a single result/REPORT.md grounded in the plan.

/plan-experiment
/run-experiment

All available skills

Command	What it does
`/research-session`	Bootstrap a session directory at the start of a research workflow
`/development-session`	Load engineering context at the start of codebase work
`/getting-started`	Onboarding walkthrough
`/setup-task`	Create, explore, or investigate a task
`/plan-experiment`	Crystallize a research objective into `RESEARCH_OBJECTIVE.md` + `PLAN.md`
`/run-experiment`	Materialize runner configs from the plan and execute
`/interpret-experiment`	Auto-invoked after `/run-experiment` — writes `result/REPORT.md`
`/replicate-paper`	Reproduce results from a research paper
`/document-issues`	Document failures, confusions, and workarounds

Core Concepts

Causal Models

A causal model is your hypothesis about how the LM solves a task. It consists of:

Variables: concepts that might be represented in the network (e.g., "subject name", "indirect object")
Values: possible assignments to each variable
Parent–Child Relationships: directed dependencies
Mechanisms: functions that compute a variable's value given its parents'

Causal Abstraction

Mechanistic interpretability aims to reverse-engineer the algorithm a network implements. Causal abstraction grounds this: an algorithm is a causal model, a network is a causal model, and "implementation" is the abstraction relation between two models. The algorithm is a high-level causal model, the network is a low-level causal model, and when the high-level mechanisms are accurate simplifications of the low-level mechanisms, the algorithm is a causal abstraction of the network.

Interchange Interventions

Interchange interventions test whether a high-level variable aligns with specific features in the LM. The intervention replaces activations from one input with activations from a counterfactual input, isolating one causal pathway at a time.

Method-level techniques for constructing the feature space being intervened on — DAS, DBM, PCA, Boundless DAS, SAE — live in causalab/methods/ and are selected as options inside analyses (e.g. subspace.method: das, locate.method: interchange).

Available Analyses

The runner is built around eight named analyses. Each answers a specific research question and may consume artifacts from earlier analyses. Chain them in a single run by listing multiple - /analysis/<name> entries in a runner config's defaults: block.

Analysis	Research question	Depends on
baseline	Can the model solve the task? Are counterfactual generators well-formed?	—
locate	Which (layer, token_position) encodes each causal variable?	baseline
subspace	What k-dimensional subspace captures the variable's representation?	locate
activation_manifold	What is the geometric structure of activations as the variable varies?	subspace
output_manifold	What is the geometry of output distributions on the probability simplex?	baseline
path_steering	Does the subspace/manifold faithfully preserve causal structure?	subspace, activation_manifold
pullback	What activation trajectories realize prescribed belief-space paths?	activation_manifold, output_manifold
attention_pattern	Which attention heads attend to which token types?	—

Each analysis is configured by a Hydra YAML at causalab/configs/analysis/<name>.yaml and invoked through a runner config under causalab/configs/runners/<group>/<name>.yaml.

Repository Structure

The codebase follows a strict layering. See ARCHITECTURE.md for the full breakdown, layering invariants, and config conventions.

causalab/
├── causal/        # Causal model primitives
├── tasks/         # Task definitions (causal_models.py, counterfactuals.py, …)
├── neural/        # Pyvene API surface — pipeline.py, units.py, LM_units.py,
│                  #   featurizer.py, activations/
├── methods/       # Reusable interpretability tools — DAS, DBM, PCA, SAE,
│                  #   manifold builders, scoring metrics
├── io/            # Single source of truth for disk I/O + shared plot primitives
├── analyses/      # Research-question wrappers (baseline/, locate/, subspace/, …)
├── runner/        # Hydra dispatcher — run_exp.py
└── configs/       # Hydra configs — analysis/, model/, task/, runners/
demos/             # Onboarding notebooks + the weekdays_geometry pipeline notebook
artifacts/         # Run outputs, keyed by task / model / analysis (gitignored)

Dependency flow: tasks/ and causal/ are independent. neural/ depends on neither. io/ depends only on neural/, tasks/, causal/. methods/ depends on neural/, causal/, io/. analyses/ depends on all four. runner/ is a thin shell over analyses/.

Task packages (`causalab/tasks/<name>/`)

Each task is a self-contained Python package consumed by the analyses through a fixed interface:

File	Purpose
`causal_models.py`	Causal model: variables, values, mechanisms
`counterfactuals.py`	Generates counterfactual pairs for each variable
`token_positions.py`	Maps variable names to token positions in the input
`config.py`	Constants: variable value lists, max tokens, task name
`templates.py`	Input text templates with placeholders

Tasks and analyses are fully separated — define a new task and every analysis works automatically.

Where results land

artifacts/{task}/{model}/{analysis}/
├── *.json / metadata.json     # Results + resolved Hydra config snapshot
├── *.safetensors / *.pt       # Tensors (activations, weights, distributions)
├── *.png / *.pdf              # Plots
└── *.html                     # Interactive visualizations

The path encodes the run, so cross-model and cross-analysis comparisons are direct file-system operations. Re-running the same runner config rewrites the directory; copy artifacts aside if you want to keep an old run.

Getting Started

Installation

git clone https://github.com/goodfire-ai/causalab.git
cd causalab
uv sync

For development:

uv run pre-commit install  # set up git hooks

Recommended starting points

End-to-end pipeline (recommended): demos/weekdays_geometry.ipynb chains baseline → subspace → activation_manifold → output_manifold → path_steering → pullback on Llama-3.1-8B. The same pipeline runs from the CLI as ./scripts/run_exp.sh weekdays_8b_pipeline. Minimum hardware: 1 GPU with ≥24 GB VRAM.
Causal model primer: demos/causal_model_demo.ipynb walks through defining a causal model and counterfactual dataset.

Tab completion for `run_exp.sh`

Tab-complete runner config names when invoking ./scripts/run_exp.sh:

# bash
source scripts/completion.bash
# zsh
source scripts/completion.zsh

To enable permanently from the repo root:

# bash
echo "source $(pwd)/scripts/completion.bash" >> ~/.bashrc
# zsh
echo "source $(pwd)/scripts/completion.zsh" >> ~/.zshrc

Submitting a slurm job

run_exp.sh is the single entry point for both inline and slurm runs. Pass --slurm to dispatch as sbatch; --gres=gpu:N is resolved from the model config's slurm.gpus and --time from the runner's slurm.time (default in causalab/configs/base.yaml). CLI flags --gpus, --time, --qos override.

./scripts/run_exp.sh --slurm weekdays_8b_pipeline
./scripts/run_exp.sh --slurm --qos=opportunistic --time=08:00:00 weekdays_8b_pipeline

Running tests

uv run pytest -m "not slow and not gpu"  # quick
uv run pytest                            # full

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.basedpyright		.basedpyright
.claude/skills		.claude/skills
.github/workflows		.github/workflows
causalab		causalab
demos		demos
docs		docs
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.mcp.json		.mcp.json
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Causal Abstraction for Mechanistic Interpretability

Quick Start

Working with a coding agent

Investigating a task — `/setup-task`

Running experiments — `/plan-experiment` then `/run-experiment`

All available skills

Core Concepts

Causal Models

Causal Abstraction

Interchange Interventions

Available Analyses

Repository Structure

Task packages (`causalab/tasks/<name>/`)

Where results land

Getting Started

Installation

Recommended starting points

Tab completion for `run_exp.sh`

Submitting a slurm job

Running tests

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Causal Abstraction for Mechanistic Interpretability

Quick Start

Working with a coding agent

Investigating a task — /setup-task

Running experiments — /plan-experiment then /run-experiment

All available skills

Core Concepts

Causal Models

Causal Abstraction

Interchange Interventions

Available Analyses

Repository Structure

Task packages (causalab/tasks/<name>/)

Where results land

Getting Started

Installation

Recommended starting points

Tab completion for run_exp.sh

Submitting a slurm job

Running tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Investigating a task — `/setup-task`

Running experiments — `/plan-experiment` then `/run-experiment`

Task packages (`causalab/tasks/<name>/`)

Tab completion for `run_exp.sh`

Packages