meganeura

Neural network training and inference in Rust, on any GPU. Vulkan, Metal, DX12 — Linux, Windows, macOS, iOS, Android. No CUDA, no Python in the loop, routinely faster than PyTorch on AMD and Apple GPUs.

Status: actively developed, APIs still in motion, but already running real workloads (SmolLM2, SmolVLA, ResNet-50, Whisper-tiny, Stable Diffusion U-Net). Issues and pull requests welcome.

Define a graph, call build_session, train. Meganeura handles autodiff, kernel fusion via e-graph equality saturation, Naga IR shader codegen, and GPU dispatch automatically.

use meganeura::{Graph, Trainer, TrainConfig, build_session};

let mut g = Graph::new();
let x = g.input("x", &[32, 784]);
let labels = g.input("labels", &[32, 10]);

let w1 = g.parameter("w1", &[784, 128]);
let h = g.relu(g.matmul(x, w1));
let w2 = g.parameter("w2", &[128, 10]);
let logits = g.matmul(h, w2);

let loss = g.cross_entropy_loss(logits, labels);
g.set_outputs(vec![loss]);

// autodiff + e-graph optimize + compile + GPU init
let session = build_session(&g);
let mut trainer = Trainer::new(session, TrainConfig::default());
trainer.train(&mut data, /* epochs = */ 10); // data loader: see examples/mnist.rs

A two-layer MLP, trained end to end on the GPU, in one screen.

Why Meganeura

Fast. Inference against PyTorch on matching model configs, from the cross-framework benchmark at Inferena:

Platform	Workload	Meganeura	PyTorch
Apple M3	Stable Diffusion U-Net	10 ms	534 ms (MPS)	53× faster
Apple M3	SmolVLA	37 ms	172 ms (MPS)	4.6× faster
Apple M3	SmolLM2-135M	62 ms	247 ms (MPS)	4× faster
Radeon 890M	SmolLM2-135M	34 ms	67 ms (ROCm)	2× faster
Radeon 890M	SmolVLA	15 ms	25 ms (ROCm)	1.7× faster
RTX 5080	SmolVLA	3 ms	3 ms (CUDA 13.0)	parity
RTX 5080	SmolLM2-135M	7 ms	4 ms (CUDA 13.0)	within 1.75×

Training shows the same shape on non-NVIDIA: on Radeon 890M, Meganeura trains SmolLM2-135M in 87 ms/step vs PyTorch ROCm’s 123 ms, and SmolVLA in 35 ms vs 41 ms. On NVIDIA, PyTorch CUDA still leads on training workloads.

Meganeura also runs on GPUs PyTorch doesn’t target, including Radeon 780M and Intel integrated graphics (RPL-U).

The wedge isn’t every workload: on ResNet-50 and Whisper-tiny, Meganeura currently trails. Full cross-framework tables — including the losses — at inferena.tech. Reproduce locally with ./run.sh -m <Model>.

Portable. GPU access via blade-graphics — one backend, five platforms. Works on anything with Vulkan (including Mesa’s Lavapipe for headless CI), Metal on Apple silicon, and DX12 on Windows. No CUDA, no ROCm, no vendor lock-in at any layer of the stack.

Lean. A handful of kernel archetypes — pointwise, reduction, matmul, attention — compose into specialized GPU shaders at compile time. An e-graph equality-saturation pass discovers fusions (e.g. x * sigmoid(x) → Silu, Silu(gate) * up → SwiGLU) with a cost-model-driven extractor, rather than relying on hand-written fused kernels for every pattern. The codebase is small enough to read end to end.

How it compares

	GPU backends	Training	Approach
Meganeura	blade-graphics (Vulkan, Metal, DX12)	yes	graph IR + e-graph fusion + Naga codegen
Candle	CUDA, Metal, CPU	limited	eager tensors, hand-written kernels
Burn	CUDA, wgpu, NDArray, LibTorch	yes	modular multi-backend
tch-rs	CUDA, CPU (via libtorch)	yes	PyTorch FFI bindings
Luminal	CUDA, Metal	inference-only	e-graph IR

Meganeura’s wedge: training, on non-NVIDIA hardware, without writing the kernels by hand.

Install

cargo add meganeura

Worked examples live in examples/:

mnist.rs — MNIST training end to end.
smollm2.rs — LLM inference with HuggingFace weights.

Pretrained models can be loaded from ONNX or NNEF via meganeura::load_onnx(...) / meganeura::load_nnef(...). Both lower through Meganeura’s IR, so e-graph fusions apply to imported graphs as well as hand-built ones.

System requirements

Meganeura runs best where cooperative matrix operations are hardware-accelerated:

Vulkan — VK_KHR_cooperative_matrix (NVIDIA Volta+, AMD RDNA3+, Intel Arc).
Metal — simdgroup matrix (Apple M1+).

Falls back to scalar matmul on older hardware. Headless Lavapipe works for CI.

Profiling

MEGANEURA_TRACE=trace.pftrace cargo run --example mnist

Open the trace in Perfetto:

Contributing

Early project, small API surface, small community — a good time to show up. Open an issue before starting anything sizeable so we can align on direction.

Name		Name	Last commit message	Last commit date
Latest commit History 271 Commits
.github/workflows		.github/workflows
bench		bench
docs		docs
etc		etc
examples		examples
scripts		scripts
src		src
tests		tests
tools		tools
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

meganeura

Why Meganeura

How it compares

Install

System requirements

Profiling

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

meganeura

Why Meganeura

How it compares

Install

System requirements

Profiling

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages