███████╗███████╗██████╗ ██████╗ ██╗ ██╗ ███╗ ███╗
╚══███╔╝██╔════╝██╔══██╗██╔═══██╗██║ ██║ ████╗ ████║
███╔╝ █████╗ ██████╔╝██║ ██║██║ ██║ ██╔████╔██║
███╔╝ ██╔══╝ ██╔══██╗██║ ██║██║ ██║ ██║╚██╔╝██║
███████╗███████╗██║ ██║╚██████╔╝███████╗███████╗██║ ╚═╝ ██║
╚══════╝╚══════╝╚═╝ ╚═╝ ╚═════╝ ╚══════╝╚══════╝╚═╝ ╚═╝
ZeroLLM is a personal LLM backend control plane designed to scale GPU inference to zero when idle.
Current status: phases 1-4 are implemented (orchestration, routing, API key auth, and cluster state). The runtime now starts llama-server on GPU instances, although some internal names still say vLLM.
Inference is exposed through a Lambda Function URL for true streamed responses.
Not yet implemented:
- Google OAuth / JWT validation. Programmatic API keys with the
zllm-prefix are implemented. - Web UI.
- Automatic model seeding during deploy.
- AWS SAM template and Lambda handlers (
template.yaml,control_plane/backends/aws/handlers.py) - Cloud-agnostic core logic (
control_plane/core/) - AWS and mock backends (
control_plane/backends/aws/,control_plane/backends/mock/) - Unit and E2E tests (
tests/unit/,tests/e2e/)
- Python 3.12+
uv- Docker (required for LocalStack E2E tests)
- Optional: AWS SAM CLI (for build/deploy)
make setup-devmake test-unitmake test-e2eE2E tests require Docker. LocalStack tests are skipped if Docker/Testcontainers is unavailable.
make validate
make buildAWS_REGION=ap-southeast-2 make deployThis command automatically:
- builds/uses a GPU AMI from the Image Builder pipeline
- discovers
GpuSubnetIdvalues and the VPC for those subnets - creates the GPU security group from the SAM stack
- runs
sam buildandsam deploywith parameter overrides
Optional deploy environment variables:
STACK_NAME(defaultzerollm)ENVIRONMENT(defaultdev)DEPLOY_DEFAULTS_FILE(default.zerollm/deploy-<region>-<stack>.env)AMI_BUILD_MODE=auto(default): use latest pipeline AMI, build if missingAMI_BUILD_MODE=latest: require latest pipeline AMIAMI_BUILD_MODE=build: always build a new AMI firstGPU_AMI_ID,GPU_SUBNET_IDto override auto-discoveryALLOWED_EMAILS,GOOGLE_CLIENT_IDfor auth configuration
Network defaults behavior:
- First deploy auto-discovers subnet IDs and writes them to
DEPLOY_DEFAULTS_FILE. - Later deploys reuse those pinned values by default for consistency.
- Delete the file (or set explicit
GPU_SUBNET_ID) to re-select.
After deploy, seed model configs and create at least one API key:
AWS_REGION=ap-southeast-2 make seed-models
AWS_REGION=ap-southeast-2 make create-api-key EMAIL=you@example.commake seed-models can also upload configured GGUF files to the deployment S3 bucket when run with the script's --upload --bucket <bucket> options.
AWS_REGION=us-east-1 \
make ami-buildOptional environment variables:
BASE_AMI_ID(if omitted, script uses regional defaults when available)BUILDER_SUBNET_ID(if omitted, script auto-selects a subnet)BUILDER_SECURITY_GROUP_ID(if omitted, script auto-selects a security group in the subnet VPC)BUILDER_INSTANCE_TYPE(defaultt3.small, used only for AMI build instances)IMAGE_VERSION(default1.0.2; bump when recipe changes)AMI_PIPELINE_STACK(defaultzerollm-ami-pipeline)AMI_PIPELINE_ENV(defaultdev)PIPELINE_STATUS(defaultDISABLED)
Useful subcommands:
AWS_REGION=us-east-1 make ami-build-deploy # deploy/update pipeline stack only
AWS_REGION=us-east-1 make ami-build-start # start a new image build
AWS_REGION=us-east-1 make ami-build-latest # print latest AMI ID from pipelineRegional defaults (community-maintained, PRs welcome):
| Region | Base GPU AMI (BASE_AMI_ID) |
Notes |
|---|---|---|
ap-southeast-2 |
ami-021000ae4658b3c28 |
Seed default; validate periodically |
us-west-2 |
ami-0a08f4510bfe41148 |
Seed default; validate periodically |
make setup- install runtime depsmake setup-dev- install runtime + dev depsmake sync-requirements- regenerate rootrequirements.txtfrom rootpyproject.tomlmake ami-build- deploy Image Builder stack and build a GPU AMImake ami-build-deploy- deploy/update Image Builder stack onlymake ami-build-start- start a new Image Builder pipeline executionmake ami-build-latest- print latest AMI ID built by pipelinemake test- run default test target (test-unit)make test-unit- run unit testsmake test-e2e- run E2E testsmake validate- validate SAM templatemake build- SAM buildmake deploy- one-command auto deploy (AMI + network param auto-resolution + SAM deploy)make seed-models- seed default model configuration into DynamoDBmake create-api-key EMAIL=you@example.com- create a programmatic API keymake status- print instance records from DynamoDBmake logs- show EC2 state, health, and instance journal logs via SSM
Dependency note:
- Root
pyproject.tomlis the source of truth. make buildrunsmake sync-requirementsfirst so SAM packaging stays in sync.
- All API Gateway routes are protected by the Lambda authorizer. Use
Authorization: Bearer <zllm-key>with keys created bymake create-api-key. - First inference for a cold model returns
503withRetry-After; the router triggers async scale-up and clients should retry. - Use the
StreamingApiUrlstack output for inference clients. It validates the sameAuthorization: Bearer <zllm-key>API keys and supportsPOST /v1/responses,POST /v1/chat/completions, andGET /v1/models. - Prefer OpenAI's Responses API (
POST /v1/responses) for new clients. Chat completions remain available for compatibility; legacy completions are not exposed. - GPU instances must expose port
8000; the generated security group currently opens that port publicly. - The default model seed data points at GGUF model files for
llama-server. Ensure the files exist in the AMI or upload them to the model bucket and seeds3_key.
ZeroLLM works as a pi backend via ~/.pi/agent/models.json. Add a zerollm provider:
{
"providers": {
"zerollm": {
"baseUrl": "https://<your-streaming-url>.lambda-url.<region>.on.aws/v1",
"api": "openai-completions",
"apiKey": "<your-zllm-key>",
"models": [
{ "id": "Qwen/Qwen3.5-4B", "contextWindow": 131072, "reasoning": true, "compat": { "thinkingFormat": "deepseek" } },
{ "id": "Qwen/Qwen3.6-27B", "contextWindow": 262144, "reasoning": true, "compat": { "thinkingFormat": "deepseek" } }
]
}
}
}Set as default in ~/.pi/agent/settings.json:
{
"defaultProvider": "zerollm",
"defaultModel": "Qwen/Qwen3.6-27B",
"defaultThinkingLevel": "medium"
}Key points:
api: "openai-completions"— llama.cpp's server speaks the OpenAI Chat Completions API. Pi'sopenai-completionshandler parses DeepSeek-style<thinking>blocks from the response stream.reasoning: true— tells pi the model supports extended thinking. Without this, pi won't send reasoning params and thinking level cycling (Shift+Tab) will show "Current model does not support thinking".defaultThinkingLevel— set tooffby default in pi; change tomediumorhighto enable thinking on these models.- llama-server flag — models use
--reasoning-format deepseekinvllm_args(seemodels.json) so the server outputs<thinking>tags that pi's openai-completions parser maps to thinking blocks.
control_plane/core/- cloud-agnostic domain logiccontrol_plane/backends/aws/- AWS implementations + Lambda handlerscontrol_plane/backends/mock/- in-memory/mock implementations for testingtests/unit/- unit tests with mock backendstests/e2e/- E2E tests (mock vLLM + optional LocalStack)scripts/create_api_key.py- API key creation helper