Skip to content

examples(ltx): add LTX-2.3 text-to-video generation example#456

Draft
oboulant wants to merge 148 commits intomasterfrom
oboulant/ltx-cleanup
Draft

examples(ltx): add LTX-2.3 text-to-video generation example#456
oboulant wants to merge 148 commits intomasterfrom
oboulant/ltx-cleanup

Conversation

@oboulant
Copy link
Copy Markdown
Contributor

@oboulant oboulant commented Apr 10, 2026

LTX-2 Video+Audio Generation

Zig port of the LTX-2.3 22B two-stage text-to-{video+audio} generation pipeline.

Reference python implementation : https://github.com/Lightricks/LTX-2

What this does

A single inference binary runs the full GPU pipeline end-to-end:

  1. text embedding post-processing (connector blocks),
  2. position / mask / clean-latent construction from geometry,
  3. noise generation,
  4. Stage 1 denoising (30 steps),
  5. spatial upsampling,
  6. Stage 2 distilled denoising (3 steps),
  7. video VAE decode,
  8. audio VAE decode,
  9. vocoder + bandwidth extension

It produces an MP4 file with synchronized audio directly from Gemma hidden states. GPU buffers flow between phases with no intermediate disk I/O.

Only the Gemma 3 forward pass remains in Python via the export_pipeline.py script, which also serves as the reference implementation.

Optional image conditioning is supported via --image (VAE-encodes a reference image to guide the first frame).

File overview

File Role
inference.zig Pipeline orchestrator — CLI, phase runners, MP4 mux
model.zig Core transformer (48 AV blocks), attention, STG guidance, noise init, sigma schedules
text_embeddings.zig Text embedding post-processing: FeatureExtractorV2 + Embeddings1DConnector (8 transformer blocks, SPLIT RoPE)
conv_ops.zig Shared convolution / norm primitives (Conv3d, Conv2d, GroupNorm)
upsampler.zig 2× spatial latent upscaler + patchify/unpatchify
video_vae.zig / audio_vae.zig Video and audio VAE decoders
video_vae_encoder.zig Video VAE encoder (for image conditioning)
vocoder.zig BigVGAN vocoder + bandwidth extension (16 kHz → 48 kHz stereo)
image_loading.zig JPEG/PNG loading, resize, center crop, bf16 normalize
export_pipeline.py Python reference pipeline + embedding/metadata export
README.md A top level readme
ARCHITECTURE.md A more technical description of the implementation

oboulant added 30 commits March 11, 2026 10:50
…sors are the sames, will be usefull to compare inputs and outputs
…2 audio_to_video_attn video_to_audio_attn. Some final re-run to be done to double check
…to-end block. Several steps are described in block0_reverse_engineering_map.md, Step M1 is done.
…g the right inputs like the production python path
…umerical drift due to XLA op fusions. Distribution of error consistent with a drift
oboulant and others added 30 commits April 16, 2026 12:05
… or double free. As of now, forces exe synchronous calls.
…ed early for positions, masks and clean latents
Use Flash Attention whenever this is possible. As of now it is wired for :
* self attention (audio and video)
* cross attention (audio to video and video to audio)

Not yet implemented for text cross attention (investigation needed because of the use of masks)
… H100 (#515)

Route video and audio text cross-attention (`attn2`/`audio_attn2`)
through FlashAttention3 when available, matching the existing FA3 path
used by self-attention and AV cross-attention.

* Pass `attn_meta`/`attn_params` instead of mask for text cross-attn
calls
* Remove unused `mask` field from `ForwardOpts` and
`v_text_ctx_mask`/`a_text_ctx_mask` from `SharedInputs`
* Simplify `forwardImpl` dispatch (no dead mask guard)

Stage 1 execution: 81.2s → 75.6s (-6.9%), Stage 2 execution: 11.6s →
11.1s (-4.3%), total: 231.6s → 206.3s (-10.9%), total compile: 96.6s →
78.0s (-19.3%).
Remove f32 upcast around GELU in FeedForward: keep activation in native
bf16.

Eliminates two unnecessary dtype conversions per FF call (92 blocks × 2
FF calls per block × 30 steps for stage 1).

~0.7s execution saving.
Changes:

* Switch transformer block calls to async execution in Stage 1 and Stage
2 denoising loops
* Edit timing reporting summary table to the old `other` column is
actually split into {Compile, Load, Execute}
* Gate all GPU profiling behind `--profile` CLI flag
* Add `analyze_trace.py`: analyzes XLA trace JSON files with GPU kernel
categorization, de-overlap correction, and `--fix` flag to emit
corrected traces (so even overlapping GPU events appear on Perfetto)

Impact: ~1.9s saved per generation (GPU utilization 92.5% --> 99.2% for
Stage 1, and 98.4% --> 99.7% Stage 2)
Improve inline doc comments in `inference.zig` and `model.zig` 

Rewrite `README.md` and `ARCHITECTURE.md`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants