examples(ltx): add LTX-2.3 text-to-video generation example by oboulant · Pull Request #456 · zml/zml

oboulant · 2026-04-10T09:52:43Z

LTX-2 Video+Audio Generation

Zig port of the LTX-2.3 22B two-stage text-to-{video+audio} generation pipeline.

Reference python implementation : https://github.com/Lightricks/LTX-2

What this does

A single inference binary runs the full GPU pipeline end-to-end:

text embedding post-processing (connector blocks),
position / mask / clean-latent construction from geometry,
noise generation,
Stage 1 denoising (30 steps),
spatial upsampling,
Stage 2 distilled denoising (3 steps),
video VAE decode,
audio VAE decode,
vocoder + bandwidth extension

It produces an MP4 file with synchronized audio directly from Gemma hidden states. GPU buffers flow between phases with no intermediate disk I/O.

Only the Gemma 3 forward pass remains in Python via the export_pipeline.py script, which also serves as the reference implementation.

Optional image conditioning is supported via --image (VAE-encodes a reference image to guide the first frame).

File overview

File	Role
`inference.zig`	Pipeline orchestrator — CLI, phase runners, MP4 mux
`model.zig`	Core transformer (48 AV blocks), attention, STG guidance, noise init, sigma schedules
`text_embeddings.zig`	Text embedding post-processing: FeatureExtractorV2 + Embeddings1DConnector (8 transformer blocks, SPLIT RoPE)
`conv_ops.zig`	Shared convolution / norm primitives (Conv3d, Conv2d, GroupNorm)
`upsampler.zig`	2× spatial latent upscaler + patchify/unpatchify
`video_vae.zig` / `audio_vae.zig`	Video and audio VAE decoders
`video_vae_encoder.zig`	Video VAE encoder (for image conditioning)
`vocoder.zig`	BigVGAN vocoder + bandwidth extension (16 kHz → 48 kHz stereo)
`image_loading.zig`	JPEG/PNG loading, resize, center crop, bf16 normalize
`export_pipeline.py`	Python reference pipeline + embedding/metadata export
`README.md`	A top level readme
`ARCHITECTURE.md`	A more technical description of the implementation

…ry results

…sors are the sames, will be usefull to compare inputs and outputs

…rs for stage 1 and 2

…2 audio_to_video_attn video_to_audio_attn. Some final re-run to be done to double check

…ideo_to_audio_attn (no-LoRA)

…to-end block. Several steps are described in block0_reverse_engineering_map.md, Step M1 is done.

…eat for LoRA 0.5

…icAVTransformerBlock

…g the right inputs like the production python path

…umerical drift due to XLA op fusions. Distribution of error consistent with a drift

… or double free. As of now, forces exe synchronous calls.

…steps)

…ed early for positions, masks and clean latents

…ve registry and store

…a_noise_buf, v_latent_buf and a_latent_buf

…r blocks

…nd early deinit of preprocessing buffers

Use Flash Attention whenever this is possible. As of now it is wired for : * self attention (audio and video) * cross attention (audio to video and video to audio) Not yet implemented for text cross attention (investigation needed because of the use of masks)

… H100 (#515) Route video and audio text cross-attention (`attn2`/`audio_attn2`) through FlashAttention3 when available, matching the existing FA3 path used by self-attention and AV cross-attention. * Pass `attn_meta`/`attn_params` instead of mask for text cross-attn calls * Remove unused `mask` field from `ForwardOpts` and `v_text_ctx_mask`/`a_text_ctx_mask` from `SharedInputs` * Simplify `forwardImpl` dispatch (no dead mask guard) Stage 1 execution: 81.2s → 75.6s (-6.9%), Stage 2 execution: 11.6s → 11.1s (-4.3%), total: 231.6s → 206.3s (-10.9%), total compile: 96.6s → 78.0s (-19.3%).

# Conflicts: # zml/tensor.zig

Remove f32 upcast around GELU in FeedForward: keep activation in native bf16. Eliminates two unnecessary dtype conversions per FF call (92 blocks × 2 FF calls per block × 30 steps for stage 1). ~0.7s execution saving.

Changes: * Switch transformer block calls to async execution in Stage 1 and Stage 2 denoising loops * Edit timing reporting summary table to the old `other` column is actually split into {Compile, Load, Execute} * Gate all GPU profiling behind `--profile` CLI flag * Add `analyze_trace.py`: analyzes XLA trace JSON files with GPU kernel categorization, de-overlap correction, and `--fix` flag to emit corrected traces (so even overlapping GPU events appear on Perfetto) Impact: ~1.9s saved per generation (GPU utilization 92.5% --> 99.2% for Stage 1, and 98.4% --> 99.7% Stage 2)

Improve inline doc comments in `inference.zig` and `model.zig` Rewrite `README.md` and `ARCHITECTURE.md`

oboulant added 30 commits March 11, 2026 10:50

first main to parse safetensors metadata to list name keys and shape

0d87398

add command line argument for safetensor path

59e349b

working script to run the TI2VidTwoStagesPipeline pipeline

0d8f0d4

trying to encapsulate with zml_utils for activations tracking

8f1602f

creating empty script

5bd94c2

working reference trace without the detail diffusion steps intermedia…

439f735

…ry results

Encapsulate to have stage 1 and stage 2 diffusion steps

d31bc4b

create an inspect script

f5338c2

adding additional arguments needed when we replay some pipeline stages

24b10f1

working stage 2 replay for block 0

72aa069

patch inspect-pt.py to accept a file

8ab6982

refine print summary

06a351f

debup tensors

7f42acb

generate safetensor out of .pt generated file. Check that two safeten…

9958a14

…sors are the sames, will be usefull to compare inputs and outputs

python dependencies

5f836d6

allows to print the structure of the model, especially the transforme…

9721ce6

…rs for stage 1 and 2

First attempt to implement transformer_blocks[0].ff.forward

46456e1

FF okay modulo numerical differences and first nok attempt to attention

a6128be

mostly an attention that works for attn1 attn2 audio_attn1 audio_attn…

095b248

…2 audio_to_video_attn video_to_audio_attn. Some final re-run to be done to double check

pass parity check for attention of attn1 block

bf038df

working for attn1 attn2 audio_attn1 audio_attn2 audio_to_video_attn v…

f3654e6

…ideo_to_audio_attn (no-LoRA)

for LoRA export

3ae76b3

FF is now fixed and full parity. Plus started implementation for end-…

409bb6e

…to-end block. Several steps are described in block0_reverse_engineering_map.md, Step M1 is done.

working M3. Extensive test for M1 and M2. M1 might have precision cav…

6c4f23f

…eat for LoRA 0.5

M6 validated for a single block (block0). Now will integrate into Bas…

c977511

…icAVTransformerBlock

M6 integrated to BasicAVTransformerBlock properly

27b70ff

BasicAVTransformerBlock now has everything. Still working on it havin…

bce986e

…g the right inputs like the production python path

Trying to validate for 8 blocks. Video okay. Audio no. Seems like a n…

1479b34

…umerical drift due to XLA op fusions. Distribution of error consistent with a drift

created a .md with all the finding of today on the audio behaviour

47cbbca

Working full step for token limit 512 and Lora 0

5d1a026

oboulant and others added 30 commits April 16, 2026 12:05

test(temporal tiling): test for the blend weights

a843910

investigation for OOM and async

3cc713c

still investigation of OOM

324ee47

working version on 20s video generation. Fixed couple of memory leaks…

db35d7d

… or double free. As of now, forces exe synchronous calls.

fix(memory): several buffers were never freed

d512b1c

cosmetic

c523935

refactor(sigma schedule): clearer use of constants

b98b967

feat(cli): variable number of stage 1 steps (instead of hardcoded 30 …

432fac4

…steps)

comments on computeVideoPositions and make sure host memory is releas…

430ff4f

…ed early for positions, masks and clean latents

refactor(computeTextEmbeddings): early deinit for positive and negati…

6a5f02b

…ve registry and store

refactor(stage1): early memory deallocate when creating v_noise_buf, …

73e66d4

…a_noise_buf, v_latent_buf and a_latent_buf

refactor(image conditioning): replace deprecated call

0cce2f2

refactor(stage1 and stage2): create constant for number of transforme…

5e4295d

…r blocks

refactor(stage 1 and stage 2): Hoisting of blk_args and blk_results a…

088ae61

…nd early deinit of preprocessing buffers

refactor(runVideoVaeDecode): improvements

02b288e

refactor(runAudioVaeDecode): improvements

422eec6

refactor(runVocoderWithBWE): improvements

a0fa505

refactor(encodeOutputMp4): improvements

b2fbd68

chore(zls): fix with adding a fallback

45ba02a

refactor(inference): add cli args for cfg, stg, mod and rescale

ebbf1b4

docs(vae tiling): remove dev docs

7dbe740

docs: update README and ARCHITECTURE

4dcb866

solve CI error and conflicts on signature

ee4a772

Merge remote-tracking branch 'origin/master' into oboulant/ltx-cleanup

26a2728

# Conflicts: # zml/tensor.zig

FeedForward: do not upcast in fp32 for GELU (#516)

ac10a13

Remove f32 upcast around GELU in FeedForward: keep activation in native bf16. Eliminates two unnecessary dtype conversions per FF call (92 blocks × 2 FF calls per block × 30 steps for stage 1). ~0.7s execution saving.

docs: improve source comments and rewrite docs (#520)

13b1f8d

Improve inline doc comments in `inference.zig` and `model.zig` Rewrite `README.md` and `ARCHITECTURE.md`

docs: minor explicitation

9a7f439

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples(ltx): add LTX-2.3 text-to-video generation example#456

examples(ltx): add LTX-2.3 text-to-video generation example#456
oboulant wants to merge 148 commits intomasterfrom
oboulant/ltx-cleanup

oboulant commented Apr 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

oboulant commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LTX-2 Video+Audio Generation

What this does

File overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

oboulant commented Apr 10, 2026 •

edited

Loading