examples(ltx): add LTX-2.3 text-to-video generation example#456
Draft
examples(ltx): add LTX-2.3 text-to-video generation example#456
Conversation
…sors are the sames, will be usefull to compare inputs and outputs
…rs for stage 1 and 2
…2 audio_to_video_attn video_to_audio_attn. Some final re-run to be done to double check
…ideo_to_audio_attn (no-LoRA)
…to-end block. Several steps are described in block0_reverse_engineering_map.md, Step M1 is done.
…icAVTransformerBlock
…g the right inputs like the production python path
…umerical drift due to XLA op fusions. Distribution of error consistent with a drift
… or double free. As of now, forces exe synchronous calls.
…ed early for positions, masks and clean latents
…ve registry and store
…a_noise_buf, v_latent_buf and a_latent_buf
…nd early deinit of preprocessing buffers
Use Flash Attention whenever this is possible. As of now it is wired for : * self attention (audio and video) * cross attention (audio to video and video to audio) Not yet implemented for text cross attention (investigation needed because of the use of masks)
… H100 (#515) Route video and audio text cross-attention (`attn2`/`audio_attn2`) through FlashAttention3 when available, matching the existing FA3 path used by self-attention and AV cross-attention. * Pass `attn_meta`/`attn_params` instead of mask for text cross-attn calls * Remove unused `mask` field from `ForwardOpts` and `v_text_ctx_mask`/`a_text_ctx_mask` from `SharedInputs` * Simplify `forwardImpl` dispatch (no dead mask guard) Stage 1 execution: 81.2s → 75.6s (-6.9%), Stage 2 execution: 11.6s → 11.1s (-4.3%), total: 231.6s → 206.3s (-10.9%), total compile: 96.6s → 78.0s (-19.3%).
# Conflicts: # zml/tensor.zig
Remove f32 upcast around GELU in FeedForward: keep activation in native bf16. Eliminates two unnecessary dtype conversions per FF call (92 blocks × 2 FF calls per block × 30 steps for stage 1). ~0.7s execution saving.
Changes:
* Switch transformer block calls to async execution in Stage 1 and Stage
2 denoising loops
* Edit timing reporting summary table to the old `other` column is
actually split into {Compile, Load, Execute}
* Gate all GPU profiling behind `--profile` CLI flag
* Add `analyze_trace.py`: analyzes XLA trace JSON files with GPU kernel
categorization, de-overlap correction, and `--fix` flag to emit
corrected traces (so even overlapping GPU events appear on Perfetto)
Impact: ~1.9s saved per generation (GPU utilization 92.5% --> 99.2% for
Stage 1, and 98.4% --> 99.7% Stage 2)
Improve inline doc comments in `inference.zig` and `model.zig` Rewrite `README.md` and `ARCHITECTURE.md`
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
LTX-2 Video+Audio Generation
Zig port of the LTX-2.3 22B two-stage text-to-{video+audio} generation pipeline.
Reference python implementation : https://github.com/Lightricks/LTX-2
What this does
A single
inferencebinary runs the full GPU pipeline end-to-end:It produces an MP4 file with synchronized audio directly from Gemma hidden states. GPU buffers flow between phases with no intermediate disk I/O.
Only the Gemma 3 forward pass remains in Python via the
export_pipeline.pyscript, which also serves as the reference implementation.Optional image conditioning is supported via
--image(VAE-encodes a reference image to guide the first frame).File overview
inference.zigmodel.zigtext_embeddings.zigconv_ops.zigupsampler.zigvideo_vae.zig/audio_vae.zigvideo_vae_encoder.zigvocoder.zigimage_loading.zigexport_pipeline.pyREADME.mdARCHITECTURE.md