Skip to content
View QiJune's full-sized avatar

Organizations

@PaddlePaddle

Block or report QiJune

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

FlashInfer: Kernel Library for LLM Serving

Python 5,575 959 Updated May 8, 2026

A throughput-oriented high-performance serving framework for LLMs

Jupyter Notebook 957 49 Updated Mar 29, 2026

Dynamic Memory Management for Serving LLMs without PagedAttention

C 483 41 Updated May 30, 2025

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python 5,205 375 Updated Apr 20, 2026

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 13,578 2,358 Updated May 8, 2026

Minimalist ML framework for Rust

Rust 20,204 1,558 Updated May 7, 2026

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Python 4,046 325 Updated May 8, 2026

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 79,361 16,544 Updated May 8, 2026

Inference code for Llama models

Python 59,399 9,815 Updated Jan 26, 2025

SCQL (Secure Collaborative Query Language) is a system that allows multiple distrusting parties to run joint analysis without revealing their private data.

Go 180 73 Updated Mar 18, 2026

Running large language models on a single GPU for throughput-oriented scenarios.

Python 9,366 590 Updated Oct 28, 2024

Simple samples for TensorRT programming

Python 1,658 350 Updated May 5, 2026

Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.

Jupyter Notebook 1,585 99 Updated Jan 28, 2026

🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.

Python 33,575 6,970 Updated May 8, 2026

Synthesizer for optimal collective communication algorithms

Python 123 28 Updated Apr 8, 2024

Repo for external large-scale work

Python 6,554 721 Updated Apr 27, 2024

Transformer related optimization, including BERT, GPT

C++ 6,413 935 Updated Mar 27, 2024
Python 2,964 344 Updated Apr 21, 2026

Development repository for the Triton language and compiler

MLIR 19,123 2,836 Updated May 8, 2026

Microsoft Collective Communication Library

C++ 389 33 Updated Sep 20, 2023

Large-scale model inference.

Python 628 85 Updated Sep 12, 2023

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.

C++ 9,392 1,016 Updated Dec 4, 2025

A baseline repository of Auto-Parallelism in Training Neural Networks

Python 146 20 Updated Jun 25, 2022

XGo is a programming language that reads like plain English. But it's also incredibly powerful — it lets you leverage assets from C/C++, Go, Python, and JavaScript/TypeScript, creating a unified so…

Go 9,413 560 Updated May 8, 2026

Kubernetes-native Deep Learning Framework

Python 745 116 Updated Jan 26, 2024

Training and serving large-scale neural networks with auto parallelization.

Python 3,187 362 Updated Dec 9, 2023

Flexible and powerful tensor operations for readable and reliable code (for pytorch, jax, TF and others)

Python 9,479 398 Updated Apr 19, 2026

PyTorch Implementation of OpenAI GPT-2

Python 358 68 Updated Jul 4, 2024

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Python 9,715 2,090 Updated Apr 16, 2024
Next