LLM engine

Here is the translated comparison table in Chinese:

engine	Description	main feature	Supported hardware	speed	flaw
Pytorch Transformers	A widely used library for Training and reasoning on Transform Model.	集中于Hugging Face	通用（CPU/GPU）	中等到快速，取决于模型大小	较慢。
vLLM	A fast library forLLM inference and services, optimized for high throughput.	连续批处理，高效的内存管理（PagedAttention），优化的CUDA内核。	Very fast, optimized for high throughput	Limited to specific hardware configurations (CUDA).	限于特定硬件配置（CUDA）。
Llama.cpp	A lightweight engine for running the LLaMA Model on Various hardware, including Apple Silicon.	简单的模型转换，支持量化，在任何合适的机器上运行，活跃的社区支持。	Fast, Especiales on quantitative Model	May lack some advanced features in Grande libraries.	可能缺乏大型库中的一些高级功能。
SGLang	High-performance inference hour designed for complexLLM programs.	RadixAttention加速执行，自动KV缓存重用，支持连续批处理和张量并行。	Very fast, optimized performance	Complexity may require a steeper Learning Curve.	复杂性可能需要更陡峭的学习曲线。
MLX	Efficient hour optimized for runningLLM on Apple Silicon.	针对Mac用户进行优化，支持MLX格式模型，专注于高效资源使用。	Fast, tailored for Apple hardware	Limited to the Apple ecosystem; less flexibility.	限于Apple生态系统;灵活性较低。

Model Format

file suffix	Supported Engines
pt bin		tradition
safetensors	vLLM, Transformers, SGLang	Earth Storage is a new file format extension that is mainly used to safely and efficiently store and load Model Weight and Data Tensor. Launched by Hugging Face, it is designed to replace the traditional PyTorch`.pt` 或 `.bin`formats and solve Latent security issues and performance bottlenecks in these formats.

ggufv2	llama.cpp
gptq	vLLM, Transformers, SGLang
awq	vLLM, Transformers, SGLang
mlx	MLX