LLM engine
Here is the translated comparison table in Chinese:
engine | Description | main feature | Supported hardware | speed | flaw |
---|---|---|---|---|---|
Pytorch Transformers | A widely used library for Training and reasoning on Transform Model. | 集中于Hugging Face | 通用(CPU/GPU) | 中等到快速,取决于模型大小 | 较慢。 |
vLLM | A fast library forLLM inference and services, optimized for high throughput. | 连续批处理,高效的内存管理(PagedAttention),优化的CUDA内核。 | Very fast, optimized for high throughput | Limited to specific hardware configurations (CUDA). | 限于特定硬件配置(CUDA)。 |
Llama.cpp | A lightweight engine for running the LLaMA Model on Various hardware, including Apple Silicon. | 简单的模型转换,支持量化,在任何合适的机器上运行,活跃的社区支持。 | Fast, Especiales on quantitative Model | May lack some advanced features in Grande libraries. | 可能缺乏大型库中的一些高级功能。 |
SGLang | High-performance inference hour designed for complexLLM programs. | RadixAttention加速执行,自动KV缓存重用,支持连续批处理和张量并行。 | Very fast, optimized performance | Complexity may require a steeper Learning Curve. | 复杂性可能需要更陡峭的学习曲线。 |
MLX | Efficient hour optimized for runningLLM on Apple Silicon. | 针对Mac用户进行优化,支持MLX格式模型,专注于高效资源使用。 | Fast, tailored for Apple hardware | Limited to the Apple ecosystem; less flexibility. | 限于Apple生态系统;灵活性较低。 |
Model Format
file suffix | Supported Engines | |
---|---|---|
pt bin | tradition | |
safetensors | vLLM, Transformers, SGLang | Earth Storage is a new file format extension that is mainly used to safely and efficiently store and load Model Weight and Data Tensor. Launched by Hugging Face, it is designed to replace the traditional PyTorch*.pt 或 *.bin formats and solve Latent security issues and performance bottlenecks in these formats. |
ggufv2 | llama.cpp | |
gptq | vLLM, Transformers, SGLang | |
awq | vLLM, Transformers, SGLang | |
mlx | MLX |