Beginners Model Providers
Ollama is easy to use but slow in delivering new features. It supports a lot of models:
Other Model Providers for beginners:
Ollama is easy to use but slow in delivering new features. It supports a lot of models:
Other Model Providers for beginners:
Below are some preferred AI Model standards:
GGUF
This is now the standard used by a lot of AI applications.
References:
ggml/docs/gguf.md at master · ggml-org/ggml · GitHub
Q4_K_M or above
The 4 after the Q indicates the number of bits - the higher the better the quality but uses more resources.
References:
https://medium.com/@ingridwickstevens/quantization-of-llms-with-llama-cpp-9bbf59deda35
BigTranslate is from Institute of Automation of the Chinese Academy of Sciences (CASIA).
Code:
References:
Llama 3 supports multiple languages:
As of 2024-07-20 it has 3 different sizes: 8 Billion (available), 70 Billion (available), 400 Billion (almost there!) parameters.
References:
Despite its Open AI name, the ChatGPT it developed is not open sourced, but open sourced Large Language Models are being developed quickly by others:
As of 2024-08-12 the default LLM model is Llama 3.1
Data Cut-off Month: 2023-12
Promising for non-commercial applications.
Interesting how they took it down after short time online:
Can be used on less powerful hardware:
Promising for commercial applications.
20 Billion parameters can be a bit heavy but the gains may be worth it over the older and leaner FLAN-T5 it is based on.
Ollama has limited support for some OpenAI APIs:
But if that is not absolutely necessary, that just stick to Ollama's own API:
There is less support in Ollama than vLLM for running the SAME model across multiple GPUs, some related parameters are:
OLLAMA_NUM_PARALLEL
Maximum number of parallel requests
OLLAMA_SCHED_SPREAD
Always schedule model across all GPUs
There are plans to improve on the situation:
As of 2025-05-10, Ollama still CANNOT run llama3.2-vision model in parallel:
| model | parallel | parameters | size | update |
|---|---|---|---|---|
| llama3.2-vision:11b-instruct-q4_K_M | no | 11b | 7.8 GB | 6 months |
| minicpm-v:8b-2.6-q4_K_M | yes | 8b | 5.7 GB | 5 months |
| gemma3:12b-it-q4_K_M | yes | 12b | 8.1 GB | 6 weeks |
| granite3.2-vision:2b-q4_K_M | yes | 8b | 4.9 GB | 2 months |
| moondream:1.8b-v2-q4_K_M | yes | 1.8b | 829 MB | 12 months |
| llava-phi3:3.8b-mini-q4_0 | yes | 3.8b | 2.9 GB | 12 months |
| llava-llama3:8b-v1.1-q4_0 | yes | 8b | 5.5 GB | 12 months |
| llava:13b-v1.6-vicuna-q4_K_M | yes | 13b | 8.5 GB | 15 months |
| bakllava:7b-v1-q4_K_M | yes | 7b | 5.0 GB | 17 months |
All VLLM behavie differently, below is a VERY rough guide:
| Model | Max Image Resolution | Max Input Tokens | Structured Output |
|---|---|---|---|
| llama 3.2 | 1120 x 1120 | 128K | json |
| minicpm-v 2 | 1344 x 1344 | 4K | text |
| gemma 3 | 896 x 896 | 128K | json |
| llava | 672x672, 336x1344, 1344x336 | ||
| mistral small 3.1 | 14x14 batch (max. 1540x1540) | ||
| granite 3.2 vision | 1152x1152 |