Ollama

Beginners Model Providers

Ollama is easy to use but slow in delivering new features. It supports a lot of models:

Other Model Providers for beginners:

Model Interfaces

Mix and Match AI

Below are some preferred AI Model standards:

1. Model File Format

2. Parameters

  • 3B or above
    For general purpose models having at least 3B parameters is necessary for acceptable performance with 2024-10 technologies. Specialised models can have substantially less parameters.

3. Quantization

Translation

Collections

Models

BigTranslate

BigTranslate is from Institute of Automation of the Chinese Academy of Sciences (CASIA).

Code:

References:

Llama 3

Llama 3 supports multiple languages:

  • English
  • Spanish
  • French
  • German
  • Italian
  • Portuguese
  • Dutch
  • Russian
  • Chinese
  • Japanese
  • Korean

As of 2024-07-20 it has 3 different sizes: 8 Billion (available), 70 Billion (available), 400 Billion (almost there!) parameters.

References:

Open Source Models

Despite its Open AI name, the ChatGPT it developed is not open sourced, but open sourced Large Language Models are being developed quickly by others:

https://medium.com/geekculture/list-of-open-sourced-fine-tuned-large-language-models-llm-8d95a2e0dc76

1. Llama

As of 2024-08-12 the default LLM model is Llama 3.1

Data Cut-off Month: 2023-12

1. Stanford Alpaca

Promising for non-commercial applications.

Interesting how they took it down after short time online:

Can be used on less powerful hardware:

2. FLAN UL2

Promising for commercial applications.

20 Billion parameters can be a bit heavy but the gains may be worth it over the older and leaner FLAN-T5 it is based on.

Retrieval Augmented Generation

Without Embedding

https://cobusgreyling.medium.com/prompt-rag-vector-embedding-free-retrieval-augmented-generation-c37446b43cdd

Structured Outputs

OpenAI APIs

Ollama has limited support for some OpenAI APIs:

But if that is not absolutely necessary, that just stick to Ollama's own API:

Multiple GPUs

There is less support in Ollama than vLLM for running the SAME model across multiple GPUs, some related parameters are:

  1. OLLAMA_NUM_PARALLEL
    Maximum number of parallel requests

  2. OLLAMA_SCHED_SPREAD
    Always schedule model across all GPUs

There are plans to improve on the situation:

  1. https://github.com/ollama/ollama/pull/6729

Vision Models

As of 2025-05-10, Ollama still CANNOT run llama3.2-vision model in parallel:

model parallel parameters size update
llama3.2-vision:11b-instruct-q4_K_M no 11b 7.8 GB 6 months
minicpm-v:8b-2.6-q4_K_M yes 8b 5.7 GB 5 months
gemma3:12b-it-q4_K_M yes 12b 8.1 GB 6 weeks
granite3.2-vision:2b-q4_K_M yes 8b 4.9 GB 2 months
moondream:1.8b-v2-q4_K_M yes 1.8b 829 MB 12 months
llava-phi3:3.8b-mini-q4_0 yes 3.8b 2.9 GB 12 months
llava-llama3:8b-v1.1-q4_0 yes 8b 5.5 GB 12 months
llava:13b-v1.6-vicuna-q4_K_M yes 13b 8.5 GB 15 months
bakllava:7b-v1-q4_K_M yes 7b 5.0 GB 17 months

Resolution

All VLLM behavie differently, below is a VERY rough guide:

Model Max Image Resolution Max Input Tokens Structured Output
llama 3.2 1120 x 1120 128K json
minicpm-v 2 1344 x 1344 4K text
gemma 3 896 x 896 128K json
llava 672x672, 336x1344, 1344x336
mistral small 3.1 14x14 batch (max. 1540x1540)
granite 3.2 vision 1152x1152

Ollama Proxy

Ollama Cluster