Ollama

admin · April 10, 2023, 9:41pm

Default Model Provider

There are now more and more high quality open Models available to Compute Owners:

These 2 light weight models are available as base line by default:

Llama 3.1 8B Instruct Q4_K_M - 4.9 GB
Qwen 2.5 7B Instruct Q4_K_M - 4.7 GB

Notes:

Use of the models with higher number of parameters is recommended if you have more VRAM in your GPU available.
Use language tuned models
e.g. Chinese tuned version of Llama 3.1 8B above.

Alternative Model Providers

There are other Model Providers

LocalAI

Ollama is the default as of 2024-08-24

Model Interfaces

Must support Retrieval Augmented Generation (RAG) since traditional LLMs are difficult to for non-technical Compute Owners to customise (e.g. fine-tuning GPT-3.5)

Open-WebUI is the default as of 2024-12-10.

admin · February 15, 2024, 3:05am

Mix and Match AI

Below are some preferred AI Model standards:

1. Model File Format

GGUF
This is now the standard used by a lot of AI applications.
References:
ggml/docs/gguf.md at master · ggerganov/ggml · GitHub

2. Parameters

3B or above
For general purpose models having at least 3B parameters is necessary for acceptable performance with 2024-10 technologies. Specialised models can have substantially less parameters.

3. Quantization

Q4_K_M or above
The 4 after the Q indicates the number of bits - the higher the better the quality but uses more resources.
References:
Quantization of LLMs with llama.cpp | by Ingrid Stevens | Medium

admin · July 20, 2024, 7:08am

Translation

Collections

Models

BigTranslate

BigTranslate is from Institute of Automation of the Chinese Academy of Sciences (CASIA).

Code:

GitHub - ZNLP/BigTranslate: BigTranslate: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages

References:

Paper page - BigTrans: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages
TheBloke/BigTranslate-13B-GPTQ · Hugging Face

Llama 3

Llama 3 supports multiple languages:

English
Spanish
French
German
Italian
Portuguese
Dutch
Russian
Chinese
Japanese
Korean

As of 2024-07-20 it has 3 different sizes: 8 Billion (available), 70 Billion (available), 400 Billion (almost there!) parameters.

References:

admin · July 20, 2024, 7:51am

Open Source Models

Despite its Open AI name, the ChatGPT it developed is not open sourced, but open sourced Large Language Models are being developed quickly by others:

1. Llama

As of 2024-08-12 the default LLM model is Llama 3.1

Data Cut-off Month: 2023-12

1. Stanford Alpaca

Promising for non-commercial applications.

Interesting how they took it down after short time online:

Can be used on less powerful hardware:

2. FLAN UL2

Promising for commercial applications.

20 Billion parameters can be a bit heavy but the gains may be worth it over the older and leaner FLAN-T5 it is based on.

Cerebras GPT - cerebras/Cerebras-GPT-13B · Hugging Face

admin · August 25, 2024, 7:35am

Retrieval Augmented Generation

Without Embedding

admin · December 9, 2024, 2:37am

Structured Outputs

admin · December 10, 2024, 12:22am

OpenAI APIs

Ollama has limited support for some OpenAI APIs:

OpenAI compatibility · Ollama Blog

But if that is not absolutely necessary, that just stick to Ollama's own API:

ollama/docs/api.md at main · ollama/ollama · GitHub

admin · February 8, 2025, 4:10am

Multiple GPUs

There is less support in Ollama than vLLM for running the SAME model across multiple GPUs, some related parameters are:

OLLAMA_NUM_PARALLEL
Maximum number of parallel requests
OLLAMA_SCHED_SPREAD
Always schedule model across all GPUs

There are plans to improve on the situation:

Feature: Add Support for Distributed Inferencing by ecyht2 · Pull Request #6729 · ollama/ollama · GitHub

admin · May 10, 2025, 2:21am

Vision Models

Vision models · Ollama Search

As of 2025-05-10, Ollama still CANNOT run llama3.2-vision model in parallel:

parallel request error · Issue #9564 · ollama/ollama · GitHub

model	parallel	parameters	size	update
llama3.2-vision:11b-instruct-q4_K_M	no	11b	7.8 GB	6 months
minicpm-v:8b-2.6-q4_K_M	yes	8b	5.7 GB	5 months
gemma3:12b-it-q4_K_M	yes	12b	8.1 GB	6 weeks
granite3.2-vision:2b-q4_K_M	yes	8b	4.9 GB	2 months
moondream:1.8b-v2-q4_K_M	yes	1.8b	829 MB	12 months
llava-phi3:3.8b-mini-q4_0	yes	3.8b	2.9 GB	12 months
llava-llama3:8b-v1.1-q4_0	yes	8b	5.5 GB	12 months
llava:13b-v1.6-vicuna-q4_K_M	yes	13b	8.5 GB	15 months
bakllava:7b-v1-q4_K_M	yes	7b	5.0 GB	17 months

Resolution

All VLLM behavie differently, below is a VERY rough guide:

Model	Max Image Resolution	Max Input Tokens	Structured Output
llama 3.2	1120 x 1120	128K	json
minicpm-v 2	1344 x 1344	4K	text
gemma 3	896 x 896	128K	json
llava	672x672, 336x1344, 1344x336
mistral small 3.1	14x14 batch (max. 1540x1540)
granite 3.2 vision	1152x1152