AI Hardware

admin · May 31, 2025, 1:08am

If you are buying new AI hardware for sharing with other Compute Owners, as of 2025-06-01 the following AI hardware are recommended due to their ability to deliver the maximum performance from minimum power consumption:

Ram Size	Hardware Model
16 GB	Nvidia RTX 5060 Ti
64 GB	Apple M4 Pro

Reference low power AI hardware:

AI Hardware	RAM Size	RAM Width	RAM Bandwidth	Maths Core	Neural Core
RTX 4060 Ti	16 GB	128	288 GB/s	4352 cuda	136 tensor
RTX 5060 Ti	16 GB	128	448 GB/s	4608 cuda	144 tensor
RX 7900 xtx	24 GB	384	960 GB/s	6144 stream	192 ai
M4 Pro	64 GB	256	273 GB/s	2048 alu	16 neural
M4 Max	128 GB	512	546 GB/s	5120 alu	16 neural
DGX Spark	128 GB	256	273GB/s	6144 cuda	192 tensor

Note each Apple M4 GPU Core has 128 alu (arithmetic logic units)

admin · July 2, 2025, 12:22pm

AI Cluster

You are welcome to build clusters using any hardware, below are just some that we have experience with.

Existing Frameworks

Nvidia CUDA
AMD ROCm
Apple MLX

VRAM Size

GPUs with 6GB or above is recommended , having at least 16GB is preferred.

VRAM usage estimation for Inference operation:

32-bit parameters use 4GB of VRAM per billion
16-bit parameters use 2GB of VRAM per billion
4-bit parameters use 0.5GB of VRAM per billion

admin · November 20, 2025, 1:11am

Setup Guide

admin · November 20, 2025, 7:43am

Multiple GPUs

Combining the VRAM on multiple GPUs are often much cheaper than purchasing a single GPUs with a lot of VRAM.

Data Transfers

The amount of data transmitted between GPUs for inference depends on where the model is split and the size of the intermediate outputs.

Factors determining the data transfer size

Model split point: The data transferred is the output of the layer processed on the first GPU that must be sent to the next GPU for processing. The more layers you split the model across, the more data will be transferred.
Activation size: The size of the intermediate tensors, such as the hidden states, determines the volume of data. This can be estimated as batch_size * seq_len * hidden_size, where hidden_size is the number of features in the hidden state.
Batch size: A larger batch size increases the number of parallel inferences, which in turn increases the amount of data that needs to be passed between GPUs.
Data type: The precision of the data (e.g., FP16, FP32, or 8-bit integers) affects the final data size. For instance, using FP32 will result in twice the data size of FP16.

It is roughly the size of the layer's output activations (e.g., hidden states) multiplied by the batch size and data type, which can be in the range of kilobytes to a few gigabytes per token:

A batch of data is sent to GPU 1, which processes a subset of the layers.
The output (activations) from the last layer on GPU 1 is then transferred to GPU 2.
GPU 2 then processes its subset of layers with the data it received.

This data transfer introduces overhead, which can significantly slow down inference, especially over a PCIe bus with lower bandwidth than the GPU's VRAM bandwidth.