AI Hardware

If you are buying new AI hardware for sharing with other Compute Owners, as of 2025-06-01 the following AI hardware are recommended due to their ability to deliver the maximum performance from minimum power consumption:

Ram Size Hardware Model
16 GB Nvidia RTX 5060 Ti
64 GB Apple M4 Pro

Reference low power AI hardware:

AI Hardware RAM Size RAM Width RAM Bandwidth Maths Core Neural Core
RTX 4060 Ti 16 GB 128 288 GB/s 4352 cuda 136 tensor
RTX 5060 Ti 16 GB 128 448 GB/s 4608 cuda 144 tensor
RX 7900 xtx 24 GB 384 960 GB/s 6144 stream 192 ai
M4 Pro 64 GB 256 273 GB/s 2048 alu 16 neural
M4 Max 128 GB 512 546 GB/s 5120 alu 16 neural
DGX Spark 128 GB 256 273GB/s 6144 cuda 192 tensor

Note each Apple M4 GPU Core has 128 alu (arithmetic logic units)

AI Cluster

You are welcome to build clusters using any hardware, below are just some that we have experience with.

Existing Frameworks

  • Nvidia CUDA
  • AMD ROCm
  • Apple MLX

VRAM Size

GPUs with 6GB or above is recommended , having at least 16GB is preferred.

VRAM usage estimation for Inference operation:

  1. 32-bit parameters use 4GB of VRAM per billion
  2. 16-bit parameters use 2GB of VRAM per billion
  3. 4-bit parameters use 0.5GB of VRAM per billion

Setup Guide

Multiple GPUs

Combining the VRAM on multiple GPUs are often much cheaper than purchasing a single GPUs with a lot of VRAM.

Data Transfers

The amount of data transmitted between GPUs for inference depends on where the model is split and the size of the intermediate outputs.

Factors determining the data transfer size

  • Model split point: The data transferred is the output of the layer processed on the first GPU that must be sent to the next GPU for processing. The more layers you split the model across, the more data will be transferred.
  • Activation size: The size of the intermediate tensors, such as the hidden states, determines the volume of data. This can be estimated as batch_size * seq_len * hidden_size, where hidden_size is the number of features in the hidden state.
  • Batch size: A larger batch size increases the number of parallel inferences, which in turn increases the amount of data that needs to be passed between GPUs.
  • Data type: The precision of the data (e.g., FP16, FP32, or 8-bit integers) affects the final data size. For instance, using FP32 will result in twice the data size of FP16.

It is roughly the size of the layer's output activations (e.g., hidden states) multiplied by the batch size and data type, which can be in the range of kilobytes to a few gigabytes per token:

  1. A batch of data is sent to GPU 1, which processes a subset of the layers.
  2. The output (activations) from the last layer on GPU 1 is then transferred to GPU 2.
  3. GPU 2 then processes its subset of layers with the data it received.

This data transfer introduces overhead, which can significantly slow down inference, especially over a PCIe bus with lower bandwidth than the GPU's VRAM bandwidth.