If you are buying new AI hardware for sharing with other Compute Owners, as of 2025-06-01 the following AI hardware are recommended due to their ability to deliver the maximum performance from minimum power consumption:
Ram Size
Hardware Model
16 GB
Nvidia RTX 5060 Ti
64 GB
Apple M4 Pro
Reference low power AI hardware:
AI Hardware
RAM Size
RAM Width
RAM Bandwidth
Maths Core
Neural Core
RTX 4060 Ti
16 GB
128
288 GB/s
4352 cuda
136 tensor
RTX 5060 Ti
16 GB
128
448 GB/s
4608 cuda
144 tensor
RX 7900 xtx
24 GB
384
960 GB/s
6144 stream
192 ai
M4 Pro
64 GB
256
273 GB/s
2048 alu
16 neural
M4 Max
128 GB
512
546 GB/s
5120 alu
16 neural
DGX Spark
128 GB
256
273GB/s
6144 cuda
192 tensor
Note each Apple M4 GPU Core has 128 alu (arithmetic logic units)
Combining the VRAM on multiple GPUs are often much cheaper than purchasing a single GPUs with a lot of VRAM.
Data Transfers
The amount of data transmitted between GPUs for inference depends on where the model is split and the size of the intermediate outputs.
Factors determining the data transfer size
Model split point: The data transferred is the output of the layer processed on the first GPU that must be sent to the next GPU for processing. The more layers you split the model across, the more data will be transferred.
Activation size: The size of the intermediate tensors, such as the hidden states, determines the volume of data. This can be estimated as batch_size * seq_len * hidden_size, where hidden_size is the number of features in the hidden state.
Batch size: A larger batch size increases the number of parallel inferences, which in turn increases the amount of data that needs to be passed between GPUs.
Data type: The precision of the data (e.g., FP16, FP32, or 8-bit integers) affects the final data size. For instance, using FP32 will result in twice the data size of FP16.
It is roughly the size of the layer's output activations (e.g., hidden states) multiplied by the batch size and data type, which can be in the range of kilobytes to a few gigabytes per token:
A batch of data is sent to GPU 1, which processes a subset of the layers.
The output (activations) from the last layer on GPU 1 is then transferred to GPU 2.
GPU 2 then processes its subset of layers with the data it received.
This data transfer introduces overhead, which can significantly slow down inference, especially over a PCIe bus with lower bandwidth than the GPU's VRAM bandwidth.