AI workstation [GPU it#4]

Last summer I decided to build an AI Workstation for myself with dual goals:
1) Enable me to play with not just models but also learn about the various hardware configurations that are usually hidden from you if you’re using the cloud.
2) Owned hardware is a one-time cost, I don’t need another subscription.

I haven’t owned a desktop machine in the last 20 years. I’ve only used laptops (especially MacBooks) so most new standards, new connector types, and sockets were all new to me. Thankfully there were plenty of blogposts out there to guide you to a reasonable config. But let’s concentrate on the GPUs and save the rest of the system for later.

it#1

The setup has started with a dual RTX 2080Ti 11GB, with NVLink. It was reasonable to start with but you were very limited by the memory. NVLink looked neat you could measure the data transfer difference with NCCL test, but I haven’t seen any impact on training or inference performance.

RTX 2080Ti

  • TU102 Turing
  • 12 nm
  • Base Clock 1350 MHz
  • Boost Clock 1545 MHz
  • Memory Size 11 GB
  • Memory Type GDDR6
  • Memory Bus 352 bit
  • Bandwidth 616.0 GB/s
  • 4352 CUDA cores

it#2

Because of the memory needs, I’ve looked to add something richer in RAM. The couple-generation older Tesla P40 with 24GB memory was an easy choice considering the price was around $200/unit. The was a caveat though, cooling. Server-grade GPUs rely on the super loud high-pressure pusher fans in the server. I’ve purchased a 3d printed adapter and some aftermarket high-pressure fans but the noise was unbearable (remember we’re talking about a Workstation that has to co-exist with me in the same room). There were so many options and various designs. It was a pain to order them, test and send them back which did not satisfy my expectations. Finally, I’ve found the least noisy one. It had a flat design so allowed me to stack them tightly. This was also the time when I invested in a 3D printer :).
There’s the 3D model of the winning cooling adapter for the Tesla P40.

I’ve had 2x Teslas, so the PCIe slots were tightly packed but the 2080 Lower fans struggled to suck enough air via the tight gap between the cards, the cards were thermal throttled. Especially main 2080 also served the displays. (And another not-so-scientific problem: I’ve also started gaming again, 2080 couldn’t keep up with the needs)
To “unchoke” the primary card I’ve moved one of the P40s to the top of the house. It has helped – a little.

The config was not ideal, but it was neat to see I have 4 GPUs available on the nvidia-smi tool’s output. It has also provided reasonable inference speed with LMStudio offloading the models to the GPUs.

Anyone looking to use a Tesla or any other older NVidia AI Accelerator should be aware that the latest drivers do not support the older generation on Windows. On Linux, you’ll be good with the 525.125.06 version. That’s what I’m using on Debian 12.

Tesla P40

  • GP102 Pascal
  • 16 nm
  • Base Clock 1303 MHz
  • Boost Clock 1531 MHz
  • Memory Size 24 GB
  • Memory Type GDDR5
  • Memory Bus 384 bit
  • Bandwidth 347.1 GB/s
  • 3840 CUDA cores

it#3

The huge step-up was replacing the dual 2080Ti with a single 4080. The card has a massive form factor of 3 units in height (covering the 2nd PCIe slot from the top) and is wider than any other card I’ve ever seen. It also has a massive increase from the CUDA core point of view, local inference has been significantly faster. And looks amazing too.

The downside: it’s not a 4090. 6000 Short on CUDA cores and 8GB short on memory. Why I initially chose the 4080 over 4090 is the price and the concerns around the melting power connector news. Since the purchase, NVidia announced the 4080 Super and decreased the prices overall (if anything is available from none gray market).
Note: Even if on paper 4080 Super should perform better than the original 4080 benefiting from the slight clock increase and the extra 512 cuda cores, the power consumption is limited to 280W on the Super compared to the 320W of the original 4080. Some tests online showed that the Super might throttle earlier and cannot use all of its power. I haven’t tested this myself though.

The downside of the Teslas: There’s a huge difference between the compute performance and bandwidth, which has slowed down my training efforts as the 4080 has mainly just waited for the P40s to complete the batch.

RTX 4080

  • AD103 Adav Lovelace
  • 5 nm
  • Base Clock 2205 MHz
  • Boost Clock 2505 MHz
  • Memory Size 16GB
  • Memory Type GDDR6X
  • Memory Bus 256 bit
  • Bandwidth 716.8 GB/s
  • 9728 CUDA cores

it#4 – in progress

So the P40s should go. 4080 is pretty fast but with one or two more comparable GPUs I can speed up my experiments and toy projects. More 4080 -> only one would fit, and I am still stuck with 16GB. Replace all with 2×4090 $$$ just too much and lack of availability. 3090 would be reasonable. Compute is comparable with 4080 and offers 24GB.

Finally, I’ve found the exact version: Gigabyte RTX 3090 Turbo

The huge benefit of this card is that Gigabyte produced it in a 2-unit height form factor and a blower fan design. Basically optimized a consumer-grade product for tight pack server use. The blower intake has a lower profile which I assume is intended to solve the “choking” issue I’ve faced with the 2080Ti in a tight pack configuration. Additionally, I’ll win back driver support under Windows!
Unfortunately, the card was discontinued early (here’s the article about it) but few are still available on eBay. I’ve managed to get one for sub $1k, now waiting for the delivery. If it turns out good the plan is to add another and get back to the 3GPU setup.

RTX 3090

  • GA102 Ampere
  • 8 nm
  • Base Clock 1395 MHz
  • Boost Clock 1695 MHz
  • Memory Size 24GB
  • Memory Type GDDR6X
  • Memory Bus 256 bit
  • Bandwidth 936.2 GB/s
  • 10496 CUDA cores