Word count: 1151. Estimated reading time: 6 minutes.
- Summary:
- The current house server is being used to run various tasks, but its age and limitations are becoming apparent. It has been running continuously for almost twelve years, with an SSD that still has decades of life remaining. However, the mainboard and CPU are outdated, and a replacement is being considered.
Tuesday 17 March 2026: 21:28.
- Summary:
- The current house server is being used to run various tasks, but its age and limitations are becoming apparent. It has been running continuously for almost twelve years, with an SSD that still has decades of life remaining. However, the mainboard and CPU are outdated, and a replacement is being considered.
My current house server is very old: I wrote up an entry on it here on the 10th
April 2014, so that
makes it almost exactly twelve years old. It has been powered on for almost all
of that, and its current SSD (which was not its original) says it has been
powered on for 98,000 hours. That SSD, a 128 Gb Samsung 830, has written about
105 Tb in its life, and its SMART data thinks decades more of life remain for
it – that SSD model was massively overengineered, and > 2 Pb of write endurance
would be expected for that specific drive. So, we’re only 5% through its
lifetime
. The mainboard is the very
popular at the time Supermicro X10SL7-F, and the CPU is a quad core Intel Xeon
E3-1230 v3, the last really good new CPU architecture by Intel which is Haswell,
and it is fitted with 32 Gb of ECC RAM, which is the maximum possible.
As much as that server was expensive at the time of purchase, nobody can now say it wasn’t value for money: it has been utterly reliable and trouble free in the past twelve years, and fast enough for what I’ve wanted it for up until LLMs appeared. It also isn’t bad for the idle power consumption, recent Linux kernels have it down to 41 watts or so even with the constantly spinning ZFS array which is currently two 26 Tb drives.
But, it is getting a bit long in the tooth, and I am intending to upgrade it sometime after we move into the new house. My main use case for its replacement is that I want a ‘Star Trek’ like house computer which is always listening and able to respond to you at any time to do any thing or on any topic. For that, you’re going to need a frontier approaching MoE model, so at least 200 billion parameters which means ideally 256 Gb of RAM with enough bandwidth and enough compute. Additionally, the MoE model needs to be specifically designed to not suck on consumer grade CPUs, and while there aren’t many of these, there are some. One of those which I’ve been therefore watching closely is Step 3.5 Flash, which has amongst the least worst performance for a 200b model running on CPUs only.
Right now, the hardware list for house servers with a CPU powerful enough to run LLMs is very short: exactly four options in 2026, and following almost the same table format from the last entry:
| Item | parse toks/sec/euro | Price (EUR) | Launch year | RAM Gb | Bandwidth Gb/sec | Idle power watts | Full power watts | FP16 TFLOPS | llama2 7b parse | llama2 7b gen | Other notes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mac Studio M3 Ultra | 0.1604485659 | 9274 | 2025 | 256 | 800 | 9 | 200 | 57 | 1488 | 64 | Must store hard drives externally connected via Thunderbolt | |
| Mac Studio M4 Max | 0.2111742424 | 4224 | 2025 | 128 | 546 | 10 | 200 | 28 | 892 | 54 | surprisingly worse parse performance compared to the AMD - not enough GPU cores | |
| AMD AI Max+ 395 | 0.4155383623 | 3102 | 2025 | 128 | 215 | 12 | 180 | 59 | 1289 | 54 | Standard Mini-ITX! Due to low bandwidth only suits MoE models, replacement model with max 192 Gb RAM expected in 2027 | |
| nVidia DGX Spark | 0.6224842448 | 4919 | 2025 | 128 | 273 | 25 | 200 | 127 | 3062 | 57 | unsure about kernel support longevity |
These are not the best hardware for running LLMs except in two areas:
RAM capacity for your euro, which means you can fit the entire model into RAM, which means even the relatively low compute available in a CPU relative to a GPU can get you there. Buying this much VRAM costs over ten grand right now, whereas none of the above is that expensive, plus they come with a free general purpose computer, power supply and case.
Idle power consumption, no GPU capable of running LLMs idles at below sixteen watts or usually more, so after you add in the idle power consumption of the rest of the server that does add up. In my future house almost all the electricity will be free of cost from the solar panels, however emitting heat does mess with the thermal balance of the house and could contribute towards overheating in summer. In comparison, all the machines above bar the nVidia idle below twelve watts – which includes their main SSD boot drive.
I managed to find performance benchmarks for some of the hardware above for Step 3.5 Flash and Qwen3 Coder Next:
| AMD AI Max+ 395 | Apple M3 Ultra | nVidia DGX Spark | |
|---|---|---|---|
| Compute Units | 16 CPU + 40 GPU | 24 CPU + 80 GPU | 20 CPU + 384 GPU |
| Parse Step 3.5 Flash Q4_K | 131 | 377 | 530 |
| Gen Step 3.5 Flash Q4_K | 23 | 33 | 20 |
| Parse Qwen3 Coder Next Q8_0 | 275 | 1624 | 2162 |
| Gen Qwen3 Coder Next Q8_0 | 25 | 45 | 37 |
The AMD Strix Halo was originally designed for gaming laptops, and they only repurposed it into an AI solution quite late in the product cycle. Had they known, if it had twice the RAM, RAM bandwidth and GPUs, they would have swept the market even if they charged twice the price.
The reason why is good old fashioned PC compatibility: none of the above apart from the AMD solution comes in a standard PC motherboard taking standard PC connections and peripherals. If I therefore want to keep my tower case which has lots of very convenient hard drive bays all of which use SATA/SAS, I am severely limited with anything other than a 100% PC compatible form factor.
However, as is obvious above, the Strix Halo is underwhelming compared to the other two for 80 billion never mind 200 billion parameter models. Even for Qwen3 Coder Next, the parse speed is problematic, which is because Strix Halo’s GPU uses a fairly ancient Radeon underneath which lacks the much improved FP8 opcodes for token parsing in newer Radeons.
It is currently expected that the next major successor to the Strix Halo, codenamed ‘Medusa Halo’, will be released in 2027. It should have twice the memory bandwidth, +50% RAM and +20% GPUs, except they’ll be the latest Radeon architecture so parse speed should take a mighty leap upwards, at least 2x over Strix Halo, and way more again for the Q8 quantisation.
Assuming Apple don’t release a price slashed M5 Ultra – which they might, if they feel this is a market share they can easily grab – and that Intel will remain asleep at the wheel, I guess I’ll be aiming to upgrade the house server to the AMD Medusa Halo architecture in 2027 to 2028.
Here’s hoping that there will be a house for me to put it into by then!
| Go to previous entry | Go back to the archive index | Go back to the latest entries |