ned Productions - Tuesday 17 March 2026

21:28.

Word count: 1151. Estimated reading time: 6 minutes.

Summary:: The current house server is being used to run various tasks, but its age and limitations are becoming apparent. It has been running continuously for almost twelve years, with an SSD that still has decades of life remaining. However, the mainboard and CPU are outdated, and a replacement is being considered.

I had thought that this St. Patrick’s Day entry would be about making my rented home internet faster, but before I do that I ought to fix something I completely forgot to mention last entry which discussed how best to run Qwen3 Coder Next: I entirely forgot to discuss options for upgrading the house server with LLM possible hardware. And no, I don’t mean just fitting a graphics card to it as its idle power consumption would be too high: I mean server hardware with ultra low idle power consumption which is also able to run large LLMs as needed on their CPU i.e. no discrete additional graphics card.

My current house server is very old: I wrote up an entry on it here on the 10th April 2014, so that makes it almost exactly twelve years old. It has been powered on for almost all of that, and its current SSD (which was not its original) says it has been powered on for 98,000 hours. That SSD, a 128 Gb Samsung 830, has written about 105 Tb in its life, and its SMART data thinks decades more of life remain for it – that SSD model was massively overengineered, and > 2 Pb of write endurance would be expected for that specific drive. So, we’re only 5% through its lifetime . The mainboard is the very popular at the time Supermicro X10SL7-F, and the CPU is a quad core Intel Xeon E3-1230 v3, the last really good new CPU architecture by Intel which is Haswell, and it is fitted with 32 Gb of ECC RAM, which is the maximum possible.

As much as that server was expensive at the time of purchase, nobody can now say it wasn’t value for money: it has been utterly reliable and trouble free in the past twelve years, and fast enough for what I’ve wanted it for up until LLMs appeared. It also isn’t bad for the idle power consumption, recent Linux kernels have it down to 41 watts or so even with the constantly spinning ZFS array which is currently two 26 Tb drives.

But, it is getting a bit long in the tooth, and I am intending to upgrade it sometime after we move into the new house. My main use case for its replacement is that I want a ‘Star Trek’ like house computer which is always listening and able to respond to you at any time to do any thing or on any topic. For that, you’re going to need a frontier approaching MoE model, so at least 200 billion parameters which means ideally 256 Gb of RAM with enough bandwidth and enough compute. Additionally, the MoE model needs to be specifically designed to not suck on consumer grade CPUs, and while there aren’t many of these, there are some. One of those which I’ve been therefore watching closely is Step 3.5 Flash, which has amongst the least worst performance for a 200b model running on CPUs only.

Right now, the hardware list for house servers with a CPU powerful enough to run LLMs is very short: exactly four options in 2026, and following almost the same table format from the last entry:

Item	parse toks/sec/euro	Price (EUR)	Launch year	RAM Gb	Bandwidth Gb/sec	Idle power watts	Full power watts	FP16 TFLOPS	llama2 7b parse	llama2 7b gen	Other notes
Mac Studio M3 Ultra	0.1604485659	9274	2025	256	800	9	200	57	1488	64	Must store hard drives externally connected via Thunderbolt
Mac Studio M4 Max	0.2111742424	4224	2025	128	546	10	200	28	892	54	surprisingly worse parse performance compared to the AMD - not enough GPU cores
AMD AI Max+ 395	0.4155383623	3102	2025	128	215	12	180	59	1289	54	Standard Mini-ITX! Due to low bandwidth only suits MoE models, replacement model with max 192 Gb RAM expected in 2027
nVidia DGX Spark	0.6224842448	4919	2025	128	273	25	200	127	3062	57	unsure about kernel support longevity

These are not the best hardware for running LLMs except in two areas:

RAM capacity for your euro, which means you can fit the entire model into RAM, which means even the relatively low compute available in a CPU relative to a GPU can get you there. Buying this much VRAM costs over ten grand right now, whereas none of the above is that expensive, plus they come with a free general purpose computer, power supply and case.
Idle power consumption, no GPU capable of running LLMs idles at below sixteen watts or usually more, so after you add in the idle power consumption of the rest of the server that does add up. In my future house almost all the electricity will be free of cost from the solar panels, however emitting heat does mess with the thermal balance of the house and could contribute towards overheating in summer. In comparison, all the machines above bar the nVidia idle below twelve watts – which includes their main SSD boot drive.

I managed to find performance benchmarks for some of the hardware above for Step 3.5 Flash and Qwen3 Coder Next:

	AMD AI Max+ 395	Apple M3 Ultra	nVidia DGX Spark
Compute Units	16 CPU + 40 GPU	24 CPU + 80 GPU	20 CPU + 384 GPU
Parse Step 3.5 Flash Q4_K	131	377	530
Gen Step 3.5 Flash Q4_K	23	33	20
Parse Qwen3 Coder Next Q8_0	275	1624	2162
Gen Qwen3 Coder Next Q8_0	25	45	37

The AMD Strix Halo was originally designed for gaming laptops, and they only repurposed it into an AI solution quite late in the product cycle. Had they known, if it had twice the RAM, RAM bandwidth and GPUs, they would have swept the market even if they charged twice the price.

The reason why is good old fashioned PC compatibility: none of the above apart from the AMD solution comes in a standard PC motherboard taking standard PC connections and peripherals. If I therefore want to keep my tower case which has lots of very convenient hard drive bays all of which use SATA/SAS, I am severely limited with anything other than a 100% PC compatible form factor.

However, as is obvious above, the Strix Halo is underwhelming compared to the other two for 80 billion never mind 200 billion parameter models. Even for Qwen3 Coder Next, the parse speed is problematic, which is because Strix Halo’s GPU uses a fairly ancient Radeon underneath which lacks the much improved FP8 opcodes for token parsing in newer Radeons.

It is currently expected that the next major successor to the Strix Halo, codenamed ‘Medusa Halo’, will be released in 2027. It should have twice the memory bandwidth, +50% RAM and +20% GPUs, except they’ll be the latest Radeon architecture so parse speed should take a mighty leap upwards, at least 2x over Strix Halo, and way more again for the Q8 quantisation.

Assuming Apple don’t release a price slashed M5 Ultra – which they might, if they feel this is a market share they can easily grab – and that Intel will remain asleep at the wheel, I guess I’ll be aiming to upgrade the house server to the AMD Medusa Halo architecture in 2027 to 2028.

Here’s hoping that there will be a house for me to put it into by then!

#AI #LLM #agentic

Go to previous entry

Go to next entry

Go back to the archive index

Go back to the latest entries

Niall’s virtual diary archives – Tuesday 17 March 2026