ned Productions - Thursday 12 March 2026

17:40.

Word count: 4599. Estimated reading time: 22 minutes.

Summary:: The author’s personal history with LLMs is a long and winding road, marked by experimentation and exploration of various models and tools. They recall playing with llama3.1 8b in Autumn 2024, being impressed by its ability to call tools and search the internet, and recognizing its potential to aid productivity. The author’s use of LLMs has evolved over time, from generating summaries for their website to more complex tasks like image editing and coding assistance. They have also experimented with various models, including Qwen, Gemma3, and Claude, and have been impressed by the rapid progress in AI capabilities.

I just finished making my rented house internet go much faster, it took me several hours of work this morning, then this afternoon I was in a WG14 standards meeting, so I just finished the work there now. And the internet is indeed now much faster! But that’ll be discussed in detail in a later virtual diary entry, because this one will be on AI coding assistants, and I apologise in advance for the wall of text about to appear.

Obviously lots of programmers like myself have been laid off these past two years ostensibly due to being replaced by AI who will happily churn out code with quality similar to perhaps the bottom fifth worst programmers, but as management is well known to be absolutely terrible at figuring out who is a good or bad programmer, they’ve been mainly performing rounds of blind headcount decimation as usual. I’ve been without income now since last June – part of that is due to changes in US tax treatment of foreign workers, but probably more of it is due to widespread headcount reduction using AI as the excuse. To date, employers have only been investing in AI to the extent of substituting X number of human devs for Y dollars of subscription fees paid to OpenAI or Anthropic – they haven’t gone for any deeper integrations than that. There are good reasons for that: every six months AI gets quite a lot better, and with such shifting foundations there is no point investing in deep structural change to rebase your business on this new tooling until AI improvements slow down to a few percent per year.

As an example of exactly that fast progress, last month Alibaba released its newest set of Qwen models all of which can be downloaded and run locally – unlike nowadays most of the recent western AI models. That release was expected to very substantially improve capabilities over their previous models. I, along with lots of others, had been eagerly awaiting that release because the Qwen Mixture of Expert (MoE) models are the only feasible way to get large models running on hardware an individual could reasonably afford. As I have zero wish to invest my time training into AI which can be rug pulled from me later, an iron clad requirement for me personally (and I suspect lots of others like me) is that my time is only worth being invested in AI models I could have 100% personal control over. So, whatever such models can – or cannot – do is the sweet spot at which I shall aim my practice and training.

I don’t mind the moving target as those models keep leaping forwards – it’s the price of being on the leading edge. And I don’t mind if a future employer pays for some super smart AI to assist me for some piece of work they want me to do. However, for my own personal work, I will be absolutely refusing to get locked in to super smart AI I’ll never be able to fully run on hardware I 100% control.

My personal history with LLMs

The first time I played with a locally run LLM I think was about Autumn 2024, about four months after llama3.1 8b had been released. I was relatively late to that game, to be honest I had until then mostly dismissed LLMs as being little more than improved chat bots. I remember being especially impressed by its ability to call tools you had personally taught it about and it could search the internet when forming an answer to your question or instruction, plus it ran well on my Macbook and even not terribly on my ancient Haswell house server which is well over a decade old. Unlike the pure chat bots preceding which were mere curiosities, I felt at the time that this new generation of chat bot had genuine potential to aid my productivity. But the tooling, and indeed the LLM models, weren’t there yet – though, I am still using llama3.1 8b to generate summaries for the diary entries on this website to this day.

In July 2025 I wrote up how I converted ancient computer parts I had lying around plus an ancient datacenter AI accelerator board I had bought from Aliexpress into an AI video inferencing solution for the site. That used a decade old nVidia Tesla P4 with 8 Gb of VRAM, which was and still remains one of the best bang for the power watt AI accelerators you can get. I came away very impressed with its capabilities, and indeed I expect to reinstall it into the site later this month once enough sunshine falls from the sky to power it.

In October 2025 I upgraded this website’s generator scripts to invoke llama3.1 8b to summarise each entry, and last January I evaluated the then recently released Qwen models for image editing and whether the 30b MoE Qwen model could replace llama3.1 8b (it could not, at least for the limited 18 Gb RAM on my Macbook). Obviously last entry last month was all about getting Gemma3 4b to describe and categorise all 25,000 photos in our collection.

Around this time last year while I was still working at Category Labs, Anthropic’s Claude coding assistant AI was beginning to get mentioned due to Anthropic having released ‘Claude code’, their command line agentic AI programming assistant tool, in February 2025. I think they bought a subscription for anybody who wanted one around April just before they told me they’d be ending my contract early. So I only very briefly played with it during my final month working there in May, and given it cost US$20 per month and I was now unemployed, I wasn’t hugely keen to spend more money subscribing to it especially as I was 99.9% certain that six months later I wouldn’t need to. One thing that I did notice during my playtime was that I already was running into the daily usage limits of the US$20 per month plan after maybe an hour of use. Obviously they wanted you to pay them a LOT more money, which I suppose is fair – the US$20 per month plan is just their taste tester plan.

Qwen3 Coder Next (Q3CN)

It’s rare that I predict the future so accurately! Last month Alibaba released Qwen3 Coder Next, a 80b parameter MoE model specifically tuned to help you work with code. I waited for a few weeks for https://github.com/ggml-org/llama.cpp to catch up with optimising support for this latest LLM, and then I gave it a proper tyre kicking last week and this week. I have come away once again impressed!

Qwen3 Coder Next is about as capable as Claude Sonnet 3.5 is, so about where Claude was at in Summer 2024, which in practical terms is exactly where today’s US$20 per month Claude subscription is at because with that plan any newer model than 3.5 runs out of usage limits so quickly it’s useless. I therefore have exactly what I predicted: the same capability of AI assistant as what costs $20 per month from Anthropic, except runnable on my own local hardware for free of cost, if you have sufficient hardware.

My development workstation is a little old now: I last upgraded it when still working for MayStreet in 2022, and I had intended to upgrade it summer of last year, but without income it no longer made sense. It is a AMD Threadripper 5975WX based machine which has thirty-two Zen 3 CPUs with eight lanes of memory bandwidth, so should have ~180 Gb/sec of memory bandwidth, but only ~4 TFLOPs of FP16 compute. This is far too little to run LLMs well, as I found during the image analysis diary entry where even the small 4b Gemma3 model took a minute per image. But what it does have is 128 Gb of RAM and a PCI 4.0 interface, so you can theoretically run ~100 Gb footprint LLMs, if you can offload the compute to hardware with far more TFLOPs and memory bandwidth.

To run a MoE 80b model well – which is what Qwen3 Coder Next (Q3CN) is – you need a GPU with enough VRAM and compute power to run the dense levels quickly. Those dense levels then select which experts will be used, and those experts usually are run on the CPU using all available CPU cores. So long as the experts don’t touch much memory and are computationally lightweight, you absolutely can run a 80b model like Q3CN well on local hardware.

Running Q3CN locally

As you are surely inferring by now, much now hangs on what a ‘GPU with enough VRAM and compute power to run the dense levels quickly’ is, and more importantly, how much it might cost. I currently have in my workstation these two GPUs:

GPU	parse toks/sec/euro		Price (EUR)	Launch year	RAM Gb	Bandwidth Gb/sec	Full power watts	FP16 TFLOPS	llama2 7b parse	llama2 7b gen	Notes
AMD RX 6600 XT	2.87		200	2021	8	256	160	16.1	574	54	assigned to linux
AMD RX 6700 XT	3.28		320	2021	12	384	230	23.8	1051	84	assigned to windows

These were purchased principally with gaming-on-a-budget in mind – I had wanted to play the Mass Effect Legendary Edition trilogy in 4k with the updated graphics and bug fixes which was released in 2021 (and I didn’t get round to it until Autumn 2024). Hence the GPU allocated to Windows was a bit beefier (also, I bought it a year after the first one, and what you could get for €500 had improved by then).

The above table shows their Euro price today on eBay, and the llama2 7b performance numbers come from this list of llama.cpp benchmarks which are for the Vulkan backend. As you can see, despite that the RX 6700 XT is only a bit faster than the RX 6600 XT for games (about 12%), it’s 50-100% faster for running a LLM which entirely fits inside VRAM. Had I known there would have been such a performance differential, I’d have used the 6700 XT to run Gemma3 in the last diary entry and saved myself days of processing time. Oh well!

Unfortunately, running Q3CN with the Q4_K quantisation on the RX 6700 XT is not good:

Parse is 60 toks/sec.
Generation is 14.5 toks/sec.

Particularly the parse speed is the problem here: in any LLM, you need to feed all the context through parse every turn of interaction. The context gets real big quickly because it has to include all the source code for everything relevant to what you’re working on, plus all the accumulated steps so far. Modern models are able to cache and not reprocess the context from previous calls, so large contexts aren’t the problem per se, rather it’s whenever the model receives a lot of new content it hasn’t seen before e.g. you just fed it a new source file content. You could actually live just fine with slow generation speeds, it’s the parse speed of new content is the problem.

To explain, if you examine a few hours of me doing work, you’ll find about 99.2% of tokens used are input tokens (parsing context), and just 0.8% are output tokens (emitting changes). Therefore, for speedy turnarounds, you don’t really care about token generation speed much at all. Of the input tokens, 2% will be novel tokens, and the other 98% will be cached due to having been seen before. Therefore the ratio is:

3.9% novel input tokens
95.3% cached input tokens
0.8% output tokens

This is of course an average over many interactions, and so long as novel input tokens are small and cached input tokens are large, running Q3CN locally on the developer workstation is just fine. However, when it comes to large new input content, that 60 toks/sec parse speed becomes a problem: particularly at the beginning of each task, expect minutes for it to parse context for the first time. After it is parsed, it trundles along at a fair clip and is nicely interactive with me, until it next reads a new file and then it needs more minutes. All that is fair enough: it’s got 12 Gb of VRAM running a ~40 Gb sized model, so it’s falling back onto main, slow, RAM a lot.

So what’s the best LLM running hardware bang for the buck in March 2026?

The budget LLM executing hardware market in March 2026

I assembled a list of all GPUs and data centre AI accelerator boards with 16 Gb or more VRAM currently available new or second hand to Ireland costing no more than €1,000 inc VAT. For the purposes of comparison, I threw in my existing AMD GPUs and the nVidia Tesla P4 I bought last year for the site video inferencing – these are the only 8 Gb VRAM boards below:

GPU	parse toks/sec/euro	Price (EUR)	Launch year	RAM Gb	Bandwidth Gb/sec	Full power watts	FP16 TFLOPS	llama2 7b parse	llama2 7b gen	Notes
Intel Arc Pro B50	0.5542857143	350	2025	16	224	70	21.3	194	40
Intel Arc Pro B60	0.87	600	2025	24	456	200	24.5	522	69
nVidia P40	1.134883721	430	2016	24	345		10	488	59	needs additional fan, high idle power consumption, low compute perf
nVidia Tesla P4	2.094488189	127	2016	8	192	75	5.7	266	28	needs additional fan
AMD V340L	2.4	100	2018	2x 8	410	300	21	240	48	old, ensure vulkan shaders
AMD RX 6600 XT	2.87	200	2021	8	256	160	16.1	574	54	assigned to linux
AMD RX 6700 XT	3.284375	320	2021	12	384	230	23.8	1051	84	assigned to windows
AMD RX 7700 XT	3.488372093	430	2023	16	624	263	26	1500	70
AMD RX 6900 XT	4.405	400	2020	16	512	300	37	1762	106
Intel Arc A770	4.435714286	280	2022	16	560	225	34.4	1242	55
AMD RX 7900 XTX	4.736	750	2022	24	960	355	46.7	3552	167
AMD Radeon VII	5.066666667	180	2019	16	1024	300	26.9	912	106	needs additional fan, old, ensure vulkan shaders work
AMD RX 9070	5.144715447	615	2025	16	640	220	72.3	3164	120
AMD Mi50	7.466666667	150	2018	16	1024	300	26.5	1120	108	needs additional fan, old, ensure vulkan shaders work
nVidia RTX 5070 Ti	8.863157895	950	2025	16	896	300	43.9	8420	182
nVidia RTX 5060 Ti	9.536363636	440	2025	16	448	180	23.7	4196	94
nVidia RTX 3090	11.12	500	2021	24	936	350	29.3	5560	162

Those prices, especially for the nVidia cards even old ones, are grim.

Despite how very depressing this table is, it’s actually much improved over this time last year when I last updated my spreadsheet for AI accelerators. Back then the only games in town were the expensive nVidia GPUs above, and the Intel GPUs which suck at parsing. AMD GPUs one year ago just weren’t viable because AMD ROCm only supported the very newest GPUs, all of which cost over a grand at the time if you wanted 16 Gb of VRAM (and even today, only the RX 9070 comes in under a grand).

Twelve months later, as I noted last diary entry, now AMD ROCm ‘just works’ even on technically unsupported GPUs from the previous generation like mine. It no longer crashes and blows up like the dumpster fire it was even six months ago. However, the llama2 7b benchmarks listed above aren’t actually from the ROCm backend for llama.cpp, they’re actually from the Vulkan backend because that’s now usually faster than the ROCm backend if you’re using trunk llama.cpp. Its Vulkan backend was started last summer and it’s made enormous strides ahead in just the last three months such that it’s now almost always the fastest backend on AMD GPUs, and it’s as fast as CUDA for token generation on nVidia GPUs. Parsing performance is still one third to one half slower than CUDA on nVidia GPUs, but that gap is closing quickly.

The reason why the Vulkan backend is so game changing is because GPUs have supported Vulkan shaders (which are for high performance games) for over a decade, and that in turn means that all the ancient AMD datacenter AI accelerator boards suddenly come into play because they can all run Vulkan shaders no problem, even if they’ll never run ROCm. That expands the table above with some promising new options from what it was twelve months ago – it also proves that AMD GPUs never actually sucked at LLMs as much as people thought until recently, the actual problem with them was lousy software support, not that the hardware wasn’t capable. This much improved story for running LLMs locally is 100% the result of recent runtime software improvements, and I’m very glad for the increased menu of choice.

The table above is ordered in terms of parse speed per euro, so the bottom of the table is where the standout bang for your buck boards are listed. Unsurprisingly all of those are nVidia GPUs: nothing else can parse tokens as well for your euro. But given that all those are expensive even used on eBay, the next category of standout board is the 2018 era AMD Mi50 datacenter board which is up there with the RX 5xxx nVidia GPUs in terms of bang for the buck. Unlike those boards, the Mi50 can be sourced from China delivered for €150 inc VAT. So, naturally I’ve ordered one and I’m looking forward to its delivery. It has a similar token parse speed to my existing 6700 XT, however it has one third more VRAM and that VRAM is three times faster. I would expect maybe a +20% performance improvement at running Q3CN. I guess I’ll find out.

To get radically faster performance such that there is no waiting at all, one would need at least 48 Gb of VRAM I think, seeing as the model is ~40 Gb. That probably means two cards with 24 Gb VRAM each, and to keep under a €1,000 budget:

€860 2x nVidia P40: probably not that much faster than my 6700 XT at parsing.
€1000 2x nVidia RTX 3090: many times faster.

… of which clearly the RTX 3090 is by far the better option, plus it can be used for gaming. Still, that’s a cool €1,000. That’s a lot of money.

I am mindful that after this AI investment bubble bursts, there is going to be a flood of used AI accelerator hardware on the market which will depress prices. So now is a lousy time to buy, especially in my current financially straightened circumstances. Which then makes one ask: how much would it cost to rent the hardware instead?

Renting Q3CN

The idea for a ‘LLM marketplace’ is of course an obvious one, and as far as I am aware the first of these, and still the biggest, is OpenRouter who got started in 2023. What they do is to provide an OpenAI compatible REST API endpoint which proxies a marketplace of LLM providers for a +5.5% fee over whatever the underlying provider charges. You can set rules for which providers to choose and when and in what order you’d prefer – note that ‘cheapest first’ is NOT the default after account creation. You also don’t necessarily want the absolute cheapest as I found they hang frequently, so with some trial and error you’ll figure out what to ban and what to allow.

You can of course open an account directly with the providers on OpenRouter and save yourself the +5.5% fee, and there are further providers who don’t list on OpenRouter. However one enormous advantage of OpenRouter is automatic failover, because when a provider gets overloaded – and in my experience, they do during peak times – OpenRouter reroutes you to the next cheapest provider with zero outage experienced by you. Maybe down the road when these providers get much better uptime this will change, but for now I think I’ll be happy to pay OpenRouter their fee for not suddenly being put on pause mid-flow.

To rent Q3CN today, these are the four cheapest providers I could find online:

Model	Supplier	Max context	Input US$ per 1M tokens	Cached input US$ per 1M tokens	Output US$ per 1M tokens	Estimated US$ price per day (10M tokens)
~~qwen3-coder-next~~	~~Openrouter (chutes)~~	~~262k~~	~~0.12~~	~~0.06~~	~~0.75~~	~~0.6786~~
qwen3-coder-next	Openrouter (parasail)	262k	0.15	N/A	0.8	1.552
qwen3-coder-next	Openrouter (ionstream)	262k	0.15	N/A	0.8	1.552
qwen3-coder-next	NanoGPT	262k	0.15	N/A	1.5	1.608

The reason that the chutes provider entry is struck out is because they’re the provider who kept hanging the session or corrupting the context. They’re basically useless to use, so I only list them for information only. I’ve had good experiences with Parasail and Ionstream, each has dropped out on occasion, but OpenRouter routed to the other so my work was uninterrupted. NanoGPT is actually a standalone provider not listed on OpenRouter, there are other standalone providers a LOT more expensive for Q3CN rental than those listed here, but they were so much more expensive they’re not really worth listing. In any case, about US$1.50 per day is estimated assuming a use of ten million tokens – which given that I easily chewed through seven million tokens in five hours, may well be an underestimate.

OpenRouter can supply detailed logs on request, so from those I calculated that whatever hardware Parasail is running has this performance:

Parse (uncached): 17,450 toks/sec
Generation (uncached input): 56 toks/sec
Parse (cached): 31,204 toks/sec
Generation (cached input): 81 toks/sec

… which smell to me to be nVidia A100 cards, which I suppose makes sense as they’re older and therefore cost depreciated. In case, they are more than plenty fast enough, the agentic coding AI snaps along faster than I can read its log of actions – half the speed would also be more than plenty. I should remember that for later when buying GPU hardware.

Speaking of expensive … would you like to know how much the ‘big boy’ AI agentic coding services cost for comparison?

Model	Supplier	Max context	Input US$ per 1M tokens	Cached input US$ per 1M tokens	Output US$ per 1M tokens	Estimated US$ price per day (10M tokens)	Multiple of rented Q3CN cost above
qwen3-coder-plus	Openrouter (alibaba)	1000k (though gets noticeably forgetful after 300k)	1.17	0.13	5.85	2.1632	1.4x
gemini3.1-pro	Google	200k	2	0.2	12	3.646	2.3x
gpt-5.3-codex	OpenAI	400k	1.75	0.25	14	4.185	2.7x
claude-sonnet-4.6	Anthropic	200k	3	0.3	15	5.229	3.4x
claude-opus-4.6	Anthropic	200k	5	0.5	25	8.715	5.6x

The cheapest frontier coding model is Alibaba’s Qwen via OpenRouter (where it is heavily discounted for some reason), followed by Google’s Gemini3.1 Pro and OpenAI’s GPT5.3 Codex, with a slightly larger price gap to Anthropic Claude Sonnet and Opus. Qwen Coder Plus is 1.4x the cost of rented Q3CN, which is a useful data point; Claude’s most capable model is 5.6x the cost for my usage patterns, which if I’m honest, is less than I had expected.

Few devs pay for frontier models by the token, and instead have monthly subscriptions. I seem to consume 100 - 120 requests per hour, so that’s 500 - 600 requests and seven million tokens per five hours. That certainly needs the highest possible US$200 per month subscription: that buys you 800 requests per five hours, but there is also a weekly usage limit of 15 - 30 hours for their Opus model. If you want more, Anthropic want you to pay by per-token billing instead, and to be honest, at an estimated US$8.72 per day for my usage pattern, paying by the token for their highest end model with an average 22 day working month would be US$191.84 per month which is cheaper than their US$200 monthly subscription and it has no usage limits, which is another useful data point. I read a lot online about people complaining about the usage limits built into their US$200 per month Claude subscriptions, yet for my AI use patterns the per-token billing would always be cheaper than the subscription. I guess a lot of people have Claude write a lot more output than I would have it do?

I’ll come back to that next section, as I’m digressing as this is a Q3CN focused diary entry: point is that renting Q3CN probably would cost US$34 per month, which is under US$400 per year and that’s if you’re using it full time. Use it only sporadically like I do as I’m unemployed, and spending €1,000 on your own hardware to run it looks like lunacy (unless that same hardware can play upcoming Grand Theft Auto VI at max graphics settings, in which case it becomes more a bird in the hand is worth two in the bush type of cost-benefit analysis ).

Conclusions

I think I’m pretty much decided: I shall use OpenRouter for my Q3CN implementation until the AI investment bubble pops and I can affordably pick up ideally a new powerful GPU also able to run GTA VI well, or I can pick up firesaled legacy AI hardware on eBay for cheap. But the well under five hundred euro sort of cheap – there’s no rational point dropping more than €500 on new hardware given the rental costs, as I’m better off renting until used component prices get under €500.

OpenRouter makes it super simple to flip over to Claude Opus for analysis, architecture and plan writing, and then flip back to Q3CN for implementation. I’m not opposed to paying tens of euro cents for analysis and planning if it reveals things I would likely have missed – especially as it’ll write all that out into documentation for me, which I can then manually review and strip out the wrong parts. I simply view that as good engineering: I welcome all good quality feedback, from any source.

I think this reveals what kind of AI using coder I will be; there appear to be two main categories:

Devs who don’t like writing code much, so get AI to write as much of the code as possible, so they can focus on solving problems ASAP. The AI will therefore output lots of tokens, as it writes all the code.
Devs who feel the whole point of coding is to emit high quality code, and AI isn’t good at that especially starting from a blank sheet, so they’ll always write the bulk of the initial implementation by hand, and then only use AI when appropriate to adjust and refine that codebase. In this category, the AI will mostly read tokens, and output very few as it never edits more than a few lines of existing code at a time.

The first category tends to use an AI focused IDE like Cursor which is a fork of vscode, whilst the second category tends to use AI extensions like Roo Code installed into vscode – and to be specific about the difference here, Roo Code only appears when you open its tab. Otherwise it’s as if it’s not installed, which is exactly what you want when you’re doing work you don’t want the AI to do. Whereas, in Cursor, by choosing that IDE you’re basically saying ‘there is no work I don’t want AI to do’. In other words, it’s outside-in vs inside-out.

I am probably in category two for most of my open source work which is on reference implementation libraries. These set the standard for everybody else, and they have to be very carefully written and designed. So I like the AI to help, but I’m always going to be writing most of the code by hand most of the time.

However I’m not opposed to category one for some tasks: there are a number of Python scripts I’ve written to implement some part of a processing pipeline where I would be more than happy if the AI did as much of the work as possible, as I just want a solution ASAP and I don’t especially care how we get there. For example, if I needed some Android app to solve some itch or something, chances are very high I’d just vibe code that and call it a day.

I guess this is pretty much what Linus Torvalds said about this stuff: ‘use AI to write the code you don’t care about’. That’s pretty much where I’ve arrived at too, though I do find its analysis of what I’ve written quite insightful sometimes, as it sees with eyes which are not my own.

Anyway, that’s my analysis of agentic AI coding assistants written up! I do apologise for the wall of text, but I did also want to be comprehensive as I’ll almost certainly refer back to this in the future, so I wanted all my scattered notes built over many months to be condensed into a single, albeit very long and dense, diary entry which will turn up in search in the future should I need to refer to it.

Next entry will almost certainly be about making the rented house internet go faster, but that’ll be at the earliest next week. Be happy everybody!

#AI #LLM #agentic #qwen

Go to previous entry

Go to next entry

Go back to the archive index

Go back to the latest entries

Niall’s virtual diary archives – Thursday 12 March 2026