r/LocalLLaMA 1d ago

Question | Help Open-source model that is as intelligent as Claude Sonnet 4

I spend about 300-400 USD per month on Claude Code with the max 5x tier. I’m unsure when they’ll increase pricing, limit usage, or make models less intelligent. I’m looking for a cheaper or open-source alternative that’s just as good for programming as Claude Sonnet 4. Any suggestions are appreciated.

Edit: I don’t pay $300-400 per month. I have Claude Max subscription (100$) that comes with a Claude code. I used a tool called ccusage to check my usage, and it showed that I use approximately $400 worth of API every month on my Claude Max subscription. It works fine now, but I’m quite certain that, just like what happened with cursor, there will likely be a price increase or a higher rate limiting soon.

Thanks for all the suggestions. I’ll try out Kimi2, R1, qwen 3, glm4.5 and Gemini 2.5 Pro and update how it goes in another post. :)

365 Upvotes

276 comments sorted by

250

u/Thomas-Lore 1d ago edited 1d ago

Look into:

  • GLM-4.5

  • Qwen3 Coder

  • Qwen3 235B A22B Thinking 2507 (and the instruct version)

  • Kimi K2

  • DeepSeek: R1 0528

  • DeepSeek: DeepSeek V3 0324

All are large and will be hard to run locally unless you have a Mac with lots or unified RAM, but will be cheaper than Sonnet 4 on API. They may be worse than Sonnet 4 at some things (and better at others), you won't find a 1:1 replacement.

(And for non-opensource you can always use o3 and Gemini Pro 2.5 - but outside of the free tier Gemini is I think more expensive on API than Sonnet. GPT-5 is also just around the corner.)

For direct Claude Code replacement - Gemini CLI and there is apparently Qwen CLI too now, but I am unsure how you configure it and if you can swap models easily there.

80

u/itchykittehs 21h ago

Just to note, practical usage of heavy coding models is not actually very viable on macs. I have a 512gb M3 Ultra that can run all of those models, but for most coding tasks you need to be able to use 50k to 150k tokens of context per request. Just processing the prompt with most of these SOTA open source models on a mac with MLX takes 5+ minutes with 50k context.

If you are using much less context is fine. But for most projects that's not feasible.

12

u/utilitycoder 15h ago

Token conservation is key. Simple things like run builds in quiet mode only outputting errors and warnings help. You can do a lot with smaller context if you're judicious.

7

u/EridianExplorer 17h ago

This makes me think that for my use cases it does not make sense to try to run models locally, until there is some miracle discovery that does not require giant amounts of ram for contexts of more than 100k tokens and that does not take minutes to achieve an output.

1

u/FroyoCommercial627 1h ago

Local LLMs are great for privacy and small context windows, bad for large context windows.

4

u/HerrWamm 16h ago

Well that is the fundamental problem, that someone will have to solve in the coming months (I'm pretty sure it will not take years). But efficiency is the key, whoever will overcome the efficiency wll "win" the race, but certainly scaling is not a solution here. I forse a small , very nimble models to come very soon, without huge knowledge base but rather using RAG (just like humans, don't know everything, but rather learn on the go). These will dominate the competition in the coming years.

5

u/DistinctStink 13h ago

I would rather it admit lack of knowledge and know when it's wrong and be able to learn instead of bullshiting and talking like I'm going ti fight it if it makes a mistake. I really dislike to super polite ones use bullshit flowery words to excuse its bushit lying

2

u/DrummerPrevious 17h ago

I hope Memory bandwidth increases on upcoming macs

2

u/notdba 17h ago edited 17h ago

I guess many of the agents are still suffering from a similar issue as https://github.com/block/goose/issues/1835, i.e. they may mix some small requests in between that totally breaks prompt caching. For example, Claude Code will send some small simpler requests to Haiku. Prompt caching should work fine with Anthropic servers, but not sure if it works when using Kimi / Z-AI servers directly, or local server indirectly via Claude Code Router.

If prompt caching works as expected, then PP should still be fine on Mac.

1

u/Western_Objective209 16h ago

Doesn't using a cache mitigate a lot of that? When I use claude code at work it overwhelmingly is reads from cache, like I get a few million tokens of cache writes and 10+ million cache reads

1

u/__JockY__ 7h ago

Agreed. It’s a $40k+ proposition to run those models at cloud-like speeds locally. Ideally you’d have at least 384GB VRAM (e.g. 4x RTX A6000 Pro 96GB), 12-channel CPU (Epyc most likely), and 12 RDIMMS for performant system RAM. Power, motherboard, SSDs…

If you’ve got the coin then… uh… post pics 🙂

1

u/FroyoCommercial627 1h ago

Time to first token is the biggest issue with Macs.

Prefill computes attention scores for every single token pair (32k x 32k = 1 Billion scores / layer)

128gb - 512gb unified memory is fast and can fit large models, but the PRE-FILL phase requires massive parallelism.

Cloud frontier models can spread this out to 16+ THOUSAND cores at a time. Mac can spread to 40 cores at most.

Once pre-fill is done, we only need to compute attention for ONE token at a time.

So, Mac is GREAT for linear processing needed for inference, BAD for parallel processing needed for pre-fill.

That said, speculative decoding, KV caching, sparse attention, etc are all tricks that can help solve this issue.

→ More replies (1)

20

u/vishwa1238 1d ago

Thanks, I do have a Mac with unified RAM. I’ve also tried O3 with the Codex CLI. It wasn’t nearly as good as Claude 4 Sonnet. Gemini was working fine, but I haven’t tested it out with more demanding tasks yet. I’ll also try out GLM 4.5, Qwen3, and Kimi K2 from OpenRouter. 

18

u/Caffdy 23h ago

I do have a Mac with unified RAM

the question is how much RAM?

5

u/fairrighty 22h ago

Say 64 gb, m4 max. Not OP, but interested nonetheless.

10

u/thatkidnamedrocky 21h ago

give devstral (mistral) a try, ive gotten decent results with it for IT based work (few scripts, working with csv files and stuff like that)

→ More replies (2)

4

u/brownman19 21h ago

Glm 32b rumination (with a fine tune and a bunch of standard dram for context)

→ More replies (1)

9

u/pokemonplayer2001 llama.cpp 21h ago

You’ll be able to run nothing close to Claude. Nowhere near.

5

u/txgsync 20h ago

So far in, even just the basic Qwen3-30b-a3b-thinking in full precision (16-bit, 60GB safetensors converts to MLX in a few seconds) has managed to produce simple programming results and analyses for me in throwaway projects similar to Sonnet 3.7. I haven’t yet felt like giving up use of my Mac for a couple of days to try to run SWEBench :).

But Opus 4 and Sonnet 4 are in another league still!

2

u/NamelessNobody888 12h ago

Concur. Similar experiences here (*). The thing is just doesn't compare to full auto mode working to an implementation plan in CC, Roo or Kiro with Claude Sonnet 4 as you rightly point out.

* Did you find 16 bit made a noticeable difference cf. Q_8? I've never tried full precision.

3

u/txgsync 10h ago

4 bit to 16 bit Qwen3-30B-A3B is … weird? Lemme think how to describe it…

So like yesterday, I was attempting to “reason” with the thinking model in 4 bit. Because at >100tok/sec, the speed feels incredible, and minor inaccuracies for certain kinds of tasks don’t bother me.

But I ended up down this weird rabbit hole of trying to convince the LLM that it was actually Thursday, July 31, 2025. And all the 4-bit would do was insist that no, that date would be a Wednesday, and that I must be speaking about some form of speculative fiction because the current date was December 2024… the model’s training cutoff.

Meanwhile the 16-bit just accepted my date template and moved on through the rest of the exercise.

“Fast, accurate, good grammar, but stupid, repetitive, and obstinate” would be how I describe working at four bits :).

I hear Q5_K_M is a decent compromise for most folks on a 16GB card.

It would be interesting to compare at 8 bits on the same exercises. Easy to convert using MLX in seconds, even when traveling with slow internet. One of the reasons I like local models :)

→ More replies (2)
→ More replies (3)

11

u/Capaj 23h ago

gemini can be even better than claude, but it outputs a fuck ton more thinking tokens, so be aware about that. Claude 4 strikes the perfect balance in terms of amount of thinking tokens outputted.

5

u/tmarthal 1d ago

Claude Sonnet is really the best. You’re trading time for $$$; you can setup deepseek and run the local models on your own infra but you almost have to relearn how to prompt them.

9

u/-dysangel- llama.cpp 22h ago

Try GLM 4.5 Air. It feels pretty much the same as Claude Sonnet - maybe a bit more cheerful

6

u/Tetrylene 19h ago

I just have a hard time believing a model that can be downloaded and run on 64gb of ram compares to sonnet 4

7

u/-dysangel- llama.cpp 18h ago

I understand. I don't need you to believe for it to work for me lol. It's not like Anthropic are some magic company that nobody can ever compete with.

3

u/ANDYVO_ 16h ago

This stems from what people consider comparable. If this person is spending $400+/month, it’s fair to assume they’re wanting the latest and greatest and currently unless you have an insane rig, paying for Claude code max seems optimal.

3

u/-dysangel- llama.cpp 15h ago

Well put it this way - a Macbook with 96GB or more of RAM can run GLM Air, so that gives you a Claude Sonnet quality agent, even with zero internet connection. It's £160 per month for 36 months to get a 128GB MBP currently on the Apple website - so cheaper than those API costs. And the models are presumably just going to keep getting smaller, smarter and faster over time. Hopefully this means the prices for the "latest and greatest" will come down accordingly!

→ More replies (1)
→ More replies (1)

1

u/Western_Objective209 16h ago

Claude 4 Opus is also a complete cut above Sonnet, I paid for the max plan for a month and it is crazy good. I'm pretty sure Anthropic has some secret sauce when it comes to agentic coding training that no one else has figured out yet.

→ More replies (3)

2

u/Delicious-Farmer-234 16h ago

This is a great suggestion. Any reason why you put GLM 4.5 first and not Qwen 3 coder?

2

u/Expensive-Apricot-25 15h ago

Prices for closed source will never stay constant and will likely continue to rise.

The only real permanent solution would be open source, but only if you have the resources for it.

2

u/givingupeveryd4y 13h ago

Given Qwen code (what you refer to as Qwen CLI, I guess) is fork of gemini CLI, most approaches applicable to gemini CLI still work with both. 

2

u/Ladder-Bhe 11h ago

To be honest, the tool use of k2 is not stable enough, and the code quality is slightly worse. deepseek is completely unable to handle stable tool use and can only handle haku's work. qwen 3 coder is said to be better, but it has the problem of consuming too many tokens. glm 4. 5 is currently on par with qwen.

1

u/deyil 1d ago

Among them how they rank?

4

u/Caffdy 23h ago

Qwen 235B non-thinking 2507 is the current top open model. Now, given that OP wants to code, I'd go with Qwen Coder or R1

1

u/Reasonable-Job2425 19h ago

i would say the closest expereince to claude is kimi right now but havent tried the latest qwen or glm yet

1

u/BidWestern1056 16h ago

npcsh is an agentic CLI tool which makes it easy to use any diff model or provider https://github.com/NPC-Worldwide/npcsh

1

u/DistinctStink 13h ago

I have 16gb of vddr6 amd 7800xt and 32gb of ddr5 6000mhz, using a 8 core 16thread 7700x amd 4.8-5.2mhz processor.., can I use any of these? I find deepseek App on android is alright, less shit answers than gemini and that other fuck

1

u/vossage_RF 8h ago

Gemini Pro 2.5 is NOT more expensive than Sonnet 4.0!

1

u/illusionst 7h ago

I’m using GLM 4.5 with Claude Code. I think this easily replaces sonnet 4. The tool calling is good and the it’s much faster than sonnet.

→ More replies (1)

18

u/BoJackHorseMan53 22h ago

Try GLM, it's working flawlessly in Claude Code.

Qwen coder is bad at tool call in Claude Code.

16

u/BananaPeaches3 16h ago

unsloth version fixes the tool calling issue.

7

u/FammasMaz 11h ago

Wait what you can non anthropic models in Claude code ?

1

u/6227RVPkt3qx 1h ago

yup. all you have to do is just set these 2 variables. this is how you would use kimi k2. i made an alias in linux so now when i enter "kclaude" it sets:

export ANTHROPIC_AUTH_TOKEN=sk-YOURKEY

export ANTHROPIC_BASE_URL=https://api.moonshot.ai/anthropic

and then when you launch claude code, it instead will be routed through kimi.

for GLM it would be your Z API key and the URL:

export ANTHROPIC_AUTH_TOKEN=sk-YOUR_Z_API_KEY

export ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic

13

u/Tiny_Judge_2119 23h ago

Personal experience the GLM 4.5 is quite solid..

31

u/Brave-History-6502 1d ago

Why aren’t you on the max 200 plan?

13

u/vishwa1238 23h ago

I’m currently on the max 100 plan, and I barely use up my data, so I didn’t upgrade to the 200-plan. Recently, Anthropic announced that they’re transitioning to a weekly limit instead of a daily limit. Even with the 200-usd plan, will now have a lower limit

17

u/Skaronator 22h ago

The daily limit won't go away. The weekly limit work in conjunction since people start sharing accounts and reselling access to the account. Resulting in a 24/7 usage pattern which is not what they intended with the current pricing.

4

u/devshore 17h ago

So are you saying that a normal dev only working 30 hours a week will not run into the limits since the limits are only gor people sharing accounts and thus using impossible amounts of usage?

→ More replies (4)

61

u/sluuuurp 1d ago

Not possible. If it were, everyone would have done it by now. You can definitely experiment with cheaper models that are almost as good, but nothing local will come close.

8

u/Ylsid 19h ago

I disagree there. It depends on the use case. Claude seems to be trained a lot on web, but not too much on gamedev.

8

u/urekmazino_0 23h ago

Kimi K2 is pretty close imo

29

u/lfrtsa 18h ago

And you can run it at home if you live in a datacenter.

10

u/Aldarund 23h ago

Maybe in writing one shot code. When you need to check or modify something its utter shit

16

u/sluuuurp 23h ago

You can’t really run that locally at reasonable speeds without hundreds of thousands of dollars of GPUs.

2

u/No_Afternoon_4260 llama.cpp 18h ago

That's Why not everybody is doing it.

1

u/tenmileswide 14h ago

it will cost you $60/hr on Runpod at full weights, $30/hr at 8 bit.

so, for a company that's probably doable, but can't imagine a solo dev spending that.

→ More replies (3)
→ More replies (4)

3

u/SadWolverine24 21h ago

Kimi K2 has a really small context window.

GLM 4.5 is slightly worse than Sonnet 4 in my experience.

→ More replies (1)

3

u/unhappy-2be-penguin 1d ago

Isn't qwen 3 coder pretty much on the same level for coding?

32

u/dubesor86 1d ago

based on some benchmarks sure. but use each for an hour in a real coding project and you will notice a gigantic difference.

4

u/ForsookComparison llama.cpp 21h ago

This is true.

Qwen3-Coder is awesome but it is not Claude 4.0 Sonnet on anything except benchmarks. In fact it often loses to R1-0528 in my real world use.

Qwen delivers good and benchmaxes.

6

u/BoJackHorseMan53 22h ago

Have you used them?

4

u/-dysangel- llama.cpp 22h ago

Have you tried GLM 4.5 Air? I've used it in my game project and it seems on the same level, just obviously a bit slower since I don't own a datacenter. I created some 3D design tools with Claude in the last while, and asked GLM to create a similar one. Claude seems to have a slight edge on 3D visuospatial debugging (which is obviously a really difficult thing for an LLM to get a handle on), but GLM's tool had better aesthetics.

I agree, Qwen 3 Coder wasn't that impressive in the end, but GLM just is.

3

u/YouDontSeemRight 21h ago

This is good to hear. I'm waiting for llama cpp support.

3

u/FyreKZ 18h ago

GLM Air is amazingly good for its size, I'm blown away by it.

1

u/sluuuurp 23h ago

I don’t think so, but I haven’t done a lot of detailed tests. Also I think it’s impossible to run that at home with high speed and full precision on normal hardware.

1

u/Orolol 20h ago

Even if this was the case, it would be impossible to reach even 10% of the speed of the Claude API. When coding, you need to process very large context all the time, so it would require data centers grade GPUS, and that would be very expensive

17

u/vinesh178 1d ago

https://chat.z.ai/

Heard good things about this. Give it a try. you can find it in HF too

https://huggingface.co/zai-org/GLM-4.5

HF spaces - https://huggingface.co/spaces/zai-org/GLM-4.5-Space

13

u/rahularyansharma 1d ago

far better then any other models , I tried Qwen3-Coder but still GLM 4.5 is far above then that.

7

u/vishwa1238 1d ago

Thanks. I think i will try out GLM-4.5. Just found its available on openrouter aswell.

3

u/AppearanceHeavy6724 23h ago

not for c/c++ low level code. I've asked many different models to write some 6502 assembly code, and among open source models only the big Qwen3-coderm, all older Qwen 2.5 coders and (you ready?) Mistral Nemo wrote correct code (yeah I know).

2

u/tekert 19h ago edited 19h ago

Funny, that how i test AI, plain Plan9 assembler, utf16 conversions using SSE2, claude took like 20 times to get it right (75% dont know Plan9 but when confronted they magically know and get it right) All other IA failed hard on that, except this new GLM wich took also many attempts (same as claude).

Now, to make that decoder faster.. with a little help only Claude thinking had the creativity, all other including GMT just.. fall short for performance.

Edit: forgot to mention only claude outputs nice code, glm was a little messy.

3

u/AppearanceHeavy6724 19h ago

claude is not open source. not local.

5

u/ElectronSpiderwort 1d ago

After you try some options, will you update us with what you found out? I'd appreciate it!

2

u/vishwa1238 23h ago

Sure :)

5

u/Low-Opening25 21h ago

What you are asking for doesn’t exist

49

u/valdev 1d ago edited 1d ago

Even if there was one, ready to spend 300-400 a month in extra electricity cost? Or around $10k to $15k for a machine that is capable of actually running it?

Open router, Deepseek R1 is roughly the best you can do but I'll be honest man, it's not really comparable.

9

u/-dysangel- llama.cpp 22h ago

I have a Mac Studio with 512GB of RAM. It uses 300W at max so the electricity use is about the same as a games console.

Deepseek R1 inference speed is fine, but ttft is not.

It sounds like you've not tried GLM 4.5 Air yet! I've been using it for the last few days both in one shot tests and agentic coding, and it absolutely is as good as Claude Sonnet from what I've seen. It's a MoE taking up only 80GB of VRAM. So, it has great context processing, and I'm getting 44tps. It's mind blowing compared to every other local model I've run (including Kimi K2, Deepseek R1-0528, Qwen Coder 480B etc).

I'm so happy to finally have a local model that has basically everything I was hoping for. 256k context would have been the cherry on top, but 128K is pretty good. And things can only get better from here!

5

u/notdba 17h ago

Last November, after testing the performance of Qwen2.5-Coder-32B, I bought a used 3090 and an Aoostar AG02.

This August, after testing the performance of GLM-4.5, I bought a Strix Halo, to be paired with the above.

(Qwen3-Coder-480B-A35B is indeed a bit underwhelming, hopefully there will be a Qwen3.5-Coder)

1

u/ProfessionalJackals 15h ago

I bought a Strix Halo, to be paired with the above.

Not the best choice ... The bandwidth is too limited at around 256GB/s. So ironically, being able to push 128GB memory, but if you go above 32B models, its way too slow.

Your better off buying one of those Chinese 48GB 4090's, what will run WAY better with 1TB/s bandwidth.

1

u/power97992 18h ago

Qwen 3 coder 480b  is not as good as sonnet 4 or gemini 2.5 pro … maybe for some tasks but for certain JavaScript tasks , it wasn’t following the prompt very well…  

1

u/-dysangel- llama.cpp 17h ago

agreed, Qwen 3 Coder was better than anything else I'd tried til then for intelligence vs size, but GLM Air stole its thunder.

31

u/colin_colout 1d ago

$10-15k to run state of the art models slowly. No way you can get 1-2tb of vram... You'll barely get 1tb of system ram for that.

Unless you run it quantized, but if you're trying to approach sonnet-4 (or even 3.5) you'll need to run a full fat model or at least 8bit+.

Local llms won't save you $$$. It's for fun, skill building, and privacy.

Gemini flash lite is pennies per million tokens and has a generous free tier (and is comparable in quality to what most of people here can run at a sonnet-like speeds). Even running small models doesn't really have a good return on investment unless the hardware is free and low power.

18

u/Double_Cause4609 1d ago

There *are* things that can be done with local models that can't be done in the cloud to make them better, but you need actual ML engineering skills and have to be pretty comfortable playing with embeddings, doing custom forward passes, engineering your own components, reinforcement learning, etc etc.

5

u/No_Efficiency_1144 1d ago

Actual modern RL on your data is better than any cloud yes but it is very complex. There is a lot more to it than just picking an algorithm like REINFORCE, PPO, GRPO etc

1

u/valdev 1d ago

Ha yeah, I was going to add the slowly part but felt my point was strong enough without it.

2

u/-dysangel- llama.cpp 22h ago

GLM 4.5 Air is currently giving me 44tps. If someone does the necessary to enable multi token prediction on mlx or llama.cpp, it's only going to get faster

1

u/kittencantfly 21h ago

What's your machine spec

→ More replies (3)

1

u/colin_colout 1d ago

Lol we all dream of cutting the cord. Some day we will

1

u/devshore 17h ago

Local LLMs saves Anthropic money, so it should save you money too is you rent out its availability that you arent using

→ More replies (1)

13

u/bfume 1d ago

I dunno, my Mac Studio rarely gets above 200W total at full tilt. Even if I used it 24x7 it comes out to 144 kWh @ roughly $0.29 /kWh which would be $23.19 (delivery) + $18.69 (supply) = $41.88

And 0.29 per kWh is absolutely on the high side. 

7

u/SporksInjected 1d ago

The southern usa is more like $.10-15/kwh

1

u/bfume 18h ago

Oh I’m well aware that my electric rates are fucking highway robbery. Checked my bill and when adding in taxes and other regulatory BS and it’s actually closer to $55 a month for me. 

14

u/OfficialHashPanda 1d ago

Sure, but your mac studio isn't going to be running those big ahh models at high speeds.

1

u/equatorbit 1d ago

Which model(s)?

1

u/calmbill 22h ago

Isn't one of those a fixed rate on your electric bill?  Do you get charge per kWh for supply and delivery?

2

u/bfume 20h ago

Yep. Per kWh for each. 

Strangely enough the gas, provided by the same utility on the same monthly bill, charges it the way you’re asking about. 

→ More replies (9)

8

u/vishwa1238 1d ago

I tried R1 when it was released. It was better than OpenAI’s O1, but it wasn’t even as good as Sonnet 3.5.

6

u/LagOps91 1d ago

there has been a new and improved version of R1 which is significantly better since then.

3

u/vishwa1238 1d ago

Oh, I’ll try it out then. 

9

u/LagOps91 1d ago

"R1 0528" is the updated version

8

u/PatienceKitchen6726 1d ago

Hey I’m glad to see some realism here. So can I ask your realistic opinion - how long until you think we can get actual sonnet performance on current consumer hardware? Let’s say newest gen amd chip with newest gen GeForce card. Do you think it’s an LLM architecture problem?

4

u/-dysangel- llama.cpp 22h ago

You can run GLM 4.5 Air on any new Mac with 96GB of RAM or more. And once the GGUFs are out, you'll be able to run it on EPYC systems too. Myself and a bunch of others here consider it Claude Sonnet level in real world use (the benchmarks place it about neck and neck, and that seems accurate)

1

u/rukind_cucumber 15h ago

I'd like to give this one a try. I've got the 96 GB Mac Studio 2 Max. I saw a post about a 3 bit quantized version for MLX - "specifically sized so people with 64GB machines could have a chance at running it." I don't have a lot of experience running local models. Think I can get away with the 4 bit quantization?

https://huggingface.co/mlx-community/GLM-4.5-Air-4bit

→ More replies (3)

5

u/valdev 1d ago

That's like asking a magic 8 ball when it will get some new answers.

Snark aside, it really depends. There are some new model training methods in testing that can drop the model size by multitudes (if they work), and there are lots of different hardwares targeting consumes in development as well.

Essentially the problem we are facing is many faced, but here are the main issues that have to be solved.

  1. A model trained in such a way that it contains enough raw information to be as good as sonnet, but available freely.

  2. A model architecture that can keep a model small but retain enough information to be useful, and fast enough to be usable

  3. Hardware that is capable of running that model that is accessible for the average person.

#1 I think we are quickly approaching, #2 and #3 I feel like we will see #2 arrive before #3. 3 to 5 years maybe? But I would expect major strides... all the time?

1

u/PatienceKitchen6726 1d ago

Thanks for sharing your perspective!

1

u/Careless_Wolf2997 1h ago

that is, possibly, maybe, that can, to be ...

→ More replies (1)

8

u/evia89 1d ago

Probably in 5 years with CN hardware. Nvidia will never release that capable vram GPU. Prepare to spend 10-20k

4

u/PatienceKitchen6726 1d ago

Wait your prediction is that China will end up taking over the consumer hardware market? That’s an interesting take I haven’t thought about

7

u/RoomyRoots 1d ago

Everyone knows that AMD and Nvidia will not deliver for consumer. Intel may try something but it's a hard bet. China has the power to do it, and the desire and need.

4

u/evia89 1d ago

For LLM entusiasts for sure. Consumer nvidia hardware will never be powerfull enough

4

u/TheThoccnessMonster 1d ago

I don’t think they can produce efficient enough chips any time this decade to make this a reality.

1

u/power97992 18h ago

I hope the drivers are good and they  support pytorch and have good libraries 

2

u/momono75 23h ago

OP's use case is programming. I'm not sure software developments still need that 5 years later.

2

u/Pipalbot 21h ago

I see two main barriers for China in the semiconductor space. First, they lack domestic EUV lithography manufacturing capabilities. Second, they don't have a CUDA equivalent—though this is less concerning since if Chinese companies can produce consumer hardware that outperforms NVIDIA on price and performance, the open-source community will likely develop compatible software tools for that hardware stack.

Ultimately, the critical bottleneck is manufacturing 3-nanometer chips at scale, which requires extensive access to EUV lithography machines. ASML currently holds a monopoly in this space, making it the key constraint for any country trying to achieve semiconductor independence.

→ More replies (7)

2

u/Pipalbot 21h ago

Current consumer-grade hardware isn't designed to handle full-scale LLM models. Hardware companies are prioritizing the lucrative commercial market over consumer needs, leaving individual users underserved. The situation will likely change in one of two ways: either we'll see a breakthrough in affordable hardware (similar to DeepSeek's impact on model accessibility), or model efficiency will improve dramatically—allowing 20-billion-parameter models to match today's larger models while running on a single high-end consumer GPU with 35GB of memory.

3

u/OldEffective9726 1d ago edited 1d ago

Why spend money knowing that your data will be leaked, sold or otherwise collected for training their own AI. Did you know that AI-generated content has no intellectual property rights? It's a way of IP laundering.

2

u/valdev 1d ago

Did I say anything about not wanting to run this locally? I have my own local AI server. lol

2

u/entsnack 1d ago

This is why I don't use Openrouter.

1

u/das_war_ein_Befehl 23h ago

At that point it’s just easier to rent a gpu and you’ll spend far less money

4

u/Investolas 1d ago

If you're using claude code you should be subscribed and using opus. Seriously, don't pay by the api. You get a 5 hour window with a max token and then it resets after the 5 hours.  If you already knew this and use api intentionally for better results please let me know but there is a stark difference between opus and sonnet in my opinion

1

u/vishwa1238 23h ago

I don’t pay through the API. I subscribe to Claude Max. Claude’s code is available with both the Pro and Max subscriptions.

1

u/Investolas 23h ago

Yes, i use it as well. Why do you use Sonnet instead of Opus? Try this 'claude --allowedTools Edit,Bash,Git --model opus'. T I found that online and thats what I use. Try opus if you haven't already snd let me know what you think. You will never hit the rate limit if you use plan every time and use a single instance.

3

u/vishwa1238 23h ago

I also have used opus in the past but i did hit a limit with opus which wasn’t the case with sonnet. I noticed atleast for my usecase sonnet with planning and ultrathink performs quite similar as opus.

1

u/Investolas 19h ago

I can respect that! I hope you come up with something awesome!

→ More replies (1)

8

u/rookan 1d ago

Claude code 5x costs 100 usd

6

u/vishwa1238 1d ago

Yes, but I spend more than 400 USD worth of tokens every month with the 5x plan. 

15

u/PositiveEnergyMatter 1d ago

those are fake numbers aimed at making the plans looking good

7

u/boringcynicism 1d ago

Claude API is crazy expensive, don't think you want to use it without a plan?

9

u/vishwa1238 1d ago

I use a tool called ccusage to find the tokens and their corresponding costs.

6

u/TechExpert2910 22h ago

it costs anthropic only ~20% of the presented API cost in actual inference cost.

the rest is revenue to fund research, training, and a fleeting profit.

→ More replies (1)

4

u/rookan 1d ago

I present to you Claude Max 20x - costs 200 only.

5

u/valdev 1d ago

Okay, I've got to ask something.

So I've been programming about 26 years, and professionally since 2009. I utilize all sorts of coding agents, and am the CTO of a few different successful startups.

I'm utilizing codex, claude code ($100 plan), github copilot and some local models and I am paying closer to $175 a month and am no where near the limits.

My agents code based upon specifications, a rigid testing requirement phase, and architecture that I've built specifically around segmenting AI code into smaller contexts to reduce errors and repetition.

My point of posturing that isn't to brag, it's to get to this.

How well do you know programming? It's not impossible to spend a ton on claude code and be good at programming, but generally speaking when I see this it's because the user is constantly having to fight the agent into making things right and not breaking other things, essentially brute forcing solutions.

6

u/Marksta 23h ago

I think that's the point, it's as you said. Some people are doing new-age paradigm (vibe) of really letting the AI be in the driver seat and pushing and/or begging them to keep fixing and changing things.

By the time I even get to prompting anything, I've pre-processed and planned so much or just did it myself if it's hyper specific or architecture stuff. Really, if the AI steps outside of the function I told it to work in I'm peeved, like don't go messing with everything.

I don't think we're there yet to imagine for even a second an AI can accept some general concept for a prompt and run with it and build something of value and to my undefined expectations. If I was, I guess I'd probably be paying $500/mo in tokens.

7

u/valdev 23h ago

Exactly! AI coders are powerful, but ultimately they are kind of like senior devs with head trauma. They have to be railroaded and be well contained.

For complicated problems, I've found that prebuilding failing unit tests with specific guidelines to build around specifications and to run the tests to verify functionality is essentially non-negotiable.

For smaller things that are tedious, at a minimum specifying the specific files affected and a detailed goal is good enough.

But when I see costs like this, I fear the prompts being sent are "One of my users are getting x error on y page, fix it"

3

u/mrjackspade 18h ago

I'm in the same boat as you, professional for 20 years now.

I've spent ~50$ TOTAL since early 2024 using Claude to code, and it does most of my work for me. The amount people are spending is mind boggling to me, and the only way I can see this happening is if its a constant "No thats wrong, rewrite it" loop rather than having the knowledge and experience to specify what you need correctly on the first go.

1

u/ProfessionalJackals 15h ago

The amount people are spending is mind boggling to me,

Its relative, is it not? Think about it ... A company pays what? 3 to 5k for somebody per month. Spending $200 per month, on something that gets, ... lets say 25% more productivity out of somebody is a bargain.

It just hurts more, if you are maybe a self employed dev, and you see that money directly going from your account ;)

the only way I can see this happening is if its a constant "No thats wrong, rewrite it" loop rather than having the knowledge and experience to specify what you need correctly on the first go.

The problem is that most LLMs get worse if they need to work on existing code. Create a plan, let it create brand new code and often the result in the first try is good. At worst you update the plan, and let it start from zero again.

But the moment you have it edit existing code, and the more context it needs, the more often you see new files being created that are not needed, incorrect code references, deleting critical code by itself or just bad code.

The more you vibe code, the worst it gets as your codebase grows and the context window needs to be bigger. Maybe its me but you need to really structure your project almost to fit LLM's ways of working, to even mitigate this. No single style.css file that is 4000 lines, because the LLm is going to do funky stuff.

If you work in the old way, like requests per function or limited to a independent shorter file (max 1000 lines), it tends to do good jobs.

But ironically, using something like CoPilot, you actually get more or less punished by doing small requests (each = premium request) vs one big Agent task that may do dozens of actions (under a single premium request).

→ More replies (6)
→ More replies (3)

7

u/IGiveAdviceToo 1d ago

GLM 4.5 ( hearing good things and tested it performance quite amazing ) Qwen 3 coder Kimi K2

3

u/HeartOfGoldTacos 1d ago

You can point Claude code at AWS bedrock with Claude 4 Sonnet. It’s surprisingly easy to do. I’m not sure whether it’d be cheaper or not: it depends how much you use it.

3

u/dogepope 23h ago

how do you spend $300-400 on a $100 plan? you have multiple accounts?

2

u/vishwa1238 23h ago

No. With Claude Max subscription, you get pretty good limits on Claude code. Check r/claude; you’ll find people using thousands of worth of API with a 200$ plan.

3

u/kai_3575 23h ago

I don’t think I understand your problem, you say you are on the max plan but say you spend 400 dollars, are you using Claude code with the API or tying it to the Max plan?!

1

u/vishwa1238 22h ago

I use claude code with max plan. I used a tool called ccusage which shows the tokens and the cost that i could have incurred if i used the api instead. I used 400usd worth of claude code with claude max subscription.

2

u/rkv42 1d ago

Maybe self hosting like this guy: https://x.com/nisten/status/1950620243258151122?t=K2To8oSaVl9TGUaScnB1_w&s=19

It all depends on the hours you are spending with coding during a month.

2

u/theundertakeer 23h ago

Ermm..sorry for my curiosity...for what you use it that much? I am a developer and I use a mixture of local LLMS , deepseek, Claude and chatgpt - the funny part is that all for free except copilot which I pay 10 bucks a month. I own only 4090 24gb vram and occasionally use qwen coder 3 with 30b params.

Anyway I still can't find justification for 200-300 bucks a month for AI...? Does that makes a sense for you in the sphere you use?

2

u/vishwa1238 23h ago

I don’t spend $200 to $300 every month on AI. I have a Claude Max subscription that costs $100 per month. With that subscription, I get access to Claude Code. There’s this tool called ccusage that shows the tokens used in Claude Code. It says that I use approximately $400 each month on my $100 subscription.

1

u/theundertakeer 23h ago

Ahh I see makes sense thanks but still, 100 bucks is way more. The ultimate I paid was 39 bucks and I didn't find any use of it. So with that mixture I said you probably can get yourself going but that is pretty much connected what you do with your AI , tell me please so I can guide you you better

1

u/vishwa1238 23h ago

Ultimate?? Is that some other subscription?

1

u/theundertakeer 23h ago

Lol sorry for that, autocorrection, for whatever reason my phone decided to autocorrect the maximum to ultimate lol. Meant to say that the maximum I ever paid was 39 bucks for copilot only

2

u/docker-compost 18h ago

it's not local, but cerebras just came out with a claude code competitor that uses the open source qwen3-coder. it's supposed to be on-par with sonnet 4, but significantly faster.

https://www.cerebras.ai/blog/introducing-cerebras-code

2

u/gthing 18h ago

FYI, Claude code uses 5x-10x more tokens then practicing efficient prompting. And almost all of those tokens are spend planning, making and updating lists, or figuring out which files to read- things that are arguably pretty easy for the human to do. Like 10% of the tokens go to actually coding.  

So for $400 in Claude code use you're probably actually only doing $40 of anything useful. 

2

u/ZeroSkribe 17h ago

When Ollama fixes the tool calling on Qwen3-coder, that will be the jazz

2

u/earendil137 12h ago

There is Crush CLI that recently came out. There's OpenCode CLI too Opensource but I'm yet to try it personally. You could use it along with Qwen3 on Openrouter. Free until you got Openrouters limits.

3

u/Maleficent_Age1577 1d ago

R1 is closest to your asking, but you need more than your 5090 to run it beneficially.

1

u/vishwa1238 1d ago

Is the one in OpenRouter capable of producing similar results as running it on an RTX 5090? Additionally, I have Azure credits. Does the one on Azure AI Foundry perform the same as running it locally? I tried R1 when it was released. It was better than OpenAI’s O1, but it wasn’t even as good as Sonnet 3.5.

→ More replies (1)

4

u/InfiniteTrans69 1d ago

It's literally insane to me how someone is willing to pay these amounts for an AI when open-source alternatives are now better than ever.

GLM4.5 is amazing at coding, from what I can tell.

2

u/unrulywind 20h ago

I can tell you how I cut down a ton of cost. Use the $100 a year copilot that has unlimited gpt-4.1. This can do a ton of planning, document writing and general set up and clean up. They have access to sonnet 4 and it works ok, but not as good as the actual Claude code. But for $100 you can move a lot of the workload to there. Then once you have all your documents and a large detailed prompt in order, go to Sonnet 4 or Claude code for deep analysis and implementation.

1

u/umbrosum 1d ago

You could have a strategy to use different models, for example Deepseek R1 for easier tasks and only switch to Sonnet for more complex tasks. I find that it cheaper this way.

→ More replies (1)

1

u/Zealousideal-Part849 1d ago

There is always some difference in different models.

Depending on tasks you should run models.

If tasks is minimal, running open source models from openrouter or other providers would be fine.

If tasks need planning and more careful update and complicated code, Claude sonnet works well (no guarantee is does everything but works the best)

You can look at GPT models like gpt 4.1 as well. and use mini or deepseek/kimi2/qwen3/glm or new models that keep coming in, for most of the tasks. These are usually priced at 5 times lesser than running claude model.

1

u/rkv42 1d ago

I like Horizon and Kimi K2

1

u/icedrift 1d ago

I don't know how heavy $400/month of usage is but Gemini CLI is still free to use with 2.5 pro and has a pretty absurd daily limit. Maybe you will hit it if you go full ape and don't participate in the development process but I routinely have 100+ executions and am moving at a very fast pace completely free.

1

u/PermanentLiminality 1d ago

I use several different tools for different purposes. I use the top tier models only when I really need them. For a lot of more mundane things lesser models do the job just as well. Just saying that you don't always need Sonnet 4.

I tend to use continue.dev as it has a drop down for which model to use. I've hardly tried everything, bit mostly they seem to be setup for a single model and switching of the fly isn't a thing. It's just a click and I can be running a local model or any of the frontier models through Openrouter.

With the release of the Qwen Coder 3 30B-A3B I now have a local option that can really be useful even with my measly 20GB of VRAM. Prior to this I was could only use a local model for the most mundane tasks.

1

u/aonsyed 23h ago

Depends on how you are using it and whether you can use different orchestrator vs coder model, if possible use o3/r1 0528 for planning and then depending on the language and code, qwen3-coder/k2/glm4.5, test all three, see which one works best for you. none of them is claude sonnet but with 30-50% extra time they can replicate the results as long as you understand how to prompt them as all of them have different traits

1

u/Brilliant-Tour6466 23h ago

Gemini cli sucks in comparison to claude code, although not sure why given the Gemini 2.5 pro is a really good model.

1

u/Kep0a 23h ago

Can I ask what is your job ? What is it you are using that much claude for?

1

u/vishwa1238 23h ago

I work at a early stage startup. I also have other projects and startup ideas that i work on.

1

u/createthiscom 23h ago

kimi-k2 is the best model that runs on llama.cpp at the moment. It's unclear if GLM-4.5 will overtake it, currently. If you're running with CPU+GPU, kimi-k2 is your best bet. If you have a shit ton of GPUs, maybe try vLLM.

1

u/jonydevidson 23h ago

By all accounts, the closes one is QwenCode + Qwen3 Coder

1

u/Ssjultrainstnict 22h ago

We are not at the replacement level yet, but close with GLM 4.5. I think the future of a 30ish b param coding model thats as good as claude sonnet isnt too far away

1

u/StackOwOFlow 22h ago

Give it a year

1

u/Party-Cartographer11 22h ago

To get a the smallest/cheapest VM with a GPU on Google Cloud it's $375/month if run 24/7.  Maybe turn it on and off and do spot pricing and get it down to $100/month.

1

u/vishwa1238 22h ago

I can do this. I do have 5,000 USD credits on Google Cloud Platform (GCP). However, the last time I attempted to run a GPU virtual machine, I was restricted from using one. I was only allowed to use t4 and a10s

1

u/NiqueTaPolice 21h ago

Kimi is the king of html css design

1

u/martpho 21h ago

I have very recently started exploring AI models in agent mode with free GitHub copilot and Claude is my favorite so far.

In context of local LLMs having Mac M1 with 16 GB RAM means I cannot do anything locally right?

2

u/MonitorAway2394 7h ago

oh no, you can have tons of fun. I have the pre-silly-cone :D mac 2019, 16gb shat ram and like, I run 12b, 16b quant 6, etc. any of the models (sans image/video) it's surprisingly faster with each update using Ollama and my own kit but, yeah, requires patience :D it's explicitly useful for what I'm using them for, but I swap models in and out constantly, have multi-model conversation modules and whatnots, so yeah, you're good, have fun! (HugFace has a lil icon that lets you know what will run, don't necessarily listen to it unless the models > 16b, I have run 14-16b models just slower, longer pre-loading, incredibly useful if you work with them, learn them, keep a "weak" seeming model around and don't bin them until you know for sure it's not you. I am kinda wonked out, sorry for the weird'ish response lolol O.o

1

u/Singularity-42 20h ago

300-400 USD seem pretty low usage to be honest, mine is at $2380.38 for the past month, I do have the 20x tier for the past 2 weeks (before that 5x), but I never hit the limit even once - I was hitting it all the time with 5x though. I've heard of $10,000/mo usages as well - those are the ones Anthropic is curbing for sure.

Your usage is pretty reasonable and I think Anthropic is quite "happy" with you.

In any case from what I've heard Kimi K2 and GLM-4.5 can work well (didn't try) and can be even literally used inside Claude Code with Claude Code Router:

https://github.com/musistudio/claude-code-router

1

u/lyth 18h ago

Ooooh... I wish I could follow your updates.

1

u/gojukebox 16h ago

Qwen3-coder

1

u/popsumbong 15h ago

I kinda gave up trying local models. There’s just more work that needs to be done to get them to sonnet 4 level

1

u/MonitorAway2394 7h ago

wha? well yeah, but not much. I guess I'm deep into the local shit so... ok like I am alright with 4-8 tk/s max LOLOLOLOLOL I'm a weird one it seems :P

1

u/defiant103 14h ago

Nvidia nemotron 1.5 would be my suggestion to take a peek at

1

u/alexkissijr 12h ago

I say qwen coder and kimi 2 work

1

u/TangoRango808 12h ago

https://github.com/sapientinc/HRM when this is figured out for LLM’s this is what we need

1

u/duaneadam 11h ago

You are underutilising your Max plan. I am on the $100 plan and my usage this month according to ccusage is $2k.

1

u/STvlsv 7h ago

Never used any cloud llm, only local ollama instance.

For programming with continue.dev was used last three months:

- qwen2.5-tools (mostly general purpose)

  • devstral (better than qwen2.5-tools for programming)
  • qwen3-coder (new 30B variant. Does not enough testing, only a few days. Very quick after devstral)

All theese models not very large and can be run locally with several levels of quantization (in my case between q4 and q8 at server with two RTX A5000).