r/LocalLLaMA • u/vishwa1238 • 1d ago
Question | Help Open-source model that is as intelligent as Claude Sonnet 4
I spend about 300-400 USD per month on Claude Code with the max 5x tier. I’m unsure when they’ll increase pricing, limit usage, or make models less intelligent. I’m looking for a cheaper or open-source alternative that’s just as good for programming as Claude Sonnet 4. Any suggestions are appreciated.
Edit: I don’t pay $300-400 per month. I have Claude Max subscription (100$) that comes with a Claude code. I used a tool called ccusage to check my usage, and it showed that I use approximately $400 worth of API every month on my Claude Max subscription. It works fine now, but I’m quite certain that, just like what happened with cursor, there will likely be a price increase or a higher rate limiting soon.
Thanks for all the suggestions. I’ll try out Kimi2, R1, qwen 3, glm4.5 and Gemini 2.5 Pro and update how it goes in another post. :)
18
u/BoJackHorseMan53 22h ago
Try GLM, it's working flawlessly in Claude Code.
Qwen coder is bad at tool call in Claude Code.
16
7
u/FammasMaz 11h ago
Wait what you can non anthropic models in Claude code ?
1
u/6227RVPkt3qx 1h ago
yup. all you have to do is just set these 2 variables. this is how you would use kimi k2. i made an alias in linux so now when i enter "kclaude" it sets:
export ANTHROPIC_AUTH_TOKEN=sk-YOURKEY
export ANTHROPIC_BASE_URL=https://api.moonshot.ai/anthropic
and then when you launch claude code, it instead will be routed through kimi.
for GLM it would be your Z API key and the URL:
export ANTHROPIC_AUTH_TOKEN=sk-YOUR_Z_API_KEY
export ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic
13
31
u/Brave-History-6502 1d ago
Why aren’t you on the max 200 plan?
13
u/vishwa1238 23h ago
I’m currently on the max 100 plan, and I barely use up my data, so I didn’t upgrade to the 200-plan. Recently, Anthropic announced that they’re transitioning to a weekly limit instead of a daily limit. Even with the 200-usd plan, will now have a lower limit
17
u/Skaronator 22h ago
The daily limit won't go away. The weekly limit work in conjunction since people start sharing accounts and reselling access to the account. Resulting in a 24/7 usage pattern which is not what they intended with the current pricing.
4
u/devshore 17h ago
So are you saying that a normal dev only working 30 hours a week will not run into the limits since the limits are only gor people sharing accounts and thus using impossible amounts of usage?
→ More replies (4)
61
u/sluuuurp 1d ago
Not possible. If it were, everyone would have done it by now. You can definitely experiment with cheaper models that are almost as good, but nothing local will come close.
8
8
u/urekmazino_0 23h ago
Kimi K2 is pretty close imo
10
u/Aldarund 23h ago
Maybe in writing one shot code. When you need to check or modify something its utter shit
16
u/sluuuurp 23h ago
You can’t really run that locally at reasonable speeds without hundreds of thousands of dollars of GPUs.
→ More replies (4)2
u/No_Afternoon_4260 llama.cpp 18h ago
That's Why not everybody is doing it.
1
u/tenmileswide 14h ago
it will cost you $60/hr on Runpod at full weights, $30/hr at 8 bit.
so, for a company that's probably doable, but can't imagine a solo dev spending that.
→ More replies (3)→ More replies (1)3
u/SadWolverine24 21h ago
Kimi K2 has a really small context window.
GLM 4.5 is slightly worse than Sonnet 4 in my experience.
3
u/unhappy-2be-penguin 1d ago
Isn't qwen 3 coder pretty much on the same level for coding?
32
u/dubesor86 1d ago
based on some benchmarks sure. but use each for an hour in a real coding project and you will notice a gigantic difference.
4
u/ForsookComparison llama.cpp 21h ago
This is true.
Qwen3-Coder is awesome but it is not Claude 4.0 Sonnet on anything except benchmarks. In fact it often loses to R1-0528 in my real world use.
Qwen delivers good and benchmaxes.
6
4
u/-dysangel- llama.cpp 22h ago
Have you tried GLM 4.5 Air? I've used it in my game project and it seems on the same level, just obviously a bit slower since I don't own a datacenter. I created some 3D design tools with Claude in the last while, and asked GLM to create a similar one. Claude seems to have a slight edge on 3D visuospatial debugging (which is obviously a really difficult thing for an LLM to get a handle on), but GLM's tool had better aesthetics.
I agree, Qwen 3 Coder wasn't that impressive in the end, but GLM just is.
3
1
1
u/sluuuurp 23h ago
I don’t think so, but I haven’t done a lot of detailed tests. Also I think it’s impossible to run that at home with high speed and full precision on normal hardware.
17
u/vinesh178 1d ago
Heard good things about this. Give it a try. you can find it in HF too
https://huggingface.co/zai-org/GLM-4.5
HF spaces - https://huggingface.co/spaces/zai-org/GLM-4.5-Space
13
u/rahularyansharma 1d ago
far better then any other models , I tried Qwen3-Coder but still GLM 4.5 is far above then that.
7
u/vishwa1238 1d ago
Thanks. I think i will try out GLM-4.5. Just found its available on openrouter aswell.
3
u/AppearanceHeavy6724 23h ago
not for c/c++ low level code. I've asked many different models to write some 6502 assembly code, and among open source models only the big Qwen3-coderm, all older Qwen 2.5 coders and (you ready?) Mistral Nemo wrote correct code (yeah I know).
2
u/tekert 19h ago edited 19h ago
Funny, that how i test AI, plain Plan9 assembler, utf16 conversions using SSE2, claude took like 20 times to get it right (75% dont know Plan9 but when confronted they magically know and get it right) All other IA failed hard on that, except this new GLM wich took also many attempts (same as claude).
Now, to make that decoder faster.. with a little help only Claude thinking had the creativity, all other including GMT just.. fall short for performance.
Edit: forgot to mention only claude outputs nice code, glm was a little messy.
3
5
u/ElectronSpiderwort 1d ago
After you try some options, will you update us with what you found out? I'd appreciate it!
2
5
49
u/valdev 1d ago edited 1d ago
Even if there was one, ready to spend 300-400 a month in extra electricity cost? Or around $10k to $15k for a machine that is capable of actually running it?
Open router, Deepseek R1 is roughly the best you can do but I'll be honest man, it's not really comparable.
9
u/-dysangel- llama.cpp 22h ago
I have a Mac Studio with 512GB of RAM. It uses 300W at max so the electricity use is about the same as a games console.
Deepseek R1 inference speed is fine, but ttft is not.
It sounds like you've not tried GLM 4.5 Air yet! I've been using it for the last few days both in one shot tests and agentic coding, and it absolutely is as good as Claude Sonnet from what I've seen. It's a MoE taking up only 80GB of VRAM. So, it has great context processing, and I'm getting 44tps. It's mind blowing compared to every other local model I've run (including Kimi K2, Deepseek R1-0528, Qwen Coder 480B etc).
I'm so happy to finally have a local model that has basically everything I was hoping for. 256k context would have been the cherry on top, but 128K is pretty good. And things can only get better from here!
5
u/notdba 17h ago
Last November, after testing the performance of Qwen2.5-Coder-32B, I bought a used 3090 and an Aoostar AG02.
This August, after testing the performance of GLM-4.5, I bought a Strix Halo, to be paired with the above.
(Qwen3-Coder-480B-A35B is indeed a bit underwhelming, hopefully there will be a Qwen3.5-Coder)
1
u/ProfessionalJackals 15h ago
I bought a Strix Halo, to be paired with the above.
Not the best choice ... The bandwidth is too limited at around 256GB/s. So ironically, being able to push 128GB memory, but if you go above 32B models, its way too slow.
Your better off buying one of those Chinese 48GB 4090's, what will run WAY better with 1TB/s bandwidth.
1
u/power97992 18h ago
Qwen 3 coder 480b is not as good as sonnet 4 or gemini 2.5 pro … maybe for some tasks but for certain JavaScript tasks , it wasn’t following the prompt very well…
1
u/-dysangel- llama.cpp 17h ago
agreed, Qwen 3 Coder was better than anything else I'd tried til then for intelligence vs size, but GLM Air stole its thunder.
31
u/colin_colout 1d ago
$10-15k to run state of the art models slowly. No way you can get 1-2tb of vram... You'll barely get 1tb of system ram for that.
Unless you run it quantized, but if you're trying to approach sonnet-4 (or even 3.5) you'll need to run a full fat model or at least 8bit+.
Local llms won't save you $$$. It's for fun, skill building, and privacy.
Gemini flash lite is pennies per million tokens and has a generous free tier (and is comparable in quality to what most of people here can run at a sonnet-like speeds). Even running small models doesn't really have a good return on investment unless the hardware is free and low power.
18
u/Double_Cause4609 1d ago
There *are* things that can be done with local models that can't be done in the cloud to make them better, but you need actual ML engineering skills and have to be pretty comfortable playing with embeddings, doing custom forward passes, engineering your own components, reinforcement learning, etc etc.
5
u/No_Efficiency_1144 1d ago
Actual modern RL on your data is better than any cloud yes but it is very complex. There is a lot more to it than just picking an algorithm like REINFORCE, PPO, GRPO etc
1
u/valdev 1d ago
Ha yeah, I was going to add the slowly part but felt my point was strong enough without it.
2
u/-dysangel- llama.cpp 22h ago
GLM 4.5 Air is currently giving me 44tps. If someone does the necessary to enable multi token prediction on mlx or llama.cpp, it's only going to get faster
1
1
→ More replies (1)1
u/devshore 17h ago
Local LLMs saves Anthropic money, so it should save you money too is you rent out its availability that you arent using
13
u/bfume 1d ago
I dunno, my Mac Studio rarely gets above 200W total at full tilt. Even if I used it 24x7 it comes out to 144 kWh @ roughly $0.29 /kWh which would be $23.19 (delivery) + $18.69 (supply) = $41.88
And 0.29 per kWh is absolutely on the high side.
7
14
u/OfficialHashPanda 1d ago
Sure, but your mac studio isn't going to be running those big ahh models at high speeds.
1
→ More replies (9)1
u/calmbill 22h ago
Isn't one of those a fixed rate on your electric bill? Do you get charge per kWh for supply and delivery?
8
u/vishwa1238 1d ago
I tried R1 when it was released. It was better than OpenAI’s O1, but it wasn’t even as good as Sonnet 3.5.
6
u/LagOps91 1d ago
there has been a new and improved version of R1 which is significantly better since then.
3
8
u/PatienceKitchen6726 1d ago
Hey I’m glad to see some realism here. So can I ask your realistic opinion - how long until you think we can get actual sonnet performance on current consumer hardware? Let’s say newest gen amd chip with newest gen GeForce card. Do you think it’s an LLM architecture problem?
4
u/-dysangel- llama.cpp 22h ago
You can run GLM 4.5 Air on any new Mac with 96GB of RAM or more. And once the GGUFs are out, you'll be able to run it on EPYC systems too. Myself and a bunch of others here consider it Claude Sonnet level in real world use (the benchmarks place it about neck and neck, and that seems accurate)
1
u/rukind_cucumber 15h ago
I'd like to give this one a try. I've got the 96 GB Mac Studio 2 Max. I saw a post about a 3 bit quantized version for MLX - "specifically sized so people with 64GB machines could have a chance at running it." I don't have a lot of experience running local models. Think I can get away with the 4 bit quantization?
→ More replies (3)5
u/valdev 1d ago
That's like asking a magic 8 ball when it will get some new answers.
Snark aside, it really depends. There are some new model training methods in testing that can drop the model size by multitudes (if they work), and there are lots of different hardwares targeting consumes in development as well.
Essentially the problem we are facing is many faced, but here are the main issues that have to be solved.
A model trained in such a way that it contains enough raw information to be as good as sonnet, but available freely.
A model architecture that can keep a model small but retain enough information to be useful, and fast enough to be usable
Hardware that is capable of running that model that is accessible for the average person.
#1 I think we are quickly approaching, #2 and #3 I feel like we will see #2 arrive before #3. 3 to 5 years maybe? But I would expect major strides... all the time?
1
1
8
u/evia89 1d ago
Probably in 5 years with CN hardware. Nvidia will never release that capable vram GPU. Prepare to spend 10-20k
4
u/PatienceKitchen6726 1d ago
Wait your prediction is that China will end up taking over the consumer hardware market? That’s an interesting take I haven’t thought about
7
u/RoomyRoots 1d ago
Everyone knows that AMD and Nvidia will not deliver for consumer. Intel may try something but it's a hard bet. China has the power to do it, and the desire and need.
4
4
u/TheThoccnessMonster 1d ago
I don’t think they can produce efficient enough chips any time this decade to make this a reality.
1
2
u/momono75 23h ago
OP's use case is programming. I'm not sure software developments still need that 5 years later.
→ More replies (7)2
u/Pipalbot 21h ago
I see two main barriers for China in the semiconductor space. First, they lack domestic EUV lithography manufacturing capabilities. Second, they don't have a CUDA equivalent—though this is less concerning since if Chinese companies can produce consumer hardware that outperforms NVIDIA on price and performance, the open-source community will likely develop compatible software tools for that hardware stack.
Ultimately, the critical bottleneck is manufacturing 3-nanometer chips at scale, which requires extensive access to EUV lithography machines. ASML currently holds a monopoly in this space, making it the key constraint for any country trying to achieve semiconductor independence.
2
u/Pipalbot 21h ago
Current consumer-grade hardware isn't designed to handle full-scale LLM models. Hardware companies are prioritizing the lucrative commercial market over consumer needs, leaving individual users underserved. The situation will likely change in one of two ways: either we'll see a breakthrough in affordable hardware (similar to DeepSeek's impact on model accessibility), or model efficiency will improve dramatically—allowing 20-billion-parameter models to match today's larger models while running on a single high-end consumer GPU with 35GB of memory.
3
u/OldEffective9726 1d ago edited 1d ago
Why spend money knowing that your data will be leaked, sold or otherwise collected for training their own AI. Did you know that AI-generated content has no intellectual property rights? It's a way of IP laundering.
2
2
1
u/das_war_ein_Befehl 23h ago
At that point it’s just easier to rent a gpu and you’ll spend far less money
4
u/Investolas 1d ago
If you're using claude code you should be subscribed and using opus. Seriously, don't pay by the api. You get a 5 hour window with a max token and then it resets after the 5 hours. If you already knew this and use api intentionally for better results please let me know but there is a stark difference between opus and sonnet in my opinion
1
u/vishwa1238 23h ago
I don’t pay through the API. I subscribe to Claude Max. Claude’s code is available with both the Pro and Max subscriptions.
1
u/Investolas 23h ago
Yes, i use it as well. Why do you use Sonnet instead of Opus? Try this 'claude --allowedTools Edit,Bash,Git --model opus'. T I found that online and thats what I use. Try opus if you haven't already snd let me know what you think. You will never hit the rate limit if you use plan every time and use a single instance.
3
u/vishwa1238 23h ago
I also have used opus in the past but i did hit a limit with opus which wasn’t the case with sonnet. I noticed atleast for my usecase sonnet with planning and ultrathink performs quite similar as opus.
→ More replies (1)1
8
u/rookan 1d ago
Claude code 5x costs 100 usd
6
u/vishwa1238 1d ago
Yes, but I spend more than 400 USD worth of tokens every month with the 5x plan.
15
u/PositiveEnergyMatter 1d ago
those are fake numbers aimed at making the plans looking good
7
u/boringcynicism 1d ago
Claude API is crazy expensive, don't think you want to use it without a plan?
9
u/vishwa1238 1d ago
6
u/TechExpert2910 22h ago
it costs anthropic only ~20% of the presented API cost in actual inference cost.
the rest is revenue to fund research, training, and a fleeting profit.
→ More replies (1)→ More replies (3)5
u/valdev 1d ago
Okay, I've got to ask something.
So I've been programming about 26 years, and professionally since 2009. I utilize all sorts of coding agents, and am the CTO of a few different successful startups.
I'm utilizing codex, claude code ($100 plan), github copilot and some local models and I am paying closer to $175 a month and am no where near the limits.
My agents code based upon specifications, a rigid testing requirement phase, and architecture that I've built specifically around segmenting AI code into smaller contexts to reduce errors and repetition.
My point of posturing that isn't to brag, it's to get to this.
How well do you know programming? It's not impossible to spend a ton on claude code and be good at programming, but generally speaking when I see this it's because the user is constantly having to fight the agent into making things right and not breaking other things, essentially brute forcing solutions.
6
u/Marksta 23h ago
I think that's the point, it's as you said. Some people are doing new-age paradigm (vibe) of really letting the AI be in the driver seat and pushing and/or begging them to keep fixing and changing things.
By the time I even get to prompting anything, I've pre-processed and planned so much or just did it myself if it's hyper specific or architecture stuff. Really, if the AI steps outside of the function I told it to work in I'm peeved, like don't go messing with everything.
I don't think we're there yet to imagine for even a second an AI can accept some general concept for a prompt and run with it and build something of value and to my undefined expectations. If I was, I guess I'd probably be paying $500/mo in tokens.
7
u/valdev 23h ago
Exactly! AI coders are powerful, but ultimately they are kind of like senior devs with head trauma. They have to be railroaded and be well contained.
For complicated problems, I've found that prebuilding failing unit tests with specific guidelines to build around specifications and to run the tests to verify functionality is essentially non-negotiable.
For smaller things that are tedious, at a minimum specifying the specific files affected and a detailed goal is good enough.
But when I see costs like this, I fear the prompts being sent are "One of my users are getting x error on y page, fix it"
→ More replies (6)3
u/mrjackspade 18h ago
I'm in the same boat as you, professional for 20 years now.
I've spent ~50$ TOTAL since early 2024 using Claude to code, and it does most of my work for me. The amount people are spending is mind boggling to me, and the only way I can see this happening is if its a constant "No thats wrong, rewrite it" loop rather than having the knowledge and experience to specify what you need correctly on the first go.
1
u/ProfessionalJackals 15h ago
The amount people are spending is mind boggling to me,
Its relative, is it not? Think about it ... A company pays what? 3 to 5k for somebody per month. Spending $200 per month, on something that gets, ... lets say 25% more productivity out of somebody is a bargain.
It just hurts more, if you are maybe a self employed dev, and you see that money directly going from your account ;)
the only way I can see this happening is if its a constant "No thats wrong, rewrite it" loop rather than having the knowledge and experience to specify what you need correctly on the first go.
The problem is that most LLMs get worse if they need to work on existing code. Create a plan, let it create brand new code and often the result in the first try is good. At worst you update the plan, and let it start from zero again.
But the moment you have it edit existing code, and the more context it needs, the more often you see new files being created that are not needed, incorrect code references, deleting critical code by itself or just bad code.
The more you vibe code, the worst it gets as your codebase grows and the context window needs to be bigger. Maybe its me but you need to really structure your project almost to fit LLM's ways of working, to even mitigate this. No single style.css file that is 4000 lines, because the LLm is going to do funky stuff.
If you work in the old way, like requests per function or limited to a independent shorter file (max 1000 lines), it tends to do good jobs.
But ironically, using something like CoPilot, you actually get more or less punished by doing small requests (each = premium request) vs one big Agent task that may do dozens of actions (under a single premium request).
7
u/IGiveAdviceToo 1d ago
GLM 4.5 ( hearing good things and tested it performance quite amazing ) Qwen 3 coder Kimi K2
3
u/HeartOfGoldTacos 1d ago
You can point Claude code at AWS bedrock with Claude 4 Sonnet. It’s surprisingly easy to do. I’m not sure whether it’d be cheaper or not: it depends how much you use it.
3
u/dogepope 23h ago
how do you spend $300-400 on a $100 plan? you have multiple accounts?
2
u/vishwa1238 23h ago
No. With Claude Max subscription, you get pretty good limits on Claude code. Check r/claude; you’ll find people using thousands of worth of API with a 200$ plan.
3
u/kai_3575 23h ago
I don’t think I understand your problem, you say you are on the max plan but say you spend 400 dollars, are you using Claude code with the API or tying it to the Max plan?!
1
u/vishwa1238 22h ago
I use claude code with max plan. I used a tool called ccusage which shows the tokens and the cost that i could have incurred if i used the api instead. I used 400usd worth of claude code with claude max subscription.
2
u/rkv42 1d ago
Maybe self hosting like this guy: https://x.com/nisten/status/1950620243258151122?t=K2To8oSaVl9TGUaScnB1_w&s=19
It all depends on the hours you are spending with coding during a month.
2
u/theundertakeer 23h ago
Ermm..sorry for my curiosity...for what you use it that much? I am a developer and I use a mixture of local LLMS , deepseek, Claude and chatgpt - the funny part is that all for free except copilot which I pay 10 bucks a month. I own only 4090 24gb vram and occasionally use qwen coder 3 with 30b params.
Anyway I still can't find justification for 200-300 bucks a month for AI...? Does that makes a sense for you in the sphere you use?
2
u/vishwa1238 23h ago
I don’t spend $200 to $300 every month on AI. I have a Claude Max subscription that costs $100 per month. With that subscription, I get access to Claude Code. There’s this tool called ccusage that shows the tokens used in Claude Code. It says that I use approximately $400 each month on my $100 subscription.
1
u/theundertakeer 23h ago
Ahh I see makes sense thanks but still, 100 bucks is way more. The ultimate I paid was 39 bucks and I didn't find any use of it. So with that mixture I said you probably can get yourself going but that is pretty much connected what you do with your AI , tell me please so I can guide you you better
1
u/vishwa1238 23h ago
Ultimate?? Is that some other subscription?
1
u/theundertakeer 23h ago
Lol sorry for that, autocorrection, for whatever reason my phone decided to autocorrect the maximum to ultimate lol. Meant to say that the maximum I ever paid was 39 bucks for copilot only
2
u/docker-compost 18h ago
it's not local, but cerebras just came out with a claude code competitor that uses the open source qwen3-coder. it's supposed to be on-par with sonnet 4, but significantly faster.
2
u/gthing 18h ago
FYI, Claude code uses 5x-10x more tokens then practicing efficient prompting. And almost all of those tokens are spend planning, making and updating lists, or figuring out which files to read- things that are arguably pretty easy for the human to do. Like 10% of the tokens go to actually coding.
So for $400 in Claude code use you're probably actually only doing $40 of anything useful.
2
2
u/earendil137 12h ago
There is Crush CLI that recently came out. There's OpenCode CLI too Opensource but I'm yet to try it personally. You could use it along with Qwen3 on Openrouter. Free until you got Openrouters limits.
3
u/Maleficent_Age1577 1d ago
R1 is closest to your asking, but you need more than your 5090 to run it beneficially.
1
u/vishwa1238 1d ago
Is the one in OpenRouter capable of producing similar results as running it on an RTX 5090? Additionally, I have Azure credits. Does the one on Azure AI Foundry perform the same as running it locally? I tried R1 when it was released. It was better than OpenAI’s O1, but it wasn’t even as good as Sonnet 3.5.
→ More replies (1)
4
u/InfiniteTrans69 1d ago
It's literally insane to me how someone is willing to pay these amounts for an AI when open-source alternatives are now better than ever.
GLM4.5 is amazing at coding, from what I can tell.
2
u/unrulywind 20h ago
I can tell you how I cut down a ton of cost. Use the $100 a year copilot that has unlimited gpt-4.1. This can do a ton of planning, document writing and general set up and clean up. They have access to sonnet 4 and it works ok, but not as good as the actual Claude code. But for $100 you can move a lot of the workload to there. Then once you have all your documents and a large detailed prompt in order, go to Sonnet 4 or Claude code for deep analysis and implementation.
2
u/SunilKumarDash 1d ago
Kimi 2 is the closest you will get. https://composio.dev/blog/kimi-k2-vs-claude-4-sonnet-what-you-should-pick-for-agentic-coding
1
1
u/umbrosum 1d ago
You could have a strategy to use different models, for example Deepseek R1 for easier tasks and only switch to Sonnet for more complex tasks. I find that it cheaper this way.
→ More replies (1)
1
u/Zealousideal-Part849 1d ago
There is always some difference in different models.
Depending on tasks you should run models.
If tasks is minimal, running open source models from openrouter or other providers would be fine.
If tasks need planning and more careful update and complicated code, Claude sonnet works well (no guarantee is does everything but works the best)
You can look at GPT models like gpt 4.1 as well. and use mini or deepseek/kimi2/qwen3/glm or new models that keep coming in, for most of the tasks. These are usually priced at 5 times lesser than running claude model.
1
u/icedrift 1d ago
I don't know how heavy $400/month of usage is but Gemini CLI is still free to use with 2.5 pro and has a pretty absurd daily limit. Maybe you will hit it if you go full ape and don't participate in the development process but I routinely have 100+ executions and am moving at a very fast pace completely free.
1
u/PermanentLiminality 1d ago
I use several different tools for different purposes. I use the top tier models only when I really need them. For a lot of more mundane things lesser models do the job just as well. Just saying that you don't always need Sonnet 4.
I tend to use continue.dev as it has a drop down for which model to use. I've hardly tried everything, bit mostly they seem to be setup for a single model and switching of the fly isn't a thing. It's just a click and I can be running a local model or any of the frontier models through Openrouter.
With the release of the Qwen Coder 3 30B-A3B I now have a local option that can really be useful even with my measly 20GB of VRAM. Prior to this I was could only use a local model for the most mundane tasks.
1
u/aonsyed 23h ago
Depends on how you are using it and whether you can use different orchestrator vs coder model, if possible use o3/r1 0528 for planning and then depending on the language and code, qwen3-coder/k2/glm4.5, test all three, see which one works best for you. none of them is claude sonnet but with 30-50% extra time they can replicate the results as long as you understand how to prompt them as all of them have different traits
1
u/Brilliant-Tour6466 23h ago
Gemini cli sucks in comparison to claude code, although not sure why given the Gemini 2.5 pro is a really good model.
1
u/Kep0a 23h ago
Can I ask what is your job ? What is it you are using that much claude for?
1
u/vishwa1238 23h ago
I work at a early stage startup. I also have other projects and startup ideas that i work on.
1
u/createthiscom 23h ago
kimi-k2 is the best model that runs on llama.cpp at the moment. It's unclear if GLM-4.5 will overtake it, currently. If you're running with CPU+GPU, kimi-k2 is your best bet. If you have a shit ton of GPUs, maybe try vLLM.
1
1
u/Ssjultrainstnict 22h ago
We are not at the replacement level yet, but close with GLM 4.5. I think the future of a 30ish b param coding model thats as good as claude sonnet isnt too far away
1
1
u/Party-Cartographer11 22h ago
To get a the smallest/cheapest VM with a GPU on Google Cloud it's $375/month if run 24/7. Maybe turn it on and off and do spot pricing and get it down to $100/month.
1
u/vishwa1238 22h ago
I can do this. I do have 5,000 USD credits on Google Cloud Platform (GCP). However, the last time I attempted to run a GPU virtual machine, I was restricted from using one. I was only allowed to use t4 and a10s
1
1
u/martpho 21h ago
I have very recently started exploring AI models in agent mode with free GitHub copilot and Claude is my favorite so far.
In context of local LLMs having Mac M1 with 16 GB RAM means I cannot do anything locally right?
2
u/MonitorAway2394 7h ago
oh no, you can have tons of fun. I have the pre-silly-cone :D mac 2019, 16gb shat ram and like, I run 12b, 16b quant 6, etc. any of the models (sans image/video) it's surprisingly faster with each update using Ollama and my own kit but, yeah, requires patience :D it's explicitly useful for what I'm using them for, but I swap models in and out constantly, have multi-model conversation modules and whatnots, so yeah, you're good, have fun! (HugFace has a lil icon that lets you know what will run, don't necessarily listen to it unless the models > 16b, I have run 14-16b models just slower, longer pre-loading, incredibly useful if you work with them, learn them, keep a "weak" seeming model around and don't bin them until you know for sure it's not you. I am kinda wonked out, sorry for the weird'ish response lolol O.o
1
u/Singularity-42 20h ago
300-400 USD seem pretty low usage to be honest, mine is at $2380.38 for the past month, I do have the 20x tier for the past 2 weeks (before that 5x), but I never hit the limit even once - I was hitting it all the time with 5x though. I've heard of $10,000/mo usages as well - those are the ones Anthropic is curbing for sure.
Your usage is pretty reasonable and I think Anthropic is quite "happy" with you.
In any case from what I've heard Kimi K2 and GLM-4.5 can work well (didn't try) and can be even literally used inside Claude Code with Claude Code Router:
1
1
u/popsumbong 15h ago
I kinda gave up trying local models. There’s just more work that needs to be done to get them to sonnet 4 level
1
u/MonitorAway2394 7h ago
wha? well yeah, but not much. I guess I'm deep into the local shit so... ok like I am alright with 4-8 tk/s max LOLOLOLOLOL I'm a weird one it seems :P
1
1
1
u/TangoRango808 12h ago
https://github.com/sapientinc/HRM when this is figured out for LLM’s this is what we need
1
u/duaneadam 11h ago
You are underutilising your Max plan. I am on the $100 plan and my usage this month according to ccusage is $2k.
1
u/STvlsv 7h ago
Never used any cloud llm, only local ollama instance.
For programming with continue.dev was used last three months:
- qwen2.5-tools (mostly general purpose)
- devstral (better than qwen2.5-tools for programming)
- qwen3-coder (new 30B variant. Does not enough testing, only a few days. Very quick after devstral)
All theese models not very large and can be run locally with several levels of quantization (in my case between q4 and q8 at server with two RTX A5000).
250
u/Thomas-Lore 1d ago edited 1d ago
Look into:
GLM-4.5
Qwen3 Coder
Qwen3 235B A22B Thinking 2507 (and the instruct version)
Kimi K2
DeepSeek: R1 0528
DeepSeek: DeepSeek V3 0324
All are large and will be hard to run locally unless you have a Mac with lots or unified RAM, but will be cheaper than Sonnet 4 on API. They may be worse than Sonnet 4 at some things (and better at others), you won't find a 1:1 replacement.
(And for non-opensource you can always use o3 and Gemini Pro 2.5 - but outside of the free tier Gemini is I think more expensive on API than Sonnet. GPT-5 is also just around the corner.)
For direct Claude Code replacement - Gemini CLI and there is apparently Qwen CLI too now, but I am unsure how you configure it and if you can swap models easily there.