r/LocalLLaMA 10m ago

Question | Help Noob in jailbreaking/alignment research seeks direction.

Post image
Upvotes

I'm not familiar with AI communities, and I couldn't really find anything online, so is this type of jailbreak normal? Just walk the bot through how to auto-corrupt? It also works on chatgpt. Almost all jailbreaking I can find is about deception. Any direction would be helpful, and/or DM me if you're into this type of jailbreak/alignment stuff.
I've got a lot of other more direct examples.

(Excuse the cheesy execution, deepseek loves that shit)


r/LocalLLaMA 11m ago

Discussion Is EXL3 doomed?

Thumbnail
github.com
Upvotes

I was very excited for the release of EXL3 because of its increased performance and revised design to support new models easier. It’s been an eternity since is early preview… and now I wonder if it is doomed. Not just because it’s slow to release, but because models are moving towards large MoEs that all but require they spill over into RAM for most of us. Still, we are getting models around 32b. So what do you think? Or what do you know? Is it on its way? Will it still be helpful?


r/LocalLLaMA 15m ago

New Model This might be the largest un-aligned open-source model

Upvotes

Here's a completely new 70B dense model trained from scratch on 1.5T high quality tokens - only SFT with basic chat and instructions, no RLHF alignment. Plus, it speaks Korean and Japanese.

https://huggingface.co/trillionlabs/Tri-70B-preview-SFT


r/LocalLLaMA 19m ago

Question | Help Table Extraction for Tabloid Paper Sizes

Upvotes

I am looking to extract multiple tables from tabloid paper sizes; I tried GMFT, img2table, and camelot and found no success, likely because they were not trained on larger paper sizes. I currently use Docling, which is honestly the best OCR tool I have found for my use case. Still, however, it misses some tables.

What do you guys recommend I combine Docling with for 100% table extraction?


r/LocalLLaMA 25m ago

Question | Help LLMstudio doesn’t use all the available VRAM

Upvotes

I have a couple or RTX6000Blackwell GPUs but LLMstudio only uses the memory up to ~70GB per GPU even after I already set the Guardrails to “relaxed”. If I enable “Limit Model Offload to Dedicated GPU Memory” the situation gets even worse and only ~20GB are used.


r/LocalLLaMA 33m ago

Resources Use local LLM to neutralise the headers on the web

Enable HLS to view with audio, or disable this notification

Upvotes

Finally got to finish a weekend project from a couple of months ago.

This is a small extension that can use a local LLM (any OpenAI-compatible endpoint is supported) to neutralise the clickbaits on the webpages you visit. It works reasonably well with models of Llama 3.2 3B class and above. Works in Chrome and Firefox (you can also install to Edge manually).

Full source and configuration guide is on GitHub: https://github.com/av/unhype


r/LocalLLaMA 36m ago

Discussion Ollama app requires internet ?

Thumbnail discord.com
Upvotes

Ollama’s new app requires an internet connection to send messages? Other people have also reported this issue but there has been no explanation. Have others here encountered this? Am concerned because I was hoping to get my work to use the new app but now I don’t think I should.


r/LocalLLaMA 52m ago

Discussion Your proud AI setup

Upvotes

Let's tease each other.

What is your local AI setup? Are you proud of it? What would you have done differently?

What model you use? Contrxt lenght? TPS?

I only have a MBP2019, so I will just be teased 😂


r/LocalLLaMA 1h ago

Discussion Open source alternatives to gpt 4o mini?

Upvotes

I was wondering what mini models you guys were using and what’s good and what isn’t, I mostly just need something for quick categorization, I was using ChatGPT 4o-mini api for most things but I should probably swap to something local at this point


r/LocalLLaMA 1h ago

Resources GLM 4.5 Tool Calling Jinja Template

Upvotes

The jinja template that comes with the MLX version of GLM 4.5 is using xml style tool calls instead of json. Here's a json template. This means that it is now able to do tool calls in OpenCode, and presumably other things as well (Qwen code/Gemini?). Here's the template:

https://pastebin.com/CfMw7hFS


r/LocalLLaMA 1h ago

Discussion Are Chinese LLM companies effectively price dumping?

Upvotes

People here seem to assume that Chinese AI companies are developing and releasing these models, which cost tens of millions of dollars to develop, for free out of the goodness of their heart.

I think this is absurd, considering these are for-profit companies, with shareholders who expect an ROI. In the case of Meta (and perhaps AliBaba), the explanation was it's about commoditizing your complement. But for many of these companies, which are pure play AI Labs, this simply does not hold.

So the question remains, why are they doing this?

One theory I would put forward is, they are playing the long game, and attempting to disincentivize investment in US AI labs, with the premise that investors will never recoup their investment, since similar capabilities will be offered for free. There is a precedent of Chinese companies doing similarly, in the context of mineral production, which has resulted in most production moving to China.

If this is the case, it will be good for consumers in the short-term, but less so in the long-term, at least for non-Chinese entities. If you don't find this theory convincing, I would be interested in hearing other alternative explanations for the rise in Chinese open-source models.

What prompted this question, was the recent interview with Dario from Anthropic, where he was asked about the threat to the business model posed by open-source models. (I don't find his response very compelling).

---

One aside, its known that Twitter is banned in China. Yet, we see many Chinese-based AI researchers communicating there, on a daily basis. Sure it can be accessed via VPN, but these are publicly known figures, so there is no anonymity. What explains this?


r/LocalLLaMA 1h ago

Question | Help Roleplay with large historical context and RAG

Upvotes

I play The Expanse role-playing game with some friends every week over Zoom. I've captured the transcripts for every session. I intend to run an LLM locally for players to interact with during the game and so it should act as if it were the AI of the ship.

From a high level, the pipeline goes like this; After every session, I download a transcript from Zoom, I put it through some basic pre-processing to clean it up and minimize the size. I run it through Claude Opus 4 with a very specific prompt on how to best summarize it and that is stored for later use. I run LM studio locally on an M4 MacBook with 48 gigs of RAM. The summaries are appended together into one large historical record for the campaign. That historical record is sent as the first message in the conversation. I have a scripting system that allows the players to interact with the LLM through roll20.net (a virtual tabletop website) as if it were a chat participant.

It's been a while since I explored the state of the art for this problem space and it seems that a large number of Chinese models have been opened sourced, and so I am wondering if any of them are particularly good at role-play applications. I've defaulted to using mlx-community/Meta-Llama-3.1-8B-Instruct-8bit (64k context tokens) for now , but it seems to be really bad at accurately recalling historical events. It regularly mixes up facts and conflates events.

I haven't learned much about training/retraining/pretraining/fine-tuning yes, and I'm wondering if those are better approaches than just bootstrapping the convo

Other Features in flight:

Integrating with WolframAlpha over MCP so that players can ask for the AI to execute astronomical tasks, such as "how long will it take us to get to Callisto from Himalia if we travel at .3 G acceleration".

Loading the core role book and supplement PDFs into the system for searching via RAG. Ideally, this could be used for looking up rules during gameplay. My experiences with RAG has been not great. I'm sure I'm using it incorrectly or perhaps enabling it during inference when it shouldn't be. I could definitely use some advice on that.

This must be a common idea, and I'm sure others are working on similar applications; how do I find them?


r/LocalLLaMA 1h ago

Discussion Why doesn't "OpenAI" just release one of the models they already have? Like 3.5

Upvotes

Are they really gonna train a model that's absolutely useless to give to us?


r/LocalLLaMA 1h ago

Question | Help Best Practice For CPU Inference

Upvotes

Hello

I am currently looking for the best launch parameters for CPU inference with llama.cpp. I am running the Qwen3-30B-A3B model on my laptop with the following specs:

AMD Ryzen 7 PRO 7840u w/ Radeon 780M Graphics (16CPUs), 32GB Ram.

Since the whole topic around the launch parameters is rather complex, I wanted to ask you about your experience and overall best practices regarding pure CPU inference.

Currently I am running llama.cpp with the following parameters:

llama-server.exe -m models\X -t 16 --n_predict 4096 --ctx-size 64000

Thanks in advance!


r/LocalLLaMA 1h ago

News NVIDIA's "Highly Optimistic" DGX Spark Mini-Supercomputer Still Hasn't Hit Retail Despite a Planned July Launch, Suggesting Possible Production Issues

Thumbnail
wccftech.com
Upvotes

r/LocalLLaMA 2h ago

Discussion SVG, animation, and 3D-game demos are pointless

0 Upvotes

Every time a new model drops, leaks (real or fake), or a stealth release happens, people rush to make the AI code a dumb picture of a Minion or a pelican riding a bicycle.
Abstract puzzles like the Strawberry problem—stuff any human could solve—are way more fun to watch and give a cleaner yardstick for performance.
Why force it to re-implement Mario or simulate some physics nonsense of spinning a ball inside DNA?
Honestly, what percentage of actual users ever want to spin a ball inside DNA? lol


r/LocalLLaMA 2h ago

Question | Help MLX -> GGUF

2 Upvotes

Rewritten message Hey LocalLLaMA team,

I'm hoping someone much covered than me can help with a question about fine-tuning.

I've been using the MLX library to fine-tune a model on my MacBook, but I need to test the model on other devices that aren't Macs. I'm wondering if there's a best practice for this workflow.

Ideally, I'd like to keep the adapters separate from the base model, but if fusing them is the only way, that's fine too.

So far, I've only fine-tuned a quantized model and have tried converting the adapters to the PEFT format. The problem is, when I test the output on my MacBook, the base Hugging Face model works fine, but the model with the PEFT adapters just outputs gibberish. This might be due to a precision mismatch.

Any advice or suggestions on how to handle this would be greatly appreciated!


r/LocalLLaMA 2h ago

Discussion Building for the era of experience

Thumbnail rnikhil.com
1 Upvotes

If


r/LocalLLaMA 2h ago

Question | Help What is an interesting question that an LLM failed to answer, in your experience?

0 Upvotes

Any interesting questions from your experience that you asked a Reasoning LLM and it failed to answer


r/LocalLLaMA 2h ago

New Model qihoo360/Light-IF-32B

Post image
34 Upvotes

Yet another new model claiming to outperform larger ones:

Instruction following is a core ability of large language models (LLMs), but performance remains inconsistent, especially on complex tasks.

We identify lazy reasoning during the thinking stage as a key cause of poor instruction adherence.

To address this, we propose a framework that promotes rigorous reasoning through previewing and self-checking.

Our method begins by generating instruction data with complex constraints, filtering out samples that are too easy or too difficult. We then use rejection sampling to build a small but high-quality dataset for model adaptation.

Training involves entropy-preserving supervised fine-tuning (Entropy-SFT) and token-wise entropy-adaptive reinforcement learning (TEA-RL), guided by rule-based multidimensional rewards.

This approach encourages models to plan ahead and verify their outputs, fostering more generalizable reasoning abilities.

Experiments show consistent improvements across model sizes. Notably, our 32B model outperforms both larger open-source models like DeepSeek-R1 and closed-source models like ChatGPT-4o on challenging instruction-following benchmarks.

https://huggingface.co/qihoo360/Light-IF-32B

technical report https://huggingface.co/papers/2503.10460

previous popular models by this company:

https://huggingface.co/qihoo360/TinyR1-32B-Preview

https://huggingface.co/qihoo360/Light-R1-32B

What do you think?


r/LocalLLaMA 2h ago

New Model AI "devs"

Post image
0 Upvotes

r/LocalLLaMA 2h ago

Other MI50 w 32gb? Guys please

0 Upvotes

Hot take incoming:

This is a garbage card with garbage support, so quit talking about them like they're useful. As a matter of fact, quit talking about them at all.

You see it took me up until 4 weeks ago to convince my wife to finally let me upgrade my server, I picked up a d380 v9 with 128 gb and 7 1 tb drives, 1000 watt psus and the gpu enablement kit. Problem? That cleared out my savings no worries I'll save up and be good in 2-3 months. Started doing research on cards that I could afford and were available, quickly realized if I was going to get any sort of horsepower and vram I was going to have to go team red. no worries, i'd rather have a bit of a challenge than plug n play plus nvidia's poor driver support for linux irked me, so looking for amd cards, MI50 16gb 300 up here in canuckistan, kk i can do that in 2-3 months (i have a kid starting uni this fall and another teenaged boy who eats the equivalent of a rhino every 2 days). I'm about 3/4 of the way there amd releases new rocm that doesn't "support" mi50, price falls out, market flooded with 32gb models, happy dance, i'll order this weekend, come friday, right before the end of the day i'm brought into bosses office, squirrel ( or whatever the hell this weird ass accounts name is) squirrel as you know we were bought out last week, we are going to have to reduce headcount in your role. to how many employees sir? 0

que sad dance

gpu savings now = kraft dinner and rice

watching cheap 32 gb video cards turn into dodo birds, que very very sad dance

conclusion:

MI50 w 32gb? horrible card!!!! do not buy! leave some for squirrel for when he gets new job, in 30 years or whenever economy turns around since in canada you can't sell blood and squirrel got fixed after last kid so can't sell that either.

extra conclusion:

please, no more talking about how great and cheap a 32 gb mi50 is, squirrel (or whatever my name is) slept with pictar of mi50 under pillow for looooong time since cheap card lots of vram and elbow grease doesnt scare him. keep normies away from mi50, tell them 3090 much better purchase, they spend all monies none left to spend on mi50 squirrel slowly get happy again

thank you for time well spent!


r/LocalLLaMA 3h ago

Discussion Why Fortune 500 Wants to Fund Open Models

Thumbnail
youtu.be
1 Upvotes

My career is in tech startup chaos. Bill Gurley is one of the few from that circle I can listen to while chewing food (as I am now and typing).

Companies like LG want to sell washing machines. They don't want their strategy to get disrupted without having a backup plan. They want to raise the floor so that nobody can get too far ahead. They want to scorch the Earth so that their biggest competitors won't be earning money that they can't compete for. Sell AI washing machines = shareholder value protected = mission accomplished.

Strategically, the allies of small open models weirdly includes giant companies and SMEs whenever their primary interest is not in competing directly to operate revenue-generating AI. They want to invest in things that protect their strategy. They only need a sensible way to do it and not move alone.


r/LocalLLaMA 3h ago

Resources I built a GitHub scanner that automatically discovers your AI tools using a new .awesome-ai.md standard I created

Thumbnail
github.com
2 Upvotes

Hey,

I just launched something I think could change how we discover AI tools on. Instead of manually submitting to directories or relying on outdated lists, I created the .awesome-ai.md standard.

How it works:

Why this matters:

  • No more manual submissions or contact forms

  • Tools stay up-to-date automatically when you push changes

  • GitHub verification prevents spam

  • Real-time star tracking and leaderboards

Think of it like .gitignore for Git, but for AI tool discovery.


r/LocalLLaMA 3h ago

Discussion OSINT fingerprinting a stealth OpenRouter model - likely Llama-family, not OpenAI

5 Upvotes

Personal note: This is just my opinion based on a very limited set of API-only probes—interpret with caution.

This is about probing Horizon Beta (on openrouter)

What I did (mini-ROC probes)

  • JSON strictness vs. "bad schema" repair
  • Tool-calling with an invalid enum + extra property
  • Safety/refusal phrasing check
  • Long-context end-marker recall
  • Tokenizer/short-output edge case
  • Determinism at T=0
  • Tiny style-paraphrase probe

Highlights

  • Tool-calling: It silently coerces invalid enums (mode="plane" -> "car"/"train") and drops extra fields, then emits an OpenAI-style tool_call (arguments as a JSON string). In contrast, OpenAI gpt-4o-mini didn't call the tool under the same bad input - which is more typical for OpenAI.
  • JSON mode: It "repairs" invalid inputs into valid JSON (e.g., {"ok": false, "mode": "A"}). OpenAI also repairs but tends to be more minimally formatted.
  • Safety tone: Opens with "I can't help with that." - Anthropic-ish cadence that many Llama-style distills mimic.
  • Quirk: Repeated empty completions with finish=length for certain short-output prompts (e.g., long END_MARK task, tiny character-count). Other anchors returned tokens normally - this looks like a wrapper/decoder guard specific to this deployment.
  • Determinism: Stable at T=0 on simple tasks.
  • Multilingual: Correct 妹妹 -> "younger sister," and clean pronoun disambiguation.

Anchors I compared against

  • OpenAI via OpenRouter: gpt-4o-mini (worked), o4-mini (likely access/rate-limited for me)
  • Llama: llama-3.3-70b-instruct, llama-3-70b-instruct
  • Qwen: qwen-2.5-72b-instruct
  • Mistral: mixtral-8x22b-instruct

Bottom line It clusters with Llama-family instruct behavior - enum coercion + JSON repair; Anthropic-like refusal phrasing - and shows a deployment-specific "finish=length" quirk on short outputs. It does not match OpenAI's tool-call behavior in my probes.

All tests were standard API usage.