In one recent run I had local LLMs going for more than 28 hours, chewing through 437 markdown files. The model used was qwen3.5:122b, served via Ollama working two Mac Studios on a private LAN. The task was unglamorous: read a file, return a JSON with a two-sentence summary, the named persons mentioned, the topic tags, the key-ideas the file argues. Repeat. Aggregate across files. Write per-entry scaffolds.
If I try to do this with claude code I will go out of tokens in a heartbeat, and even if I pay cheap chinese cloud providers I will end up not doing this kind of content repository optimizations because it can get expensive soon.
I have since started replacing that pipeline with DS4 running DeepSeek V4 Flash Q4 on the same hardware. Per-call latency dropped from a hundred and fifty seconds to about twelve. Same task. Same prompt shape. Different engine, double the size in RAM. Quite an improvement in performance, and I even feel the Studio machine run cooler.
This post is the optimisation note. What I want from the local fleet is a quality LLM with variable cost approaching zero, for batch tasks — wiki indexing, references audit, transcription cleanup, bulk classification. The kind of work where the bottleneck is latency and reliability, not cleverness. Two engines, two postures, one tradeoff worth naming.
The use case#
A wiki index over a personal writing and research repository. About four hundred markdown files: chapters, drafts, source material, memory notes - hell even my full phd disertation is in there!-. For each file I want a small structured artefact: a summary the index can show, named entities to populate person/topic/key-idea facets, cross-references between them.
This is bulk extraction. Reusable system prompt; variable user content; structured JSON output. The model does not need to be clever — it needs to be reliable, predictable, and available enough to grind through hundreds of calls without breaking.
Calling an LLM in this register is a different sport from chatting with one. The output is the input to the next pipeline step. If twenty per cent of calls return empty, that is twenty per cent of the wiki missing. If you double the context length and quality silently degrades, the corpus that needs chunking is your largest file. The engineering surface that matters is latency, predictability, and graceful failure.
Karpathy’s wiki pattern#
In early April 2026 Andrej Karpathy posted a tweet that went viral: “a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating [knowledge].” He sketched a system where raw sources — articles, papers, repos, datasets, even images — drop into a raw/ directory, and an LLM agent incrementally compiles them into a structured wiki: interlinked markdown files with summaries, backlinks, concept pages. Two days later he followed up with an idea file — a GitHub gist that describes the pattern conceptually, no code attached, on the theory that in the agent era the idea is more useful to share than the implementation.
That tweet describes the project I am building almost word for word. The repository I am indexing is years of writing — chapters, drafts, source transcriptions, memory notes, the lot. The bootstrap I ran is the incremental compile step Karpathy names. The operations he sketches downstream — ingest, query, lint — are what comes next for me. He did not invent the pattern, but he made it legible: named it, gave it shape, made it public at the moment the local-hardware tooling caught up. By summer it will be everywhere.
Versatility vs specialisation#
Two postures sit at the ends of the local-LLM spectrum.
Ollama is the versatile end. It serves any model in its library — Qwen, Llama, Mistral, Gemma, Phi, anything you can grab as a GGUF. One install, GUI and CLI, one HTTP API, and you can swap models per task. If you change models every week, Ollama is the right tool: the friction of trying a new one is a single ollama pull command.
DS4 is the specialised end, the new kid in town. It serves one model — DeepSeek V4 Flash quantized at q2 or q4 — and is dedicated to making that one model run as well as a Mac Studio will allow. Disk-resident KV cache, MoE-aware kernels, MTP speculative decoding, an OpenAI-compatible HTTP API on top. It cannot serve any other model. That is not a bug; that is the design and intent.
The tradeoff is direct. If you commit to one model and use it heavily — same prompt prefix, hundreds of calls a day, prefill cost amortised over thousands of inferences — specialisation pays for itself fast. If you flit between models, the specialised engine is sunk cost. The right answer depends on how much you actually use it.
Ollama + qwen3.5:122b — the versatile incumbent#
qwen3.5:122b is a strong Mixture-of-Experts model, 125 billion total parameters, plenty of capability. Served via Ollama it is a one-line install and a permanent presence on the LAN. It has been my workshop’s “Tier 1” for months: anything mechanical and voluminous that I do not want to pay an API for goes there.
For the wiki bootstrap, two observations are worth recording.
The context cliff. I capped every call at 12k tokens of input with a hard ceiling of 16k via --num-ctx 16000. Past 16k, quality degrades visibly — truncated tails, missed named entities, dropped numerical qualifiers. The 30k chapter, the 70k transcription, the 50k worldview file all had to be chunked, summarised per chunk, and re-aggregated. Each chunk is another Ollama call.
The empty-output failure mode. Through the run, eighteen to twenty per cent of calls returned len=0 — an empty completion. Not malformed JSON, not refused content, just nothing. Across files of all sizes; not predicted by size, type, or content. Retrying the same call usually produces the same result. My mitigation was a retry pass on gpt-5-mini via OpenAI — about twenty cents per hundred files. Cheap, but not free, and not the point.
DS4 + DeepSeek V4 Flash — the specialised stack#
DS4 is an alpha inference engine built by antirez, dedicated to DeepSeek V4 Flash — a 284B total / 13B active MoE model. The model spec claims up to a one-million-token context; I run the server at --ctx 200000 and have tested files up to 70k tokens. It exposes an OpenAI-compatible HTTP API on port 8000 and keeps an on-disk SHA1 KV cache that persists across server restarts.
Three of those properties matter for batch extraction:
The KV cache reuses prompt prefixes. When the system prompt is identical across every call — and in a structured-extraction pipeline it always is — the prefill cost is paid once, then cached on disk. The second call onward sees a dramatically smaller per-call cost. Over a long run this compounds into the biggest single saving.
200k context (the model claims more, I have only tested this much). No more chunking the 70k files. The whole thing goes in, the whole thing comes out summarised. Less aggregation logic, less risk of losing a section.
Noticeably better at structured JSON output in my early smoke tests. Zero empty-output failures in the first runs.
Hardware throughput on the M3 Ultra: ~38 tokens per second prefill, 32 generation — consistent with antirez’s documented numbers. Per-call latency in the live enrichment pipeline is 6.3 to 25.9 s, mean 11.6 s for a 2k-input, 500-token-output call (last 200 calls of a 1,241-call run). The same call shape on qwen3.5:122b ran at 90–290 s, mean 150–200 s. An order of magnitude on the wall-clock that mattered most.
The reliability picture is sharper than the latency picture, and the one that decides whether the pipeline is usable at all. Zero silent failures in 1,241 calls, versus one in five on Qwen. The system prompt tells the model “if the context is too thin, output INSUFFICIENT_CONTEXT and nothing else” — DeepSeek V4 Flash obeys 15.6% of the time. The scaffold stays in place; nothing is invented. Qwen, when it failed, failed silently.
| Metric | DS4 + DeepSeek V4 Flash | Ollama + Qwen3.5:122b |
|---|---|---|
| Per-call latency, mean | 11.6 s | 150–200 s |
| Per-call latency, range | 6.3 – 25.9 s | ~90 – 290 s |
| Silent empty-output failures | 0 / 1,241 | ~18–20 % |
| Honest refusals (scaffold kept) | 194 / 1,241 (15.6 %) | n/a |
| Context budget that holds quality | 200k tokens (tested) | ~16k effective |
| Cross-call cache reuse | on-disk SHA1, persistent | none |
One unscientific observation worth recording: the Mac Studio runs cooler under DS4 at near-max GPU than under Ollama+Qwen3.5:122b at comparable load. No benchmark, just the case temperature.
Not locked to DS4. The KV-cache-on-disk trick is not proprietary. llama.cpp can hold prefix cache in memory today and has persistent disk cache on the roadmap; Ollama sits on top of it and will inherit. The pattern — specialised engine, prompt-prefix reuse — is portable, and the optimisation will spread to whichever model you happen to be serving.
Catalan generation#
I am not comparing the generation quality of these two engines in this post. That is a different question, and on long enough texts of consequence I would not trust either of them to do the writing for me. For Catalan in particular, my standing rule is that anything generated by a local model below the ~250B effective-parameter mark loses the idiom — the rhythm, the vocabulary that does not translate. Qwen3.5:122b fails that test. DeepSeek V4 Flash, with thirteen billion active params at inference, sits in the same effective-size class and I expect it to fail the same test until a real translation batch tells me otherwise.
Catalan-target translation in my workshop goes to Claude or to a frontier OpenAI model. The local fleet covers ES/EN bulk only. The cost saving lives on the bulk-mechanical side; voice and idiom are not yours to save on.
Demonetisation, in the Diamandis sense#
In Bold (2015), Peter Diamandis and Steven Kotler describe six effects that follow once a technology digitises: it goes deceptive, disruptive, demonetised, dematerialised, democratised. Demonetisation is the one that matters here. A year ago, indexing four hundred personal markdown files with an LLM was not economically obvious — at frontier-API rates the bill would have been real, the rate limits would have stretched it across weeks. Today, on hardware I already own, the same job is a single batch run at zero variable spend. The marginal call is free.
When marginal cost approaches zero, the use cases that emerge are the ones that were not worth the bill before. Wiki indexing of a personal corpus is one. Transcription cleanup at scale, multilingual drafts for things you will rewrite anyway, semantic search across hundreds of papers, references audit, ontology extraction — each was “interesting but not interesting enough” at frontier-API prices, and each is a side project now.
That is the optimisation worth caring about. Not “Ollama vs DS4” — that decision is small, and the answer shifts as llama.cpp catches up. The optimisation is finding the workflows that only make sense once the per-call cost is gone.
Quick technical reference#
If you want to try the DS4 path on an Apple Silicon machine with enough memory (256–512 GB for the Q4 build, 96–192 GB for the Q2):
# 1. Clone + build (~30 seconds)
mkdir -p ~/Code && cd ~/Code
git clone https://github.com/antirez/ds4.git
cd ds4 && make
# 2. Download the model (Q4: 153 GB, Q2: 81 GB) — antirez's published GGUFs
./download_model.sh q4 # or `q2` for the smaller build
# 3. Run the server
mkdir -p ~/.ds4-kv
./ds4-server \
--ctx 200000 \
--kv-disk-dir ~/.ds4-kv \
--kv-disk-space-mb 32768
# 4. Hit it like any OpenAI endpoint
curl -s http://192.168.1.35:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-v4-flash",
"messages": [{"role":"user","content":"Hola"}]}'By default the server should bind to 0.0.0.0:8000, though the version I built bound to 127.0.0.1:8000 and required an explicit reconfigure to reach across the LAN. The --kv-disk-dir is where the SHA1 KV cache lives — keep it on internal NVMe, not external storage. First call to a cold server takes ten to sixty seconds to load the weights; subsequent calls reuse them.
