Everyone is arguing about which model is best. Claude, GPT, Gemini. The leaderboard changes every quarter and so does the conversation.
The Claude Code leak made something visible that we should have noticed earlier. What made Anthropic’s coding product feel powerful was not the weights. It was the orchestration layer sitting on top: query routing, context enrichment through search, artifact management, output validation, planning loops. When you could see the code, product magic looked like engineering.
A few weeks ago at a leadership offsite, I half-joked that Laozi taught us how to use AI 2,000 years ago. I still think that was partly right.
为无为,则无不治. Act through non-action, and all things will be well governed. The best AI systems do not brute-force intelligence out of the model. They shape the conditions so good outcomes emerge naturally. Get the context, tools, retrieval, and boundaries right, and much more becomes possible. Get them wrong, and even a strong model underperforms.
The real power in AI is not the model. It is the harness around it.
Not because the models do not matter. But because the harness shapes what the model actually sees, what it can do, and how its output gets validated. The model generates. The harness decides.
What the Harness Actually Does
When you ask a coding agent to build something, the model does not just receive your prompt and start writing. The harness intercepts your request. It routes it. It enriches the context with relevant code from your repository, pulls in documentation from the web, structures the task into sub‑problems, validates intermediate outputs against constraints, and maintains state across multiple planning cycles.
All of that happens before the model does its part. And all of it determines the quality of the result far more than the difference between a GPT‑5.4 and an Opus 4.6 prompt.
What we often call a model gap is really a thinking gap, hidden by a better harness.
At the technical level, the harness performs work that most users never see: context assembly, search grounding, task decomposition, memory management, tool routing, output validation, execution boundaries. Each of these multiplies what the model can accomplish from the same raw weights.
That is the feature.
The Prompt That Did Not Travel
I started with a simple assumption. ChatGPT Deep Research worked well for travel research. A single one‑paragraph prompt was enough. Good results, every time.
I built a travel research workflow in OpenClaw, an open‑source AI harness running on a VPS. It orchestrates multiple models, routes tasks, integrates web search, delegates to sub‑agents, maintains memory, loads skills. My starting setup was simple: Mimo v2 Pro and Gemini search. A one‑paragraph prompt.
The output was weak.
The obvious interpretation: the model was not as good. But the problem was not capability. It was that the prompt relied on ChatGPT’s harness doing invisible work around it. The Deep Research harness fills in context, searches, structures, validates. The OpenClaw setup was different. It did not do the same invisible scaffolding, and the prompt did not know it was missing.
Only after reworking the instruction stack specifically for the OpenClaw setup did the output improve dramatically. Not a new model. A different set of instructions that accounted for what this harness actually does and does not do automatically.
Then I fed that improved prompt back into ChatGPT. The output got much better there too.
That changed the lesson.
The prompt that works in one harness is not automatically good in another. Instructions are not portable the way people assume. Their effectiveness depends on the harness they operate inside. A large part of the value was not the model. It was the harness helping to produce the right task specification in the first place.Over time, I have found myself gradually shifting toward Qwen 3.6 and Mimo v2 (both variants) to bring costs down. The harness I have designed has narrowed the gap between my OpenClaw setup and the Claude and ChatGPT apps for general writing and deep research. The models are not all equal. But inside a well‑built system, the difference is smaller than most people expect.
Why the Same Model Feels Different
Two products using the same model feel different because they are different harnesses. One wraps the model in better context assembly. Another gives it richer search. A third has tighter validation loops. A fourth decomposes tasks more carefully and remembers what you tried last week.
The user experiences this as “the model is better here.” Sometimes it is. But more often, the difference is how much work the harness does before and after the model speaks.
For users, this means the question to ask is not which model does a product use. It is what is the product doing around the model. Context or no context. Search or no search. Memory or amnesia. Tool use or just chat. Planning loop or single shot. Those are the real differentiators.
That distinction matters because it shifts how you evaluate any AI tool. You are not buying a model. You are buying a harness. Always.
Model Commoditization Has Arrived
Models are converging. The distance between the top tier is shrinking. Open‑weight options are competitive with closed models on most benchmarks. The API cost curves look like commodity curves.
The barbell is forming. Commodity models on the bottom. The orchestration layer on the top. Everything in between gets squeezed out.
Model commoditization has arrived. The non‑commodity layer is the harness. The system that shapes what the model sees, what it can touch, how it validates its own work, and how it chains its own reasoning together.
For anyone thinking about AI strategy, this flips the vendor question. The point is not which company has the best model. The point is which harness gives you the most control, the most transparency, and the most portability. Not because you will swap models every week. But because you should be able to.
That is the shift.
Real Capital Changes the Feedback Loop
The harness thesis stops being theoretical the moment real money is at stake.
I am also running portfolio management research through these same systems. Real capital at risk sharpens the feedback loop. You learn faster when the output is not just interesting, but consequential. A research workflow that looks impressive in a demo is one thing. A research workflow that helps you make better decisions under uncertainty is something else.
The question I am currently investigating: why does my ChatGPT research desk seem to help construct portfolios with strong Sortino profiles, and what exactly is still missing in my OpenClaw setup?
My current guess is that the answer will not be “the model is better.” It will be something more structural in the harness. Better research flow. Better context carryover. Better synthesis. Better ranking of uncertainty. Better decision support at the point where ambiguity is still high.
That is the kind of gap that matters. Not benchmark scores. Not token throughput. Whether the system helps you act well when the answer is not yet clear.
The Risk: Borrowed Cognition
Strong harnesses change the user. When a system is good enough, you stop doing the thinking it used to do for you. First‑principles atrophy. Borrowed cognition gets mistaken for your own. You cannot tell the difference between what you figured out and what the harness figured out for you.
This is not hypothetical. It already happens with search engines and calculators, just at lower stakes. AI harnesses go much further because they make decisions, not just retrieve.
What kind of harness is shaping how we think, and what happens to us if we cannot leave it?Strategically, this creates a dangerous dependency: commoditized models plus a black‑box harness equals lock‑in. You are not choosing the best tool. You are negotiating from weakness. You cannot leave because you no longer understand what the harness was doing for you, and you cannot rebuild it from the outside.
Economic dependency follows. When the underlying commodity is cheap and interchangeable, the supplier with the best harness holds all the pricing power. Not because their intelligence is unique. Because they shaped the conditions so that leaving feels impossible.
That is the trap.
Shape the Conditions
有之以为利,无之以为用. We benefit from what is there, but we make use of what is not there. A vessel is useful because of the empty space it leaves open. A good harness is the same. It leaves room for the right things to happen without forcing them.
Translated into plain terms: do not ask what the model can do. Ask what conditions you have built around it. Context, search, memory, validation, permissions, boundaries. Get the conditions right and the model does the rest. Get them wrong and no model is good enough.
What to Do
Stop asking which model is best. Ask what your harness does around the model.
Know what invisible work your current harness is doing for you. If you cannot name it, you cannot replace it.
Test your instructions across harnesses. If they only work in one, you do not have a prompt problem. You have a portability problem.
Build or use harnesses you can inspect, modify, and walk away from. Open‑source is not a moral choice. It is a leverage choice.
The real risk is not that models will think badly for us. It is that they will think well enough, inside systems we cannot see, that we stop thinking deeply and lose the leverage to leave.
That loss is the only one that matters.
This article was produced with inputs from Qwen 3.6 Plus, Mimo v2 Flash, Mimo v2 Pro, GPT-5.4-mini, GPT-5.4, and Sonnet 4.6, all running through an open-source harness. Which, if you have been paying attention, is the point.