Skip to main content
  1. Posts/

Domain Experts Dominate AI

Last week a colleague asked me to review an AI-generated architecture for running FIX engines in Kubernetes. He’d used Claude Code to generate the entire implementation — thousands of lines of gateway code and tens of thousands of lines of tests. The output was fluent, well-structured, and only accepted pure FIX standard messages. No support for custom tags or dialects.

He didn’t know it was wrong because he’d never operated FIX sessions in production.

There’s a seductive narrative circulating in boardrooms right now: give AI to cheaper, less experienced staff and they’ll produce expert-level output. Replace your expensive domain specialists with generalists who are good with AI — fluent in prompt engineering, comfortable with Claude, Copilot, Perplexity, and whatever ships next week. The research says the opposite. It’s not even close. Let me show you why.

Two Prompts, Two Architectures #

Here’s the prompt my colleague used:

Design a FIX 4.2 acceptor gateway for Kubernetes that routes client sessions to engine pods with high availability

The AI produced a clean architecture. Stateless gateway, Kubernetes Service with TCP load balancing, round-robin across a Deployment — all validating pure FIX 4.2 standard. The first counterparty message carrying a custom tag would trigger a session-level Reject. The automated QA suite, built from the same prompt, only validated standard tags — so the Reject never got caught in testing. Without a domain expert reviewing the output, this ships to production and becomes an incident the first morning a counterparty connects.

A domain expert’s full prompt runs nearly 3,000 words — specifying Kubernetes operator CRDs with routing rules on FIX header fields, StatefulSet pod ordinals, per-tag redaction policies, TCP proxy technology selection, TLS offload acceleration phases through FPGA/SmartNIC, and a complete OpenTelemetry observability design. Same AI. Completely different architecture.

W. Edwards Deming is widely credited with: “You don’t know what you don’t know.” Donald Rumsfeld later made the framework famous — known knowns, known unknowns, unknown unknowns. When I prompt an AI about FIX gateway architecture, I operate in known knowns and known unknowns. I know what I need. I recognize when the output drifts. My colleague operated blind in the unknown unknowns. He couldn’t evaluate what he couldn’t recognize.

This is the Dunning-Kruger effect applied to AI. Dunning and Kruger’s 1999 research showed that the skills needed to produce correct answers are the same skills needed to recognize correct answers. My colleague didn’t write a bad prompt. He wrote the best prompt his knowledge allowed — and had no way to know the gap existed.

An Aalto University study published in Computers in Human Behavior confirmed this with AI specifically: across 698 participants, AI improved performance by 3 points but inflated self-assessment by 4 points. Everyone became overconfident. Most submitted a single prompt and accepted results without verification. As the researchers noted: “AI literacy might be very technical, and it’s not really helping people actually interact fruitfully with AI systems.”

The Jagged Frontier #

The most rigorous evidence comes from Harvard Business School and Boston Consulting Group. Their landmark study, Navigating the Jagged Technological Frontier, put 758 BCG consultants through realistic tasks with and without GPT-4 access.

The headline results look like an equalizer story. Consultants with AI completed 12.2% more tasks, 25.1% faster, with 40% producing higher-quality results. Inside the frontier — where AI was genuinely capable — below-average consultants improved by 43%. AI lifts the floor.

But outside the frontier — where AI was confidently wrong — consultants using AI performed 19 percentage points worse than those working without it. The AI actively degraded performance. The worst performers were those who blindly adopted AI output without interrogation. My colleague’s FIX gateway was an outside-the-frontier task — standards-only code that passed AI-generated tests and would have rejected every real counterparty. He didn’t know that. The AI certainly didn’t tell him.

The researchers identified two patterns among effective AI users: “Centaurs” who strategically delegate specific subtasks to AI, and “Cyborgs” who continuously integrate AI into their workflow. Both patterns require knowing which tasks the AI can handle — knowledge that comes from domain expertise, not AI fluency.

The Data Is Not Ambiguous #

A follow-up study from Harvard Business School gave 78 workers at IG Group — web analysts, marketing specialists, and technology specialists — identical AI-assisted content tasks. AI slashed conceptualization time by roughly 63% and writing time by 75%. Everyone got faster. But technology specialists — the group furthest from the domain — scored 13% below web analysts on quality despite identical AI access. Conceptualization scores were nearly identical (4.05 vs 4.12 out of 5.0) — AI helped everyone ideate equally well. The gap appeared in execution, where domain knowledge determined whether you could turn a concept into something correct. The researchers call this “knowledge distance.” The farther you are from the domain, the less AI can compensate. Adjacent expertise transfers. Distant expertise doesn’t.

A METR study gave 16 experienced open-source developers access to Cursor Pro with Claude 3.5/3.7 Sonnet across 246 tasks. They predicted AI would save 24% of their time. It increased completion time by 19%. They still believed it had helped them.

The seniority gap is just as telling. Fastly surveyed 791 developers: 32% of seniors ship mostly AI-generated code versus 13% of juniors — a 2.5x gap. Jellyfish data shows seniors write code 22% faster with Copilot while juniors gain 4%. Addy Osmani, engineering leader at Google, calls this the 70% problem: non-experts hit 70% fast, then spiral on the remaining 30%.

Financial Services: Where the Stakes Are Measured in Basis Points #

The prompt gap isn’t academic in financial services. Consider anti-money laundering. A non-expert prompts:

Flag suspicious transactions in our wire transfer data

The AI returns generic threshold rules — transactions over $10,000, round-number amounts, high-frequency patterns. Textbook compliance. Catches nothing sophisticated.

A compliance officer with fifteen years of experience prompts:

Analyze wire transfers for layering patterns through correspondent banking chains. Flag nested relationships where the originating institution is in a FATF grey-list jurisdiction, the intermediary clears through a US bank, and the beneficiary account was opened within 90 days. Cross-reference against our SAR filing history for repeat counterparties.

Same AI. That prompt encodes knowledge of correspondent banking relationships, jurisdictional risk ratings, and the specific regulatory expectations of FinCEN, the FCA, and MAS. The expert’s prompt is the detection logic. Without that knowledge, the tool returns noise that looks like signal — the most dangerous kind of output in regulated industries. McKinsey’s research on AI in banking identifies regulatory compliance, financial crime, credit risk, and climate risk as first-wave applications. Every one of these domains requires deep institutional knowledge that no prompt can substitute.

The hallucination rates make this concrete: research shows up to 41% of finance-related AI queries contain hallucinations, spiking to 60-80% in specialized subdomains. AI fabricates regulatory references, cites non-existent Bank Secrecy Act clauses, and hallucinates market data. A Stanford HAI study found that even purpose-built legal AI tools like Lexis+ AI and Westlaw AI hallucinate 17-34% of the time. General-purpose chatbots hallucinate on 58% to 82% of legal queries. A lawyer was sanctioned in federal court for submitting ChatGPT-fabricated case citations — a practicing attorney who simply didn’t verify.

The firms getting this right aren’t hiring AI specialists. JPMorgan deployed its AI toolkit to 200,000+ employees — domain experts, not a centralized AI team. Visa applied AI to fraud detection with domain-expert oversight: 85% reduction in false positives. Bank of America holds nearly 1,100 AI patents and pending applications. The U.S. Department of the Treasury launched the AI Innovation Series with FSOC, emphasizing “embedding AI into core workflows” — not replacing the people who run them.

If your AI strategy starts with hiring AI experts who don’t understand your business, you’re solving the wrong problem. The regulatory, reputational, and financial risks of AI-generated output that no one on your team can verify are not theoretical — they’re measurable.

The Difference Isn’t the Tool #

Every AI tool on the market — ChatGPT, Claude, Copilot, Perplexity, and every “AI computer” that promises to do anything and everything — shares the same fundamental limitation: it can generate plausible output in any domain, but it cannot distinguish correct output from confidently wrong output. That’s the user’s job. And that job requires domain expertise.

Humanity’s Last Exam, published in Nature, tested AI against domain experts across 2,500 questions in 100+ disciplines. At the time of publication, the top AI model scored 37.5%. Domain experts scored approximately 90% within their specialties. The models are impressive generalists. They are not domain experts.

Wharton professor Prasanna Tambe’s research in Management Science found that firms capture more value from AI when it’s distributed across domain experts rather than concentrated in IT departments. Ethan Mollick frames management itself as an AI superpower: his students succeed because they spent years learning to scope problems and recognize when output is wrong.

PromptLayer told TechCrunch it saw 13x revenue growth “as teams discover they need domain experts, not just engineers, to build AI.” Their thesis: “You can’t build healthcare AI without doctors, legal AI without lawyers, or therapy AI without therapists.”

The mechanical aspects of prompting are increasingly irrelevant. A review of 1,500+ prompt engineering papers found that AI systems create better prompts in ten minutes than human experts do after twenty hours. The machines optimize themselves. What they can’t optimize is the knowledge of whether their output is right.

Kartik Hosanagar at Wharton warns that AI is actively deskilling its users: “The effort we avoid is often the expertise we lose.” Organizations that skip domain expertise in favor of AI-first workflows aren’t just accepting lower quality — they’re destroying their capacity to evaluate quality at all.

What This Means #

This isn’t even unique to AI. If you hire a contractor and nobody on your team can evaluate their work, you have the same problem — you just find out slower. AI makes it cheaper to be wrong at scale. Every tool on the market generates output that someone must evaluate. If that someone can’t tell right from wrong in the domain, the tool’s sophistication is irrelevant.

After twenty years architecting distributed systems in financial services, I’ve seen technology cycles promise to flatten expertise hierarchies — from WS-* to WCF to microservices to cloud-native. Every time, the engineers who understood the domain adopted the new tools first and pulled further ahead. AI is no different.

The best AI users in your organization are already there. They’re not in the innovation lab. They’re the ones who know why your trading engine loads reference data from flat files at startup instead of hitting a database — because an external dependency that goes down at market open is a material business risk, not an architecture inconvenience. They’re the ones who built your CI/CD pipeline to run integration tests in isolation before deploying to an environment where a bad release destabilizes systems that real users depend on. Give them the tools. Get out of the way.

George Tsiokos
Author
George Tsiokos

Leave a comment

Preview

Comments are reviewed before publishing.