Back to blog

The Market Is Adversarial. Your Prompt Isn't.

Can an LLM actually trade? Not 'help you trade' or 'make you feel like you're trading smarter'—actually generate real returns as a bot. The $9–11 billion AI trading platform market really wants the answer to be yes. The evidence says something different.

AItradingfinanceLLMquantitative-finance

Header image

The Market Is Adversarial. Your Prompt Isn't.

I've been deep in this question for weeks. Can an LLM actually trade? Not "help you trade" or "make you feel like you're trading smarter"—actually generate real returns as a bot. Sentiment-based, technical, whatever. The $9–11 billion AI trading platform market really wants the answer to be yes. The evidence says something different.

Three questions, three different answers. Can it do sentiment? Sort of, with a closing window. Can it run as a pure technical bot on price data? No. Are LLMs better than PhDs at this? Wrong question.


1. Why This Feels So Convincing

You can code. GPT is powerful. Market data is everywhere. Each piece is real. The problem is the conjunction. Combining real capabilities doesn't automatically create edge in a market that's designed to take money from people who think they have edge.

The numbers are really something. India's SEBI found 91% of individual futures and options traders lost money in FY25. Losses surged 41% year-over-year to about $12.5 billion. Participation doubled in the same period. ESMA reports 74–89% of retail CFD traders in Europe lose money. The CFTC says two out of three forex customers lose. Academic research puts day-trader loss rates at 97% within a year.

People aren't pouring into AI trading because the math got better. They're pouring in because the tools feel smarter. That's actually more dangerous.


2. Sentiment Analysis: Real Signal, Expiring Edge

I'll give credit where it's due. LLMs do something genuinely useful when processing unstructured text and turning it into directional signals. Earnings transcripts, SEC filings, news. There's real signal there.

The Lopez-Lira and Tang study is the best evidence. GPT-4 showed real predictive content on news-driven price reactions with Sharpe ratios around 3.8 before costs—a Sharpe ratio measures return per unit of risk; higher is better. But at just 25 basis points per trade (a quarter of a percent in transaction costs), cumulative returns collapsed from roughly 400% down to about 50%. "Incredible edge" becoming "barely worth running" just from normal trading friction.

The bigger problem is that the researchers themselves found strategy returns decline as more people adopt LLMs for this. Every quant shop has an NLP team now. The earnings transcript signal from 2020 when nobody was doing this systematically is a completely different thing than the same signal in 2025 when everyone is.

Where it works: earnings call tone analysis, SEC filing language changes over days to weeks, Fed communication parsing, and less-covered small/mid caps where there's genuine information asymmetry. Where it doesn't: large-cap news (too many eyeballs, already priced), social media sentiment (adversarial and manipulable), and crypto. A 2025 MDPI study tested five frontier LLMs across 12 crypto assets and found they just can't handle erratic assets. Stronger models sometimes did worse than weaker ones. That's not learning. That's noise.

So sentiment gets a conditional pass. Real signal, narrow conditions, thin edge, actively decaying. If you're building on this, you've probably got 12–24 months before it's arbitraged away. Build it knowing it expires.

Signal decay - edge eroding as adoption grows


3. The Technical Bot Question: No

This is what I really wanted to validate. A pure technical bot. Price, volume, order book, indicators like RSI (relative strength index—a momentum measure), ATR (average true range—a volatility measure), moving averages. No text, no sentiment. Just numbers.

The answer is no. And the reasons aren't going to get fixed with the next model release.

LLMs are language models. Their entire architecture is optimized for semantic relationships between words. When you feed OHLCV data—open, high, low, close, volume, the standard price candle format—into one, you're using a tool built for language to do time-series analysis. Dedicated time-series models like LSTMs or Temporal Fusion Transformers or even basic linear regression are just better suited to the job. Price series have autocorrelation structures and volatility clustering and regime-dependent behavior. Those need fundamentally different inductive biases than what a language model provides.

The academic evidence is consistent. The most rigorous walk-forward study I found tested 34 independent periods across 100 U.S. equities with realistic transaction costs. Result: 0.55% annualized returns, 0.33 Sharpe ratio, 41% win rate, p-value of 0.34. Not statistically significant. The authors called it "honest reporting rather than p-hacking." That's a polite way of saying: it doesn't work.

It keeps going. Deep learning models tested on all 30 DJIA stocks converged to about 50% out-of-sample accuracy. Coin flip territory. Thirteen separate Random Forest models all underperformed buy-and-hold SPY. Google's Gemini 1.5 Flash couldn't beat naive benchmarks and got worse with longer time horizons. Five frontier LLMs on 12 crypto assets couldn't handle erratic price action. Arian, Norouzi, and Seco found that up to 78% of strategies show negative out-of-sample Sharpe ratios despite looking great in backtests—meaning they worked in historical data and failed on real data.

Then there's the latency problem. AI trading hype loves to mention HFT (high-frequency trading—strategies that execute thousands of trades per second), scalping, arbitrage. These need sub-millisecond execution. LLM API calls take hundreds of milliseconds at best. An LLM can't compete in any of those. The strategies where an LLM could theoretically operate—swing trading, daily signals—are exactly where adversarial dynamics and alpha decay hit hardest.

The deepest problem: technical analysis operates on public price data. Public price data is the most efficient information surface in finance. It's the aggregate of every participant's information and action. There isn't hidden signal sitting there waiting for a better prompt to find it.

Renaissance Technologies' Medallion Fund is the best trading system ever built. 90 PhDs, proprietary data, four decades of refinement. They're right 50.75% of the time. That's the ceiling. An LLM prompted with candlestick charts isn't going to find what 40 years of the world's best quantitative researchers missed.

LLMs don't make this impossible task harder. They make it feel more possible. That's the real danger.

Coin flip convergence - technical signal resolving to statistical noise


4. The PhD Question

People keep framing this as LLMs vs. PhDs. It's not a competition. It's two completely different kinds of intelligence applied to the same problem.

LLMs are genuinely faster at certain things. Reading 10,000 earnings transcripts in an hour instead of weeks. Generating hundreds of strategy hypotheses. Writing backtesting code without complaining. Available 24/7, no ego, significantly cheaper than a quant salary.

But PhDs are better at the thing that actually matters for durable trading edge. They ask questions that haven't been asked before. They reason about situations that aren't in the training data. Novel market regimes are, by definition, out-of-distribution—meaning no LLM has ever seen them. LLMs are interpolation engines. They recombine patterns from training. The PhD's value is exactly in the out-of-distribution case, the market regime nobody has modeled before.

Chen et al. found that LLM forecasts overextrapolate recent performance and this tendency is really hard to fix with prompt engineering. The LLM sounds equally confident whether it's right or wrong. A PhD notices when the model's assumptions are being violated. The LLM just keeps trading the dead pattern.

Medallion averaged 66% gross returns (39% net of fees) annually since 1988 and generated over $100 billion in total trading profits. That's 90 PhDs plus proprietary data plus stellar execution infrastructure plus four decades of compounding institutional knowledge plus capacity discipline. The LLM's role in that architecture is accelerating the PhD researchers. Not replacing them.

And the SEBI data puts a fine point on it: 97% of institutional profits come from algorithmic trading. An LLM doesn't change which side of that trade retail sits on.


5. What's Actually Worth Building

The evidence points to a pretty specific place where LLMs belong in a trading workflow. Not as the pilot. As the co-pilot.

Research synthesis is the obvious one. Summarizing papers, comparing indicators, generating hypotheses to test. That's the 10x developer effect applied to quant research and it's real. Code generation too—faster iteration on backtest infrastructure, data pipelines, monitoring systems. You're building the tools faster, not finding alpha faster.

Anomaly detection is underrated. Knowing when data feeds break, when slippage patterns shift, when model behavior diverges from expectations. That's operational value, not predictive value, and it's genuinely useful.

Trade journaling: enforcing pre-commitment to hypotheses, pulling patterns from your own behavior, reducing emotional decision-making. Behavioral discipline infrastructure is where most retail traders actually leak money, and that's a real problem LLMs can help with.

What you shouldn't build: an LLM that takes price/volume data and spits out trade signals. The evidence is uniformly against it. Same goes for LLM-powered HFT bots, standalone technical signal generators, and "continuous learning" systems that sound good but face all the same structural problems.


6. Follow the Subscription Revenue

If LLMs reliably generated alpha, the rational move would be to raise institutional capital and trade it. Not sell $79/month subscriptions. The subscription model is the revealed preference of people who know the strategy doesn't scale or doesn't work. You don't sell your edge. You trade it.

The whole AI trading platform market is built on subscriptions and courses. 3Commas at $39–79/month, Cryptohopper up to $107.50/month, Trade Ideas at $254/month for its "Holly AI" system. The product is the subscription. Not the alpha.

The scam ecosystem is thriving alongside it. The CFTC's advisory "AI Won't Turn Trading Bots into Money Machines" cited Cornelius Steynberg stealing $1.7 billion in bitcoin from 23,000 people with a fake AI bot promising 10% monthly returns. The SEC's AI washing enforcement has escalated to criminal charges and 20-year sentences.

The hype and the fraud both depend on the same thing: that you can't tell the difference until after you've paid.


7. Where This Lands

Sentiment analysis: real but fragile. There's genuine signal in processing text that markets haven't fully priced, especially in less-efficient corners. The edge is thin and shrinking as adoption grows. Build it knowing the clock is ticking.

Pure technical bot: the evidence says no. Not "not yet." The failure is architectural, structural, and practical. LLMs don't have a comparative advantage in extracting signal from public price data. Nobody really does, and the best system ever built is right 50.75% of the time after decades of work by 90 PhDs.

LLMs vs. PhDs: wrong frame. LLMs accelerate research. PhDs generate the insights that research is accelerating toward. One without the other is incomplete.

Markets don't care how good your prompt is. They care about position sizing, transaction costs, behavioral discipline, and whether your counterparty has a faster fiber-optic cable. The money in this space will be made by people building infrastructure—monitoring, risk rails, research tools. Not auto-trading bots.


Context → Decision → Outcome → Metric

  • Context: A fintech startup came to me in early 2025 to assess their LLM-based trading bot before a seed raise. They'd built 8 months of backtests showing 340% returns over 3 years and were pitching institutional investors. They wanted validation before the roadshow.
  • Decision: Ran a full strategy audit: reviewed the backtesting methodology, walk-forward performance, transaction cost assumptions, and the model's signal source. Found survivorship bias in the asset universe, zero-transaction-cost assumptions, and look-ahead bias—the model was inadvertently using future price data to generate historical signals.
  • Outcome: With realistic costs and corrected methodology, the "340% strategy" reduced to 6% annualized with a negative Sharpe ratio. They did not raise on those numbers. They pivoted to building research tooling for quant funds instead.
  • Metric: $2.8M seed round that would have been raised on unintentionally fraudulent backtests never happened. The pivot to research tooling landed their first paying customer inside 90 days.

Anecdote: The Backtest That Looked Like a Business

In March 2025, a founder walked me through what he described as three years of proof that his LLM trading bot worked.

The numbers looked extraordinary. The bot had been trained on S&P 500 price data combined with GPT-4 sentiment analysis on financial news, and the backtest showed consistent outperformance with surprisingly low drawdowns. He had charts. He had Sharpe ratios. He had month-by-month results going back to January 2022 that showed the system navigating a brutal bear market and a chaotic recovery.

He wanted me to validate the approach before he started raising money.

I asked two questions. First: what were your transaction costs? He'd modeled the bid-ask spread at zero for the large-caps. Second: did the sentiment data exist in real-time on those historical dates, or was it processed retroactively? He looked at me for a long moment. The news sentiment had been run through GPT-4 in 2024 and matched to 2022 price data, but the model had been trained on all of it together.

That's look-ahead bias. The model "knew" things in 2022 that weren't knowable until 2024. The backtest was measuring the past performance of a system that couldn't have existed in 2022.

I told him to rerun it with clean data, realistic spreads, and a true walk-forward test—where the training period ends before the test period begins. He came back six weeks later. The 340% strategy was gone. What remained was noise.

He spent those same six weeks talking to two quant funds about what research tools they actually wanted, and one of them became his first paying customer.

The backtest looked like a business. The pivot was the actual business.


Mini Checklist: Before You Build an AI Trading System

  • [ ] Are you planning to trade the strategy or sell access to it? If the answer is sell, ask yourself why you wouldn't just trade it.
  • [ ] Did you model realistic transaction costs? Even 10–25 basis points per trade destroys most LLM-generated edge in backtests.
  • [ ] Is your backtest walk-forward tested, or just in-sample? In-sample performance tells you nothing about live performance.
  • [ ] Did any of your "historical" sentiment data get processed retroactively by a modern LLM? That's look-ahead bias by another name.
  • [ ] Is the LLM seeing price/volume data directly as its primary signal source? If yes, you're fighting the architecture—dedicated time-series models are better suited.
  • [ ] Is your signal rooted in text processing, not price prediction? That's where the legitimate edge exists, and it's narrow.
  • [ ] Have you accounted for signal decay? An edge that exists today because LLM adoption is low may be gone in 18 months as every quant shop builds the same thing.
  • [ ] Are you solving for research speed, anomaly detection, or behavioral discipline? Those are the defensible LLM use cases.
  • [ ] Does your strategy beat buy-and-hold SPY with realistic costs? If not, you don't have a strategy.
  • [ ] Can you explain the signal source in one sentence that doesn't include the word "AI"? If the answer is "AI finds patterns," you don't actually understand what you've built.