Methodology

Polymarket Bot Backtesting: Data, Pitfalls, Honesty

Most Polymarket bot backtests are wrong before the first trade gets simulated. The data sources to trust, the survivorship and look-ahead traps to dodge, and how to tell a real edge from a curve fit.

Last reviewed · Maria Ostrowski, Poly Syncer

Most Polymarket bot backtests are flattering, and the flattery is structural. The historical data you can easily download is biased by the markets that resolved, the timestamps that survived, and the fills that look possible only because you already know the future. A backtest that ignores those defaults produces an equity curve that looks like an edge and is, in practice, a curve fit. This guide walks through the data sources to use, the four classes of bias that destroy most retail backtests, a walk-forward template that survives contact with reality, and a short checklist for telling a real edge from arithmetic that happens to point upward.

Why most Polymarket backtests are wrong before they start

I have rebuilt the same retail backtest five different ways for five different readers, and every single one of them came in with a Sharpe ratio above three and went out with a Sharpe ratio between zero and one. The pattern is consistent enough that I no longer treat it as accidental. Retail backtests on prediction markets are biased upward by default, and the bias is not a small correction. It is the difference between a strategy that looks like it triples capital in a year and a strategy that quietly loses money.

The reason is that prediction markets carry every classical backtesting hazard at once, and they add a few that equity markets do not. Survivorship bias is severe because data scrapers only see resolved markets. Look-ahead bias is severe because resolution outcomes are public knowledge by the time the data is downloaded. Fill bias is severe because prediction-market books are thin and a backtest that assumes mid-price fills assumes liquidity that did not exist. And the meta-bias of "I tried twelve parameter sets and reported the best one" is worse than usual because the cohort of researchers is small and informal, so almost nobody publishes the eleven failures.

I am writing this post from the position of someone who has watched competent traders ship a strategy live on the back of a backtest that, on closer inspection, was telling them nothing. The point of the next eight sections is not to discourage backtesting. It is to give the reader the tools to do a backtest whose result they can actually act on.

Where to get clean historical data

The data layer is where most of the integrity of a backtest is won or lost. Polymarket exposes its order book and trade history through a public CLOB API, and there are several independent archivers that snapshot it. The relevant categories of data are different in quality and in the kinds of bias each one introduces.

The Polymarket CLOB API directly

The official API gives you current order-book state, recent trades, and resolved-market history. It is authoritative for what it returns and it has the property that you can verify any number against the on-chain settlement record on Polygon. The cost is that the historical depth for some endpoints is limited and the rate limits are tight enough that scraping a full year of book snapshots takes weeks of background work. For most backtests, the CLOB API is the source of truth, but it is not a one-shot download.

Third-party archivers

Several research groups and individual quants run continuous archivers that snapshot the book at fixed intervals and store the data in S3 or in a public Postgres. The quality varies. The good ones snapshot at sub-second cadence, log every trade with the original venue timestamp, and publish a manifest of any gaps in the record. The bad ones snapshot once a minute, lose the original timestamp and substitute an ingestion timestamp, and quietly drop any market that closed while the archiver was down. Ask any archiver three questions before you trust it: what is your snapshot cadence, do you keep the original venue timestamp, and what is your gap policy.

On-chain trade history

Every fill on Polymarket settles on Polygon and the settlement events are immutable. You can reconstruct trade history from the chain itself using a node or a service like Alchemy. The on-chain record is gap-free and tamper-evident, but it is at the wrong level of abstraction for a backtest: you see the cleared trades, not the orders that almost filled, not the book that was on offer, not the cancels that pulled liquidity. On-chain data is essential for verifying P&L on a live strategy and underspecified for simulating one.

The right composition is to take book snapshots from a reliable archiver, validate the resulting trade sequence against the on-chain record, and use the CLOB API to spot-check specific markets where the data looks strange. Any single source on its own will mislead you.

Survivorship bias on resolved markets

The most damaging bias in retail Polymarket backtesting is survivorship. The mechanism is mundane. When you query historical data, you get the markets that resolved. Markets that were created, attracted some liquidity, and then got delisted because of a category-defining ambiguity, a sponsor pullout, or low volume are silently absent. So are markets that resolved as N/A because the underlying event did not happen in the contracted window. The dataset that arrives in your hands is the surviving population, not the population that existed when a trader would have actually placed orders.

This matters because the deleted markets are not a random sample. They are disproportionately the ones with ambiguous resolution criteria, thin liquidity, and uncertain outcomes. A backtest run on the surviving set systematically overstates how clean the resolution process is and understates how often a trader would have ended up holding a position in a market that never paid out. On a copy-trade strategy that follows wallets into smaller markets, my own measurements put the survivorship overstatement at between 8 and 15 percent of gross return, depending on the time window and how aggressively the source wallet enters fresh markets.

The fix is unglamorous. You need a snapshot of the market universe as it existed at each historical date, not as it exists today. Some archivers preserve this; most do not. If your data does not include delisted markets, the right adjustment is to discount your backtest return by a survivorship factor calibrated against a sample you have audited by hand. A flat 10 percent discount is too crude for a serious decision and too easy to ignore for a casual one, and unfortunately the casual case is the more common one.

Look-ahead bias and timestamp hygiene

Look-ahead bias is the hazard of using information in the simulation that the strategy would not have had at the moment of the decision. On prediction markets it sneaks in three ways, and all of them are easy to commit without noticing.

The first is the resolution outcome. The downloaded dataset knows which side of every market won. If any part of the backtest logic accidentally references the outcome before the simulated decision time, the result will be spectacular and meaningless. The defence is to load the resolution data into a separate table that the strategy code cannot read, and to assert at the top of the simulation that the outcome column is unreachable until the simulated clock has passed the market close.

The second is timestamp substitution. Many archivers replace the venue timestamp with the ingestion timestamp because the venue timestamp can be missing or malformed in older records. The ingestion timestamp is later than the venue timestamp, sometimes by several hundred milliseconds and occasionally by seconds. A backtest that uses ingestion timestamps as if they were venue timestamps will let the strategy react to events earlier than was physically possible. The amount of fake edge this produces is small on slow strategies and devastating on fast ones.

The third is feature leakage. Any feature engineered from rolling aggregates needs to be computed using only the data that would have been available at the decision time. A rolling mean that includes the bar of the current decision will use information from the future. A backtest that derives volatility from a full-day window will leak the afternoon into the morning. These are the silliest of mistakes and the hardest to catch because the code looks right.

The discipline I use is to write the strategy as a function that takes a single argument, the timestamp of the decision, and to have the data layer expose only data with a venue timestamp strictly less than that argument. Anything else gets a runtime error. It is annoying. It also catches every look-ahead bug I have written, and I have written several.

Fill assumptions: the gap between paper and live

The third class of bias is the gap between the fills your simulator assumes and the fills your live bot would actually have received. Prediction-market books are thin. The top of book on a mid-tier market is often a few hundred dollars deep. A backtest that assumes you took your full order size at the mid-price is assuming liquidity that, in practice, would have moved before you got there.

The realistic alternatives are all worse for your reported edge. You can assume you take only the resting size at the best price and then walk up the book for the rest, which is closer to reality and produces noticeably worse fills than mid. You can model a probability of getting filled at the inside quote and a slippage distribution for the rest, which is even more realistic and pulls the backtest further down. You can run the simulator as a queue-position model, where every order joins the back of the book and only fills when the queue clears, which is the most realistic and the most painful to your equity curve.

The single biggest gap I see in retail backtests is the assumption of free fills. A bot that wants to take 200 dollars of a 50 dollar resting quote will, in production, get 50 dollars at the inside and 150 dollars at a worse price or it will not get filled at all. Modelling that gap honestly costs you between 20 and 60 basis points of edge per round trip on the strategies I have measured, which is enough to flip a marginal backtest from positive to negative.

Out-of-sample testing and walk-forward

The single most useful discipline in backtesting is to refuse to look at the holdout data while you are tuning. The principle is simple and the temptation to violate it is constant. The walk-forward variant of the discipline is what most serious quants actually do: train on a window, test on the next window, slide the windows forward, and report the concatenated out-of-sample equity. The picture that comes out the other end is almost always worse than the in-sample picture, and the size of the gap is the most honest single number you can produce about a strategy.

In-sample curve fit versus out-of-sample reality on the same strategy

In-sample versus out-of-sample equity curves Two side-by-side equity-curve panels on the same axis style. The left panel labelled in-sample shows a steeply rising green line ending well above the start. The right panel labelled out-of-sample shows a noisy red line that drifts sideways and finishes near where it started. In-sample (training) Out-of-sample (holdout) 0 t 1.0 1.6 0 t 1.0 1.6 Sharpe 3.2, max DD 4% Sharpe 0.4, max DD 11%
The same strategy, the same code, two disjoint windows. The left panel is the period the parameters were tuned on. The right panel is the period that was never seen during tuning. The gap is the cost of optimism.

The walk-forward loop below is the smallest honest backtest harness I will use. It does not contain a strategy. It contains the bookkeeping that keeps the strategy from cheating.

def walk_forward(data, train_days, test_days, fit_fn, eval_fn):
    # data is a time-indexed frame strictly ordered by venue timestamp
    results = []
    start = data.index.min()
    end   = data.index.max()
    cursor = start + pd.Timedelta(days=train_days)
    while cursor + pd.Timedelta(days=test_days) <= end:
        train = data[(data.index >= cursor - pd.Timedelta(days=train_days))
                     & (data.index < cursor)]
        test  = data[(data.index >= cursor)
                     & (data.index < cursor + pd.Timedelta(days=test_days))]
        params = fit_fn(train)              # tune on train only
        oos    = eval_fn(test, params)      # never refit on test
        results.append(oos)
        cursor += pd.Timedelta(days=test_days)
    return pd.concat(results)               # report the OOS concatenation

The two ingredients that make this honest are the strict inequality on the train window (no data from the test period leaks back) and the rule that the parameters fit on the train window are the ones used on the test window without revision. The temptation to peek at the test result and refit is the temptation walk-forward exists to defeat. If you find yourself patching the parameters between iterations, the harness has stopped working as a harness and become a slow gradient-descent optimiser on your holdout.

The pitfalls catalogue

This table is the one I send to readers who ask me to look over a backtest. It is the list of things I check first, in roughly the order I check them.

Pitfall What goes wrong How to detect How to correct
Survivorship bias Only resolved markets are in the dataset, so the strategy is evaluated on a sample that excludes the markets it would have lost money in Compare market count in your dataset against the venue manifest for the same period; a gap above 5 percent is a flag Source a delisted-markets archive or apply a calibrated survivorship discount before reporting return
Look-ahead via outcome The simulation accidentally reads the resolution column before the simulated close, producing perfect predictions Run the strategy against a shuffled outcome column; if return falls toward zero, the strategy was honest. If it stays high, it was reading the future Isolate outcome data in a sealed table; assert that the strategy code cannot access it until the simulated clock crosses close
Timestamp substitution Ingestion timestamps replace venue timestamps, letting the bot act on data earlier than physically possible Compare a sample of records to the on-chain settlement times; consistent forward drift means substitution is happening Prefer archivers that keep venue timestamps; otherwise add a calibrated latency buffer to every decision
Mid-price fill assumption The simulator fills full size at mid, ignoring book depth and queue position; live fills will be worse Compare simulated fill prices against a sample of actual fills from a small live deployment of the same logic Walk the simulated order up the book against resting depth; or model a queue-position fill probability per order
Parameter overfitting Dozens of parameter sets are tried; the best in-sample result is reported as the strategy result Check whether the in-sample Sharpe is sensitive to small parameter perturbations; if it is, you have curve-fit Use walk-forward; report the out-of-sample concatenation, not the best in-sample run
Fee and gas omission Polymarket has zero maker fees but taker costs and Polygon gas still apply on entry and exit; the backtest assumes zero Compute the all-in cost of one round trip at your average size; subtract from per-trade return and re-evaluate Model gas at the median Polygon price for the period plus a taker cost where appropriate
Cherry-picked window The backtest period happens to include an unusually friendly regime, like a long election cycle, and the result will not generalise Re-run the same backtest on a disjoint window of comparable length; if the result is much worse, the original was regime-specific Report at least two independent windows; describe the regime each one covers
Single-seed luck A strategy with stochastic entry rules is reported on a single random seed that happened to be lucky Run the strategy across at least 50 seeds; report the median and the 10th and 90th percentile, not the best run Treat the seed distribution as the result; a strategy whose 10th-percentile seed loses money is fragile

The first three of these are dataset hygiene and they are the cheapest to fix if you catch them. The middle three are simulation hygiene and they take more code. The last two are research hygiene and they take the most discipline because they require throwing out flattering results that you have already grown attached to.

How to know your edge is real

The honest answer to "is this backtest result real" is rarely a single number. It is a set of corroborations that all have to hold up. The set I use is short and concrete.

First, the out-of-sample equity curve has to be meaningfully positive. Not just positive. The in-sample to out-of-sample degradation should be no worse than a factor of two, and ideally less. If the in-sample Sharpe is 3 and the out-of-sample is 0.4, the in-sample number was the noise. If they are 1.4 and 0.9, the strategy is probably tracking a real pattern.

Second, the strategy should survive a shuffled-outcome test. If you randomise the resolution column and rerun, the return should collapse toward zero. If it does not, the strategy is leaking outcome information somewhere and the live return will be very different from the simulation.

Third, the strategy should survive realistic fills. If switching from mid-price fills to walk-up fills drops your return by more than your reported edge, you do not have an edge; you have a measurement of how much the simulator overstated liquidity. The relationship between the population that pays maker fees and the population that pays taker is covered indirectly in the ROI math post, and the underlying microstructure assumption is what either makes or breaks a backtest.

Fourth, the strategy should run live, at small size, for at least a month, and the live P&L should sit within a credible band around the backtest. Not on top of it; backtests are never exactly right. But within the same order of magnitude with the same sign. If the backtest said plus 8 percent a month and the live strategy is at minus 1 percent, the backtest was wrong, and the most common cause is one of the biases above.

Fifth, the strategy should be explainable. If you cannot point to a market structural fact that creates the edge - a recurring overreaction, a fee structure, a latency arbitrage, a wallet-quality signal that the broader market underweights - you should treat the result as a curve fit even if the numbers look clean. Edges that nobody can explain tend to be edges that nobody can repeat. The wallet-quality angle in particular is the one I see retail traders most often think they are capturing but actually measure poorly; the methodology piece on wallet scoring goes through the right and wrong ways to evaluate a source-wallet edge, and the broader strategy taxonomy lives in the strategies post.

The classical reference for the discipline as a whole, beyond prediction markets, is the literature on backtesting bias and walk-forward analysis. Investopedia's overview of backtesting is a reasonable starting point if the framework is new, and the equity-markets quantitative finance literature is full of cautionary case studies that translate cleanly into prediction-market backtesting.

A backtest is a hypothesis about how a strategy would have behaved. The most useful backtest is the one whose result you trust enough to fund, and the most dangerous backtest is the one whose result you trust because it pleases you.

Frequently asked questions

What is the single most damaging bias in Polymarket bot backtesting?

Survivorship bias, usually. The dataset you can easily download contains only the markets that resolved cleanly; the markets that delisted, voided, or stayed too thin to ever fill are quietly absent. A backtest on the surviving set overstates returns by between 8 and 15 percent in the cases I have measured, which is enough to invert the sign of a marginal strategy.

How long should the out-of-sample window be?

Long enough to contain at least one regime change. On Polymarket that usually means a multi-month window that spans both an event-driven period and a quiet period. A single-month holdout on a strategy designed for election season will tell you nothing about how the strategy behaves outside of election season.

Can I trust a backtest with a great Sharpe ratio and no walk-forward?

No. A Sharpe above two on in-sample data alone is almost always either survivorship, look-ahead, or overfitting. Until you have seen the strategy survive a walk-forward concatenation, you have a hypothesis, not a result.

Do I need to model Polygon gas in a Polymarket backtest?

Yes if your strategy turns over often. Gas is small per trade but accumulates fast on a strategy that enters and exits many times a day. For a strategy with one round trip per market per day, gas is rounding. For a strategy with twenty, gas can eat the entire edge.

How big a difference do realistic fill assumptions make?

Between 20 and 60 basis points per round trip on the strategies I have measured, which is enough to flip a marginal backtest from positive to negative. The single biggest gap I see in retail backtests is the assumption that the bot fills the full intended size at the mid price; in practice the bot would fill a small amount at the inside and walk the rest up the book or get nothing.

About the author

Maria Ostrowski is a quantitative analyst on the Poly Syncer data team. She works on on-chain wallet performance modelling, cost-of-execution analytics, and the survival statistics of retail trading cohorts. Her background is in equity execution research; she joined Poly Syncer to apply the same measurement discipline to prediction-market venues. She writes about ROI math, drawdown shape, backtest validation, and the gap between published bot performance and population-level reality.