Polymarket bot errors come in two shapes, and only one of them shows up in your alerts. The loud class is the one operators worry about first: signed-message rejection, nonce drift, RPC timeout, websocket disconnect. The silent class is what actually drains accounts: a decimal precision bug that loses a cent per share for six weeks, a partial-fill state machine that quietly accumulates one-sided inventory, a stale-price action that never quite triggers a stop. This field guide walks both, the structured logging pattern that catches them, the error catalogue most teams end up rediscovering by hand, and a debugging playbook that turns a 3 a.m. pager into a 15-minute diagnosis. The audience is engineers who already have a bot in production and want to spend less of every week on incident response.
The two shapes of bot error: loud versus silent
Every Polymarket bot eventually develops two distinct failure populations. Loud errors raise exceptions, return HTTP 4xx or 5xx, and stop a code path mid-flight. They are easy to catch because the runtime tells you something is wrong. Silent errors return success at every layer of the stack and only show up in a P&L reconciliation two weeks later. They are difficult to catch because nothing in the system disagrees with itself.
The loud class is dominated by four families: signing and nonce errors at the wallet layer, RPC and websocket errors at the network layer, validation errors at the CLOB layer, and rate-limit errors at the gateway. These are uncomfortable in production but tractable: each one has a stable log signature and a known fix. A well-instrumented bot will catalogue them within a month of running and most operators will stop seeing new variants after that.
The silent class is harder. The defining feature is that the bot believes it is working. Orders submit, fills come back, balances update, the dashboard is green. What the operator does not see is that the fill price was rounded to the wrong tick, that the taker fee was counted once instead of twice, that a partial fill left the bot one-sided for forty minutes, or that a stale websocket cache caused the bot to act on a price that no longer existed on the venue. Each silent error individually is small. Together they are the difference between a bot that returns capital and a bot that bleeds it.
The framing matters because the debugging tools for the two classes are different. Loud errors are caught by exception tracking and log aggregation. Silent errors are caught by invariant checks, daily reconciliation against the venue, and a habit of looking at fill-by-fill economics rather than aggregate P&L. A team that only invests in the first class will keep their bot up but slowly lose money inside it. The rest of this guide treats both classes as first-class concerns. For the broader architecture context this debugging work fits inside, see the companion Polymarket bot architecture write-up.
What good logs look like
Before any error catalogue, the bot needs structured logs that make errors searchable. The single most useful pattern is a correlation identifier that propagates from the websocket event through detection, signing, submission, and confirmation, alongside stable identifiers for the market and any order the event becomes. A log line without these three fields is almost useless during an incident; a log line with them turns a multi-hour grep into a one-query lookup.
The snippet below shows the minimum useful shape in Python with the standard logging module and a JSON formatter. Equivalent patterns exist for Node with pino or Go with zerolog; the field names are what matter, not the library.
# observability.py — structured logging for a Polymarket bot
import json, logging, uuid, contextvars, time
correlation_id = contextvars.ContextVar("correlation_id", default=None)
class JsonFormatter(logging.Formatter):
def format(self, record):
payload = {
"ts": time.time(),
"level": record.levelname,
"logger": record.name,
"message": record.getMessage(),
"correlation_id": correlation_id.get(),
"market_id": getattr(record, "market_id", None),
"order_id": getattr(record, "order_id", None),
}
if record.exc_info:
payload["exc"] = self.formatException(record.exc_info)
return json.dumps(payload, default=str)
log = logging.getLogger("bot")
h = logging.StreamHandler()
h.setFormatter(JsonFormatter())
log.addHandler(h)
log.setLevel(logging.INFO)
def on_book_update(event):
# one correlation id per inbound event, propagates everywhere
correlation_id.set(uuid.uuid4().hex[:12])
log.info("book_update_received", extra={"market_id": event.market_id})
signal = detect(event)
if signal:
order_id = submit(signal)
log.info("order_submitted",
extra={"market_id": event.market_id, "order_id": order_id})
Three properties make this pattern earn its keep. First, every record from a single inbound websocket frame shares a correlation id, so an operator can reconstruct the full causal chain from one identifier. Second, the market and order identifiers are first-class fields rather than embedded in the message string, so log aggregators can index and group on them. Third, exceptions are serialised into the same JSON record rather than printed as multi-line tracebacks that break log shippers.
For shipping these records to a backend, the open-source choices are sensible defaults: an OTLP exporter sending to a self-hosted collector, then onward to Loki, Tempo, or whichever store the team already runs. Anyone unfamiliar with the conventions should read the OpenTelemetry logs documentation before designing the schema; the field names there have become a de facto standard and re-using them saves work later. Operators on managed stacks can substitute Datadog, New Relic, or Honeycomb without changing the bot code.
The error catalogue
Most teams rediscover the same error catalogue by hand over the first six months of running a bot. The table below collapses that work. It covers the eight error classes that account for the overwhelming majority of incidents on a Polymarket bot, with the typical cause, the observable symptom, the log signature to grep for, and the fix that usually works.
| Error class | Typical cause | Observable symptom | Log signature | Fix |
|---|---|---|---|---|
| Signed-message rejection | EIP-712 domain or type mismatch with what the CLOB expects | Submit returns 400 with “invalid signature” | code=INVALID_SIGNATURE | Re-derive the typed-data hash from the live CLOB schema; pin a version |
| Nonce drift | Bot and node disagree on next nonce after a failed broadcast | Transaction stuck pending; subsequent sends fail | nonce too low or nonce too high | Re-sync nonce from chain on every batch; never cache aggressively |
| RPC timeout | Primary RPC overloaded or geo-routed badly | Random multi-second hangs; sporadic 504s | timeout=true endpoint=primary | Two independent RPC providers with sub-2-second failover |
| Partial fill mishandling | Coordinator treats partial as terminal and never closes the residual | One-sided inventory drift across days | fill_qty < order_qty status=DONE | State machine that explicitly models PARTIAL and reposts residual |
| Stale price action | Websocket reconnected silently; cache is older than the bot believes | Orders placed at prices that no longer exist | cache_age_ms>1500 acted=true | Timestamp every cache entry; reject cache older than 1500 ms |
| Websocket disconnect | Idle ping not sent; load balancer kills idle connection | Bot silently stops receiving updates; no errors raised | ws_idle_ms>30000 | Application-level ping every 15 s; alarm if no inbound for 60 s |
| Decimal precision bug | Float arithmetic on prices that should be integer-tick | Cumulative rounding loss of ~0.5 to 2 cents per share | price_round_diff!=0 | Use integer tick units end-to-end; convert only at display |
| Fee miscount | Taker fee applied once on a two-leg trade instead of per leg | Reported edge consistently higher than realised edge | edge_reported - edge_realised > 0 | Compute fees per leg; reconcile daily against venue statement |
The catalogue is opinionated about ordering. The top four are loud errors that show up in alerts within minutes; the bottom four are silent and only show up in reconciliation or P&L drift. The log signatures are the literal strings to put in a saved query in whichever backend the team uses. The fixes are the version that has held up in production; cheaper fixes exist for several of these and they break under load.
Severity versus frequency: where to spend debugging time
Not every error class deserves equal engineering investment. The figure below plots the eight classes from the catalogue on two axes: how often a class occurs in a typical week of running, and how much capital a single instance can cost if left unhandled. The points cluster into three regions and each region has a different correct response.
Polymarket bot error classes — frequency versus capital at risk
The practical advice from this view: a team that is overwhelmed by RPC timeout alerts should not be adding more retry logic. They should be silencing the alert (replacing it with a daily rollup), failing over to a second provider, and reallocating the engineering hour to the upper-right quadrant. The pager fatigue that comes from over-alerting on the noisy-but-cheap class is itself a contributor to silent classes going unnoticed.
Nonce, signing, and RPC errors
The wallet and network layer produce the loudest errors and have the most stable fixes. They are worth getting right early because every other class assumes this layer behaves.
Signed-message rejection. Polymarket's CLOB uses EIP-712 typed data for orders. The bot computes a hash, signs it, submits the signature alongside the order, and the gateway re-derives the hash from the order fields and verifies. A mismatch in any single field (the domain separator, a type name, a field order, an integer width) produces a 400 with a generic invalid-signature message. The diagnostic is always the same: print both hashes (bot side and gateway side, the gateway exposes its expected hash in the error body on most endpoints) and find the field that differs. The fix is always to re-derive the typed-data schema from whatever the gateway is currently advertising and pin that version in the bot.
Nonce drift. When a transaction is broadcast and either fails locally or gets dropped, the bot's cached next-nonce can diverge from the node's view. Subsequent transactions fail with either “nonce too low” (bot is behind) or “nonce too high” (bot is ahead). The defensive pattern is to never cache the nonce across more than a single batch. Before each batch, query the node for the current pending nonce; for the batch itself, allocate sequentially from that value. The cost is one extra RPC call per batch; the benefit is that nonce drift becomes self-correcting within a single round.
RPC timeout. Polygon RPC providers are individually reliable around 99.5 percent of the time, which means the bot will see multi-second hangs several times per day from any single provider. The fix is not to add more retry logic against the same endpoint; it is to run two providers and fail over within 2 seconds. Most bots that suffer from RPC instability are running a single provider and treating timeouts as a transient condition to retry past. Two independent providers, with the failover logic isolated in a small wrapper module, is the cheapest and most durable fix. The wider context for these network-layer failures, including the role of private relays under load, is in the arbitrage bot build guide, where latency tail behaviour is the binding constraint.
Order-book and matching errors
Above the network layer sits a class of errors that come from the bot disagreeing with the order book or the matching engine about reality. These are often loud (the gateway returns a validation error) but the underlying cause is usually a stale or incorrect bot-side belief.
Tick-size violations. Each market on Polymarket has a minimum price tick. Submitting an order at a price that does not align with the tick returns a validation error. The bot should snap prices to the tick before signing, never after; rounding to a non-tick price and only catching the error at the gateway wastes a round-trip and obscures the cause.
Self-cross prevention. If the bot has a resting order on one side of a market and the strategy logic produces a marketable order on the other side, the gateway will reject it as a self-cross. The cheap fix is to maintain a small in-memory map of the bot's own resting orders by market and refuse to submit a marketable order that would cross them. The more durable fix is to cancel the resting order before submitting the crossing one; the right choice depends on whether the resting order is part of an active maker strategy or a forgotten residual.
Partial fill mishandling. The most dangerous order-book error class. A submission for 1000 shares fills 600 and the order moves to a terminal state. A bot that treats DONE as “the trade completed” without checking fill_qty against order_qty silently accumulates a 400-share unfilled gap on one side of a two-leg trade. Across a day this can stack to thousands of shares of one-sided inventory. The state machine has to model PARTIAL as a distinct, non-terminal state; the residual either reposts at the next tick or is unwound at market, with the policy logged at the time of decision.
Silent P&L drains
The silent class is what separates a bot that looks profitable from a bot that is profitable. There are three drains that account for most of the gap.
Decimal precision. Polymarket prices are expressed as decimals between 0 and 1 with a tick of 0.01 or finer depending on market. A bot that does intermediate arithmetic in IEEE 754 floats will accumulate rounding error: 0.34 + 0.66 will not equal 1.0 in float, and a sum-to-one invariant check that uses raw equality will sometimes fire and sometimes not. The fix is to do all internal arithmetic in integer ticks (multiply by 100 or 10000 at ingress, divide only at display) and to use tolerance bands rather than equality whenever floats are unavoidable. The cost of getting this wrong is on the order of half a cent per share, which against a 1-cent arbitrage edge is half the strategy.
Fee miscounting. The Polymarket CLOB takes a taker fee on each leg of a trade. Bots that compute pre-trade edge as “gross spread minus one fee” instead of “gross spread minus two fees” will report edges that are larger than the realised edge. The bug is easy to miss because it is consistent: the bot is always wrong by the same amount, so the dashboard looks steady. The catch is a daily reconciliation that compares the bot's claimed P&L against the venue's statement; any persistent gap of the same sign is a structural calculation error, not noise. The security implications of this kind of accounting drift, especially when it interacts with wallet permissions, are covered in the Polymarket bot security write-up.
Stale price action. The websocket reconnects silently after a network blip. The bot's cached top-of-book is now several seconds old. The detector signals on a stale price that the venue no longer offers. The submitted order either fills at a worse price (silent slippage) or is rejected with a no-such-price error (loud, but the operator now has to figure out why the cache was wrong). The fix is to timestamp every cache entry at the moment it is written and to reject any cache entry older than 1500 milliseconds at the detection step. Anything that wants to act on a stale entry should explicitly refresh from REST first; the latency cost is one round-trip and the alternative is unbounded silent loss.
A debugging playbook
When an incident fires, the goal is to move from alert to root cause in under fifteen minutes. A playbook that the operator can follow without thinking under pressure is worth more than any single diagnostic tool.
- Capture the correlation id from the alert. Every alert should carry one; if it does not, fix the alert template before doing anything else.
- Pull all log records with that correlation id from the last hour. This is the full causal chain for one inbound event and is usually under 50 lines.
- Read the chain forward, not backward. The first log line that disagrees with what the operator expects is almost always the proximate cause. Reading from the failure backward wastes time because the same exception can have ten upstream causes.
- Cross-check against the venue. Pull the order state directly from the Polymarket API for any
order_idmentioned. If the venue disagrees with the bot, the bug is in the bot's local state, not the network. - Check the silent-class invariants. Even when an incident looks loud, run the daily reconciliation script for the affected market. A surprising number of loud incidents are the visible edge of a silent drift that has been running for days.
- Log the diagnosis, not just the fix. After resolution, write the alert, the correlation id, the proximate cause, and the fix into a single shared document. The same incident class will recur and the future operator (often the same person two months later) needs to find the previous diagnosis in one search.
The playbook will not catch everything. Novel bugs will still take hours. What it does catch is the long tail of repeat incidents that consume most of the operational cost of running a bot. A team that runs this playbook honestly will, within three months, find that more than 80 percent of incidents resolve in the first three steps and the remaining 20 percent are worth deep investigation. The ratio is the measure of whether the logging and the catalogue are doing their job; if it stays inverted, the structured-log fields are wrong or the alert thresholds are. Either way, the diagnostic work is in the observability stack, not in the bot code itself.