Calibration and Skill of the Kalshi Prediction Markets

Russ Clay
Sep 18, 2022
10 min read

The Kalshi prediction markets are the first federally regulated exchange for event contracts. Beginning in 2021, Kalshi introduced an event contract asset class which allowed investors to make trades based on their belief about the likely outcome of an event. In this project, I show how the Brier Skill Score (BSS) can be used to quantify the ability of prediction markets to correctly forecast the outcome of a target event and how this metric changes over the course of an open market. To be clear at the outset, the primary focus of this project is about the ability of the markets to predict future events. If you came looking for financial insight, you will be thoroughly disappointed. I am a social psychologist by training and I have spent the past several years building predictive models in my role as a senior data scientist. Prediction markets combine these two aspects of my professional career brilliantly, and I have been interested in them for quite some time.

I’ll start with a bit of background. In more traditional financial markets such as stocks and mutual funds, investors buy shares of a company or portfolio in the hopes that its value will increase over time and as such, their initial investment will be worth more at some future time when they elect to sell, but in general, there is no set timeline for this future date. Investors attempt to buy and sell assets “when the time is right”. Conversely, in prediction markets, investors are buying positions about the outcome of a future event (i.e., will the event happen or not) in the hopes that their initial investment will appreciate because they correctly predicted the event outcome. The price of an individual asset in a prediction market varies as a function of the market’s belief that the event will occur. For example, one Kalshi prediction market asked: Will the S&P 500 be between 3900 and 3949.99 at the end of September 06, 2022? Investors can buy “yes” or “no” positions on market outcomes for anywhere between $0.01 and $0.99. Markets pay out $1.00 to positions that correctly predicted their outcome. So, when the S&P 500 closed at $3908.19 on September 6, 2022, all “yes” positions paid out at $1.00. The profit margin on any individual “yes” position was the difference between the purchase price and $1.00, and this varied across the lifetime of the market:

As we can see, purchases a little over 3 days before the close of the market were buying “yes” positions at around $0.40. This dipped down to around $0.30 up until September 6, and then rose throughout the final day of trading, closing at a price of $0.99. This makes intuitive sense. Investors can watch the value of the S&P 500 Index and as the close of the prediction market approaches, there is less time for the value to change considerably and more confidence in what the ultimate market outcome will be.

Given this background, I wanted to understand how skilled the markets have been at predicting outcomes in general, at any time in the history of the market. Because the price of a market has to vary between $0.01 and $0.99, it is essentially equivalent to a probability output from a predictive model. Therefore, I was able to evaluate how closely this market price ‘probability’ mapped to the observed outcomes across the history of the markets.

One of my preferred ways of quantifying the ability of any model to make predictions about some binary outcome is to use the Brier Skill Score (BSS). The BSS combines two important aspects about predictive ability: calibration and confidence, into a single, easy to interpret value (See my primer on the Brier Skill Score). There are a couple of key things to know about the BSS: 1) Higher scores are better, with 1 representing the highest possible score, 2) A BSS of 0 means that the model is not doing any better than just predicting the base event rate, and 3) a negative BSS means that the model is actually doing worse than you would by just predicting the base event rate (negative scores are not good).

I downloaded all of the finalized markets from Kalshi which, at the time of this writing, was 8,476 total markets. The base event rate (the proportion of all finalized markets that resolved to ‘yes’) was 28.7%. Throughout the remainder of the analysis, I present market prices as predictors of a ‘yes’ outcome in the market, but it is important to note that investors can purchase ‘no’ contracts as well, and this analysis would be complimentary as every ‘yes’ contract must also have a buyer for the reciprocal ‘no’ option (e.g., if somebody is buying a ‘yes’ contract for $0.60, there needs to also be somebody willing to buy a ‘no’ contract for $0.40 in order for the overall purchase to go through).

Kalshi lists markets with varying time horizons; daily markets for the high temperature in New York City, weekly markets for the closing price of the S&P 500, monthly markets for the president’s approval rating, as well as some others that have longer time horizons (e.g., A NASA manned mission will successfully land on the moon before December 31, 2024). As such, I looked at the predictive ability of the markets at a few levels: in the 6-weeks leading up to market close, in the seven days leading up to market close, and in the final 24 hours before market close. Here are the trends of the BSS across those three views:

Brier Skill Score in the Six Weeks Prior to Market Close

The predictive skill of the market is fairly constant when we back out and look at it over a period of weeks. The BSS starts and ends near its overall low of 0.367 and peaks at 0.513 at 3 weeks before close. It is important to be aware that in this view, I’m taking all of the price measurements for a whole week and aggregating them into a single BSS value. I do this because the trading volume is relatively low this far out and I needed to make sure I had enough measurements for the skill score to reflect a fairly stable measure of calibration. But be aware that what this instance of the skill score conveys is the average predictive value of the market price averaged over an entire week.

Brier Skill Score in the week before market close

In the days leading up to the close of the market, we see a bit of a general upward slope in the BSS trend, but the skill of the market price is pretty constant over the course of the week. The BSS starts at a low of 0.262 and finishes at a relative high of 0.415, though it comes very close to both of these points at 1 day (BSS = 0.273) and 2 days (BSS=0.414) prior to close as well. In this plot, the BSS represents the predictive skill of the market averaged across an entire day of trading.

Brier Skill Score the day of market close

On the final day of trading, there is a distinct upward slope in the skill of the market price to predict the market outcome, as we would expect given that investors have pretty much maximized the amount of information they can collect about the market at this point, and there is very little time for the market to change from wherever it happens to be. On the final day, the BSS starts at a low of 0.250 and finishes at a high of 0.624. These BSS values represent the predictive skill of the market averaged over an hour of trading.

Just to belabor the point a bit, what this BSS score is telling us at any point across the timescale of the markets is how well calibrated and confident the market price is as a predictor of the eventual market outcome. In order to get a better sense about what we mean by “calibrated and confident”, it helps to look at snapshots of the calibration at specific points in time. I’ll start with a snapshot of market calibration one week prior to close:

Market calibration one week before market close

Each red dot in this plot is the observed proportion of markets that eventually resolved to “yes” for all markets that traded at a particular price. So for example, the red dot near the bottom of the chart next to the market price of $0.50 indicates that less than 5% of the markets that were trading at $0.50 one week prior to close resulted in ‘yes’ (however, before assuming that that is good investment information, note that almost 90% of the markets trading at $0.52 prior to close resulted in yes). Why so much volatility over such a small price range? It’s really just because there isn’t a whole lot of trading going on this far out from the market close in general. There are a few instances of markets trading at each price, and especially when it comes to mid-range prices (prices closer to $0.50), whatever eventually happens to these small clusters of markets will heavily influence where the red dot ends up. Also note the dotted diagonal black line. This represents where all the points would be if the model was perfectly calibrated (i.e., if every market price reflected the true probability of the market resolving to ‘yes’), and the blue shaded area around this diagonal line represents the variability that we would expect to see around this line based on the number of trades that occurred at each price point if the market price was indeed perfectly calibrated. So if the red dot overlaps the blue shaded area, it is plausible that the market is actually perfectly calibrated at that point, there is just some variability due to random luck and small sample sizes. Finally, the blue vertical dotted line is the base event rate. Out of all of the markets that have ever finalized, 28.7% resolved to ‘yes’. The BSS takes this into account and gives more credit to market prices that are well calibrated away from this point.

Here is the same plot for the markets one day prior to close:

Market calibration one day before market close

There was more trading volume at this time point and so the expected variability around the line of perfect calibration is smaller, and we do see the observed proportion of ‘yes’ clustering a bit closer around this line, but the skill score actually drops (perhaps not by a meaningful amount) and this is likely due to the fact that when we see the red dots deviating from the diagonal line, they represent a lot more measurements and they influence the overall score to a greater extent.

Finally, here is the same plot in the last hour of trading:

Market calibration in the final hour before market close

The skill score is considerably higher and you can see that the red dots have clustered much more tightly around the line of perfect calibration. At this point, market prices have become a pretty good indicator of the true probability of the market resolving to ‘yes’ (as we would expect, since this plot represents the point in time where the market has the least amount of time to change and investors have the most information that they can have about the market prior to close).

There are, however, some important systematic deviations from calibration that are worth pointing out. It seems that low market prices in the last hour of trading (prices below $0.50) are overly optimistic. If we assume that the market price reflects the market’s best guess at the probability of the outcome, investments below $0.50 traded at an average of $0.05 higher than the eventual proportion that resolved to ‘yes’. Conversely, investments above $0.50 seem to be overly pessimistic as they traded at an average of $0.04 below the eventual proportion that resolved to ‘yes’.

There are several well-studied social psychological phenomena that could explain this mis-calibration in the last hour of trading. One in-particular that seems relevant is the anchoring and adjustment cognitive heuristic. This is when an individual’s decision about some numeric quantity is influenced by some prior reference point. According to this heuristic, investors in the last hour of trading may observe the existing market price and may be influenced enough by its present value that they don’t adjust away from it to the price that they would if they were simply predicting the probability of the outcome outright. They get somewhat ‘anchored’ to the list price as is. A host of other possibilities for the optimism/pessimism in the last hour of trading exist as well, but for the purpose of this analysis, we’ll just acknowledge that it exists in our calibration plot.

So, for me, there are a few big takeaways from this analysis:

Market price is a valuable tool for the prediction of a future event. This is my key source of fascination with prediction markets in general. What prediction markets are doing is essentially crowd-sourcing opinions in order to predict the probability of some future event, and through this analysis, we have demonstrated objectively that there is value in the market as a predictive tool. Across all three trend plots, our BSS line was positive, and a positive BSS indicates that the model (in this case, the market price) is doing a better job at predicting the outcome than simply knowing the event base rate. Another way of saying this is that prediction markets are extracting insight from the investment crowd and incorporating this insight into the market price. Almost certainly, the prediction markets are able to do this because investors are backing their beliefs with real money. It is one thing for an “expert” to give an opinion about what is going to happen with the economy, or politics, or a manned mission to the moon, but outside of some reputational impact, there isn’t much consequence when experts end up being wrong, and they can generally formulate a plausible reason in hindsight as to why their projections didn’t turn out the way they had initially forecast. However, when lots of individuals stand to lose money if their predictions are wrong, it helps eliminate sources of bias in each individual prediction, and when lots of individual predictions made in this manner are averaged out, we actually get a pretty useful predictive tool.
Calibration means one thing for prediction, from a social science standpoint, and an entirely different thing from a financial/investment standpoint. A perfectly calibrated prediction market would be an extremely valuable tool for social science because it would provide precise probabilities of future events. If perfectly calibrated prediction markets were to exist, they could replace expert opinion and provide an extremely valuable source of information for policy decisions and future planning in general. However, a perfectly calibrated market would not provide any opportunity for investment returns. In other words, if the market price always represented the true probability of the event outcome, over the long-run, all investments would break even. My analysis of the history of the Kalshi prediction markets to date indicate that this is not the case. The positive BSS value over the course of the markets’ history indicates that the market price is providing valuable predictive information about events. However, markets are not perfectly calibrated, and therefore, the markets hold investment opportunity as well.

Hopefully you found my analysis of the Kalshi prediction markets insightful. If you did, please share the post with others. If you would like to collaborate on a project involving evaluation of the skill of a predictive model, or any other type of data-driven analysis project, please get in touch: info@cwdatasolutions.com.

Thank you for reading!

Approximate Project Time: 60 Hours