Forecasting Uncertainty via Simulation
In my post Forecasting the Impact of Abortion Law Changes on State Foster Care Systems, I mentioned that one way to incorporate uncertainty in a forecast is through simulation. In this post, I explore this technique in greater detail in order to demystify what I’m doing and hopefully make the technique more accessible to others.
One thing I needed to do in my project was to build a baseline forecast of the need for foster care resources out into the future. I had data about the proportion of all children in each state that were in the foster care system from 2014-2019 which gave me a range of values that were reasonable to use in the forecast and a sense of how much this proportion was likely to change from year to year. Once again, using Texas as the example:

You can see that the proportion of children in foster care was pretty stable over this time period, ranging from a low of 0.361% in 2015 to a high of 0.385% in 2018 (think of this as 361 to 385 out of every 100,000 children in the state).
From the above, I can also see that from year-to-year, the change in proportion was very small – between -0.017% and +0.012%. This is important because when I progress from year to year in the simulation, I don’t want to just grab any proportion of the overall state child population that falls into the “reasonable” range – I want to pick a proportion that is reasonably different from the proportion that came the year before.
I had a good data about what a reasonable change in the foster care proportion would look like from year to year, but there is no one right value, and there are a whole lot of future possibilities depending on how things play out. This is why I relied on simulation. *Note: even though I conducted this project in 2022, I had to simulate all of the data after 2019 because this was the last year that all the data I needed (total state child population and total state foster care population) was publicly available.
Here is how I carried out the simulation:
I started with the last known foster care proportion from 2019 (in Texas, this was 0.368%), and added a randomly selected value (again, from Texas, this was between -0.017% and +0.012). This gave me a simulated foster care proportion for 2020, which I then multiplied by the child population from 2020 (8,721,313) to get the forecast of the actual number of children in foster care in 2020. I repeated this for 2021, selecting a change in the foster care proportion at random within the same range, adding (or subtracting) it to the proportion I got for 2020 and multiplying it by the child population forecast for 2021, and so on all the way to 2040. This gave me a simulated count of the number of children in foster care every year from 2020-2040.

But this is just one possible forecast out of an infinite number of reasonable forecasts we could make over this time period; so instead of just one, I built 10,000 of these simulated paths. Thankfully, due to modern computing, the whole process only took a handful of seconds for each state, so I was able to do all 50 states in a little over a minute.
When selecting values at random, there are a lot of paths through the data, but I constrained the paths so that the change in the foster care proportion from year to year is within what we determined was a pretty reasonable range. It is possible that I could, just by sheer luck, select the maximum increase (or the maximum decrease) every year, but most paths through the data will bounce back and forth a bit like the plot above and the result is that over the course of lots of simulations, the majority of them will end up pretty close to what is ultimately the most likely forecast from year to year. We can take a closer look at the first 5 years of the forecast to illustrate what I mean. This shows all 10,000 simulated forecasts for every year from 2020-2025:

Knowing every single path through the data isn’t very informative, so I plotted the median and build an 80% confidence interval around the points from each year:

If we focus just on the simulated points from the year 2025, we can see how much more frequently points near the median (red dot) value showed up in the simulations compared to extreme values near the edges. This gives a sense of their likelihoods. Values really close to the median showed up in thousands of simulations, whereas the really high and low forecast values only showed up in a few:

Carrying this whole process out to 2040 resulted in the baseline forecast of the foster care population in Texas (the forecast of foster care if abortion laws hadn’t changed).

I built the updated forecasts (due to the change in abortion law) the same way. The only difference was that the forecasted population of children in the state was higher following the change in abortion laws, so multiplying this bigger number by the same proportion of children in foster care resulted in larger forecasts of the foster care population in an absolute sense.
And that's the whole process. I realize this has been a technical post, but hopefully that helps make this simulation technique more clear, and that you can see how we can use our knowledge of what constitutes “reasonable” variation in the data to build a valuable forecast that still contains the right amount of uncertainty.
Once again, if you have a use for this type of analysis in one of your projects, we would love to partner with you. Please contact us at info@cwdatasolutions.com to get started!
Thank you for reading!
Russ Clay, PhD
Founder and Principal Data Scientist