Probability Framework of the 2008 Crisis

14 min readMar 30, 2023

Note: This work is mostly from HarvardX: Data Science online courses. I’m aiming to be intuitive with Monte Carlo simulations. You can find the Kaggle notebook and find the R codes on my profile link

Central Limit Theorem

It’s helpful to remember the central limit theorem before working on this subject. The theorem says whatever distribution it’s coming from, the sample mean is normally distributed. Even if it’s a uniform distribution or exponential, any distribution that has a mean value, mean of the sample from this distribution is normally distributed. Here are some examples to exactly see this. There is a rule of thumb that the sample size should be at least 30. I can’t dig into this issue and I just selected a sample size very large.

# First we will see the histogram of a uniform distribution:
df.uniform <- data.frame(values = runif(100000))
df.uniform %>%
  ggplot(aes(values)) +
  geom_histogram(aes(y = ..density..), 
                 binwidth = 0.07, 
                 color = "grey",
                 fill = "#336699") +
  ggtitle("Histogram of a random uniform distribution")

# Sampling from a uniform distribution and taking averages:
CLT_unif <- replicate(1000, {
  x <- runif(30) # 30 random variables from a uniform distribution between 0 and 1
  mean(x) # take the average and append it to the object 'CLT_unif'
})

as.data.frame(CLT_unif) %>% 
  ggplot(aes(CLT_unif)) + 
  geom_histogram(aes(y = ..density..), color = "grey", fill = "#336699") +
  geom_density(size = 1) +
  ggtitle("Averages of samples from a uniform distribution")

# This is the histogram of a random exponential distribution:
df.exp <- data.frame(values = rexp(100000))
df.exp %>% 
  ggplot(aes(values)) +
  geom_histogram(aes(y = ..density..), 
                 binwidth = 0.3, 
                 color = "grey",
                 fill = "#ff9933") +
  ggtitle("Histogram of a random exponential distribution")

# Here is the sampling and making histogram of the averages:
CLT_exp <- replicate(1000, {
  x <- rexp(30) # 30 random variables from an exponential distribution
  mean(x) # take the average and append it to the object 'CLT_exp'
})

as.data.frame(CLT_exp) %>%
  ggplot(aes(CLT_exp)) + 
  geom_histogram(aes(y = ..density..), color = "grey", fill = "#ff9933") +
  geom_density(size = 1) +
  ggtitle("Averages of samples from a exponential distribution")

# Finally, example of a poisson distribution: 
df.pois <- data.frame(values = rpois(100000, lambda = 4))
df.pois %>% # histogram of a random poisson distribution
  ggplot(aes(values)) +
  geom_histogram(aes(y = ..density..), 
                 binwidth = 1, 
                 color = "grey",
                 fill = "#009999") +
  ggtitle("Histogram of a random poisson distribution")

CLT_pois <- replicate(1000, {
  x <- rpois(30, lambda = 4) # 30 random variables from a poisson distribution
  mean(x) # take the average and append it to the object 'CLT_pois'
})

as.data.frame(CLT_pois) %>%
  ggplot(aes(CLT_pois)) + 
  geom_histogram(aes(y = ..density..), color = "grey", fill = "#009999") +
  geom_density(size = 1) +
  ggtitle("Averages of samples from a poisson distribution")

As we see, they are all going towards normal distribution. This made a great deal of implementation on science and statistics. We don’t need to care about the distribution of the target population, we can do statistical tests on the sample as if it’s normally distributed. The central limit theorem (CLT) and the expected value are directly relevant to the banking industry. It’s actually the basis of how credits work.

How banks make money

Let’s forget about all technical details and think about the basis of making money from money. A bank (creditor) gives a loan to someone who needs that money today (borrower). The creditor gives the money (creates the money from thin air) and expects the money in the future with a certain interest. For simplicity, we will omit the inflation and the interest rates will be all real interest. The bank gives the loan, say $200,000 with a 2% interest rate, the borrower will pay back $204,000, and that $4,000 is profit for the bank. But there is default risk here. If everything is okay, the bank makes a profit and the borrower is happy. If the borrower defaults, the bank loses not the potential profit but the whole principal ($200,000 in our example). Furthermore, the bank does not give credit to a single person but to many. Some of them will default and some of them will pay back. Let’s say the default risk is 2%. With every hundred borrowers, we ‘expect’ that 2 of them will default and not pay back. What does it mean that we ‘expect’ that 2 will default? Is it certainly 2 or is it around 2? If it’s around 2 will default, how close it will be to 2? Let’s make a simulation and see it ourselves. (For simplicity, let’s just don’t think about where this 2% comes from. Creditors derive this ratio from various sources and it’s not in the scope of this notebook. In this example the 2% is arbitrary but not so far from the real world)

Understanding credits with Monte Carlo simulations

Say, from the view of a bank:

n = 1000 (borrowers),
p = 0.02 (default risk or probability that a borrower defaults)
loss per closure = -200000
Replication = 10000 times

 First, we will only focus on default risk and losing amount, not profiting. 
set.seed(2022)
n <- 1000 # thousand borrowers
p <- 0.02 # default risk or probability that a borrower defaults
loss_per_foreclosure <- -200000
B <- 10000 # for replicate 
losses <- replicate(B, {
  defaults <- sample(c(0,1), n, prob=c(1-p, p), replace = TRUE) 
  sum(defaults * loss_per_foreclosure)
})

cat("Average losses in millions;", mean(losses) / 10^6)
as.data.frame(losses) %>%
  ggplot(aes(losses/10^6)) +
  geom_histogram(aes(y = ..density..), fill = "#669999", color = "grey", binwidth = 0.4) +
  ggtitle("Distribution of losses in this monte carlo simulation") +
  xlab("Losses in millions") +
  ylab("Density")


# output - Average losses in millions; -3.99592

Briefly, this is a 10000 replication of binomial trials and averaging over the losses (expected value). Here is the outcome:

This is the central limit theorem in front of our very eyes. When the sample is big enough, the sample mean is distributed normally. It seems intuitive now, most of the time the loss will be around 200,000*0.02*n = $4M. But there is also the luck factor and the outcome might be more or less than 4 million loss, of course with a lower probability. Using the theorem formulas:

The expected value of the sum of n of a random variable: n × (ap + b(1–p))

Standard error of the sum of n of a random variable: √n × |b−a| √(p(1−p))

avg_loss <- n*(p*loss_per_foreclosure + (1-p)*0) /10^6
se_loss <- sqrt(n)*abs(loss_per_foreclosure)*sqrt(p*(1-p))
cat("Expected value of the sums (in millions):", avg_loss, "\n", 
   "Standard error of the sum of n:", se_loss)


# output - Expected value of the sums (in millions): -4 
#  Standard error of the sum of n: 885437.7

Therefore, the expected value of the sums (in millions) is ‘-4’ and the standard error of the sum of n is ‘885437.7’

The banking system relies on this simple theorem and the banks rightfully believe that the default results will distribute normally like that histogram. The banks need to find the right interest rates for their accepted risk of profiting or losing. Now let’s take into account interest rates and profits. (For simplicity, I am omitting the operational costs and focusing only on capital and default risk) First, we will find out the interest rate that makes the expected value zero.

The notation in this formula is: (l: loss per closure, p: default probability, x: profit, 1− p: not default)

lp + x(1 — p) = 0 and we solve this equation for x

x = — lp/(1 — p)

x = - loss_per_foreclosure*p/(1-p)
cat("The amount of profit for expected value being zero:", x, "\n", 
   "The interest rate for $200,000 loan should be:", x/200000)


# output - The amount of profit for expected value being zero: 4081.633 
#   The interest rate for $200,000 loan should be: 0.02040816

The amount of profit for the expected value being zero: 4081.633

The interest rate for a $200,000 loan should be: 0.02040

When the expected value (or the average of outcome) is 0, the chance the bank losing money is 0.5. This is easy to see because the expected value is 0 with some ‘swing’ for positive or negative values, with this swing, I mean standard error. It’s either losing or profiting with the same chance.

se_loan <- sqrt(n)*abs(loss_per_foreclosure-x)*sqrt(p*(1-p))
pnorm(0, 0, se_loan) # Probabilty of an outcome being 0 or less than 0 in a normal distribution with mean 0 and standard error 'se_loan'

# output - 0.5

Suppose the bank wants to take a risk of losing money at 0.01 probability. We will stick to n=1000 borrowers and 2% default risk, thus we need to change the interest rate. If the expected value becomes greater than 0, the risk of losing money for the bank decreases. Imagine the bell curve going to the right on the cartesian coordinates with the same other parameters.

We want the sum (S) to have a probability of 0.01 to be less than 0: Pr(S<0)=0.01

We know that S is approximately normal. Now we will make a ‘trick’ that is very common in statistics. We are going to add and subtract the same amount on both sides and our inequality will still stand true. That will permit us to use x as an unknown and thus we can solve it. On both sides, we will subtract the expected value of S and divide it by the standard error of S.

Now, remember that the term on the left is the Z score formula. We will put Z on the left. This process is also called ‘standardizing’. On the right, we can unbind the notations to formulas, so we will get:

We know that Z is a standard normal variable with the expected value of 0 and standard error of 1, the right side, call it little z must be the quantity of 0.01 quantile. qnorm(0.01) gives us the value of z and it means that: Pr(Z≤z)=0.01 Thus,

After that, it’s just tedious algebra and we can solve for x as this:

l <- loss_per_foreclosure
z <- qnorm(0.01)
x <- -l*( n*p - z*sqrt(n*p*(1-p)))/ ( n*(1-p) + z*sqrt(n*p*(1-p)))

cat("The interest rate is:", x/200000, "\n", 
   "Expected value of the profit per loan:", l*p + x*(1-p), "\n", 
   "Expected value of the profit for n loans (millions):", n*(l*p + x*(1-p))/10^6)

# output - The interest rate is: 0.03124591 
#   Expected value of the profit per loan: 2124.198 
#   Expected value of the profit for n loans (millions): 2.124198

Therefore,

The interest rate is: 0.03124591

The expected value of the profit per loan: 2124.198

The expected value of the profit for n loans (millions): 2.124198

Let’s make a Monte Carlo simulation and compare the outcomes.

# Monte Carlo simulation for 1% probability of losing money
B <- 10000
set.seed(2021)
profit <- replicate(B, {
    draws <- sample(c(x, l), n, prob = c(1-p, p), replace = TRUE) 
    sum(draws)
})

cat("Expected value of the profit over n loans (in millions):", mean(profit)/10^6, "\n", 
   "Probability of losing money", mean(profit<0))

# output - Expected value of the profit over n loans (in millions): 2.115205 
#   Probability of losing money 0.0133

The expected value of the profit over n loans (in millions): 2.115205 and

Probability of losing money: 0.0133

We see that the outcome is pretty close to our theoretical approach. So we are on a good track.

Now see the profits on a Monte Carlo simulation:

# Here is the histogram of monte carlo simulation for profits
data.frame(profit) %>% 
  ggplot(aes(profit/10^6)) +
  geom_histogram(aes(y = ..density..), binwidth = 0.4, color = "grey", fill = "#660066") +
  geom_vline(xintercept = 0, color = "red") +
  xlab("Profit (in millions)") +
  ylab("Density") +
  ggtitle("Distribution of profits in millions")

If this works, why not do it more?

Within the context of the crisis, we know that bankers started to take more and more risks. We also know that in hindsight, the mortgages rated with As even AAs (which are the highest marks for their low risks) were false. It was so profitable and bankers wanted more and more, with the omission of other indicators.

The first thing that comes to mind is that if you profit so much from something, why not do it more? This means giving loans to more people, in our framework increasing the n. Take into account that finding that n=1000 borrowers with a moderate default risk of 2% weren’t easy. If you want to increase the n number of borrowers quickly, you need to renounce some of that moderate default risk. From a bank’s perspective, if you want to hold still your risk of losing money, say 1%, you need to increase your interest rates. After our theoretical calculations and Monte Carlo simulations, It’s easy to understand the balance between all this. But if you increase your interest rates, borrowers would go to another bank. So if you want to increase your n borrowers and hold still your interest rates, you must accept more risk of losing money, which means accepting borrowers with higher default risk. You say the 1% risk was quite low, what harm could happen with a 2% or 3% risk? You rely on the law of large numbers, if you increase n to a very large extent, even a 4% or 5% risk of losing money is not that scary. We even believe that the outcomes of a scientific study with a p-value of 0.05 as true. If I were offered a gamble with the probability of losing 0.05, I would bet on it with my whole savings. As long as you keep the n large, the outcome will be very predictable. This is most probably how the bankers might have thought of it. Let’s make some simulations and see the numbers ourselves.

Firstly, we will find out the n value with the same 1% risk of losing money.

We define mu and sigma as the expected value and standard error of one loan.

We already defined that z = 0.01, so we can see that:

To find the n for our desired or less than desired 0.01 value:

As long as this inequality holds true, we have a guarantee of not losing money 99% of the time.

# Expected value with a higher default rate and interest rate
p <- .04
l <- -200000
r <- 0.05
x <- r*200000
loss_per_foreclosure*p + x*(1-p) # The positive expected value of a single loan

# output - 1600

# Calculating number of loans for desired probability of losing money
z <- qnorm(0.01)
l <- loss_per_foreclosure
n <- ceiling((z^2*(x-l)^2*p*(1-p))/(l*p + x*(1-p))^2)
cat("Number of loans required:", n, "\n", 
   "Expected profit over n loans (in millions):", n*(loss_per_foreclosure*p + x * (1-p)) / 10^6)

# output - Number of loans required: 3580 
#   Expected profit over n loans (in millions): 5.728

It looks great! Even though the default risk of borrowers doubled to .04, with n = 3580 and interest rate r = 0.5, we guarantee to make huge profits .99 of the time. This is what the theorem tells us. Let’s see it in a Monte Carlo simulation again.

# Monte Carlo simulation with known default probability 
set.seed(2022)
B <- 10000
p <- 0.04
x <- 0.05 * 200000
profit <- replicate(B, {
    draws <- sample(c(x, l), n, 
                        prob=c(1-p, p), replace = TRUE) 
    sum(draws)
})
cat("The mean profit (in millions):", mean(profit)/10^6)


# output - The mean profit (in millions): 5.753368

# Histogram of profit with higher risk and higher borrowers
data.frame(profit) %>% 
  ggplot(aes(profit/10^6)) +
  geom_histogram(aes(y = ..density..), fill = "#3B7548", color = "grey") +
  geom_vline(xintercept = 0, color = "red") +
  xlab("Profit (in millions)") +
  ylab("Density") +
  ggtitle("Histogram of profit with higher n and higher risk")

Bad news: Draws are dependent, risks are unclear and volatile

This seems like a no-brainer. Huge profits await on a fairly basic mathematical theorem. Any bank or mortgage firm can make a profit with high risk, as long as holding large n. However, for the central limit theorem to stand, the draws (loans) must be independent. But is it realistic that the default of a borrower is independent of another’s default whatsoever? Even in a micro event, not one borrower but many would be affected and their default risk would change. This means that draws for the sample are not independent. If a global event occurs, not one borrower but all of them would be affected by it. These draws are no longer independent. Let’s set up a Monte Carlo simulation and see this effect ourselves.

Assume that default risk is not certain. We are guessing it would be between 0.02 ≤ p ≤ 0.06. In our model, the default risk will go up and down between these rates with 0.5 probability. It will happen to everybody at once and not just one person. This means that in a replication of the simulation, everybody will have a high risk and in another replication, they might all have a lower risk. The risk changes slightly. Let’s see how this affects the results.

# Monte carlo simulation of unkown and volatile default risk
set.seed(2008)
p <- 0.04
x <- 0.05*200000
profit <- replicate(B, {
    new_p <- 0.04 + sample(seq(-0.02, 0.02, length = 100), 1)
    draws <- sample(c(x, l), n, 
                        prob=c(1-new_p, new_p), replace = TRUE)
    sum(draws)
})
cat("Expected profit (in millions):", mean(profit)/10^6, "\n",
   "Probability of losing money:", mean(profit < 0), "\n",
   "Probability of losing over $1 million:", mean(profit < -1000000))

# output - 
# Expected profit (in millions): 5.78701 
#  Probability of losing money: 0.3109 
#  Probability of losing over $1 million: 0.276

Note that the expected profit is still huge, more than 5 million but the probability of losing money rocketed to .31, and what is scarier is the probability of losing over a million is .27. Why is this so? If we look at the distribution of the profit, we see that it’s not normally distributed at all. Because of draws for the sample were not independent, we can’t talk about the central limit theorem anymore.

data.frame(profit) %>% 
  ggplot(aes(profit/10^6)) +
  geom_histogram(aes(y = ..density..), binwidth = 2.5, color = "grey", fill = "darkred") +
  xlab("Profits (in millions)") +
  ylab("Density") +
  ggtitle("Not normal distribution of profits") +
  geom_vline(xintercept = 0, size = 1)

This simulation assumed default risks were not so great. Even though, when draws aren’t independent and there is volatility with risks, we can’t talk about the central limit theorem anymore. Even if the expected profit is huge, the risk of losing money is also quite concerning. In addition to these, in the 2008 crisis, when you think about subprime mortgages and AA ratings being false, the crisis seems not a surprise but it was meant to be.