Thursday, November 5, 2009

The Real World Birthday Problem

Originally posted on my private blog September 19th, 2008.

The birthday problem is a well known mathematical illustration. Basically it asks how many people are needed in one room before two of them have a 50% chance of sharing a birthday. The answer is 23, which apparently is lower than what most people's intuition suggests. That result, however, is based on math that assumes that birthdays occur uniformly throughout the year and that also neglects leap days. I was curious what effect this might have, so I went looking for some birth statistics. The closest I found was a list of U.S. births by day for 1978. Series 1 (dark blue) is the raw information, which as you can see has a weekly cycle with much lower birth rates on weekends (especially Sunday). Unfortunately I needed to remove this effect as every date occurs on every day of the week in the long run. Not having other years to work with, I filtered the data, averaging every day with the births for three days before and after so that every weekday would be included. This is shown as series 2 (pink). Series 3 (yellow) is what the assumption of completely uniform birth rates would look like and is included for comparison. You can see that birth rates are significantly higher in the second half of the year. Unfortunately, 1978 wasn't a leap year, so I averaged the values of February 28th and March 1st and then weighted the leap day to be one quarter as likely (not shown).
Next I had to use this data to compute the probability distribution. I went with a simulation approach, which means I just performed the experiment a lot of times and kept track of how many people it took to get a matching pair of birthdays. Not sure of how many trials were needed to be accurate, I did successively larger runs, starting with ten thousand and ending with a billion, each run being an order of magnitude larger. You can see that the results converge, with the last two curves being right on top of each other.
Finally, here are the cumulative probability distributions of both the standard mathematical model (series 1) and my simulated one using real data (series 2). You can see that the two are basically identical and in fact the maximum error is less than two tenths of one percent. This means that all the effects of non-uniform birthday occurrence average out in practice. If the data had been more dramatically clustered this might not have been the case, but now I know for sure. In case you're wondering, a couple professors have written papers on this very topic, but I wanted to do it for myself as a mental exercise.

No comments:

Post a Comment