Written by Colin+ in probability.

For all the grief I give @reflectivemaths on Wrong, But Useful, he does occasionally ask an interesting question.

In episode 45, he wondered how many packs of Lego cards one would need to acquire, on average, to complete the set of 140?1

Suppose, instead of 140 cards, there was a total of three to collect.

You start (obviously) with no cards. On average, you need to pick up a single card to make sure you have one card in your collection. By on average, I mean 'every time'. The expected time (in cards) to go from 0 cards to 1 card is 1.

If you have one card, how long do you expect to wait until you get to two? Now we're looking at a *geometric distribution* with a parameter of $\frac{2}{3}$ - every time you get a new card, there are two chances in three of it being the one you don't already have. If $X \sim Geo\left(p\right)$, then $E(X) = \frac{1}{p}$ - and in particular, it will take us on average 1.5 cards to go from one card to two.

And for the last card? Now it's another geometric distribution; this time, only one in three cards helps us, so we're drawing from $Geo\left(\frac{1}{3}\right)$, with an expected value of 3.

Adding these up gives $1 + \frac{3}{2} + 3 = 5.5$ - although it's not really the *number* we care about.

Breaking it down, this is three geometric distributions, with parameters $\frac{3}{3}$, $\frac{2}{3}$ and $\frac{1}{3}$, giving expected times of $\frac{3}{3}$, $\frac{3}{2}$ and $\frac{3}{1}$ - or $3 \left(1+\frac{1}{2}+\frac{1}{3}\right)$.

A few moments' thought shows that if you have $n$ cards to collect, you have $n$ geometric distributions, with parameters $\frac{n}{n}$, $\frac{n-1}{n}$, ... $\frac{1}{n}$ - and so your total expected time is $n\left(\frac 1n + \frac1{n-1} + \frac{1}{n-2}+ ... + \frac{1}{2} + 1\right)$.

That sum in the bracket? That's known as the $n$th *harmonic number*, $H_n$. In the Sainsbury's case, your expected value turns out to be $140 H_{140} \approx 773$ cards - or close to 200 packets.

In fact, the packets are made of a single shiny card (of which there are 36 to collect) and three regular cards (numbering 104).

To collect 36 silver cards (forgetting anything else) would take $36 H_{36} \approx 150$ cards. Collecting 104 regular cards would require $104 H_{104} \approx 544$ cards. Altogether, having this fairly mild guarantee about the structure means you'd expect to need 694 cards, or 173 packs, to finish your collection.

Because $\int_1^n \frac{1}{x} \d x = \ln(n)$, it's reasonable to expect $\sum_1^n \frac{1}{n}$ to be related to $\ln(n)$ as well. It *is* - but there's a mysterious discrepancy. In fact, $\sum_1^n \frac{1}{n} \approx \ln(n) + \gamma$, where $\gamma$ is the *Euler-Mascheroni constant*, about 0.577.

In our original 140-card problem without silver cards, this would give $140 (\ln(140) + \gamma)$; $\ln(140) = \ln(7)+\ln(20) \approx 5$, and we end up with $140 \times 5.5 \approx 770$. That's not at all bad!

This puzzle is known as The Coupon Collector's Problem - and I love how it falls apart so logically!

- For the sake of simplicity, let's assume that the 140 cards are equally likely to show up in any given pack of four, and that there are infinitely many of each. [↩]