We have an experiment where we select n times from a set of m arbitrary but distinct symbols. The probability of selecting a given symbol is constant and independent on each selection. [MATH]p_i[/MATH] is the probability of selecting the ith symbol, i = 1, ... m.
We have [MATH]\text i \in \mathbb Z \text { and } 1 \le i \le m \implies 0 < p_i \text { and } \sum_{i=1}^m p_i = 1.[/MATH]
[MATH]\therefore p_{min} = \text {min}(p_1,\ ... \ p_m) \implies 0 < p_{min} \le \dfrac{1}{m},[/MATH]
We are interested in the probability of at least one event, where an event is defined to be k > 0 successive occurrences of any one of the m symbols. Obviously, if m = 1, that probability is 1. Equally obviously, if k = 1, the probability is 1. Furthermore, it is also obvious that if k > n, the probability is 0. Thus, for the problem to be non-trivial, we must have
[MATH]2 \le k \le n \text { and } m \ge 2.[/MATH]
[MATH]\therefore i \in \mathbb Z \text { and } 1 \le i \le m \implies 0 < p_i < 1 \text { and } \sum_{i=1}^m p_i = 1.[/MATH]
If we have at least one event, there must be a first event. That could start at the first selection, the second selection, and so so on all the way up to the n+1-k selection. (If n = 10 and k = 2, we cannot start a sequence of length 2 in position 10; the latest it can start is position 9.)
Let [MATH]q_j[/MATH] be the probability that the first event starts at selection j > 0,
and let [MATH]c_j[/MATH] be the probability that the first event starts on or before selection j > 0.
Define [MATH]c_0 = 0. [/MATH]
Obviously, [MATH] j > 0 \implies c_j = q_j + c_{j-1}.[/MATH]
Let r = the probability that a given selection is the start of a sequence of k occurrences of one of the m symbols.
[MATH]\therefore r = \sum_{i=1}^m p_i^k \implies 0 < r < 1 \ \because \ k \ge 2.[/MATH]
The probability that the FIRST sequence of k occurrences of one of the m symbols occurs at position j is a conditional probability equal to the probability that the first such sequence did not occur earlier times the probability that such a sequence starts at this position. In other words,
[MATH]q_j = (1 - c_{j-1}) * r.[/MATH]
And the probability that we want is [MATH]c_{n+1-k}.[/MATH]
This is enough to set up a VERY simple spreadsheet to solve problems where m, k, n and the probabilities associated with each symbol are known. Moreover, messing around with such a spreadsheet seems to confirm my intuition that if n >> k, the probability of at least one sequence of at least length k of at least one symbol approaches 1. For example, with m = 4, k = 4, and probabilities for individual symbols of 0.5, 0.25, 0.15, and 0.1, the probability of at least one sequence of at least length k of at least one symbol > 99% if n > 66. It exceeds 50% if n > 10.
My difficulty has been to prove that as k/n approaches 0, the probability approaches 1, let alone say anything about the rate of convergence. I get a pattern that is similar to (but obviously different from) Pascal's triangle, but I have not been able to generalize it.
[MATH]q_1 = (1 - c_0) * r = (1 - 0)r = r.[/MATH]
[MATH]c_1 = q_1 + c_0 = r + 0 = r.[/MATH]
[MATH]q_2 = (1 - c_1) * r = (1 - r)r = r - r^2.[/MATH]
[MATH]c_2 = q_2 + c_1 = r - r^2 + r = 2r - r^2.[/MATH]
[MATH]q_3 = (1 - c_2) * r = r - r(2r - r^2) + r^3= r - 2r^2 + r^3.[/MATH]
[MATH]c_3 = q_3 + c_2 = r - 2r^2 + r^3 + 2r - r^2 = 3r - 3r^2 + r^3.[/MATH]
[MATH]q_4 = (1 - c_3) * r = r - r(3r - 3r^2 + r^3) =\\
r - 3r^2 + 3r^3 - r^4.[/MATH][MATH]c_4 = q_4 + c_3=\\
r - 3r^2 + 3r^3 - r^4 + 3r - 3r^2 + r^3 = 4r - 6r^2 + 4r^3 - r^4.[/MATH][MATH]q_5 = (1 - c_4) * r = r - r(4r - 6r^2 + 4r^3 - r^4) =\\
r - 4r^2 + 6r^3 - 4r^4 + r^5.[/MATH][MATH]c_5 = q_5 + c_4=\\
r - 4r^2 + 6r^3 - 4r^4 + r^5 + 4r - 6r^2 + 4r^3 - r^4 = \\
5r - 10r^2 +10r^3 - 5r^4 + r^5.[/MATH][MATH]q_6 = (1 - c_5) * r = r - r(5r - 10r^2 +10r^3 - 5r^4 + r^5) =\\
r - 5r^2 + 10r^3 - 10r^4 + 5r^5 - r^6.[/MATH][MATH]c_6 = q_6 + c_5= r - 5r^2 + 10r^3 - 10r^4 + 5r^5 - r^6 \\
+ 5r - 10r^2 +10r^3 - 5r^4 + r^5 = \\
6r - 15r^2 + 20r^3 - 15r^4 + 6r^5 - r^6.[/MATH]Does anyone see a way to generalize that?