How to count # of patterns in a sequence

rainie24 · Jan 3, 2020

Definition: Sequence 'ab' has patterns 'a', 'b', 'ab' but not 'ba', or by codes
def has_pattern(pat, lst):
pat_i = 0
for el in lst:
if el == pat[pat_i]:
pat_i += 1
if pat_i == len(pat):
return True
continue
return False

What I want is given a pattern and a sequence, the probability of a permutation of the sequence has the pattern. This function does the job
def prob(pat, lst):
all_perm = list(itertools.permutations(lst, len(lst)))
return Fraction(sum((has_pattern(pat, p) for p in all_perm)), len(all_perm))

but it's slow and I want a general formula so I can get, for example, prob(‘abc’, ‘aabbcc’) = 47/90, quickly.
What I can see is
1. Letters in the sequence but not in the pattern doesn't matter. prob('ab', 'abc') = prob('ab', 'ab')
2. The order of pattern doesn't matter. prob('ab', 'ab') = prob('ba', 'ab')
3, then?
Thanks.
Z K Y

Mr. Bland · Jan 3, 2020

Welcome to Free Math Help! I intend to take a look at this later on tonight when the daily grind has worn down.

I'm no Python wiz, but it looks as though a pattern matches a sequence if the sequence contains all of the elements of the pattern, in order, though not necessarily consecutively. Is that correct?

The documentation for iterable.permutations states that it returns all permutations of the elements in the iterable, without duplicating them. For example, the set ABC has permutations ABC, ACB, BAC, BCA, CAB and CBA.

Since the set "aabbcc" has six elements, it has [MATH]6! = 720[/MATH] permutations. In your example, you state that the result is [MATH]\frac{47}{90}[/MATH]. Is this to say that, of the 720 permutations in the example, 376 match the pattern?

Dr.Peterson · Jan 3, 2020

rainie24 said:
What I want is given a pattern and a sequence, the probability of a permutation of the sequence has the pattern.

Are you asking for a better program to go through all of the (perhaps impossibly many) permutations of a string, or for a mathematical formula that will determine this probability without listing at all?

If you want a formula, we'll have to first define more fully exactly what it is that you want. I take it to be this: the probability that a random permutation of a given string "lst" of N elements (which may contain repeated elements) contains the given string "pat" as a substring (that is, consecutively somewhere in the string, in the given order). Presumably repetition is also allowed in "pat".

To put it another way, since you will be permuting "lst", we could take the input as a multiset (that is, a set whose elements have specified multiplicities, e.g. your aabbcc would be 2 a's, 2 b's, and 2 c's), and we are asking how many of the permutations of that multiset contain "pat" as a substring, which we would then divide by the total number of permutations.

Is that right, or have I missed something?

rainie24 · Jan 3, 2020

Mr. Bland said:
Welcome to Free Math Help! I intend to take a look at this later on tonight when the daily grind has worn down.

I'm no Python wiz, but it looks as though a pattern matches a sequence if the sequence contains all of the elements of the pattern, in order, though not necessarily consecutively. Is that correct?

The documentation for iterable.permutations states that it returns all permutations of the elements in the iterable, without duplicating them. For example, the set ABC has permutations ABC, ACB, BAC, BCA, CAB and CBA.

Since the set "aabbcc" has six elements, it has [MATH]6! = 720[/MATH] permutations. In your example, you state that the result is [MATH]\frac{47}{90}[/MATH]. Is this to say that, of the 720 permutations in the example, 376 match the pattern?

All you said is right. 367 of the permutations have pattern abc.

rainie24 · Jan 3, 2020

Dr.Peterson said:
Are you asking for a better program to go through all of the (perhaps impossibly many) permutations of a string, or for a mathematical formula that will determine this probability without listing at all?

If you want a formula, we'll have to first define more fully exactly what it is that you want. I take it to be this: the probability that a random permutation of a given string "lst" of N elements (which may contain repeated elements) contains the given string "pat" as a substring (that is, consecutively somewhere in the string, in the given order). Presumably repetition is also allowed in "pat".

To put it another way, since you will be permuting "lst", we could take the input as a multiset (that is, a set whose elements have specified multiplicities, e.g. your aabbcc would be 2 a's, 2 b's, and 2 c's), and we are asking how many of the permutations of that multiset contain "pat" as a substring, which we would then divide by the total number of permutations.

Is that right, or have I missed something?

I'm looking for a faster implementation. And as you pointed out, the # of permutations could be huge so a fast implementation is prob. based on (at least partially) a formula. As the prob doesn't depend on the order of pattern nor the sequence, so my vision for the final format of the function is it takes two Counters.
prob({a: 1, b: 1, c:1}, {a:2, b:2, c:2}) = 367/720

Dr.Peterson · Jan 3, 2020

rainie24 said:
What I can see is
1. Letters in the sequence but not in the pattern doesn't matter. prob('ab', 'abc') = prob('ab', 'ab')
2. The order of pattern doesn't matter. prob('ab', 'ab') = prob('ba', 'ab')

I'm still just trying to clarify, before I work on the problem itself, to make sure I'm going to solve the right problem. Here, you have made some claims that need proof, and that seem contrary to my understanding of what you have said.

prob('ab', 'abc') = 1/3: {abc, acb, bac, bca, cab, cba}
prob('ab', 'ab') = 1/2: {ab, ba}, so this seems to be wrong.
prob('ab', 'ab') = 1/2: {ab, ba}
prob('ba', 'ab') = 1/2: {ab, ba}, so this may be right, but is it true when there are repeating elements or extra elements?

Dr.Peterson · Jan 3, 2020

Mr. Bland said:
Since the set "aabbcc" has six elements, it has [MATH]6! = 720[/MATH] permutations. In your example, you state that the result is [MATH]\frac{47}{90}[/MATH]. Is this to say that, of the 720 permutations in the example, 376 match the pattern?

No, there are [MATH]\frac{6!}{2!2!2!} = 90[/MATH] permutations of this multiset. And it seems to me that 23 of them contain the string "abc" (6 for each position of the abc, but subtracting one that is counted twice), making a probability of 23/90. I could be wrong, though. And I seem to be interpreting something differently.

rainie24 · Jan 3, 2020

Dr.Peterson said:
I'm still just trying to clarify, before I work on the problem itself, to make sure I'm going to solve the right problem. Here, you have made some claims that need proof, and that seem contrary to my understanding of what you have said.

prob('ab', 'abc') = 1/3: {abc, acb, bac, bca, cab, cba}
prob('ab', 'ab') = 1/2: {ab, ba}, so this seems to be wrong.

prob('ab', 'ab') = 1/2: {ab, ba}
prob('ba', 'ab') = 1/2: {ab, ba}, so this may be right, but is it true when there are repeating elements or extra elements?

Letters in the pattern don't need to appear in the sequence consecutively, so ab is a pattern of acb too. You can run my codes to check the definition.
has_pattern('ab','acb') will return True.

rainie24 · Jan 3, 2020

Dr.Peterson said:
No, there are [MATH]\frac{6!}{2!2!2!} = 90[/MATH] permutations of this multiset. And it seems to me that 23 of them contain the string "abc" (6 for each position of the abc, but subtracting one that is counted twice), making a probability of 23/90. I could be wrong, though. And I seem to be interpreting something differently.

720 is when all letters are distinguishable. You're right technically it's 90, but I get what Bland was trying to do. Maybe it's easier to approach the problem by making things distinguishable.
The definition might not be clear from my description so when in doubt, pls refer to the codes. prob('abc', 'aabbcc') will give you 47/90.

Dr.Peterson · Jan 3, 2020

rainie24 said:
Letters in the pattern don't need to appear in the sequence consecutively, so ab is a pattern of acb too. You can run my codes to check the definition.
has_pattern('ab','acb') will return True.

To me, "pattern matching" implies consecutive, and your original example was insufficient. And in my clarification question, I intentionally used words like "consecutive" and "string" to give you a chance to say explicitly whether you intended that or not.

But I did eventually run through your code to see what it actually does, and realized what your intent was.

Unfortunately, what you want seems likely to be a lot harder than if it were an actual pattern within a string. I've played with it a bit, but nothing has jumped out at me yet.

Cubist · Jan 3, 2020

For people who don't know python, I think the algorithm matches like this (here are the first 10 of 47 matches)

aabbcc
aabcbc
aabccb
aacbbc
aacbcb

aaccbb <- this does not match

ababcc
abacbc
abaccb
abbacc
abbcac
...

Mr. Bland · Jan 6, 2020

I didn't get a chance to look at this over the weekend. Hopefully I can give it some attention soon.

Dr.Peterson said:
No, there are [MATH]\frac{6!}{2!2!2!} = 90[/MATH] permutations of this multiset.

The problem as stated is in the general sense, and the "aabbcc" case was just given as an example. It's important to remember that when working with algorithms, you generally aren't looking to optimize for specific cases unless they behave differently (piecewise functions, error checking, etc).

rainie24 · Mar 3, 2020

I got some progress and just need the last leap. I notice that having a pattern, for exmaple 'abc', can be equivalently expressed in regex as
"^[abc]*a[bc]*b[ac]*c[ab]*$". Every letter can be followed by a group of letters other than it, plus a group of all letters at the beginning. The problem turns to how to assign letters to allowed groups. This function will calculate the count of pattern 'abc'.

Python:

def abc_ct(seq):
  a, b, c = 'a', 'b', 'c'
  if not isinstance(seq, Counter):
    seq = Counter(seq)
  seq = Counter({k: v for k, v in seq.items() if k in ('a', 'b', 'c'})
  n_perm = math.factorial(sum(seq.values()))
  ct = 0  # ^[abc]*a[bc]*b[ac]*c[ab]*$
  for a1 in range(seq[a]):  # a in ^[abc]
    for b1 in range(seq[b]):  # b in ^[abc]
      for c1 in range(seq[c]):  # c in ^[abc]
        for b2 in range(seq[b] - b1):  # b in [bc]
          for c2 in range(seq[c] - c1):  # c in [bc]
            for a2 in range(seq[a] - a1):  # a in [ac]
              abc_ct = math.factorial(a1 + b1 + c1) // math.factorial(a1) // math.factorial(b1) // math.factorial(c1)
              bc_ct = binomial(b2 + c2, b2)
              ac_ct = binomial(a2 + seq[c] - c1 - c2 - 1, a2)
              ab_ct = binomial(seq[a] - a1 - a2 + seq[b] - b1 - b2 - 2, seq[a] - a1 - a2 - 1)
              p = abc_ct * bc_ct * ac_ct * ab_ct
              ct += p
  ct *= math.factorial(seq[a]) * math.factorial(seq[b]) * math.factorial(seq[c])
  return Fraction(ct, n_perm)

More generally, this function works for any case.

Python:

def prob_general(pat, seq):
  if not isinstance(pat, Counter):
    pat = Counter(pat)
  if not isinstance(seq, Counter):
    seq = Counter(seq)
  seq = Counter({k: v for k, v in seq.items() if k in pat})
  pat_len = sum(pat.values())
  ct = 0
  elements = pat.keys()
  partitions = itertools.product(*(partition_n(seq[e] - pat[e], pat_len - pat[e] + 1) for e in elements))
  for part in partitions:
    ct += ct_for_one_partition(elements, part, pat)
  return Fraction(ct, grp_ct(seq.values()))

# Adapted from https://stackoverflow.com/a/18503391.
def partition_n(n, k):
  """n is the integer to partition, k is the length of partitions."""
  if k == 1:
    if n >= 0:
      yield (n,)
  elif k > 1:
    for i in range(n + 1):
      for result in partition_n(n - i, k - 1):
        yield (i,) + result
        
def grp_ct(ele_cts):
  return math.factorial(sum(ele_cts)) // reduce(mul, map(math.factorial, ele_cts))
def ct_for_one_partition(elements, part, pat):
  part = {e: iter(part[i]) for i, e in enumerate(elements)}
  ct = 1
  for ele in pat:
    other_ele = zip(*(part[e] for e in pat if e != ele))
    for _ in range(pat[ele]):
      ct *= grp_ct(next(other_ele))
  ct *= grp_ct(next(zip(*part.values())))
  return ct

I can't further reduce the nested summation to a formula. A fast program is good enough for me, but maybe someone interested can start from here.

How to count # of patterns in a sequence

rainie24

New member

Mr. Bland

Junior Member

Dr.Peterson

Elite Member

rainie24

New member

rainie24

New member

Dr.Peterson

Elite Member

Dr.Peterson

Elite Member

rainie24

New member

rainie24

New member

Dr.Peterson

Elite Member

Cubist

Senior Member

Mr. Bland

Junior Member

rainie24

New member