Counting Overlapping Elements in Python
Wed 16 April 2014
Say you have a list of lists, like:
matrix = [[1, 2, 3], [4, 5, 6], [1, 2, 6]]
You want to know how many times pairs of numbers occur in the same list, or overlap. In the example above, you'd have:
- 1 & 2: paired twice
- 1 & 3: paired once
- 2 & 3: paired once
- 4 & 5: paired once
- 4 & 6: paired once
- 5 & 6: paired once
- 1 & 6: paired once
- 2 & 6: paired once
You'd like some code that tells you that there's one pairing that occur twice and seven pairings that occur once. Stack Overflow will tell you how to do it:
from collections import Counter
from itertools import chain, combinations
Counter(chain.from_iterable(combinations(row, 2) for row in matrix))
That works, but it's completely confusing. I'm going to break down what it's doing, beginning
with combinations(row, 2)
.
Producing combinations
combinations()
produces an iterator containing the pairs
in an input iterable (e.g. a list).
The code below uses combinations
to generate all the pairs in the matrix's first
row. Each tuple represents one pairing; for example, (1, 2)
indicates that
1 and 2 were in the same row one time.
>>> x = combinations(matrix[0], 2)
>>> print x
<itertools.combinations object at 0x10d1719f0>
>>> print list(x)
[(1, 2), (1, 3), (2, 3)]
To find the combinations in all of the rows in the matrix, we apply combinations()
to each row through a generator comprehension. A generator comprehension has the
same syntax as a list comprehension but produces a generator instead of a list.
>>> combos = (combinations(row, 2) for row in matrix)
>>> print combos
<generator object <genexpr> at 0x10d168f00>
>>> print [list(x) for x in combos]
[[(1, 2), (1, 3), (2, 3)], [(4, 5), (4, 6), (5, 6)], [(1, 2), (1, 6), (2, 6)]]
Chaining the combinations
Our end goal is to count how many there are of each tuple; we should end up with one
that occurs twice ((1, 2)
) and seven that occur once. By inspecting the output above,
we can see that this is the case, but Python can't because the tuples aren't all
in the same combinations object. Enter chain.from_iterable()
, a function that will combine each of the combinations
objects in
combos
into one giant iterator:
>>> chained_combos = chain.from_iterable(combos)
>>> list(chained_combos)
[(1, 2), (1, 3), (2, 3), (4, 5), (4, 6), (5, 6), (1, 2), (1, 6), (2, 6)]
Counting the pairs in the chain
Now we can create a Counter
object to tally the frequency of each pairing.
A Counter
is a type of dictionary used for creating tallies: the keys are the items
you're counting, and the values are the tallies.
>>> counts = Counter(chained_combos)
>>> print counts
Counter({(1, 2): 2, (1, 3): 1, (4, 6): 1, (4, 5): 1, (5, 6): 1, (2, 6): 1, (2, 3): 1, (1, 6): 1})
>>> print counts.values()
[2, 1, 1, 1, 1, 1, 1, 1]
Tallying the overall frequencies
Finally, to count the overall frequencies, we use a second counter:
>>> overall_counts = Counter(counts.values())
>>> print overall_counts
Counter({1: 7, 2: 1})
Ta da! 7 pairs together once, 1 pair together twice.
This can be easily extended from
pairs to trios, quartets, and larger groups by replacing combinations(row, 2)
with
combinations(row, <your-size>)
.
Finally, an important note: this will only work if the values inside each tuple
are always listed
in the same order. (1, 2)
is not the same as (2, 1)
.