docs/index.rst
cows: Collections for wildcard strings
======================================
.. toctree::
:hidden:
data_structures
**cows** (**co**\ llections for **w**\ ildcard **s**\ trings) is a Python
library that provides efficient collection implementations where equality
checking allows for wildcards in both the search string and the strings already
in the collection.
Motivation
----------
cows was developed for a common problem in bioinformatics: given a set of DNA
sequences with the alphabet ``A``, ``T``, ``C``, ``G``, along with a wildcard
``N`` (indicating that the base is unknown), find the unique sequences and
perform some operation on them. Examples of the operation are: counting how
many times each unique sequence occurs and generate a consensus sequence for
each unique sequence.
For a simple example, for counting unique sequences consider the following
input and desired output:
.. code-block:: none
input output
----- ------
ATNG ATNG 2 # Comprised of ATNG and ATCN
ATCN ANNT 1
ANNT GTTC 1
GTTC
Notice this task requires comparing strings with wilcards not just in one
string, but in both. For example, matching ``ATCN`` to ``ATNG`` requires that
the third and fourth characters both be considered wildcards.
Naively one could pairwise compare the sequences, ignore the positions where
either contains an ``N``, and check if all other positions match. However,
this quickly becomes intractable as it scales with the square of the number of
sequences.
cows uses a modified implementation of atrie (:class:`cows.trie`) to reduce
this complexity to scale linearly with the number of sequences.
Provided Data Structures
------------------------
Below are examples for the data structures included with cows. Please see the
documentation in :ref:`data-structures` for detailed
API information.
``cows.List``
^^^^^^^^^^^^^^
A :class:`cows.list` is a simple list implementation where insertion functions
similarly to the builtin ``list`` data structure, but accessor methods take
into account ambiguity. For example:
.. code-block:: python
l = cows.List(['ABCD', 'ABC*', '****', 'DEFG'])
print(l.index('D***'))
The print statement outputs ``2`` since the first match for ``D***`` is at
position 2 (with a value of ``****``).
``cows.Set``
^^^^^^^^^^^^^
A :class:`cows.set` stores unique strings similar to the builtin ``set`` data
structure. Instead of using hashes for equality checks, the underlying
:class:`cows.trie` is used to check if the pattern being inserted matches any
existing member of the set, taking into account wildcards in both. For
example:
.. code-block:: python
import cows
s = cows.Set(wildcard='*')
s.add('ABCD')
s.add('*EFG')
s.add('T')
s.add('ABC*') # Matches ABCD, so not added
s.add('HEF*') # Matches *EFG, so not added
print(s)
Produces:
.. code-block:: none
cows.Set(['*EFG', 'ABCD', 'T'])
``cows.Dict``
^^^^^^^^^^^^^^
cows dictionaries are similar to the builtin ``dict`` type insofar as they are
key/value stores. They have a few key differences, however.
First, when setting a value, if there is an existing (potentially ambiguous)
match already in the dictionary, you can set an ``updater`` function to update
the existing value rather than simply overwrite it. Further, when inserting a
key/value pair, multiple existing keys may match the new key due to ambiguity.
Specifying a ``selector`` function at instantiation lets you define to which of
the matches the ``updater`` should be applied.
See :class:`cows.dictionary` for more detailed information.
.. code-block:: python
import cows
def increment(match, old_value, new_value):
return old_value + new_value
my_dict = cows.Dict(updater=increment)
my_dict['ABC'] = 1
my_dict['DEF'] = 2
my_dict['AB*'] = 10
for k, v in sorted(my_dict.items()):
print('{} --> {}'.format(k, v))
Produces:
.. code-block:: none
ABC --> 11
DEF --> 2
cows.Trie
^^^^^^^^^^
.. note::
Generally the :class:`cows.trie` data structure shouldn't be used
directly. Consider using one of its abstractions.
All other cows data structures are based on the :class:`cows.trie` class. It
allows for ambiguous queries taking into account wildcards both in the query
string and elements in the trie.
An example of it's use:
.. code-block:: python
import cows
t = cows.Trie()
t['ABCD'] = 1
t['DE*G'] = 5
print('Matches for ABC* {}'.format(list(t.get_matches("ABC*"))))
print('Matches for D*FG {}'.format(list(t.get_matches("D*FG"))))
Outputs:
.. code-block:: none
Matches for ABC* [('ABCD', cows.Trie(D, 1))]
Matches for D*FG [('DE*G', cows.Trie(G, 5))]
Performance
-----------
cows is performant, requiring *O(n)* time for insertions and lookups with
an input size of *n* strings. The naive approach which is currently quite
common involves pairwise comparing the sequences in a collection resulting in
*O(n*\ :sup:`2`\ *)*, quickly becoming intractable.