Figure 1: Computation of indices by the k-merCount algorithm. The figure illustrates the computation of successive indices of every 3-mer and its reverse complement of the sequence ACCAGTC. (a) The radix-3 representation of the sequence and its reverse compliment and the indices of each successive 3-mer and its reverse compliment of the sequence. (b) Computation of successive indices. (c) Graphical illustration of the computation of the index of a 3-mer and its reverse compliment in constant time given the index and reverse compliment index of its predecessor.
Figure 2: The k-mer tree. The figure illustrates the components and terminology associated with a k-mer tree.
Figure 3: A full 4-mer-tree. A full 4-mer-tree for counting the frequencies of distinct 4-mers in a sequence comprising of characters, and . As the tree is a complete 4-mer tree, it has only complete branches. Every branch represents a distinct 4-mer. Every leaf node points to an external branch node that is the external branch node of the sub branch representing the suffix of the 4-mer along the branch to the leaf node. In analogy to the k-mer-Count algorithm described in Section , moving from a leaf node to the external branch node represents a shift to the left by one position of the radix b digit of length k (in this case, a binary digit of length 4). Proceeding down to a leaf node is analogous to adding a new digit as a lowest order digit of the k-digit radix-b number (4-digit binary number).
Figure 4: Example: step by step creation of a k-mer tree. The figure demonstrates the creation of a k-mer tree for counting the frequencies of different 3-mers in sequence AATATTATAA. For simplicity we assume Σ = {A; T }. In the full 2-ary tree that is generated, we assume that, the left cell of a branch node represents A and the right cell represents T (in practice, these will be pointers with only implicit reference to the nucleotide). Every time a leaf node is added or visited, the counter for its respective k-mer and that of its reverse compliment are incremented by one. In the initialization step in (a), the branch AAT is added followed by the sub-branch of its suffix AT. A pointer is established from the leaf node of AAT to the bottom node (external branch node) of the sub-branch AT. The branch ATT, which is the reverse compliment AAT, along with the sub-branch of its suffix TT are added next. A pointer is established between the leaf nodes of the complimentary branches AAT and ATT. The prefix pointer (the pointer pointing to the external branch node to which the next leaf node will be added, or, from which the next leaf node will be visited) is moved to the external branch node of sub-branch AT. In (b), the leaf node A is added from the external branch node of sub-branch AT establishing the branch ATA. Similar to (a), its suffix TA and reverse compliment TAT are added. The prefix pointer is moved to the sub-branch TA. In (c) leaf node T is visited as it is already in sub-branch TA, (incrementing the frequency counter of 3-mer TAT and its reverse compliment ATA), and the prefix pointer is moved to the sub-branch AT. In (d), similar to (c), the leaf node T is visited and the prefix pointer is moved to the sub-branch TT. In (e), leaf node A is added establishing the 3-mer, TTA, and its reverse compliment TAA. The prefix pointer is moved to the sub-branch TA. Like above, in (f), leaf node T is visited. In (g), leaf node A is visited and in (h) leaf node A is visited.
Figure 5: Memory utilization of the P/GK-MER-COUNT algorithms. Memory utilization of the (A) GK-MER-COUNT algorithm for different values of k and (B) PGK-MER-COUNT algorithm, with = 2, (k0; k) = (22; 120); (16; 120) and (16; 50), on the human, mouse and 50 archaea genome sequences listed in Supplemental Table 01. (C) Comparative histogram of 50-mer and 10-mer frequency counts.
Figure 6: Performance of the GK-MER-COUNT algorithm. Performance of the GK-MER-COUNT algorithm on the 50 archaea genome sequences listed in Supplemental Table 01.
Figure 7: Performance of the PGK-MER-COUNT algorithm. Performance of the PGK-MER-COUNT algorithm, with = 2, on the 50 archaea genome sequences listed in Supplemental Table 01.