Question: using heap helps boost performance for string rearrange

Question

using heap helps boost performance for string rearrange

Answers 1
Added at 2017-01-05 07:01
Tags
Question

I'm working on the problem below and have posted my code. My question is, my current implementation is using a sort in Python - sorted(sorted_freq, reverse=True). I referred some other implementation which is using a max heap (http://www.geeksforgeeks.org/rearrange-a-string-so-that-all-same-characters-become-at-least-d-distance-away/). I think they have the same time complexity which is O(n*log n) (if I calculate wrong, please feel free to correct). So I'm wondering if there is any benefit in terms of performance of using heap other than sort?

My code is written in Python 2.7.

Problem:

Given a string and a positive integer d. Some characters may be repeated in the given string. Rearrange characters of the given string such that the same characters become d distance away from each other. Note that there can be many possible rearrangements, the output should be one of the possible rearrangements. If no such arrangement is possible, that should also be reported. Expected time complexity is O(n) where n is length of input string.

Examples:

Input:  "abb", d = 2
Output: "bab"

Input:  "aacbbc", d = 3
Output: "abcabc"

Input: "geeksforgeeks", d = 3
Output: egkegkesfesor

Input:  "aaa",  d = 2
Output: Cannot be rearranged

Source code:

from collections import defaultdict
def rearrange(source, distance):
    freq = defaultdict(int)
    for c in source:
        freq[c] += 1
    sorted_freq = []
    for k,v in freq.items():
        sorted_freq.append((v,k))
    sorted_freq = sorted(sorted_freq, reverse=True)
    result = [0] * len(source)
    for i, (v,k) in enumerate(sorted_freq):
        index = i
        while index < len(result) and result[index] != 0:
            index += 1
        if index == len(result):
            return None
        count = v
        while count > 0:
            result[index] = k
            index += distance
            count -= 1
            if index >= len(source) and count > 0:
                return None
    return result

if __name__ == "__main__":
    print rearrange('abb', 2)
    print rearrange('aacbbc', 3)
    print rearrange('geeksforgeeks', 3)
    print rearrange('aaa', 2)
Answers to

using heap helps boost performance for string rearrange

nr: #1 dodano: 2017-01-05 08:01

The time complexity of the best algorithm presented at the link is O(n + m log m) where m is the number of unique characters in the input string. As it was mentioned, since m always less the total number of letters in the alphabet (which is a fixed constant), if m is small compared to n the time complexity can be considered O(n). There's no difference in time complexity when O(m log m) sorting algorithm is used to sort the frequencies instead of a heap.

Note that your implementation has time complexity O(nm) since you initialize index with i in every loop. Here's an alternative implementation using Counter instead of defaultdict which has the issue fixed and short comparison of the performance in degenerate case:

from collections import Counter

def rearrange2(s, dist):
    start = 0
    result = [None] * len(s)
    for char, count in Counter(s).most_common():
        while result[start]:
            start += 1
        end = start + dist * (count - 1) + 1
        if end > len(s):
            return None
        for i in xrange(start, end, dist):
            result[i] = char

    return ''.join(result)


def rearrange3(s, dist):
    start = 0
    result = [None] * len(s)
    for char, count in sorted(Counter(s).items(), key=lambda x: x[1], reverse=True):
        while result[start]:
            start += 1
        end = start + dist * (count - 1) + 1
        if end > len(s):
            return None
        for i in xrange(start, end, dist):
            result[i] = char

    return ''.join(result)

if __name__ == '__main__':
    import timeit
    print timeit.timeit("rearrange(src,2)", setup="from __main__ import rearrange; src='a'*10000 + 'b'*10000 + 'cdefghijk'", number=100)
    print timeit.timeit("rearrange2(src,2)", setup="from __main__ import rearrange2; src='a'*10000 + 'b'*10000 + 'cdefghijk'", number=100)
    print timeit.timeit("rearrange3(src,2)", setup="from __main__ import rearrange3; src='a'*10000 + 'b'*10000 + 'cdefghijk'", number=100)

Output:

3.23630073078
0.756645293244
0.753287190129

Update: most_common uses heapq.nlargest under the hood which equals to heapsort in case n is the length of given iterable. As can be seen from the results above there's no real difference. Results of course depends on the size of the data and number of unique characters.

Source Show
◀ Wstecz