Answers
to Given a string of a million numbers, return all repeating 3 digit numbers

nr: #1 dodano: 20171130 20:11
Constant time isn't possible. All 1 million digits need to be looked at at least once, so that is a time complexity of O(n), where n = 1 million in this case.
For a simple O(n) solution, create an array of size 1000 that represents the number of occurrences of each possible 3 digit number. Advance 1 digit at a time, first index == 0, last index == 999997, and increment array[3 digit number] to create a histogram (count of occurrences for each possible 3 digit number). Then output the content of the array with counts > 1.

nr: #2 dodano: 20171130 20:11
The simple O(n) solution would be to count each 3digit number:
for nr in range(1000):
cnt = text.count('%03d' % nr)
if cnt > 1:
print '%03d is found %d times' % (nr, cnt)
This would search through all 1 million digits 1000 times.
Traversing the digits only once:
counts = [0] * 1000
for idx in range(len(text)2):
counts[int(text[idx:idx+3])] += 1
for nr, cnt in enumerate(counts):
if cnt > 1:
print '%03d is found %d times' % (nr, cnt)
Timing shows that iterating only once over the index is twice as fast as using count .

nr: #3 dodano: 20171130 21:11
As per my understanding, you cannot have the solution in a constant time. It will take at least one pass over the million digit number (assuming its a string). You can have a 3digit rolling iteration over the digits of the million length number and increase the value of hash key by 1 if it already exists or create a new hash key (initialized by value 1) if it doesn't exists already in the dictionary.
The code will look something like this:
def calc_repeating_digits(number):
hash = {}
for i in range(len(str(number))2):
current_three_digits = number[i:i+3]
if current_three_digits in hash.keys():
hash[current_three_digits] += 1
else:
hash[current_three_digits] = 1
return hash
You can filter down to the keys which have item value greater than 1.

nr: #4 dodano: 20171130 21:11
I would solve the problem as follows:
def find_numbers(str_num):
final_dict = {}
buffer = {}
for idx in range(len(str_num)  3):
num = int(str_num[idx:idx + 3])
if num not in buffer:
buffer[num] = 0
buffer[num] += 1
if buffer[num] > 1:
final_dict[num] = buffer[num]
return final_dict
Applied to your example string, this yields:
>>> find_numbers("123412345123456")
{345: 2, 234: 3, 123: 3}
This solution runs in O(n) for n being the length of the provided string, and is, I guess, the best you can get.

nr: #5 dodano: 20171130 21:11
Here is a NumPy implementation of the "consensus" O(n) algorithm: walk through all triplets and bin as you go. The binning is done by upon encountering say "385", adding one to bin[3, 8, 5] which is an O(1) operation. Bins are arranged in a 10x10x10 cube. As the binning is fully vectorized there is no loop in the code.
def setup_data(n):
import random
digits = "0123456789"
return dict(text = ''.join(random.choice(digits) for i in range(n)))
def f_np(text):
# Get the data into NumPy
import numpy as np
a = np.frombuffer(bytes(text, 'utf8'), dtype=np.uint8)  ord('0')
# Rolling triplets
a3 = np.lib.stride_tricks.as_strided(a, (3, a.size2), 2*a.strides)
bins = np.zeros((10, 10, 10), dtype=int)
# Next line performs O(n) binning
np.add.at(bins, tuple(a3), 1)
# Filtering is left as an exercise
return bins.ravel()
def f_py(text):
counts = [0] * 1000
for idx in range(len(text)2):
counts[int(text[idx:idx+3])] += 1
return counts
import numpy as np
import types
from timeit import timeit
for n in (10, 1000, 1000000):
data = setup_data(n)
ref = f_np(**data)
print(f'n = {n}')
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
try:
assert np.all(ref == func(**data))
print("{:16s}{:16.8f} ms".format(name[2:], timeit(
'f(**data)', globals={'f':func, 'data':data}, number=10)*100))
except:
print("{:16s} apparently crashed".format(name[2:]))
Unsurprisingly, NumPy is a bit faster than @Daniel's pure Python solution on large data sets. Sample output:
# n = 10
# np 0.03481400 ms
# py 0.00669330 ms
# n = 1000
# np 0.11215360 ms
# py 0.34836530 ms
# n = 1000000
# np 82.46765980 ms
# py 360.51235450 ms

nr: #6 dodano: 20171201 01:12
Here's my answer:
from timeit import timeit
from collections import Counter
import types
import random
def setup_data(n):
digits = "0123456789"
return dict(text = ''.join(random.choice(digits) for i in range(n)))
def f_counter(text):
c = Counter()
for i in range(len(text)2):
ss = text[i:i+3]
c.update([ss])
return (i for i in c.items() if i[1] > 1)
def f_dict(text):
d = {}
for i in range(len(text)2):
ss = text[i:i+3]
if ss not in d:
d[ss] = 0
d[ss] += 1
return ((i, d[i]) for i in d if d[i] > 1)
def f_array(text):
a = [[[0 for _ in range(10)] for _ in range(10)] for _ in range(10)]
for n in range(len(text)2):
i, j, k = (int(ss) for ss in text[n:n+3])
a[i][j][k] += 1
for i, b in enumerate(a):
for j, c in enumerate(b):
for k, d in enumerate(c):
if d > 1: yield (f'{i}{j}{k}', d)
for n in (1E1, 1E3, 1E6):
n = int(n)
data = setup_data(n)
print(f'n = {n}')
results = {}
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
print("{:16s}{:16.8f} ms".format(name[2:], timeit(
'results[name] = f(**data)', globals={'f':func, 'data':data, 'results':results, 'name':name}, number=10)*100))
for r in results:
print('{:10}: {}'.format(r, sorted(list(results[r]))[:5]))
The array lookup method is very fast (even faster than @paulpanzer's numpy method!). Of course, it cheats since it isn't technicailly finished after it completes, because it's returning a generator. It also doesn't have to check every iteration if the value already exists, which is likely to help a lot.
n = 10
counter 0.10595780 ms
dict 0.01070654 ms
array 0.00135370 ms
f_counter : []
f_dict : []
f_array : []
n = 1000
counter 2.89462101 ms
dict 0.40434612 ms
array 0.00073838 ms
f_counter : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_dict : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_array : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
n = 1000000
counter 2849.00500992 ms
dict 438.44007806 ms
array 0.00135370 ms
f_counter : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_dict : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_array : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]

nr: #7 dodano: 20171201 02:12
You got off lightly, you probably don't want to be working for a hedge fund where the quants don't understand basic algorithms :)
There is no way to process an arbitrarilysized data structure in O(1) if, as in this case, you need to visit every element at least once. The best you can hope for is O(n) in this case, where n is the length of the string.
Although, as an aside, a nominal O(n) algorithm will be O(1) for a fixed input size so, technically, they may have been correct here. However, that's not usually how people use complexity analysis.
It appears to me you could have impressed them in a number of ways.
First, by informing them that it's not possible to do it in O(1) , unless you use the "suspect" reasoning given above.
Second, by showing your elite skills by providing Pythonic code such as:
inpStr = '123412345123456'
# O(1) array creation.
freq = [0] * 1000
# O(n) string processing.
for val in [int(inpStr[pos:pos+3]) for pos in range(len(inpStr)  2)]:
freq[val] += 1
# O(1) output of relevant array values.
print ([(num, freq[num]) for num in range(1000) if freq[num] > 1])
This outputs:
[(123, 3), (234, 3), (345, 2)]
though you could, of course, modify the output format to anything you desire.
And, finally, by telling them there's almost certainly no problem with an O(n) solution, since the code above delivers results for a onemilliondigit string in well under half a second. It seems to scale quite linearly as well, since a 10,000,000character string takes 3.5 seconds and a 100,000,000character one takes 36 seconds.
And, if they need better than that, there are ways to parallelise this sort of stuff that can greatly speed it up.
Not within a single Python interpreter of course, due to the GIL, but you could split the string into something like (overlap indicated by vv is required to allow proper processing of the boundary areas):
vv
123412 vv
123451
5123456
You can farm these out to separate workers and combine the results afterwards.
The splitting of input and combining of output are likely to swamp any saving with small strings (and possibly even milliondigit strings) but, for much larger data sets, it may well make a difference. My usual mantra of "measure, don't guess" applies here, of course.
This mantra also applies to other possibilities, such as bypassing Python altogether and using a different language which may be faster.
For example, the following C code, running on the same hardware as the earlier Python code, handles a hundred million digits in 0.6 seconds, roughly the same amount of time as the Python code processed a one million. In other words, much faster:
#include <stdio.h>
#include <string.h>
int main(void) {
static char inpStr[100000000+1];
static int freq[1000];
// Set up test data.
memset(inpStr, '1', sizeof(inpStr));
inpStr[sizeof(inpStr)1] = '\0';
// Need at least three digits to do anything useful.
if (strlen(inpStr) <= 2) return 0;
// Get initial feed from first two digits, process others.
int val = (inpStr[0]  '0') * 10 + inpStr[1]  '0';
char *inpPtr = &(inpStr[2]);
while (*inpPtr != '\0') {
// Remove hundreds, add next digit as units, adjust table.
val = (val % 100) * 10 + *inpPtr++  '0';
freq[val]++;
}
// Output (relevant part of) table.
for (int i = 0; i < 1000; ++i)
if (freq[i] > 1)
printf("%3d > %d\n", i, freq[i]);
return 0;
}

nr: #8 dodano: 20171201 06:12
As mentioned in another answer, you cannot do this algorithm in constant time, because you must look at at least n digits. Linear time is the fastest you can get.
However, the algorithm can be done in O(1) space. You only need to store the counts of each 3 digit number, so you need an array of 1000 entries. You can then stream the number in.
My guess is that either the interviewer misspoke when they gave you the solution, or you misheard "constant time" when they said "constant space."

nr: #9 dodano: 20171201 11:12
A million is small for the answer I give below. Expecting only that you have to be able to run the solution in the interview, without a pause, then The following works in less than two seconds and gives the required result:
from collections import Counter
def triple_counter(s):
c = Counter(s[n3: n] for n in range(3, len(s)))
for tri, n in c.most_common():
if n > 1:
print('%s  %i times.' % (tri, n))
else:
break
if __name__ == '__main__':
import random
s = ''.join(random.choice('0123456789') for _ in range(1_000_000))
triple_counter(s)
Hopefully the interviewer would be looking for use of the standard libraries collections.Counter class.
Parallel execution version
I wrote a blog post on this with more explanation.

nr: #10 dodano: 20171205 15:12
inputStr = '123456123138276237284287434628736482376487234682734682736487263482736487236482634'
count = {}
for i in range(len(inputStr)  2):
subNum = int(inputStr[i:i+3])
if subNum not in count:
count[subNum] = 1
else:
count[subNum] += 1
print count

