Google Machine Learning Engineer Python Interview

preview_player
Показать описание
Preparing for an interview as a Machine Learning Engineer can be challenging. Watch how Aly, a Senior Research Data Scientist at Facebook, breezed through a Python coding interview question from Google that requires a dictionary representing a histogram of the dataset.

Full video on Interview Query question page:

Quick Links:
00:00 Question
00:25 Clarifying Questions
01:35 Solution Walkthrough

More from Jay:

#DataScientists #DataScience #InterviewPrep #MockInterview
Рекомендации по теме
Комментарии
Автор

Nicely done!

Just to note that none of the tests cover cases where a bin would have zero values.

At a quick glance, it could be remedied by initializing a `count` variable to zero (instead of`histogram[bucket_str] = 0`) before the last for-loop.
Once you break out of the loop, you can check if the count is non-zero before adding it to the histogram.

Snorrason
Автор

The number of bins should be (max -min)/x

yemanbrhane
Автор

The solution is wrong... and it's quite easy to test how :
dataset = [7, 8, 9]
x = 2
returned : {7-9 : 3}

here's a pretty much 'coverall' solution using no libraries :

def automatic_histogram(dataset, x):
histogram = {}

if len(dataset)==0:
return histogram

if len(set(dataset))>x:
return histogram

min_number = dataset[0]
max_number = dataset[0]

for i in dataset:
if i < min_number:
min_number = i
if i > max_number:
max_number = i

if min_number==max_number:
return

width = (max_number-min_number)//x

edges = [min_number, *[min_number + (i+1)*(width+1) for i in range(x-1)]]
bins = if edges[i]!=edges[i+1]-1 else str(edges[i]) for i in range(x-1)]

if edges[-1]==max_number:
bins.append(str(max_number))
else:


for b in bins:
histogram[b] = 0

for i in range(x-1):
for d in dataset:
if edges[i]<=d<edges[i+1]:
histogram[bins[i]] += 1

histogram[bins[-1]] =

return histogram

tokpot
Автор

Started with not using sort to decrease complexity from Onlogn then added a for loop in while loop for On².

Btw very nice solution, wasn't able to solve myself.

piyushagarwal
Автор

instead of using loops for calculating min and max we could easily use the min(dataset) and max(dataset) and use Counter to count the number of elements in the dataset.
here is my solution:
def histogram(data, x):
res={}
min_value=min(data)
max_value= max(data)

bins= math.ceil((max_value)/x)

counts= Counter(data)
if max_value==min_value:
return
i=min_value
while i <=max_value:
if (i+bins-1)<=max_value:
rng= str(i)+'-'+str(i+bins-1)
elif i==max_value:
rng=str(i)
else:
rng= str(i)+"-"+ str(max_value)
for j in range(i, i+bins):
if not rng in res:
res[rng]=counts[j] if counts[j] else 0
else:
res[rng]+=counts[j]
i+=bins

return res

print(histogram(data, x))

nedaebrahimi
Автор

My solution (using a map for easy value counting, without importing Counter). I believe it is fairly robust, though the question has some ambiguity. In particular around how to handle inexact division (though we can infer from the example that we want to drop extra range from the final bucket most likely. Additionally, the following solution will allow for the 'wrong' number of buckets in cases where we have buckets with value 0. That behavior is unclear from the directions.

def automatic_histogram(dataset: List[int],
x: int) -> dict:
"""
@param dataset(List[int]) the numbers to populate the histogram
@param x(int): the number of bins for the histogram

@return dict: dictionary containing keys that are string ranges spread uniformly
over the dataset (except possibly a smaller final range in the case of odd #
of input values (as a set))
"""

# dataset = sorted(dataset) # make it quick to extract min, max, and iterate
# # using O(nlogn) complexity once
if x < 1:
raise ValueError("Must have at least 1 bin")
if not dataset:
return {}
# O(2n)
min_val = min(dataset)
max_val = max(dataset)

val_range = max_val + 1 - min_val # e.g. (5+1-1) = 5

bin_size = ceil(val_range / x)
print(bin_size)
bin_range = x * bin_size
extra_vals = bin_range - val_range # e.g. 5-3 = 2

histogram = {}
value_map = {} #maps integers to their corresponding key in the histogram
upper_val = None
lower_val = min_val
while upper_val != max_val:
upper_val = lower_val + bin_size - 1
if upper_val > max_val: # bin range may be larger than value range
upper_val = max_val
if lower_val == upper_val:
key = f"{lower_val}"
value_map[lower_val] = key
histogram[key] = 0 # single int value range
else:
key = f"{lower_val}-{upper_val}"
for val in range(lower_val, upper_val+1, 1):
value_map[val] = key
histogram[key] = 0
lower_val += bin_size

for val in dataset:
key = value_map[val]
histogram[key] += 1 # add count for this val to corresponding bin

# removal of zero values from histogram
final_histogram = {}
for key, val in histogram.items():
if val == 0:
continue
final_histogram[key] = val

return final_histogram

DanielMichaelFrees
Автор

I liked both question and solution! But I don't think it would pass test case with a dataset containing only negative values, and the question doesn't state there can't be negatives, does it?

jp_magal
Автор

The solution provided is not fully correct. I understand the stress under timed interview. Here's my solution given ample time. Thank you, both the interviewer and interviewee for the mock interview. It's quite helpful to me.

from collections import defaultdict
import math


def automatic_histogram(dataset, x):
min_num = dataset[0]
max_num = dataset[0]
freq_count = defaultdict(int)
for d in dataset:
if d > max_num:
max_num = d
if d < min_num:
min_num = d
freq_count[d] += 1

bin_size = math.ceil((max_num - min_num) / x)

histogram = defaultdict(int)
for n, f in freq_count.items():
bin = (n-min_num) // bin_size
if n == max_num and (max_num-min_num)%bin_size == 0:
bin_str = str(max_num)
else:
bin_str = str(min_num + bin*bin_size) + '-' + str(min_num + bin*bin_size + bin_size-1)
histogram[bin_str] += f
return dict(histogram)

### Tests:

x = 4
dataset = [-4, -2, 0, 1, 2, 2, 5]
res = automatic_histogram(dataset, x)
print(res)
# Answer: {'-4--2': 2, '-1-1': 2, '2-4': 2, '5': 1}

x = 5
dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9]
res = automatic_histogram(dataset, x)
print(res)
# Answer: {'1-2': 2, '3-4': 2, '5-6': 2, '7-8': 2, '9': 1}

x = 3
dataset = [1, 2, 4, 4, 5, 6, 6, 8, 9]
res = automatic_histogram(dataset, x)
print(res)
# Answer: {'1-3': 2, '4-6': 5, '7-9': 2}

teslarocks
Автор

Sir where we can get such type of questions?
I hardly fined it on internet
And now days many companies started asking in there 1st round of coding such type of questions

amanmiglani
Автор

Salary comparison between Machine Learning Enginner or Software Engineer

saurabhshrivastava