Google L4 Final Round Interview US

1/30/2025

Round 1

Questions: You are given a list of sentences. Your task is to build a model that learns word sequences from this data and predicts the next word given an input word. The prediction should be based on observed word patterns in the training data.

Example
Training Data:

[  
  ["I", "am", "Sam"],  
  ["Sam", "I", "am"],  
  ["I", "like", "green", "eggs", "and", "ham"]  
]

Query Example:
Input: "I"
Output: "am"

The prediction should be based on what word most commonly follows "I" in the training data.

Candidate's Approach

I use a trigram-based language model to predict the most probable next word given an input pair of words. The approach consists of the following steps:

Extract Trigrams: From the given training data (a list of tokenized sentences), I extract trigrams, which are sequences of three consecutive words. Each trigram consists of (word1, word2 → word3), where word3 is the word that follows the pair (word1, word2).
Build a Frequency Dictionary: I store the counts of word occurrences in a dictionary where each (word1, word2) pair maps to a dictionary of possible next words (word3) and their respective frequencies.
Convert Counts to Probabilities: To make accurate predictions, I normalize the counts into probabilities by dividing each count by the total occurrences of the (word1, word2) pair. This allows the model to predict the most likely next word rather than just returning the most frequent word.
Predict the Next Word: Given an input (word1, word2), I look up the probability distribution in the trigram model. I return the word with the highest probability as the predicted next word. If the given word pair is not found in the model, I return None.

from collections import defaultdict

class PredictWords:
    def __init__(self):
        self.trigram_model = defaultdict(lambda: defaultdict(int))

    def train_trigram_model(self, data):
        """Trains a trigram model based on given training data."""
        for sentence in data:
            for i in range(len(sentence) - 2):
                w1, w2, w3 = sentence[i], sentence[i + 1], sentence[i + 2]
                self.trigram_model[(w1, w2)][w3] += 1

        # Convert counts to probabilities
        for (w1, w2), counter in self.trigram_model.items():
            total_count = sum(counter.values())
            self.trigram_model[(w1, w2)] = {w3: count / total_count for w3, count in counter.items()}

    def predict_next_word(self, w1, w2):
        """Predicts the most probable next word based on the trained trigram model."""
        if (w1, w2) in self.trigram_model:
            return max(self.trigram_model[(w1, w2)], key=self.trigram_model[(w1, w2)].get)
        return None

# Example usage:
data = [
    ["I", "am", "Sam"],
    ["Sam", "I", "am"],
    ["I", "like", "green", "eggs", "and", "ham"]
]

model = PredictWords()
model.train_trigram_model(data)
print(model.predict_next_word("I", "am"))  # Output: "Sam"

Interviewer's Feedback

No feedback provided.

Follow-up Questions:

What's the cons of using Trigram:
Me: Since a trigram model relies on three consecutive words, it requires a large dataset to have sufficient coverage. Also, storing trigram counts requires a large dictionary, especially for natural language where vocabulary is extensive.
Difference between Bigram and Trigram:
Me: Bigram uses two consecutive words to predict the next word while Trigram uses three consecutive words to predict the next word.
What's the time and space complexity:
Me: Time complexity was O(v*3) and space was O(1).
He just explained to me how he would have improved the time complexity since time was over.