Solving “Top K Frequent Words” Coding Interview Problem

Problem Statement

Given a non-empty list of words, return the k most frequent words. The output should be sorted by frequency from highest to lowest. If two words have the same frequency, then the word with the lower alphabetical order should come first.

Example

For the list of words ["i", "like", "java", "i", "like", "coding"] and k = 2, the answer would be ["i", "like"].

Brute Force Solution

A straightforward approach is to first count the frequency of each word using a hashmap. Then, collect the entries of the hashmap and sort them first by frequency and then lexicographically. Finally, take the top k elements.

import java.util.*;

public class TopKFrequentWords {
    public List<String> topKFrequent(String[] words, int k) {
        // Create a map to count the frequency of each word
        Map<String, Integer> count = new HashMap<>();
        for (String word : words) {
            count.put(word, count.getOrDefault(word, 0) + 1);
        }

        // Convert the key set into a list
        List<String> candidates = new ArrayList<>(count.keySet());
        
        // Sort the list
        Collections.sort(candidates, (w1, w2) -> count.get(w1).equals(count.get(w2)) ?
                w1.compareTo(w2) : count.get(w2) - count.get(w1));

        // Return the top k words
        return candidates.subList(0, k);
    }
}

This solution involves explicit sorting which can be expensive, especially for large lists.

  1. Time Complexity:
    • Counting word frequencies: O(n) where n is the number of words.
    • Sorting the candidates: O(mlogm) where mm is the number of unique words in the list. In the worst case, m=n, but generally mn. However, sorting by both frequency and lexicographical order might require more comparisons.
    • Therefore, the overall time complexity in the worst case would be O(n+nlogn)=O(nlogn).
  2. Space Complexity:
    • We use space for the count hashmap which stores m unique words: O(m).
    • We also use space for the candidates list: O(m).
    • Hence, the space complexity is O(m+m)=O(m). In the worst case, m=n, so O(n).

Optimal Solution

A more optimal approach is to use a priority queue (or a min heap) to keep track of the top k frequent words at all times.

import java.util.*;

public class TopKFrequentWords {
    public List<String> topKFrequent(String[] words, int k) {
        // Create a map to count the frequency of each word
        Map<String, Integer> count = new HashMap<>();
        for (String word : words) {
            count.put(word, count.getOrDefault(word, 0) + 1);
        }

        // Create a priority queue with custom comparator
        PriorityQueue<String> heap = new PriorityQueue<>(
                (w1, w2) -> count.get(w1).equals(count.get(w2)) ?
                w2.compareTo(w1) : count.get(w1) - count.get(w2));

        // Add all keys to the heap
        for (String word : count.keySet()) {
            heap.offer(word);
            if (heap.size() > k) heap.poll();  // Remove least frequent word
        }

        // Build the result
        List<String> result = new ArrayList<>();
        while (!heap.isEmpty()) {
            result.add(heap.poll());
        }
        Collections.reverse(result);
        return result;
    }
}

In the above code, we perform following steps:

  1. Count the Words: First, create a hashmap to count the occurrence of each word.
  2. Heap for Sorting: A priority queue (min heap) is used to sort the words. It’s created with a custom comparator that sorts by frequency and then lexicographically.
  3. Keep only Top K: As we insert into the heap, if its size exceeds k, we immediately poll (remove) the least frequent word.
  4. Build the Result: Since a priority queue does not guarantee internal order, we pop each element, ensuring the order, and then reverse the list for the final result.

Time and Space Complexity

  1. Time Complexity:
    • Counting word frequencies: O(n).
    • Inserting into the priority queue: Each insertion is O(logk) since we’re maintaining only k elements in the queue at most. We do this m times (once for each unique word). So, this step is O(mlogk).
    • Building the result list: O(klogk) since we’re polling k times, and each poll operation is O(logk).
    • Therefore, the overall time complexity is O(n+mlogk+klogk). As m can be at most n, it simplifies to O(nlogk).
  2. Space Complexity:
    • We use space for the count hashmap: O(m).
    • We use space for the priority queue which holds at most k elements: O(k).
    • The result list that will have k elements: O(k).
    • Hence, the overall space complexity is O(m+k+k)=O(m+2k). Again, in the worst case where every word is unique, m=n, so O(n+2k).

Leave a Comment

Your email address will not be published. Required fields are marked *