Word Histogram

Standard

 

Given a document and a word, return the number of occurrences of that word in the document. What if this function is called many times?

Here we have to define what a word means. Is ‘Cat’ equivalent to ‘cat’ or ‘Cats’? Do we care about punctuation? e.g. do we consider ‘end’ equivalent to ‘end.’? Do we care about equivalency of abbreviated words, like ‘do not’ and ‘don’t’?

For this problem, I will define a word to be a string of characters separated by spaces. In addition, I will strip any punctuation that occurs at the end of the mentioned string and will assume everything is lowercase.

Let us consider the case that this function will not be frequently called. Since it will be frequently called, this suggest that we favor optimizing memory as we will only we using it a few times. Here are the high-level algorithm steps that will solve this problem:

  1. Scan document to next occurrence of a word until the end of the document.
  2. increment a counter
  3. go to step 1.
  4. return counter

This approach will take us O(d) worst-time where d is the size of the document. If we are not careful in implementing this algorithm, we could end up with O(w*d) worst-time where w is the length of the word and d is the length of the document. Here is the solution

# C++/JAVA way in ruby syntax.
def word_histo(doc, word)
  doc_pointer = 0
  counter = 0
  while doc_pointer < doc.length
    word_pointer = 0
    while doc[doc_pointer] == word[word_pointer] && doc_pointer < doc.length && word_pointer < word.length
      doc_pointer += 1
      word_pointer += 1
    end

    # in order to increment the counter, we need the following to be true:
    # 1. word_pointer == word.length (we found a all characters in doc)
    # 2. doc_pointer == doc.length OR doc[doc_pointer + 1] is not a an a-z character  (to avoid partial matching)
    if word_pointer == word.length
      if doc_pointer == doc.length || doc[doc_pointer + 1] !=~ /[[:alpha:]]/
        counter += 1
      end
    end
    doc_pointer += 1
  end
  counter
end

You can also implement the above solution with the help of the String#scan function in ruby. However, it returns an array so it has to keep track of all occurrences and that could be memory intensive. Here is the solution the Rubyist way

def word_histo(doc, word)
  doc.scan(/\b#{word}\b/).length
end

Now, in order to implement the solution that allows us to call this function, we need to give up memory for speed. We store every word in a hash that maps to its count. I will not bother optimizing the solution as it will look as ugly as the previous solution, here is a more ruby way to solve the problem:

def doc_histo(doc)
  histogram = Hash.new { 0 }
  doc.split(' ').each do |d|
    histogram[d] += 1
  end
  histogram
end

def word_histo(doc_histogram, word)
  doc_histogram[word]
end

We call doc_histo once. Then we use the hash generated to query for multiple words. This ensures a worst-time of O(d) for preprocessing, then O(1) for each query (ignoring the fact the we have to read the input of size w). But we keep O(d) memory.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s