Word Histogram

Introduction

In this homework you will practice

Problem Descrtiption

You’re a curious linguist with computer hacking skills and you want to see if Zipf’s Law holds for texts contained in files lying around on your disk.

Solution Description

Write a Python module in word_hist.py that includes the following functions. You should copy these function declarations and docstrings verbatim to ensure that we can successfully autograde your submission.

from typing import *

def normalize_text(text: str) -> str:
    """Return a copy of text in lowercase with punctuation removed.

    Usage Examples:
    >>> normalize_text("Numchuk skills, bow hunting skills, computer hacking skills...")
    'numchuk skills bow hunting skills computer hacking skills'
    """

def mk_word2count(text: str) -> Dict[str, int]:
    """Return a dictionary mapping words in text to their count in text. 
    Be sure to remove newlines in the text!

    Usage Examples: (Note technique for testing dict equality.)

    >>> mk_word2count('the butcher the baker the candlestick maker') == {'butcher': 1, 'baker': 1, 'candlestick': 1, 'the': 3, 'maker': 1}
    True
    """

def dict2tuples(word_dict: Dict[str, int],
                key: Callable[[Tuple], Any]=None) -> List[Tuple[str, int]]:
    """Convert a str:int dictionary to a sorted (in descending order) list of 
    (str, int) tuples, optionally with a key function for sorting the tuples

    Usage Examples:
    >>> dict2tuples({'a': 2, 'b': 5, 'c': 1}, key=lambda t: t[1])
    [('b', 5), ('a', 2), ('c', 1)]
    """

def normalize_counts(tuples: Sequence[Tuple[str, int]],
                     max_value: int=100) -> Sequence[Tuple[str, int]]:
    """Return: sequence of tuples with same first elements as input tuples
    but whose second elements are normalized to the range 0 to
    max_value.

    Usage Examples:
    >>> wctups = [('a', 200), ('the', 180), ('an', 160), ('shenannigans', 50)]
    >>> normalize_counts(wctups, 100)
    [('a', 100), ('the', 90), ('an', 80), ('shenannigans', 25)]
    """

def word_hist(bar_list: Sequence[Tuple[str, int]]) -> List[str]:
    """Create a text-based bar chart from bar_list. Return: List[str] with one
    line per tuple in bar_list. Each string in the returned list has a
    right-aligned label from the first element of the corresponding tuple
    in bar_list, a | character, then a number of Xs equal to the
    second element from the tuple.

    Usage Examples:
    >>> from pprint import pprint
    >>> pprint(word_hist([('a', 10),('the', 9),('an', 8),('shenannigans', 2)]))
    ['           a | XXXXXXXXXX',
     '         the | XXXXXXXXX',
     '          an | XXXXXXXX',
     'shenannigans | XX']
    """

main

Structure your main method as we have been taught:

def main(args):
    # code intended to be executed when run as a script

if __name__=="__main__":
   import sys
   main(sys.argv)

The user may supply one to three command line arguments to your Python script. The first argument to the python interpreter, sys.argv[0], is the name of your script, i.e., args[0] = "word_hist.py", so there will be 1 to 4 arguments in sys.argv. sys.argv, a list of strings, should be passed as-is to the main() function to minimize confusion.

Here’s a snippet of code that checks for the existence of a file:

import os.path
os.path.exists("file_name.txt") # returns True if file_name.txt exists

Once you have a valid file name, read the file contents into a string. Here’s a snippet of code that opens a file for reading as text and reads the file contents into a str variable:

infile = open(file_name, 'r') # opens file_name as readable file object infile
text = infile.read()          # dumps text data from infile into text variable
infile.close()                # closes infile

Once you read the contents of the file into a str, use the functions you created above to:

Here’s a sample program run, using the file i-have-a-dream.txt:

$ python hw2.py i-have-a-dream.txt 80 80
           the | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
            of | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
            to | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
           and | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
             a | XXXXXXXXXXXXXXXXXXXXXXXXXXXX
            be | XXXXXXXXXXXXXXXXXXXXXXXXX
            we | XXXXXXXXXXXXXXXXXXXXXXX
          will | XXXXXXXXXXXXXXXXXXXX
          that | XXXXXXXXXXXXXXXXXX
            is | XXXXXXXXXXXXXXXXX
            in | XXXXXXXXXXXXXXXXX
          this | XXXXXXXXXXXXXXX
       freedom | XXXXXXXXXXXXXXX
            as | XXXXXXXXXXXXXXX
          from | XXXXXXXXXXXXX
          have | XXXXXXXXXXXXX
           our | XXXXXXXXXXXXX
          with | XXXXXXXXXXXX
             i | XXXXXXXXXXX
         negro | XXXXXXXXXX
           not | XXXXXXXXXX
           one | XXXXXXXXXX
           let | XXXXXXXXXX
           day | XXXXXXXXX
          ring | XXXXXXXXX
         dream | XXXXXXXX
          come | XXXXXXX
        nation | XXXXXXX
         every | XXXXXXX
           for | XXXXXX
            go | XXXXXX
          back | XXXXXX
         today | XXXXXX
           are | XXXXXX
          must | XXXXXX
     satisfied | XXXXXX
            by | XXXXXX
           you | XXXXXX
         their | XXXXXX
       justice | XXXXXX
          able | XXXXXX
          when | XXXXX
           all | XXXXX
            it | XXXX
        cannot | XXXX
           men | XXXX
         white | XXXX
          long | XXXX
           now | XXXX
           but | XXXX
         there | XXXX
      together | XXXX
          time | XXX
         which | XXX
            on | XXX
         faith | XXX
      children | XXX
       america | XXX
            my | XXX
          free | XXX
         check | XXX
           has | XXX
         shall | XXX
         great | XXX
           new | XXX
         years | XXX
           who | XXX
            an | XXX
          into | XXX
            so | XXX
         black | XXX
          hope | XXX
       hundred | XXX
   mississippi | XXX
            up | XXX
            us | XXX
          down | XXX
         until | XXX
      mountain | XXX
         later | XXX

Note that the full output would be very long and words with low rank would have no Xs.