In this homework you will practice
You’re a curious linguist with computer hacking skills and you want to see if Zipf’s Law holds for texts contained in files lying around on your disk.
Write a Python module in word_hist.py
that includes the following functions. You should copy these function declarations and docstrings verbatim to ensure that we can successfully autograde your submission.
from typing import *
def normalize_text(text: str) -> str:
"""Return a copy of text in lowercase with punctuation removed.
Usage Examples:
>>> normalize_text("Numchuk skills, bow hunting skills, computer hacking skills...")
'numchuk skills bow hunting skills computer hacking skills'
"""
def mk_word2count(text: str) -> Dict[str, int]:
"""Return a dictionary mapping words in text to their count in text.
Be sure to remove newlines in the text!
Usage Examples: (Note technique for testing dict equality.)
>>> mk_word2count('the butcher the baker the candlestick maker') == {'butcher': 1, 'baker': 1, 'candlestick': 1, 'the': 3, 'maker': 1}
True
"""
def dict2tuples(word_dict: Dict[str, int],
key: Callable[[Tuple], Any]=None) -> List[Tuple[str, int]]:
"""Convert a str:int dictionary to a sorted (in descending order) list of
(str, int) tuples, optionally with a key function for sorting the tuples
Usage Examples:
>>> dict2tuples({'a': 2, 'b': 5, 'c': 1}, key=lambda t: t[1])
[('b', 5), ('a', 2), ('c', 1)]
"""
def normalize_counts(tuples: Sequence[Tuple[str, int]],
max_value: int=100) -> Sequence[Tuple[str, int]]:
"""Return: sequence of tuples with same first elements as input tuples
but whose second elements are normalized to the range 0 to
max_value.
Usage Examples:
>>> wctups = [('a', 200), ('the', 180), ('an', 160), ('shenannigans', 50)]
>>> normalize_counts(wctups, 100)
[('a', 100), ('the', 90), ('an', 80), ('shenannigans', 25)]
"""
def word_hist(bar_list: Sequence[Tuple[str, int]]) -> List[str]:
"""Create a text-based bar chart from bar_list. Return: List[str] with one
line per tuple in bar_list. Each string in the returned list has a
right-aligned label from the first element of the corresponding tuple
in bar_list, a | character, then a number of Xs equal to the
second element from the tuple.
Usage Examples:
>>> from pprint import pprint
>>> pprint(word_hist([('a', 10),('the', 9),('an', 8),('shenannigans', 2)]))
[' a | XXXXXXXXXX',
' the | XXXXXXXXX',
' an | XXXXXXXX',
'shenannigans | XX']
"""
main
Structure your main method as we have been taught:
def main(args):
# code intended to be executed when run as a script
if __name__=="__main__":
import sys
main(sys.argv)
The user may supply one to three command line arguments to your Python script. The first argument to the python
interpreter, sys.argv[0]
, is the name of your script, i.e., args[0] = "word_hist.py"
, so there will be 1 to 4 arguments in sys.argv
. sys.argv
, a list of strings, should be passed as-is to the main()
function to minimize confusion.
args[0] = "word_hist.py"
).Here’s a snippet of code that checks for the existence of a file:
import os.path
os.path.exists("file_name.txt") # returns True if file_name.txt exists
Once you have a valid file name, read the file contents into a string. Here’s a snippet of code that opens a file for reading as text and reads the file contents into a str
variable:
infile = open(file_name, 'r') # opens file_name as readable file object infile
text = infile.read() # dumps text data from infile into text variable
infile.close() # closes infile
Once you read the contents of the file into a str
, use the functions you created above to:
word_hist
function, andnum_lines
of the histogram, where num_lines
is the number of lines of the histogram to display.Here’s a sample program run, using the file i-have-a-dream.txt:
$ python hw2.py i-have-a-dream.txt 80 80
the | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
of | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
to | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
and | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
a | XXXXXXXXXXXXXXXXXXXXXXXXXXXX
be | XXXXXXXXXXXXXXXXXXXXXXXXX
we | XXXXXXXXXXXXXXXXXXXXXXX
will | XXXXXXXXXXXXXXXXXXXX
that | XXXXXXXXXXXXXXXXXX
is | XXXXXXXXXXXXXXXXX
in | XXXXXXXXXXXXXXXXX
this | XXXXXXXXXXXXXXX
freedom | XXXXXXXXXXXXXXX
as | XXXXXXXXXXXXXXX
from | XXXXXXXXXXXXX
have | XXXXXXXXXXXXX
our | XXXXXXXXXXXXX
with | XXXXXXXXXXXX
i | XXXXXXXXXXX
negro | XXXXXXXXXX
not | XXXXXXXXXX
one | XXXXXXXXXX
let | XXXXXXXXXX
day | XXXXXXXXX
ring | XXXXXXXXX
dream | XXXXXXXX
come | XXXXXXX
nation | XXXXXXX
every | XXXXXXX
for | XXXXXX
go | XXXXXX
back | XXXXXX
today | XXXXXX
are | XXXXXX
must | XXXXXX
satisfied | XXXXXX
by | XXXXXX
you | XXXXXX
their | XXXXXX
justice | XXXXXX
able | XXXXXX
when | XXXXX
all | XXXXX
it | XXXX
cannot | XXXX
men | XXXX
white | XXXX
long | XXXX
now | XXXX
but | XXXX
there | XXXX
together | XXXX
time | XXX
which | XXX
on | XXX
faith | XXX
children | XXX
america | XXX
my | XXX
free | XXX
check | XXX
has | XXX
shall | XXX
great | XXX
new | XXX
years | XXX
who | XXX
an | XXX
into | XXX
so | XXX
black | XXX
hope | XXX
hundred | XXX
mississippi | XXX
up | XXX
us | XXX
down | XXX
until | XXX
mountain | XXX
later | XXX
Note that the full output would be very long and words with low rank would have no X
s.