Frequency of words in the text

The following Python script calculates the frequency of words in the text (continuous sequences of letters with the exception of punctuation marks) and outputs a table of results.

It works correctly. The question is: is it possible to do the same thing easier (for example, fewer lines of code) in Python, Bash, PHP, Perl, or is this the best way?

import sys
import string

file = open(sys.argv[1], "r")
text = file.read()
file.close()

table = string.maketrans("", "")
words = text.lower().split(None)

frequencies = {}
for word in words:
    trimmed = word.translate(table, string.punctuation)
    frequencies[trimmed] = frequencies.get(trimmed, 0) + 1

keys = sorted(frequencies.keys())
for word in keys:
    print "%-32s %d" % (word, frequencies[word])
Author: insolor, 2011-03-31

8 answers

In your example, encountering a similar string: "aa, bb, cc" it is considered as "aabbcc 1", and it should be: aa 1 bb 1 cc 1

So, my Perl version is:

#!/usr/bin/perl
use strict;

my %result;
while (<>) {
    $result{ lc $_ }++ for /(\w+)/g;
}

printf "%-32s %d\n", $_, $result{$_} for sort keys %result;

You can certainly put it together in one line. but it will be unreadable.

 6
Author: zloyrusskiy, 2011-04-01 05:33:32

Python, regexp

import re
import sys
import operator

file = open(sys.argv[1], "r")
text = file.read().decode("utf8")
file.close()

words = re.findall(r"(\w+)", text, re.UNICODE)

stats = {}
for word in words:
    stats[word] = stats.get(word, 0) + 1

stats_list = sorted(stats.iteritems(), key = operator.itemgetter(1))
for word, count in stats_list:
    print "%-32s %d" % (word, count)
 4
Author: rnd_d, 2011-04-01 00:06:14

My option:

#!/usr/bin/perl
use strict;

my %frec;

sub calc{
    $frec{ $1 }++ while( $_[0] =~ /\b(\S+)\b/g );
}

my $fileName = shift or die( "Uasge: $0 filenameWithText" );
open FF, $fileName;
calc( $_ ) for( <FF> );
foreach( sort{ $frec{$b} <=> $frec{$a} } keys %frec ){
    printf( "%-32s %d\n", $_, $frec{ $_ } );
}
close FF;
 3
Author: Alex Kapustin, 2011-03-31 14:46:35

Bash/awk script, for the collection:

#/bin/bash

if [ -z "$1" ]
then
  echo "Usage: `basename $0` filename"
  exit 1
fi

for x in $(sed -rn 's/\W+/ /gp' $1);
do
  echo $x
done | awk '{print tolower($0)}' | sort | awk '
{
  if (!word) {
    word = $1
    num = 0
  } else if (word == $1) {
    num++
  } else {
    print word, num+1
    word = $1
    num = 0
  }
}'
 3
Author: Ilya Pirogov, 2011-05-19 16:58:38

Counting the frequency of words with Unicode support (.casefold(), \w+):

  • the text, encoded from the locale, is set from the files specified on the command line, or from standard input (if not specified)
  • words are displayed in descending order of popularity
  • the encoding for the output may differ from the encoding for the input
  • outputs line-by-line, with the formatting specified in the question (the width for the word is 32 characters)
#!/usr/bin/env python3
import fileinput
import re
from collections import Counter

words = (word for line in fileinput.input()
         for word in re.findall(r'\w+', line.casefold()))
for word, count in Counter(words).most_common():
    print("%-32s %d" % (word, count))

Here is a version close to the behavior of Python 2 code from the question:

  • reads bytes from a file, converts them to lowercase, and splits them into words using a standard space (does not support Unicode)
  • removes ascii punctuation from each word (may leave an empty string)
  • counts the resulting words, sorts them by byte size
  • outputs line-by-line, with the formatting specified in the question (the width for the word is 32 bytes)
#!/usr/bin/env python3
import os
import string
import sys
from collections import Counter
from pathlib import Path

words = Path(sys.argv[1]).read_bytes().lower().split()
chars_to_trim = string.punctuation.encode()
trimmed = (word.translate(None, chars_to_trim) for word in words)
for word, count in sorted(Counter(trimmed).items()):
    sys.stdout.buffer.write(b"%-32s %d%s" % (word, count, os.linesep.encode()))

Two examples generate different output as a rule.

 3
Author: jfs, 2017-09-14 17:16:58

My humble option:

import sys, string

text = sorted(open(sys.argv[1], 'r').read().translate(string.maketrans('', ''), string.punctuation).split())

for i in range(len(text) - 1):
    if text[i + 1] == text[i]: continue
    print '%s %d' % (text[i], text.count(text[i]))

For python3:

import sys, string

text = sorted(open(sys.argv[1]).read().lower().translate(''.maketrans('', '', string.punctuation)).split())
for i in range(len(text) - 1):
    if text[i + 1] == text[i]: continue
    print('{0:>20} {1:<}'.format((text[i]), text.count(text[i])))
 2
Author: BugHunter, 2011-05-10 16:10:59

You can use the nltk module, which is designed to work with natural texts:

In [54]: from nltk import word_tokenize, FreqDist

In [55]: data = open(r'c:/temp/TWAIN.LOG').read()

In [56]: fdist = FreqDist(word.lower() for word in word_tokenize(data) if word.isalpha())

10 most common words:

In [57]: fdist.most_common(10)
Out[57]:
[('message', 10),
 ('ctwtrace', 4),
 ('ctwunk', 3),
 ('dsm', 3),
 ('dsmentrydiagexit', 3),
 ('rc', 3),
 ('cc', 3),
 ('thunker', 2),
 ('scannerredirection', 2),
 ('to', 2)]

The entire dictionary:

In [58]: dict(fdist)
Out[58]:
{'message': 10,
 'ctwunk': 3,
 'reset': 1,
 'log': 1,
 'starting': 1,
 'thunker': 2,
 'why': 1,
 'ca': 1,
 'we': 1,
 'find': 1,
 'the': 1,
 'window': 1,
 'dsm': 3,
 'dsmentrydiagexit': 3,
 'rc': 3,
 'cc': 3,
 'ctwtrace': 4,
 'scannerredirection': 2,
 'to': 2,
 'null': 2,
 'control': 2,
 'identity': 2,
 'getfirst': 1,
 'getnext': 1}
 2
Author: MaxU, 2018-05-25 14:21:11

Php version of the frequency of occurrences of words in the text:

$text = preg_replace("/[\s\.\,\!]+/", " ", $text);
$text = explode(" ", $text);
foreach (array_count_values($text) as $key => $value)
{
     echo '<br>'.$key.'-->'.$value;
}
 0
Author: AHXAOC, 2011-04-29 14:08:17