Frequency of words in the text
The following Python script calculates the frequency of words in the text (continuous sequences of letters with the exception of punctuation marks) and outputs a table of results.
It works correctly. The question is: is it possible to do the same thing easier (for example, fewer lines of code) in Python, Bash, PHP, Perl, or is this the best way?
import sys
import string
file = open(sys.argv[1], "r")
text = file.read()
file.close()
table = string.maketrans("", "")
words = text.lower().split(None)
frequencies = {}
for word in words:
trimmed = word.translate(table, string.punctuation)
frequencies[trimmed] = frequencies.get(trimmed, 0) + 1
keys = sorted(frequencies.keys())
for word in keys:
print "%-32s %d" % (word, frequencies[word])
8 answers
In your example, encountering a similar string: "aa, bb, cc" it is considered as "aabbcc 1", and it should be: aa 1 bb 1 cc 1
So, my Perl version is:
#!/usr/bin/perl
use strict;
my %result;
while (<>) {
$result{ lc $_ }++ for /(\w+)/g;
}
printf "%-32s %d\n", $_, $result{$_} for sort keys %result;
You can certainly put it together in one line. but it will be unreadable.
Python, regexp
import re
import sys
import operator
file = open(sys.argv[1], "r")
text = file.read().decode("utf8")
file.close()
words = re.findall(r"(\w+)", text, re.UNICODE)
stats = {}
for word in words:
stats[word] = stats.get(word, 0) + 1
stats_list = sorted(stats.iteritems(), key = operator.itemgetter(1))
for word, count in stats_list:
print "%-32s %d" % (word, count)
My option:
#!/usr/bin/perl
use strict;
my %frec;
sub calc{
$frec{ $1 }++ while( $_[0] =~ /\b(\S+)\b/g );
}
my $fileName = shift or die( "Uasge: $0 filenameWithText" );
open FF, $fileName;
calc( $_ ) for( <FF> );
foreach( sort{ $frec{$b} <=> $frec{$a} } keys %frec ){
printf( "%-32s %d\n", $_, $frec{ $_ } );
}
close FF;
Bash/awk script, for the collection:
#/bin/bash
if [ -z "$1" ]
then
echo "Usage: `basename $0` filename"
exit 1
fi
for x in $(sed -rn 's/\W+/ /gp' $1);
do
echo $x
done | awk '{print tolower($0)}' | sort | awk '
{
if (!word) {
word = $1
num = 0
} else if (word == $1) {
num++
} else {
print word, num+1
word = $1
num = 0
}
}'
Counting the frequency of words with Unicode support (.casefold()
, \w+
):
- the text, encoded from the locale, is set from the files specified on the command line, or from standard input (if not specified)
- words are displayed in descending order of popularity
- the encoding for the output may differ from the encoding for the input
- outputs line-by-line, with the formatting specified in the question (the width for the word is 32 characters)
#!/usr/bin/env python3
import fileinput
import re
from collections import Counter
words = (word for line in fileinput.input()
for word in re.findall(r'\w+', line.casefold()))
for word, count in Counter(words).most_common():
print("%-32s %d" % (word, count))
Here is a version close to the behavior of Python 2 code from the question:
- reads bytes from a file, converts them to lowercase, and splits them into words using a standard space (does not support Unicode)
- removes ascii punctuation from each word (may leave an empty string)
- counts the resulting words, sorts them by byte size
- outputs line-by-line, with the formatting specified in the question (the width for the word is 32 bytes)
#!/usr/bin/env python3
import os
import string
import sys
from collections import Counter
from pathlib import Path
words = Path(sys.argv[1]).read_bytes().lower().split()
chars_to_trim = string.punctuation.encode()
trimmed = (word.translate(None, chars_to_trim) for word in words)
for word, count in sorted(Counter(trimmed).items()):
sys.stdout.buffer.write(b"%-32s %d%s" % (word, count, os.linesep.encode()))
Two examples generate different output as a rule.
My humble option:
import sys, string
text = sorted(open(sys.argv[1], 'r').read().translate(string.maketrans('', ''), string.punctuation).split())
for i in range(len(text) - 1):
if text[i + 1] == text[i]: continue
print '%s %d' % (text[i], text.count(text[i]))
For python3:
import sys, string
text = sorted(open(sys.argv[1]).read().lower().translate(''.maketrans('', '', string.punctuation)).split())
for i in range(len(text) - 1):
if text[i + 1] == text[i]: continue
print('{0:>20} {1:<}'.format((text[i]), text.count(text[i])))
You can use the nltk module, which is designed to work with natural texts:
In [54]: from nltk import word_tokenize, FreqDist
In [55]: data = open(r'c:/temp/TWAIN.LOG').read()
In [56]: fdist = FreqDist(word.lower() for word in word_tokenize(data) if word.isalpha())
10 most common words:
In [57]: fdist.most_common(10)
Out[57]:
[('message', 10),
('ctwtrace', 4),
('ctwunk', 3),
('dsm', 3),
('dsmentrydiagexit', 3),
('rc', 3),
('cc', 3),
('thunker', 2),
('scannerredirection', 2),
('to', 2)]
The entire dictionary:
In [58]: dict(fdist)
Out[58]:
{'message': 10,
'ctwunk': 3,
'reset': 1,
'log': 1,
'starting': 1,
'thunker': 2,
'why': 1,
'ca': 1,
'we': 1,
'find': 1,
'the': 1,
'window': 1,
'dsm': 3,
'dsmentrydiagexit': 3,
'rc': 3,
'cc': 3,
'ctwtrace': 4,
'scannerredirection': 2,
'to': 2,
'null': 2,
'control': 2,
'identity': 2,
'getfirst': 1,
'getnext': 1}
Php version of the frequency of occurrences of words in the text:
$text = preg_replace("/[\s\.\,\!]+/", " ", $text);
$text = explode(" ", $text);
foreach (array_count_values($text) as $key => $value)
{
echo '<br>'.$key.'-->'.$value;
}