Frequency of words in the text

Question

Frequency of words in the text

The following Python script calculates the frequency of words in the text (continuous sequences of letters with the exception of punctuation marks) and outputs a table of results.

It works correctly. The question is: is it possible to do the same thing easier (for example, fewer lines of code) in Python, Bash, PHP, Perl, or is this the best way?

import sys
import string

file = open(sys.argv[1], "r")
text = file.read()
file.close()

table = string.maketrans("", "")
words = text.lower().split(None)

frequencies = {}
for word in words:
    trimmed = word.translate(table, string.punctuation)
    frequencies[trimmed] = frequencies.get(trimmed, 0) + 1

keys = sorted(frequencies.keys())
for word in keys:
    print "%-32s %d" % (word, frequencies[word])

6

python php bash соревнование perl

Author: insolor, 2011-03-31

Source

8 answers

Python, regexp

import re
import sys
import operator

file = open(sys.argv[1], "r")
text = file.read().decode("utf8")
file.close()

words = re.findall(r"(\w+)", text, re.UNICODE)

stats = {}
for word in words:
    stats[word] = stats.get(word, 0) + 1

stats_list = sorted(stats.iteritems(), key = operator.itemgetter(1))
for word, count in stats_list:
    print "%-32s %d" % (word, count)

4

Author: rnd_d, 2011-04-01 00:06:14

My option:

#!/usr/bin/perl
use strict;

my %frec;

sub calc{
    $frec{ $1 }++ while( $_[0] =~ /\b(\S+)\b/g );
}

my $fileName = shift or die( "Uasge: $0 filenameWithText" );
open FF, $fileName;
calc( $_ ) for( <FF> );
foreach( sort{ $frec{$b} <=> $frec{$a} } keys %frec ){
    printf( "%-32s %d\n", $_, $frec{ $_ } );
}
close FF;

3

Author: Alex Kapustin, 2011-03-31 14:46:35

Bash/awk script, for the collection:

#/bin/bash

if [ -z "$1" ]
then
  echo "Usage: `basename $0` filename"
  exit 1
fi

for x in $(sed -rn 's/\W+/ /gp' $1);
do
  echo $x
done | awk '{print tolower($0)}' | sort | awk '
{
  if (!word) {
    word = $1
    num = 0
  } else if (word == $1) {
    num++
  } else {
    print word, num+1
    word = $1
    num = 0
  }
}'

3

Author: Ilya Pirogov, 2011-05-19 16:58:38

Counting the frequency of words with Unicode support (.casefold(), \w+):

the text, encoded from the locale, is set from the files specified on the command line, or from standard input (if not specified)
words are displayed in descending order of popularity
the encoding for the output may differ from the encoding for the input
outputs line-by-line, with the formatting specified in the question (the width for the word is 32 characters)

#!/usr/bin/env python3
import fileinput
import re
from collections import Counter

words = (word for line in fileinput.input()
         for word in re.findall(r'\w+', line.casefold()))
for word, count in Counter(words).most_common():
    print("%-32s %d" % (word, count))

Here is a version close to the behavior of Python 2 code from the question:

reads bytes from a file, converts them to lowercase, and splits them into words using a standard space (does not support Unicode)
removes ascii punctuation from each word (may leave an empty string)
counts the resulting words, sorts them by byte size
outputs line-by-line, with the formatting specified in the question (the width for the word is 32 bytes)

#!/usr/bin/env python3
import os
import string
import sys
from collections import Counter
from pathlib import Path

words = Path(sys.argv[1]).read_bytes().lower().split()
chars_to_trim = string.punctuation.encode()
trimmed = (word.translate(None, chars_to_trim) for word in words)
for word, count in sorted(Counter(trimmed).items()):
    sys.stdout.buffer.write(b"%-32s %d%s" % (word, count, os.linesep.encode()))

Two examples generate different output as a rule.

3

Author: jfs, 2017-09-14 17:16:58

My humble option:

import sys, string

text = sorted(open(sys.argv[1], 'r').read().translate(string.maketrans('', ''), string.punctuation).split())

for i in range(len(text) - 1):
    if text[i + 1] == text[i]: continue
    print '%s %d' % (text[i], text.count(text[i]))

For python3:

import sys, string

text = sorted(open(sys.argv[1]).read().lower().translate(''.maketrans('', '', string.punctuation)).split())
for i in range(len(text) - 1):
    if text[i + 1] == text[i]: continue
    print('{0:>20} {1:<}'.format((text[i]), text.count(text[i])))

2

Author: BugHunter, 2011-05-10 16:10:59

You can use the nltk module, which is designed to work with natural texts:

In [54]: from nltk import word_tokenize, FreqDist

In [55]: data = open(r'c:/temp/TWAIN.LOG').read()

In [56]: fdist = FreqDist(word.lower() for word in word_tokenize(data) if word.isalpha())

10 most common words:

In [57]: fdist.most_common(10)
Out[57]:
[('message', 10),
 ('ctwtrace', 4),
 ('ctwunk', 3),
 ('dsm', 3),
 ('dsmentrydiagexit', 3),
 ('rc', 3),
 ('cc', 3),
 ('thunker', 2),
 ('scannerredirection', 2),
 ('to', 2)]

The entire dictionary:

In [58]: dict(fdist)
Out[58]:
{'message': 10,
 'ctwunk': 3,
 'reset': 1,
 'log': 1,
 'starting': 1,
 'thunker': 2,
 'why': 1,
 'ca': 1,
 'we': 1,
 'find': 1,
 'the': 1,
 'window': 1,
 'dsm': 3,
 'dsmentrydiagexit': 3,
 'rc': 3,
 'cc': 3,
 'ctwtrace': 4,
 'scannerredirection': 2,
 'to': 2,
 'null': 2,
 'control': 2,
 'identity': 2,
 'getfirst': 1,
 'getnext': 1}

2

Author: MaxU, 2018-05-25 14:21:11

Php version of the frequency of occurrences of words in the text:

$text = preg_replace("/[\s\.\,\!]+/", " ", $text);
$text = explode(" ", $text);
foreach (array_count_values($text) as $key => $value)
{
     echo '<br>'.$key.'-->'.$value;
}

0

Author: AHXAOC, 2011-04-29 14:08:17

score 6 · Accepted Answer

In your example, encountering a similar string: "aa, bb, cc" it is considered as "aabbcc 1", and it should be: aa 1 bb 1 cc 1

So, my Perl version is:

#!/usr/bin/perl
use strict;

my %result;
while (<>) {
    $result{ lc $_ }++ for /(\w+)/g;
}

printf "%-32s %d\n", $_, $result{$_} for sort keys %result;

You can certainly put it together in one line. but it will be unreadable.