Extracting words from a long text and creating statistics from them. What's wrong?

Question

Extracting words from a long text and creating statistics from them. What's wrong?

We have the book "Pride and Prejudice by Jane Austen" from Project Gutenberg :

Http://www.gutenberg.org/ebooks/1342

The goal is to extract all words from the text, creating statistics like: frequency of each word, total characters in the text, average word size, average sentence size and a "top 10" of the longest words.

Looking at the text, I checked that many words contain strange characters like:

"the,"

Requiring that these characters be removed initially.

As I tried to do:

# -*- coding: UTF-8 -*-
from string import punctuation
from collections import Counter
with open("1342-0.txt",encoding='utf8') as f:
    texto = f.read()

words = texto.split()
n_words =[]
for word in words:
    for p in punctuation:
        if p in word:
            n_word = word.replace(p,"")
            n_words.append(n_word)
        n_words.append(word)

"1342-0.txt " is the book in question. The above code tries to eliminate the unwanted characters but it doesn't work. What's wrong? Any better ideas?

3

python python-3.x texto

Author: Laurinda Souza, 2020-03-31

Source

2 answers

To remove punctuation characters from python text, just one line is required:

from string import punctuation

texto = '''It is a truth universally acknowledged, that a single man in
      possession of a good fortune, must be in want of a wife.

      However little known the feelings or views of such a man may be
      on his first entering a neighbourhood, this truth is so well
      fixed in the minds of the surrounding families, that he is
      considered the rightful property of some one or other of their
      daughters.      
'''

#Remove os pontuadores
print(texto.translate(str.maketrans('', '', punctuation)))

Resulting in:

  It is a truth universally acknowledged that a single man in
  possession of a good fortune must be in want of a wife

  However little known the feelings or views of such a man may be
  on his first entering a neighbourhood this truth is so well
  fixed in the minds of the surrounding families that he is
  considered the rightful property of some one or other of their
  daughters

Not Working Repl.it: https://repl.it/repls/DarkcyanPointlessChord

Example 2: https://repl.it/repls/ResponsibleVariableTheories

The logic is as follows, the method str.translate() returns a copy of the string in which each character was mapped through the conversion table specified by the method str.maketrans () .

2

Author: Augusto Vasques, 2020-03-31 14:47:12

score 2 · Accepted Answer

An initial idea is to do the split not only by spaces, but by any character that is not part of a word:

from collections import Counter
import re

r = re.compile(r'\W+')
c = Counter()
with open("1342-0.txt", encoding='utf8') as f:
    for linha in f:
        for word in r.split(linha):
            c.update([word])

print(c)

The shortcut \W is "anything not is letter, number, or character _" - and since the text has "words" like _she_, this is considered a different word from she. I also consider numbers (like 1) to be "words", which are also accounted for.

As I find the words, I will update the Counter using method update (if the key does not exist, it is created with the value 1, and if it exists, add 1 to its value-at the end we have the total count of each word).

Another detail is that read() loads the entire contents of the file into memory at once. Depending on the file size, this can be a problem. Already the above code reads one line at a time (and I'm assuming there's no case of a word starting in one line and ending in another-if well that in this case, using read and split would also not consider it to be the same word).

If you don't want to include _ as part of a word, just change the regex to:

r = re.compile(r'[\W_]+')

The problem is that there are also hyphenated words, such as over-scrupulous. The above code considers that they are two different words ("over "and"scrupulous"). If you want them to be one word, you have to change a little:

from collections import Counter
import re

r = re.compile(r'\b\w+(?:-\w+)*\b')
c = Counter()
with open("1342-0.txt", encoding='utf8') as f:
    for linha in f:
        for word in r.findall(linha):
            c.update([word])

print(c)

Now I use \w+ (one or more characters that form one word), and I put an excerpt containing hyphen and \w+ (and this whole excerpt can repeat itself zero or more times). So I pick up words with one or more hyphens as well.

If you do not want to include _ as part of a word, use:

r = re.compile(r'\b[^\W_]+(?:-[^\W_]+)*\b')

It is worth remembering that string.punctuation only consider characters !"#$%&'()*+,-./:;<=>?@[]^_{|}`. If you have any other character in the text other than letter, number, or _, it will not be removed.

An example is the character “ (present in text), which not is the same as " (they are different quotes, the first is the "left double QUOTATION MARK" and the second is "QUOTATION MARK", and if you use punctuation, it will only remove the second).