python3. How to open (get data) office documents (doc, odt, etc…)

Question

python3. How to open (get data) office documents (doc, odt, etc…)

Tell me how to open and get data from office files such as odt, doc, docx, rtf in python3. At least odt.

The fact that odt and docx are essentially archives in the course, you can unpack them in theory, and look at the file content.xml (if I'm not mistaken), but maybe there are more modern or convenient ways.

All I found is for creating ods tables.

Found the uno modules, pyoo and everywhere it is described how to create tables, and I didn't find how to get data from office documents.

The task is to run through all the files existing in the directory (subdirectories), find or analyze what you need and output the result to a separate file.

Now this is partially implemented in bash, I want to rewrite everything in python3.

Tell me or show me how to search.

4

парсер python-3.x open-office

Author: Трезвый, 2016-09-06

Source

2 answers

Well here's a couple of libraries offhand:

Of course, working through OpenOffice services is a more correct way for a samurai, but for this you need at least a" headless " OpenOffice, and it may not be there. In addition, it seems to me that OpenOffice services will disappoint with performance indicators when processing a large number of files, but you will get a complete functionality.

By the way, you need to keep in mind that when using OpenOffice, you will have to follow the documentation on the Java API and adapt it to Python

2

Author: tutankhamun, 2016-09-06 20:20:21

score 0 · Accepted Answer

I will formalize it as an answer, so that you do not prowl in the comments if tutankhamun do not mind, if you do, then add it to your answer and I will delete my own.

And so, the problem was solved with the help of the module ezodf ( not much documentation of it). When installing, be careful if you have both 2 and 3 versions of python, for the third one I put python3 setup.py install.

A small code example, for clarity

import ezodf
odt = ezodf.opendoc('/home/user/python/text.odt')
list=[]
# Запускаем цикл for  и перебираем все что нашли в файле)
for i in odt.body:
  if i.text == None:
    print('no')
  else:
    list.extend(re.findall(r"[\w']+", i.text.lower()))

I will explain why I used it i.text instead of i. plaintext(), for catching several strings with the value None (apparently some service data, did not understand), just plaintext() adds empty elements to the list and at that time it seemed to me that using text would be faster, but in the morning I can rethink)

And here list. extend(re.findall(r"[\w']+", i.text.lower())) - I attach to the existing list or even so, thereby expanding the existing list. I select all the words with a regular expression (each word from the document to the list), apply lowercase to them, and that's it.

So this is only a piece, because it may not look very good, and there is a lot that can be added, but at least now it is clear how to read documents.

Thanks tutankhamun for the hints.