How do I determine if a file is binary or text?

An arbitrary file is given. You need to write a program (C / C++) that determines whether it is text or binary. Is there such an algorithm at the moment?

What criteria are appropriate to use?

Author: tutankhamun, 2012-07-05

3 answers

A text file is a type of binary file.Just a different data recording format.To know this format, you need to know how this file was written. After writing to the reading program, it becomes a faceless set of bytes. They didn't come up with an unambiguous standard. There are attempts, such as file extensions, to tell the reader program how the data was written. But these rules differ in different operating systems.

You can try the same way as the encoding of the document is determined-by analyzing content.

  1. In the case of 8-bit encodings, it is simple to look for non-printable character codes.
  2. For UTF-8 and other composite encodings, the task becomes somewhat more difficult.
 7
Author: carapuz, 2012-07-05 22:03:49

Thank you to all who responded.

In Linux there is a command file, through which you can solve the problem. Is there an analog for Windows? I tried to download the source code and compile it, but some header files are constantly missing.

Fine Free File Command.

 1
Author: Kalash, 2012-07-10 10:33:25

Of course, it is impossible to determine unambiguously, but the text file will certainly not contain such characters as #0 (the zero value of the byte). Other non-printable characters are more difficult: #13 and #10 are end-of-line and carriage return characters.

I think that such an algorithm will be optimal:

  1. Look at the null character.
  2. Look at some non-printable characters (but not all!).
  3. We look at the number of characters such as a space and the end of the line + carriage return, their the number differs from the number of other characters.
  4. Heavy artillery. We look at the ratio of different bytes, in fairly large binary files, the distribution is approximately even, and in text files, some letters will occur more often than others.

If you don't have Unicode, then the results of the application should be very good (~95%).

 1
Author: Алексей Лобанов, 2012-07-10 10:34:16