git rekt — mgba.git (a27cb6c040a0515d004456c20a0842ec2bcbf652): src/third-party/zlib/doc/txtvsbin.txt

src/third-party/zlib/doc/txtvsbin.txt (view raw)
  1A Fast Method for Identifying Plain Text Files
  2==============================================
  3
  4
  5Introduction
  6------------
  7
  8Given a file coming from an unknown source, it is sometimes desirable
  9to find out whether the format of that file is plain text.  Although
 10this may appear like a simple task, a fully accurate detection of the
 11file type requires heavy-duty semantic analysis on the file contents.
 12It is, however, possible to obtain satisfactory results by employing
 13various heuristics.
 14
 15Previous versions of PKZip and other zip-compatible compression tools
 16were using a crude detection scheme: if more than 80% (4/5) of the bytes
 17found in a certain buffer are within the range [7..127], the file is
 18labeled as plain text, otherwise it is labeled as binary.  A prominent
 19limitation of this scheme is the restriction to Latin-based alphabets.
 20Other alphabets, like Greek, Cyrillic or Asian, make extensive use of
 21the bytes within the range [128..255], and texts using these alphabets
 22are most often misidentified by this scheme; in other words, the rate
 23of false negatives is sometimes too high, which means that the recall
 24is low.  Another weakness of this scheme is a reduced precision, due to
 25the false positives that may occur when binary files containing large
 26amounts of textual characters are misidentified as plain text.
 27
 28In this article we propose a new, simple detection scheme that features
 29a much increased precision and a near-100% recall.  This scheme is
 30designed to work on ASCII, Unicode and other ASCII-derived alphabets,
 31and it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.)
 32and variable-sized encodings (ISO-2022, UTF-8, etc.).  Wider encodings
 33(UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however.
 34
 35
 36The Algorithm
 37-------------
 38
 39The algorithm works by dividing the set of bytecodes [0..255] into three
 40categories:
 41- The white list of textual bytecodes:
 42  9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255.
 43- The gray list of tolerated bytecodes:
 44  7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC).
 45- The black list of undesired, non-textual bytecodes:
 46  0 (NUL) to 6, 14 to 31.
 47
 48If a file contains at least one byte that belongs to the white list and
 49no byte that belongs to the black list, then the file is categorized as
 50plain text; otherwise, it is categorized as binary.  (The boundary case,
 51when the file is empty, automatically falls into the latter category.)
 52
 53
 54Rationale
 55---------
 56
 57The idea behind this algorithm relies on two observations.
 58
 59The first observation is that, although the full range of 7-bit codes
 60[0..127] is properly specified by the ASCII standard, most control
 61characters in the range [0..31] are not used in practice.  The only
 62widely-used, almost universally-portable control codes are 9 (TAB),
 6310 (LF) and 13 (CR).  There are a few more control codes that are
 64recognized on a reduced range of platforms and text viewers/editors:
 657 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but these
 66codes are rarely (if ever) used alone, without being accompanied by
 67some printable text.  Even the newer, portable text formats such as
 68XML avoid using control characters outside the list mentioned here.
 69
 70The second observation is that most of the binary files tend to contain
 71control characters, especially 0 (NUL).  Even though the older text
 72detection schemes observe the presence of non-ASCII codes from the range
 73[128..255], the precision rarely has to suffer if this upper range is
 74labeled as textual, because the files that are genuinely binary tend to
 75contain both control characters and codes from the upper range.  On the
 76other hand, the upper range needs to be labeled as textual, because it
 77is used by virtually all ASCII extensions.  In particular, this range is
 78used for encoding non-Latin scripts.
 79
 80Since there is no counting involved, other than simply observing the
 81presence or the absence of some byte values, the algorithm produces
 82consistent results, regardless what alphabet encoding is being used.
 83(If counting were involved, it could be possible to obtain different
 84results on a text encoded, say, using ISO-8859-16 versus UTF-8.)
 85
 86There is an extra category of plain text files that are "polluted" with
 87one or more black-listed codes, either by mistake or by peculiar design
 88considerations.  In such cases, a scheme that tolerates a small fraction
 89of black-listed codes would provide an increased recall (i.e. more true
 90positives).  This, however, incurs a reduced precision overall, since
 91false positives are more likely to appear in binary files that contain
 92large chunks of textual data.  Furthermore, "polluted" plain text should
 93be regarded as binary by general-purpose text detection schemes, because
 94general-purpose text processing algorithms might not be applicable.
 95Under this premise, it is safe to say that our detection method provides
 96a near-100% recall.
 97
 98Experiments have been run on many files coming from various platforms
 99and applications.  We tried plain text files, system logs, source code,
100formatted office documents, compiled object code, etc.  The results
101confirm the optimistic assumptions about the capabilities of this
102algorithm.
103
104
105--
106Cosmin Truta
107Last updated: 2006-May-28
all repos — mgba @ a27cb6c040a0515d004456c20a0842ec2bcbf652

mGBA Game Boy Advance Emulator