Index: > A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Business Industries Finance Tax

Home > Text encoding


A text encoding is a method of representing a piece of text as a sequence of codes (from a character encoding) for the purpose of computer storage or electronic communication of that text. While character encodings like ASCII represent individual characters of a language, a text encoding has to represent much larger things like articles and books, and must represent not only the characters they contain but the structure and organization of the text, and perhaps information about the text or its appearance. Common examples are HTMLHyperText Markup Language (HTML) is a markup language designed for creating web pages, that is, information presented on the World Wide Web. Defined as a simple "application" of SGML, which is used by organizations with complex publishing requirements, HT and RTF which represent texts in natural languageThe term natural language is used to distinguish languages spoken by humans for general-purpose communication from constructs such as computer-programming languages or the "languages" used in the study of formal logic, especially mathematical logic. In ths, and XMLXML eXtensible Markup Language is a W3C recommendation for creating special-purpose markup languages. It is a simplified subset of SGML, capable of describing many different kinds of data. Its primary purpose is to facilitate the sharing of structured tex, which can represent many kinds of text not necessarily intended to be human-readable (the contents of a databaseA database is an information set with a regular structure. Any set of information may be called a database. Nevertheless, the term was invented to refer to computerised data, and is used almost exclusively in computing. Sometimes it is used to refer to no, for example).

In general there are two basic forms of text encoding that are widely used. One is to use a markup languageSGML is used to write the electronic version of the Oxford English Dictionary. This enables sophisticated queries to be performed, as well as easy translation into HTML. A markup language is a kind of text encoding that represents text as well as details which adds markers to the text itself. MarkupThere is more than one usage of the word markup . You will find them at: Markup (computer programming) Markup (business). has the advantage of being easy to represent, but has the disadvantage of being hard to view without an "aware" reader application . For instance, if a HTMLHyperText Markup Language (HTML) is a markup language designed for creating web pages, that is, information presented on the World Wide Web. Defined as a simple "application" of SGML, which is used by organizations with complex publishing requirements, HT document is opened in a text editor, it is largely readable, but the text is cluttered with codes, and even more so in the case of a table, and there are character references for special characters which may make parts unreadable, at least to those unfamiliar with the format. Another method is to use " pointers" into the text, which is left in the original format. This has the advantage of allowing the content to be easily readable in any editor, although you lose the " styling ". On the downside, editing such a document in a non-aware application typically leaves the pointers pointing to the wrong data. Today the majority of text encoding systems appear to use markup, although whether by choice or simply because "everyone else does" is open to question.

Though character encodings like ASCII and Unicode are not, strictly speaking, text encodings in their own right, they may serve as very simple text encodings if one wishes only to preserve the English content of a document and not necessarily its formatting . By far the most common text encoding now in use is what might informally be called "Plain ASCII", which involves simply encoding a text as a stream of ASCII characters. The specifics of how this is done vary greatly: for example, the end of a text line might be encoded as ASCII code 10 (" line feed" or "new line") as is common practice on Unix machines, or as ASCII code 13 (" carriage return") as is common on Apple machines, or as both (the sequence <13, 10> is used to end lines on DOS based machines and many others, while the rather rare sequence <10, 13> was used by some Acorn machines). Some texts also use this line-end sequence inside paragraphs (with a blank line between paragraphs) while some do not. Also, various texts in this form interpret code 9 ("tab") and other control characters differently. None of these methods specify how to identify text structure like heading s and tables, or special text forms like italics. Text in this format is basically readable by any computer though some work might be needed to accommodate local variations, and all information besides the actual words of the text will be lost.





Non User