| Index: > A B C D E F G H I J K L M N O P Q R S T U V W X Y Z |
|
|||||
In general there are two basic forms of text encoding that are widely used. One is to use a markup languageSGML is used to write the electronic version of the Oxford English Dictionary. This enables sophisticated queries to be performed, as well as easy translation into HTML. A markup language is a kind of text encoding that represents text as well as details which adds markers to the text itself. MarkupThere is more than one usage of the word markup . You will find them at: Markup (computer programming) Markup (business). has the advantage of being easy to represent, but has the disadvantage of being hard to view without an "aware" reader application . For instance, if a HTMLHyperText Markup Language (HTML) is a markup language designed for creating web pages, that is, information presented on the World Wide Web. Defined as a simple "application" of SGML, which is used by organizations with complex publishing requirements, HT document is opened in a text editor, it is largely readable, but the text is cluttered with codes, and even more so in the case of a table, and there are character references for special characters which may make parts unreadable, at least to those unfamiliar with the format. Another method is to use " pointers" into the text, which is left in the original format. This has the advantage of allowing the content to be easily readable in any editor, although you lose the " styling ". On the downside, editing such a document in a non-aware application typically leaves the pointers pointing to the wrong data. Today the majority of text encoding systems appear to use markup, although whether by choice or simply because "everyone else does" is open to question.
Though character encodings like ASCII and Unicode are not, strictly speaking, text encodings in their own right, they may serve as very simple text encodings if one wishes only to preserve the English content of a document and not necessarily its formatting . By far the most common text encoding now in use is what might informally be called "Plain ASCII", which involves simply encoding a text as a stream of ASCII characters. The specifics of how this is done vary greatly: for example, the end of a text line might be encoded as ASCII code 10 (" line feed" or "new line") as is common practice on Unix machines, or as ASCII code 13 (" carriage return") as is common on Apple machines, or as both (the sequence <13, 10> is used to end lines on DOS based machines and many others, while the rather rare sequence <10, 13> was used by some Acorn machines). Some texts also use this line-end sequence inside paragraphs (with a blank line between paragraphs) while some do not. Also, various texts in this form interpret code 9 ("tab") and other control characters differently. None of these methods specify how to identify text structure like heading s and tables, or special text forms like italics. Text in this format is basically readable by any computer though some work might be needed to accommodate local variations, and all information besides the actual words of the text will be lost.