Index: > A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Business Industries Finance Tax

Home > Wubi method


First Prev [ 1 2 3 ] Next Last

Wubi, short for Wubizixing (五笔字型 pinyin wu3 bi3 zi4 xing2), is an input method for writing Simplified Chinese text on a computer.

The Wubi method is based on the structure of characters rather than their pronunciation, making it possible to input unfamiliar characters, as well as not being too closely linked to any particular Chinese dialect. It is also extremely efficient: every character that you would want to write can be written with at most 5 keystrokes. In practice, most characters can be written with less. There are reports of experienced typists hitting 160 wpm with Wubi. What this means in the context of Chinese is not entirely clear, as words are an ill-defined unit in such a largely isolating language, but it is true that wubi is extremely fast when used by an experienced typist. The main reason for this is that, unlike with traditional phonetic input methods, one does not have to spend time selecting the desired character from a list of homophonic possibilities: virtually all characters have a unique representation.

In this article, we will use the following convention: character will always mean Chinese character, whereas letter, key and keystroke will always refer to the keys on your keyboard.

1 How it works

Essentially, a character is broken down into components, which usually (but not always) are the same as radicals. These are typed in the order in which they would be written by hand. In order to ensure that extremely complex characters do not require an inordinate number of keystrokes, any character containing more than 4 components is entered by typing the first 3 components written, followed by the last. In this way, each character's data can be entered with only 4 keystrokes.

Wubi distributes its characters very evenly and as such the vast majority of characters are uniquely defined by the 4 keystrokes discussed above. One then types a space to move the character from the input buffer onto the screen. In the event that the 4 letter representation of the character is not unique, one would type a digit to select the relevant character (for example, if two characters have the same representation, typing 1 would select the first, and 2 the second). In most implementations, a space can always be typed and simply means 1 in an ambiguous setting. Intelligent software will try to make sure that the character in the default position is the one you want.

Many characters have more than one representation. This sometimes is for ease of use, in case there is more than one obvious way to break down a character. More often though, it's because certain characters have a short representation that is less than 4 letters, as well as a "full" representation.

For characters with less than 4 components that do not have a short form representation, one types each component and then "fills up" the representation (that is, types enough extra keystrokes to make the representation 4 keystrokes) by manually typing the strokes of the last component, in the order they would be written. If there are too many strokes, write as many as you can, but put the last stroke last (this mirrors the component rule for characters with more than 4 components outlined above).

This sounds very complex, but it actually is pretty easy to learn. If you don't understand any of these methods, the examples below might help. Essentially though, once you understand the algorithm, you can type any character you see (pretty much) with a little practice, even if you haven't typed it before. If you type often, muscular memory will make sure you don't have to think about how the characters are actually constructed, just as the vast majority of English typists don't actually think very much about the spelling of words when they write.

2 Implementation specific details

Many implementations employ further, multiple-word optimizations. Usually, a commonly used digraph (two character word) in which both characters have short form two-keystroke representations can be combined into a single, four keystroke representation which generates two characters rather than one. There are also a few 3-character shortcuts, and even one rather longer, politically motivated one. Some examples of these are provided in the examples section below.

Another common feature is the use of the 'z' key as a wildcard. The Wubi method was actually designed with this feature in mind; this is why no components are assigned to the z key. Basically, one can type a z when they aren't sure what the component should be, and the input method will help them complete it. If you knew, for example, that the character ought to start with "kt", but were at a loss what the next component should be, typing "ktz" would produce a list of all characters starting with "kt". In practice though, many input method engines use a tabular lookup method for all table based input systems, including for Wubi. This means that they simply have a large table in memory, associating different characters to their respective representations. The input method then simply becomes a table lookup. In such an implementation, the z key breaks the paradigm and as such is not found in a lot of generalized software (although the Wubi input method commonly found in Chinese Windows implements the feature). For this same reason, the multiple character optimization described in the previous paragraph is also relatively rare.

Some input methods, such as xcin (found on many UNIX-like systems), provide a generic wildcard functionality which can be used in all the table based input systems, including pinyin and virtually anything else. Xcin uses '*' for auto-complete and '?' for just one letter, following the conventions pioneered in UNIX file globbing. Other implementations probably have their own conventions.





Non User