General Chinese Encoding Information
Computers don't speak any languages, they only know numbers. In
order for computers to work with human languages such as Chinese and
English, special mappings between numbers and letters or characters
are made into standards that various computers and programs
understand. These agreed upon ways of using Chinese are called
characters sets or code sets. GB (short for "Guojia Biaozhun" or
"National Standard") is the standard used in the People's Republic of
China and Singapore and it has a set of about 7,000 simplified Chinese
characters. Big5 is used in Taiwan and Hong Kong and has about 13,000
traditional Chinese characters. Unicode is an emerging standard that
attempts to encode all the major languages, including Chinese.
Unicode includes all the characters from GB and Big5 and more. A
character set is different from a font that supports that character
set. You may have a document written using GB, but to view it you
need a font that includes all the GB characters. Viewing a GB encoded
document as if it were in Big5 will produce garbage on the screen.
Viewing a Chinese document on a program that thinks it is in English
will also produce an unintelligible document with lots of accented
letters and symbols.
The characters in Unicode are a superset of the characters in GB
and Big5 so it is easy to convert directly from GB or Big5 into
Unicode. However, while there is some overlap between GB and Big5,
there are also many simplified characters in GB that are not in Big5,
and many traditional characters in Big5 that are not in GB.
Consequently, conversion between GB and Big5 is not trivial, since
many simplified characters map to multiple Big5 traditional
equivalents. Going from Big5 to GB is easier, since the conversion
from traditional to simplified is much less ambiguous.
Charset Conversion
Charset Detection