Microsoft Glossary Combined Release

August 3rd, 2006

Microsoft has released a new combined version of its terminology translations. From the website:

To provide users with more up-to-date terminology, Microsoft has replaced the glossary content that was previously posted to the Microsoft ftp site with a more concise document that is easier to use. We have consolidated and moved the data from the ftp site to the Microsoft download center in an effort to significantly increase reliability and accessibility for users.

This new CSV file contains over 9,000 English terms plus the translations of the terms for up to 45 different languages. Microsoft provides the Microsoft terminology data to allow our customers, ISVs, and partners to have a more consistent user experience across the products they are using and developing.

These languages include Chinese (Hong Kong S.A.R.), Chinese (People’s Republic of China), and Chinese (Taiwan). Programmers looking to use terminology already familiar to their users will find this useful.

Third International Chinese Language Processing Bakeoff

March 23rd, 2006

From the announcement:

This is the official announcement for the Third International Chinese Language Processing Bakeoff, sponsored by the Special Interest Group for Chinese Language Processing (SIGHAN) of the Association for Computational Linguistics. The bakeoff will occur over the late spring of 2006 and the results will be presented at the 5th SIGHAN Workshop, to be held at ACL-COLING 2006 in Sydney, Australia, July 22-23, 2006.

The first bakeoff, held in 2003 and presented at the 2nd SIGHAN Workshop at ACL 2003 in Sapporo, has become the pre-eminent measure for Chinese word segmentation evaluation and has been cited in numerous papers. The second bakeoff held in 2005 and presented at the 4th SIGHAN Workshop at IJCNLP-05 on Jeju Island, Korea demostrated further progress in this task. In a change from the first two evaluations, the third bakeoff will augment the classic Word Segmentation task with a new Named Entity Recognition task.

For more information visit the Bake-off website.

“What does a Chinese Keyboard Look Like?”

February 21st, 2006

Slate has a readable article covering the basics of typing Chinese using a QWERTY style keyboard.

SentBase Chinese/English Usage Examples

January 16th, 2006

I’ve recently learned of a great tool for learning Chinese: SentBase. It has a Google-like interface where you can type in a few words of Chinese or English and find sentences that contain those words. In addition, each sentence is paired with its corresponding translation. This can be particularly useful for learning Chinese, since you can see how idiomatic language is translated. The search interface allows users to restrict searches to British or American English, or simplified character Chinese. Parallel sentence examples are also useful in developing machine translation systems.

Gong: Internet Voice Communication Tool v.4 released

January 6th, 2006

Gong, a real-time Internet-based voice communication tool has recently released version 4. The notable aspect of Gong is that it includes special support for Chinese and Japanese, including for pinyin. It can also be used to create Chinese audio lessons.

More on Gong from the website:

Gong is a tool that supports Internet-based text and audio communication. It allows groups of people such as students and teachers to participate in discussion groups using their computers. Participants can leave text and voice messages on voice boards. They can listen to and reply to other text and voice messages left by other people. A group of people can join a real-time text/voice chat which can be recorded on voice boards. In addition, there are some powerful features such as support for multiple languages, styled text editing, voice editing, voice speed up/slow down, selective word/phrase playback and support for multilingual interface.

New Segmentation Data

November 18th, 2005

Following the 2005 Chinese Word Segmentation Bake-off, the training, testing, and gold-standard data sets have been released. These data sets, available for research purposes, provide a rich resource for developing and testing new segmentation methods. The various corpora were supplied by CKIP, Academia Sinica, Taiwan; City University of Hong Kong, Hong Kong SAR; Beijing Universty, China; Microsoft Research, China.

Typical China Internet User

November 17th, 2005

A recent survey by Guo Liang of the Chinese Academy of Social Sciences sheds light on internet usage in China. Among the interesting findings are that the typical internet user is “young, male, richer and more highly educated”, relatively few people will buy products on the internet, and more users prefer to use instant messaging than e-mail.

Olympic visitors to get Chinese-speaking phone

October 31st, 2005

An article in Techworld details some of the plans China has to help visitors during the upcoming 2008 Olympics. Among the helps will be a phone with a built in phrase translator that can also read the Chinese phrases out loud.

Proposed Ideographic Variation Database

October 7th, 2005

Following the approval of the Ideographic Variation Database for Unicode, a new draft is now available describing the operation of the database.

Chinese Character Component Resources

July 13th, 2005

Recently I’ve been looking around for resource that describe the components that make up Chinese characters. Here are some links to the most useful:

http://mousai.as.wakwak.ne.jp/projects/chise/ids/index.html.ja.iso-2022-jp
http://rt.openfoundry.org/Foundry/Project/Wiki/60/index.html
http://www.sinica.edu.tw/~cdp/zip/hanzi/hzmanual.zip
http://glyph.iso10646hk.net/doc/normal_char.txt