Share via


How to detect if the string contains chinese characters

Question

Wednesday, February 8, 2012 1:24 AM

Hi, I have a textbox which user enter the text.  I need to make sure the text entered is chinese character (word). So, how do I detect if the string contain chinese word? Thanks

All replies (6)

Wednesday, February 8, 2012 9:56 AM ✅Answered

@ SkySky

yes and no

if they can switch keyboards any time they want, they might switch at any time, even in the middle of a Chinese sentence ...

Note:  i get lots of spam from China ... often there is a mix of English and Chinese words ... í've no idea what the spammers are trying to tell me.

always look for the simplest solution ... why not have different sections of your website?:

    (a) China:             Chinese
    (b) North America:  Chinese
    (c) North America:  English

then, we a person goes to (a), (b), or (c), your landing page should be a "keyboard" test page ...

for each landing page, you would give the person a "keyboard check" textbox.

you would tell your end user to type ONLY in English or Chinese.

you would give them a specific English or Chinese phrase for them to type.

if what they type does not match exactly your test phrase, keep asking them to retype it until they get it correct.

once they get it correct, take them to your page(s) that correspond to (a), (b), or (c), above.

g.


Wednesday, February 8, 2012 1:57 AM

Refer this site: http://stackoverflow.com/questions/2262091/how-do-i-validate-for-language-asp-net

It tells about validating English only words. You may modify it for Chinese words.

All the best !!


Wednesday, February 8, 2012 4:27 AM

hi

string word = "是集室內";

char fo = word[0];

UnicodeCategory cat = char.GetUnicodeCategory(fo);
if (cat == UnicodeCategory.OtherLetter)
{
//chinese char
}
else
{
//english char
}

thank u

http://msmvps.com/blogs/jon_skeet/


Wednesday, February 8, 2012 5:17 AM

@ SkySky

are you Chinese? or, at least do your read and/or write Chinese?

if your answer to both of the above questions is no, you'll have a lot of difficulty with your challenge.

Let me explain.

validating for character set alone is insufficient, example:

ASCII ,,, four languages below ... but all are ASCII characters
   house  ... English
   maison ... French
   Haus    ... German
   casa     ... Italian

worse still ... 
   hand ... English
   Hand ... German
   main ... French ............ but English also has the word "main"

**QUESTION: ** are you aware of Unicode?  http://unicode.org/

see http://www.unicode.org/standard/WhatIsUnicode.html.

also http://unicode.org/faq/han_cjk.html "Chinese and Japanese"

 Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?

A: It's basically impossible and largely meaningless. It's the equivalent of asking if "a" is an English letter or a French one. There are some characters where one can guess based on the source information in the Unihan Database that it's traditional Chinese, simplified Chinese, Japanese, Korean, or Vietnamese, but there are too many exceptions to make this really reliable.

Chiness characters are used in Chinese, Japanese, Korean and other languages AFAIK.

One possible, but not a reliable, solution:   Request.UserLanguages is a string array; if you do a very detailed study, you will see that browsers will return codes like zh-cn which could be a clue ... however, the user might still be using non-CJK characters.

BOTTOM LINE:  unless you have a lot of time, your task is very difficult ... you would need to analyze much more deeply than at the character level ... you'd also need to lookup combinations of characters in a Chinese dictionary and assign a probability to the reliabilty of your analyis.

MORE  INFORMATION

remember:  English is the lingua franca of the modern world; so even Chinese text might contain English words.

q.v.:  http://www.i18nguy.com/temp/rtl.html "Questions & Answers: Which languages are written right-to-left (RTL)?"

Ideographic languages (e.g. Japanese, Korean, Chinese) are more flexible in their writing direction. They are generally written left-to-right, or vertically top-to-bottom (with the vertical lines proceeding from right to left). However, they are occasionally written right to left. Chinese newspapers sometimes combine all of these writing directions on the same page.

g.


Wednesday, February 8, 2012 9:29 AM

@gerrylowry,

Thank you thank you very much for your detailed explanation. It has been a joy to read it and understand the difficulty to implement the detection.

For my case,  I am trying to program windows phone which has a very flexible selectiion of keyboard base your language. So, Chinese user in china  will have chinese keyboard by default.  English user in North America will have English keyboard by default but user can select Chinese keyboard. So, sometime the user forget to switch the keyboard and type.  Since  there is no way  I can detect which language Keyboard has been turned on, so i try to check the First word that they user type in.

Say, user type  this in a textbox :  Im 美国人,  My requirement is all words must be in chinese. So, I want to detect the first word. If it is not, I will show message that they need to enter all chinese. By then, they will realize what language keyboard they are using.

is this can be done?

Thanks.


Wednesday, February 8, 2012 5:03 PM

Coincidentally, Raymond Chen discussed this very topic recently

http://blogs.msdn.com/b/oldnewthing/archive/2012/01/11/10255330.aspx