|
|
Member
Total posts: 21
Joined: Oct 25, 2011
|
Good day!
I am interesting how to append other languages into ChatScript like Russian, Spanish?
What is the best way to do it?
Thank you for any advise.
|
|
|
|
|
Posted: Oct 28, 2011 |
[ # 1 ]
|
|
Moderator
Total posts: 2372
Joined: Jan 12, 2010
|
I’m not sure.
I’ve avoided other alphabets, having not sorted out how to insure getting multibyte characters correctly. The system is not set up for a unicode representation, but multibyte should be feasible. If one knew one was getting multibyte properly in the input always, in theory one could just remove the code in ReadALine in textutilities.cpp involving utfbad.
One would also, of course, end up needing a different set of dictionary entries, and shut down POSTag which is currently built for english language pos-tagging and parsing. Spelling correction would probably continue to work correctly. And I presume but dont know that the code that converts proper names into single words and multi-word text numbers into single numbers would continue working ( not knowing those languages).
|
|
|
|
|
Posted: Nov 1, 2011 |
[ # 2 ]
|
|
Senior member
Total posts: 971
Joined: Aug 14, 2006
|
I’d absolute recommend to work on UTF-8. From a European perspective, having 30 countries, each having extensions on the standard Latin alphabet, all non-UTF-8 encoding is a night mare.
We could obviously discuss about the future of Europe , but also Non-English countries are getting increasingly popular. The BRIC’s countries, Brazil, Russia, India (they have dozens of different languages!) & China, all use different alphabets.
If you manage to make ChatScript UTF-8 proof, it can be used all over the world, not only for English.
|
|
|
|
|
Posted: Nov 7, 2011 |
[ # 3 ]
|
|
Moderator
Total posts: 2372
Joined: Jan 12, 2010
|
Next Chatscript update (1.27) will support UTF-8.
|
|
|
|
|
Posted: Nov 7, 2011 |
[ # 4 ]
|
|
Administrator
Total posts: 3111
Joined: Jun 14, 2010
|
Way cool, Bruce! Thanks.
|
|
|
|
|
Posted: Nov 7, 2011 |
[ # 5 ]
|
|
Senior member
Total posts: 473
Joined: Aug 28, 2010
|
Not sure if this will help but I believe it to be one of the most comprehensive libraries for handling unicode and internationalisation. I’ve been using it for converting arbitrary text in an unknown encoding into UTF8.
http://site.icu-project.org/
|
|
|
|
|
Posted: Nov 7, 2011 |
[ # 6 ]
|
|
Member
Total posts: 20
Joined: Oct 28, 2011
|
It seems, UTF-8 without BOM works correctly right now. We’ve just removed code in ReadALine (textutilities.cpp) involving utfbad.
|
|
|
|
|
Posted: Nov 7, 2011 |
[ # 7 ]
|
|
Moderator
Total posts: 2372
Joined: Jan 12, 2010
|
THANK You andrew. ANd yes, ChatScript works w/o BOM marks. UTF8 worked before, but I had trouble testing it so suppressed it. I have improved the code and reenabled it for 1.27 release.
|
|
|
|
|
Posted: Nov 24, 2011 |
[ # 8 ]
|
|
Member
Total posts: 20
Joined: Oct 28, 2011
|
My current version is 1.27, OS Linux.
I have some rule, which includes two-byte characters:
” u: ( test ) ÄÖÜ “
But this rule isn’t working, output is generated from another rule. Of course, when I change “ÄÖÜ” to single-byte characters (e.g. “smth”) - rule works.
|
|
|
|
|
Posted: Nov 24, 2011 |
[ # 9 ]
|
|
Moderator
Total posts: 2372
Joined: Jan 12, 2010
|
This description makes no sense to me….. the rule matches based on the pattern (test) and it shouldnt matter what the output side is. Could you email me a sample topic file with the behavior to gowilcox at gmail.com so that I can see the full context.
|
|
|
|
|
Posted: Nov 29, 2011 |
[ # 10 ]
|
|
Member
Total posts: 20
Joined: Oct 28, 2011
|
Sorry, it was my stupid fault. It’s working correctly with meaningful phrases (not set of 2-byte characters).
|
|
|
|
|
Posted: Dec 16, 2011 |
[ # 11 ]
|
|
Experienced member
Total posts: 94
Joined: Dec 8, 2011
|
Hi,
I had a little UTF-8 related conversation in
“problems with ChatScript-tutorial”
beginning at Dec. 8, 2011.
Greetings
Andreas
|
|
|
|
|
Posted: Dec 21, 2011 |
[ # 12 ]
|
|
Moderator
Total posts: 2372
Joined: Jan 12, 2010
|
So… been working on UTF8. What a mess!
I have modified chatscript to read UTF-8 files and ignore the BOM at the start.
I have modified the script compiler to generate files marked with the BOM at the start
and fixed a bunch of code that wasn’t ready for multibyte characters.
At this point, I’d be done, except for ONE LITTLE PROBLEM. Taking a string of characters, some of whom may be utf-8
and getting the visual studio C++ console window to display them correctly. I tried setting a codepage. I tried converting the string to widechar stuff. But I haven’t been able to get the console output to display them. The server would be fine,
because it would send back utf8 characters and the browser or receiver would be responsible for displaying them.
Any ideas?
|
|
|
|
|
Posted: Dec 21, 2011 |
[ # 13 ]
|
|
Senior member
Total posts: 697
Joined: Aug 5, 2010
|
A quick search on utf-8 and the windows console gave:
http://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how
|
|
|
|
|
Posted: Dec 21, 2011 |
[ # 14 ]
|
|
Moderator
Total posts: 2372
Joined: Jan 12, 2010
|
Joy is not mine. I went to the windows command prompt window. Told it to type out my simple source file with umlaut character as part of topic. Printed wrong, of course. Then tried chcp 1250 and chcp 65001 before a type command, didn’t help. Still prints out wrong.
|
|
|
|
|
Posted: Dec 21, 2011 |
[ # 15 ]
|
|
Administrator
Total posts: 3111
Joined: Jun 14, 2010
|
Is it possible, Bruce, that the font used by Windows for the command window doesn’t support UTF-8? I’ve done some testing with the command window on my Win 7 machine, and the default “raster font” doesn’t print all of the UTF-8 characters properly. I found that using the font Lucida Console worked for me, when using the command “copy D:\utf8.txt con”, which displays the contents of the file to the screen. Maybe this will prove useful?
|
|
|
|