AI Zone: chatbots.org

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

What about UTF-8 support?

Posted: Dec 30, 2011

[ # 31 ]

Jan Bogaerts

Senior member

Total posts: 697

Joined: Aug 5, 2010

E-mail Jan

Here’s an interesting quote I found on stackoverflow (http://stackoverflow.com/questions/3329827/c-ifstream-utf8-first-characters) about the bom:

When you save a file as UTF-16, each value is two bytes. Different computers use different byte orders. Some put the most significant byte first, some put the least significant byte first. Unicode reserves a special codepoint (U+FEFF) called a byte-order mark (BOM). When a program writes a file in UTF-16, it puts this special codepoint at the beginning of the file. When another program reads a UTF-16 file, it knows there should be a BOM there. By comparing the actual bytes to the expected BOM, it can tell if the reader uses the same byte order as the writer, or if all the bytes have to be swapped.

When you save a UTF-8 file, there’s no ambiguity in byte order. But some programs, especially ones written for Windows still add a BOM, encoded as UTF-8. When you encode the BOM codepoint as UTF-8, you get three bytes, 0xEF 0xBB 0xBF, which is the three extra characters you’re seeing.

The argument in favor of doing this is that it marks the files as truly UTF-8, as opposed to some other native encoding. For example, lots of text files on western Windows are in codepage 1252. Tagging the file with the UTF-8-encoded BOM makes it easier to tell the difference.

The argument against doing this is that lots of programs expect ASCII or UTF-8 regardless, and don’t know how to handle the extra three bytes.

If I were writing a program that reads UTF-8, I would check for exactly these three bytes at the beginning. If they’re there, skip them.

Perhaps you can put the BOM in front of your string manually, so that the system know you are dealing with an UTF-8 file.

Posted: Dec 30, 2011

[ # 32 ]

Bruce Wilcox

Moderator

Total posts: 2372

Joined: Jan 12, 2010

E-mail Bruce

My internal strings are 8-bit. A full string is 8-bit UTF.
I can convert such to anything later for output.
But JUST doing simple things with wide char, string constants etc, a simple main program printing to the visual C console fails to correctly work. Tried various things, cant make that demo work. Can see my own strings, converted to wide, in the debug window display PERFECTLY.

Posted: Dec 30, 2011

[ # 33 ]

Jan Bogaerts

Senior member

Total posts: 697

Joined: Aug 5, 2010

E-mail Jan

so something like this: http://linuxprograms.wordpress.com/2008/03/07/c-printing-data-types-using-printf-short-wchar_t-long-double/
doesn’t work for the wchars?

My internal strings are 8-bit. A full string is 8-bit UTF.
I can convert such to anything later for output.

Yep, C will be perfectly able to handle this. I don’t know if you often want to take x nr of chars from a string, but that’s where the major difference is: with an internal storage of UTF-8, you need to inspect every char and possibly consume more then 1 byte per token, but it is possible.

Posted: Dec 30, 2011

[ # 34 ]

Bruce Wilcox

Moderator

Total posts: 2372

Joined: Jan 12, 2010

E-mail Bruce

nope. Of course the example there DIDNT show unicode characters.
It’s not that I don’t have valid characters. It’s that SOMEHOW there is a mismatch to the font accepted by the console.

Posted: Dec 30, 2011

[ # 35 ]

Andreas Drescher

Experienced member

Total posts: 94

Joined: Dec 8, 2011

E-mail Andreas

Hi all,

I think Dave is on a good way.
If I’m using a console showing me the results of (for example) ProgramD
it shows me the word “endgültig” (final) like this “endg3ltig”, too.
But if I run a firefox-localhost2001-version of it in the foreground
(+console in the background)
everything works fine: “endgültig”.

Good luck

Andreas

Posted: Dec 31, 2011

[ # 36 ]

Dave Morton

Administrator

Total posts: 3111

Joined: Jun 14, 2010

E-mail Dave

Andreas Drescher - Dec 30, 2011:
Hi all,

I think Dave is on a good way.
If I’m using a console showing me the results of (for example) ProgramD
it shows me the word “endgültig” (final) like this “endg3ltig”, too.
But if I run a firefox-localhost2001-version of it in the foreground
(+console in the background)
everything works fine: “endgültig”.

Good luck

Andreas

Andreas, if I might ask, how are you accessing the ChatScript server through Firefox? I would like to test some theories, but I’m having difficulty doing so through my browser. I’ve tried everything except creating a custom PHP script to do so, and I really didn’t want to have to do so if I can avoid it.

Posted: Dec 31, 2011

[ # 37 ]

Andreas Drescher

Experienced member

Total posts: 94

Joined: Dec 8, 2011

E-mail Andreas

Sorry Dave,

I just wanted to report the experience of a UTF-8-language-user,
that even AIMLengines like ProgramD, wich are supporting UTF-8 for a long time
are not able to show UTF-8-characters on a console,
but (via localhost:2001) on a browser.

I tried it with ChatScript (localhost:1024), too, but I didn’t succeed.

Perhaps a former discussion is helpful:
http://www.chatbots.org/ai_zone/viewthread/397/P60/

All the best

Andreas

Posted: Dec 31, 2011

[ # 38 ]

Jan Bogaerts

Senior member

Total posts: 697

Joined: Aug 5, 2010

E-mail Jan

Dave Morton - Dec 31, 2011:
Andreas Drescher - Dec 30, 2011:
Hi all,

I think Dave is on a good way.
If I’m using a console showing me the results of (for example) ProgramD
it shows me the word “endgültig” (final) like this “endg3ltig”, too.
But if I run a firefox-localhost2001-version of it in the foreground
(+console in the background)
everything works fine: “endgültig”.

Good luck

Andreas

Andreas, if I might ask, how are you accessing the ChatScript server through Firefox? I would like to test some theories, but I’m having difficulty doing so through my browser. I’ve tried everything except creating a custom PHP script to do so, and I really didn’t want to have to do so if I can avoid it.

I still think you are on to something though. Cause yesterday evening, I came accross a post (lost link) that said apps built with VS have difficulties displaying non ascii chars (the UTF specific) on the console. Some other compilers apparantly don’t have this problem.

Posted: Dec 31, 2011

[ # 39 ]

Andreas Drescher

Experienced member

Total posts: 94

Joined: Dec 8, 2011

E-mail Andreas

Hi Jan, hi all,

as a non-developer I simply keep my fingers crossed,
that the problem can besolved
and “UTF-8-nations” can use all these wonderful benefits of ChatScript, too.

Happy new year: )

Andreas

< 1 2 3

3 of 3

‹‹ failed after rebuild Parsing features ››

Search the Forum

Forum Profile

Forum Subscription

Forum Moderators

On Our Admin Forums

Partner Forums

Science Statistics

Chatbot Statistics