AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

What about UTF-8 support?
 
 
  [ # 31 ]

Here’s an interesting quote I found on stackoverflow (http://stackoverflow.com/questions/3329827/c-ifstream-utf8-first-characters) about the bom:

When you save a file as UTF-16, each value is two bytes. Different computers use different byte orders. Some put the most significant byte first, some put the least significant byte first. Unicode reserves a special codepoint (U+FEFF) called a byte-order mark (BOM). When a program writes a file in UTF-16, it puts this special codepoint at the beginning of the file. When another program reads a UTF-16 file, it knows there should be a BOM there. By comparing the actual bytes to the expected BOM, it can tell if the reader uses the same byte order as the writer, or if all the bytes have to be swapped.

When you save a UTF-8 file, there’s no ambiguity in byte order. But some programs, especially ones written for Windows still add a BOM, encoded as UTF-8. When you encode the BOM codepoint as UTF-8, you get three bytes, 0xEF 0xBB 0xBF, which is the three extra characters you’re seeing.

The argument in favor of doing this is that it marks the files as truly UTF-8, as opposed to some other native encoding. For example, lots of text files on western Windows are in codepage 1252. Tagging the file with the UTF-8-encoded BOM makes it easier to tell the difference.

The argument against doing this is that lots of programs expect ASCII or UTF-8 regardless, and don’t know how to handle the extra three bytes.

If I were writing a program that reads UTF-8, I would check for exactly these three bytes at the beginning. If they’re there, skip them.

Perhaps you can put the BOM in front of your string manually, so that the system know you are dealing with an UTF-8 file.

 

 

 
  [ # 32 ]

My internal strings are 8-bit. A full string is 8-bit UTF.
I can convert such to anything later for output.
But JUST doing simple things with wide char, string constants etc, a simple main program printing to the visual C console fails to correctly work.  Tried various things, cant make that demo work.  Can see my own strings, converted to wide, in the debug window display PERFECTLY.

 

 
  [ # 33 ]

so something like this: http://linuxprograms.wordpress.com/2008/03/07/c-printing-data-types-using-printf-short-wchar_t-long-double/
doesn’t work for the wchars?

My internal strings are 8-bit. A full string is 8-bit UTF.
I can convert such to anything later for output.

Yep, C will be perfectly able to handle this. I don’t know if you often want to take x nr of chars from a string, but that’s where the major difference is: with an internal storage of UTF-8, you need to inspect every char and possibly consume more then 1 byte per token, but it is possible.

 

 
  [ # 34 ]

nope. Of course the example there DIDNT show unicode characters.
It’s not that I don’t have valid characters. It’s that SOMEHOW there is a mismatch to the font accepted by the console.

 

 
  [ # 35 ]

Hi all,

I think Dave is on a good way.
If I’m using a console showing me the results of (for example) ProgramD
it shows me the word “endgültig” (final) like this “endg3ltig”, too.
But if I run a firefox-localhost2001-version of it in the foreground
(+console in the background)
everything works fine: “endgültig”.

Good luck

Andreas

 

 
  [ # 36 ]
Andreas Drescher - Dec 30, 2011:

Hi all,

I think Dave is on a good way.
If I’m using a console showing me the results of (for example) ProgramD
it shows me the word “endgültig” (final) like this “endg3ltig”, too.
But if I run a firefox-localhost2001-version of it in the foreground
(+console in the background)
everything works fine: “endgültig”.

Good luck

Andreas

Andreas, if I might ask, how are you accessing the ChatScript server through Firefox? I would like to test some theories, but I’m having difficulty doing so through my browser. I’ve tried everything except creating a custom PHP script to do so, and I really didn’t want to have to do so if I can avoid it. smile

 

 
  [ # 37 ]

Sorry Dave,

I just wanted to report the experience of a UTF-8-language-user,
that even AIMLengines like ProgramD, wich are supporting UTF-8 for a long time
are not able to show UTF-8-characters on a console,
but (via localhost:2001) on a browser.

I tried it with ChatScript (localhost:1024), too, but I didn’t succeed.

Perhaps a former discussion is helpful:
http://www.chatbots.org/ai_zone/viewthread/397/P60/

All the best

Andreas

 

 
  [ # 38 ]
Dave Morton - Dec 31, 2011:
Andreas Drescher - Dec 30, 2011:

Hi all,

I think Dave is on a good way.
If I’m using a console showing me the results of (for example) ProgramD
it shows me the word “endgültig” (final) like this “endg3ltig”, too.
But if I run a firefox-localhost2001-version of it in the foreground
(+console in the background)
everything works fine: “endgültig”.

Good luck

Andreas

Andreas, if I might ask, how are you accessing the ChatScript server through Firefox? I would like to test some theories, but I’m having difficulty doing so through my browser. I’ve tried everything except creating a custom PHP script to do so, and I really didn’t want to have to do so if I can avoid it. smile

I still think you are on to something though. Cause yesterday evening, I came accross a post (lost link) that said apps built with VS have difficulties displaying non ascii chars (the UTF specific) on the console. Some other compilers apparantly don’t have this problem.

 

 
  [ # 39 ]

Hi Jan, hi all,

as a non-developer I simply keep my fingers crossed,
that the problem can besolved
and “UTF-8-nations” can use all these wonderful benefits of ChatScript, too.

Happy new year: )

Andreas

 

 < 1 2 3
3 of 3
 
  login or register to react
‹‹ failed after rebuild      Parsing features ››