AI Zone Admin Forum Add your forum

NEWS: Chatbots.org survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

How to implement a Chatscript bot that talk Spanish
 
 
  [ # 31 ]

your keyboard does not transmit utf8 characters. The web does. Documents can (hence :source workds).

 

 
  [ # 32 ]

I see, thanks Bruce, it worked fine.
So now I’m about to read documentation related to making chatscript work with posgreSQL,

But before, could you please tell me if is there any way to make a kinda spell check in spanish?
I ask again this because this feature is very important, since it’s used everytime in every single volley.
Thanks for telling me how to make the engine recognize áéíóúñ spanish characters, but, isn’t there any chance that I could make a spanish spell check provided that I completely avoid using áéíóúñ spanish characters? at least some kind of spanish pseudo spell checker? isn’t any workaround?
Because if it is not posible to do a spanish spell check
I would have to preestablish any spanish typo of any pattern’s word, am I right?

Thanks Advanced Bruce, Nice engine.

 

 
  [ # 33 ]

OK. I have enabled the spell checker to allow utf8 words to be spell checked. This will be in the next release, probably next weekend. Then….

you will need to create a pseudo spanish dictionary of words you want recognized. This will consist of concept declarations of words and their part of speech.  EG

concept: ~spanishnouns NOUN ( mañana)

When I did that, and then inputted:
mañanb

it corrected it to mañana.

 

 
  [ # 34 ]

awesome, thanks bruce I’ll wait for release, meanwhile I’ll check how to get the postgreSQL, thanks again.

 

 
  [ # 35 ]

Hi Bruce,
its been a long time, I learned python, postgreSQL (made some basic triggers and functions), and made the ChatScript connection via LAN, but didn’t had the time to script the spanish ChatScript chatbot. Now finally I have the time to do so. Please Bruce could you tell me if that spanish spell checker that you made has changed? How could I create that spanish speudo dictionary mentioned in previous comment? what should I aim first in order to complete this task?

Thanks for your support Bruce.

 

 
  [ # 36 ]

I never made a spanish spell checker. I don’t know spanish. T and e spell checker in CS handles utf8 words, but still requires a dictionary of words really code that can conjuate words as well.

A spanish dictionary of words…  CS normally has a dictionary of base words and uses code to conjugate them (and thus know their interrelationship). One could conceivably put all conjugated forms of words in a dictionary, but it still wouldn’t know their conjugation relationship, but the spell checker would work. Don’t know if doing that is sensible or not.
There are two ways to make a dictionary. One is to declare concept sets of words, with part of speech markers like
concept: ~my_nouns NOUN NOUN_SINGULAR (boy animal)
The other is to create files like the text files in DICT.  The DICT files encode information much more compactly than the concept form, but are harder to author.

 

 
  [ # 37 ]

Hi Bruce, thx for your reply
I see, I should first create a dictionary of words, I’m wondering…
if DICT files encode information much more compactly, it means that an chabot that uses a spanish DICT runs faster than one that uses spanish declared concepts? If that is true, what should I learn (any programming language?) in order to be able to author a spanish DICT? I would like to do a coloquial spanish middle age chatbot (like chating in Instant Messengers), now I fave full time availability, how long it could take?

Thanks Advanced. Bruce

 

 
  [ # 38 ]

The advantage of dict is memory, not speed. All words in your concept set will automatically create a dictionary entry and mark bits on them like Singular noun. The dictionary words are used by spell checking.  The concept set is unused, except as a means of getting words and bits into the dictionary. As such the facts used to represent the concept set are wasted memory.

“Authoriing a spanish dict” can be done using a simple text editor.  But if you are going to author a lot of words, then some scripting language that reduces your typing is valuable.  If you open a dict text file, you will see a word, a bunch of named part-of-speech concepts, some xxx = yyy things, and a bunch of lines devoted to providing the gloss (the text meaning of the word). You have no use for the gloss, so meanings=4 which tells how many lines of gloss will always be meanings=0 (or omitted entirely).

For spell checking, if you dont want english at all, just spanish, then one can remove the DICT folder entirely. One might start with a script file with concepts like ~my_noun that I demonstrated. Fill up maybe 40 words as nouns, another 40 as verbs.  Then write input which consists of a slighly misspelled spanish noun from your list and a slightly misspelled spanish verb from your list, and see if :prepare corrects them.

 

 
  [ # 39 ]

If a chatbot could learn new words as it chatted and “read” the internet then how could it create its own dictionary files?
Let’s say each sentence is parsed and each word is compared to “known” words which are facts in memory that it imports from knownwords.txt.  After talking and “researching” it exports the facts back to knownwords.txt.  It could also use the facts to write dictionary formatted files directly.  It would have to look up or ask someone about the nouns versus verbs designation of words. 

If knownwords.txt is imported into @1 then how could a new fact be added to the @1 factset so it can be exported again along with the other words and also so it can be added to the @1 factset in memory?

$$NeWordFact = ^createfact(wordid:1 is_word cat)

Also how does chatscript determine the canonical form of words?  I do not see different forms relating back to a canonical designated word in the dictionary files.  If every form of a word is in the dictionary how can it be coded so Chatscript can convert the Spanish words into canonical forms using a Spanish dictionary or other files (using simple lookups or default tags not using rules necessarily). 

thx.

 

 
  [ # 40 ]

For english, CS dictionary is ONLY of canonical forms. All conjugations are detected in CS engine code which takes original input words and if they cannot be found in the dictionary, using conjugation rules to find the root form of the word that is in the dictionary.

“learning new words as you chat” has issues about whether the learning is localized to the user or not. If not localized, then the server needs to restart to acquire the new words. If localized, then only that user gets the knowedge. Localized words can be added into concept sets by creating appropriate facts.  Of course you can have a script that always reads and writes words via export, but when the number of facts there gets large, that will be a signficant slowdown in response speed per volley.

Remeber that dictionary formatted files are only read at startup of the server, so you’d have to force the server to restart to get the new words globally visible.

 

 
  [ # 41 ]

So a Spanish Linguist Chatbot could chat/research and acquire “local facts” that it can export.  It could also be programmed or call a program to translate the facts to the required dictionary file format.  Then the server could be restarted with the new dictionary files and then everyone and all chatbots on the server would have access to the new dictionary values.

The dict.bin should be deleted and :build 0 command issued?

I’m sure that the rules for conjugations are different in English than Spanish so what if the canonical.txt file were updated with translations?  Is this the same as substitutions?  Would a pattern u:(  _*1 ) still yield _0 and ‘_0 where _0 is the canonical form and ‘_0 is the original word if a word is converted using the canonical.txt file?

How many entries can be in the canonical.txt file?

Also, is there a way to add a new fact to an already existing factset @1?  Can I say @1 = @1 + $$NewFactID?

Currently I requery user facts in memory with 3 fact types and append them to the knownwords.txt file using:

outputmacro: ^ExportKnownWords()
@7 = ^query( direct_sv MaxWordID has_value ? )
^export(knownwords.txt @7 )
@7 = ^query( direct_v ? is_word ? )
^export(knownwords.txt @7 append)
@7 = ^query( direct_v ? has_word_rank ? )
^export(knownwords.txt @7 append)

I am tracking the number of times a chatbot encounters a word and then increment the rank of the word each time.  The rank is stored in the knownwords.txt along with the last wordid number used. 

 

 
  [ # 42 ]

The export command exports “facts” in a particular format. You can use the ^log command to write files in a particular format you want. Or you could write facts via export and call a local program to convert to the dictionary file.  Dict.bin should definitely be deleted. :build 0 has no meaning if you have rewritten DICT/*.txt files. Substitutions files (various of them), replace the incoming word with a different word.  The canonicals file lists what word should be used for the canonical form of the word given that the engine is not able to figure that out using the conjugation code (and for spanish it certainly wouldn’t know how).  There is no limit per se to the number of entries in the canonical file.

 

 

 
  [ # 43 ]

Hi bruce, thx for your support, sorry for the long delay, I was ending some misc stuff before get full into this.
I googled for dict files, and I found different txt examples, some of them only list the words without any gloss, other list the words and the glosses, and many other things. Also I look at the txt files inside CS/DICT/ENGLISH folder, they have words and gloss, and some sort of class info, right? So, what format should I use for a simple but the most functional possible spanish dict? should I only list each spanish word in a newline? or should I try to follow the CS/DICT/ENGLISH folder template???

Thanks Bruce for your support, thx advanced.

 

 
  [ # 44 ]

The gloss is not significant.  As for functionality supported by having a dictionary, what are you aiming toward?

1. spell checking - CS spell checking can try to match incoming words with existing dictionary ones, but also does conjugations of them using explicit code for that, which would not exist for spanish.

2. pos-tagging and parsing - CS uses code for that and wouldn’t exist for spanish

3. ontology- words are linked in an “is” relationship to words above in the hierarchy and CS uses that for marking expanded concept markings:  so that “collie” is a “dog” is an “animal”, although one can write explicit concepts to hold whatever the dictionary holds.

One can use a single line per word… like this one:
Geraldine ( NOUN_HUMAN NOUN_FIRSTNAME NOUN_SHE NOUN NOUN_PROPER_SINGULAR KINDERGARTEN )

where the things in parens are property/systemflag bits from dictionarysystem.h   But what you achieve by having these bits can be created in effect by merely defining concept sets involving the words.

 

 
  [ # 45 ]

Hi Bruce, thx for the reply
The spell checking should be the most important part, but then for the verbs I would like CS to do the conjugations, could that be possible?
I guess Pos-tagging and parsing and ontolgy would be more advance things and not possible to replicate in spanish, though it would be nice if CS could that to in spanish. Please correct me if I’m wrong.

I would like to do a “coloquial” simple spanish chatbot with with something more than basic words, but that could somehow understand the meaning of the sentences. Please Bruce tell me, which of the those points you listed is more important in order to achieve that task? I would like to do the spell checking plus another one of them. Which one should I aim for? How long should it take me? if Im full time in this?

Thanks Advanced Bruce.

 

 < 1 2 3 4 5 >  Last ›
3 of 17
 
  login or register to react
‹‹ Web interaction      Why no direct_so query? ››