|
Posted: Oct 28, 2015 |
[ # 46 ]
|
|
Moderator
Total posts: 2372
Joined: Jan 12, 2010
|
There is no chance conjugation of verbs can be done by CS engine, as that is code and not for spanish. However, you can declare canonical: in your scripts, so you could define the canonical value of all verb conjugations and noun conjugations (assuming the spanish plural of a noun is not a simple s added) manually (tedious). Replicating ontology in spanish is doable (merely doing concepts would do that as well). Pos tagging is possible if you declare concepts like:
concept: ~nounlist NOUN NOUN_SINGULAR ( hacienda ...)
concept: ~nounlistplural NOUN NOUN_PLURAL (haciendas ....)
But chatbots dont usually need pos-tagging and parsing. They do want canonicals for ease of rule writing. If you write the nounlist concept like above, that will do spelling correction. You probably want to remove ALL the english dictionary files.
How long it takes to write the concept lists is how big is your vocabulary?
|
|
|
|
|
Posted: Oct 29, 2015 |
[ # 47 ]
|
|
Senior member
Total posts: 141
Joined: Apr 24, 2011
|
Hi, Eduardo!
I am seeing your thread for Spanish, hopefully you can make a Chatscript work in English, but never in an inflection full language like Spanish, I told you that on private. ¿remember?
The problem lies on two factors, I’ve solved after a PhD like Thesis after many years of swett & work.
Spellchecking in Spanish is a complete mess! complicated and messy, even MS-Word don’t cope with Spanish as well, you can download for free my These at my university-academic website, and read the informs, its about 230 pages long!
No Chatscript nor RiveScript will cope with those errors, they are cleverly made for English, I contacted Bruce many times and he integrated wordnet into it also, good job!, but Spanish Wordnet is not inflected, so you need a morphological analyzer attached, with more than 5000+ rules, to fulfill inflection-less analysis, but you will also fail if you don’t correct spelling, because Spanish, you know well, has lots of diacritic written accents, specially on verbs, and is highly irregular, so you are in deep sh#t to repair it, was a many-years work to arrive to a decent solution, though!
You might try a snowball Potter inflection stripper, but they either didn’t work for me as well, as Spanish is highly irregular, I’ve tried them out before immerse into a deep study thesis, believe me!
I wish you luck, but as I guess, there is no shortcut for this at least, at this time!
cheers!
|
|
|
|
|
Posted: Oct 29, 2015 |
[ # 48 ]
|
|
Senior member
Total posts: 141
Joined: Apr 24, 2011
|
Bruce Wilcox - Oct 28, 2015: There is no chance conjugation of verbs can be done by CS engine, as that is code and not for spanish. However, you can declare canonical: in your scripts, so you could define the canonical value of all verb conjugations and noun conjugations (assuming the spanish plural of a noun is not a simple s added) manually (tedious). Replicating ontology in spanish is doable (merely doing concepts would do that as well). Pos tagging is possible if you declare concepts like:
concept: ~nounlist NOUN NOUN_SINGULAR ( hacienda ...)
concept: ~nounlistplural NOUN NOUN_PLURAL (haciendas ....)
But chatbots dont usually need pos-tagging and parsing. They do want canonicals for ease of rule writing. If you write the nounlist concept like above, that will do spelling correction. You probably want to remove ALL the english dictionary files.
How long it takes to write the concept lists is how big is your vocabulary?
Hi Bruce, I can answer you all this question, enlightening all.
A Spanish vocabulary is about 110k canonical forms, and 300+ prefixes, many of them nestable, and about 160 inflected forms for verbs, and over 50 for nouns and adjectives. Verbs have also many complex structures, so you cannot simply detect a verb, because you need to parse out the auxiliar verbs, similar to English phrasal verbs, and compound tenses, but the variation in inflection arrives to more than 14 complex tense-variations, this only is analyzable by means of a chunker, capable of detecting this complex syntax, which I built into my system.
Hopefully only adverbs don’t admit suffix inflection, but many derivational operations are very common, verbs convert into nouns, adjectives and even adverbs, most adjectives turn easily (almost all) into adverbs, and so on.
So solve this, I built a inflection+derivation engine to do all the analysis in my chatbot structure, and obtain the canonical form, and in this engine there are more than 5600 suffix rules as well as 300 common prefixes, mostly all are nestable.
The result? - there are over 30 M simple-inflected words, but many have nested inflections, some people counted over 3000 million different Spanish words.
To make a good parsing and POS tagging, also, you need corpus frequency of the inflected word-forms, but this also remains as a big mystery for Spanish, because there is no huge lemmatized corpus available due to the lack of good and free lemmatization engines (analyzers).
The problem is that the freq of most of the lemmatized words are zero, or one, in a 450 million word corpus, which I got access, there is only 1M different words with known frequency, and almost after the first 80k words, the freq is 1, makin easily a failure on any statistical system to disambiguate, due to sparseness. In this corpus, even common verbs are never fully inflectional represented. So this is also a latent problem for proper Spanish POS-tagging!
Freeling has a basic one, but lacks lots of vocabulary and is difficult to extend. The 600k word dictionary built in has lots of errors (I tested it) and lacks lots of commonly used Latin-American words. So it rendered useless to me.
Hope this sheds some light over Spanish problems for ChatBots
And this also explains why there are no known Spanish contests for Agents, most of them really work bad, and as it it complicated to testor tryout, they hide behind a wall of lies, saying they do inflection-analysis and they don’t, they only mess up with some simple suffix stripping, which gives reasonable results, and no one cares for more!
Also google-translate suffers from the same flexion related problems, it hardly ever translates complex nor inflected phrases from Spanish to english nor the other way.
|
|
|
|
|
Posted: Oct 29, 2015 |
[ # 49 ]
|
|
Senior member
Total posts: 141
Joined: Apr 24, 2011
|
Bruce Wilcox - Oct 28, 2015: There is no chance conjugation of verbs can be done by CS engine, as that is code and not for spanish. However, you can declare canonical: in your scripts, so you could define the canonical value of all verb conjugations and noun conjugations (assuming the spanish plural of a noun is not a simple s added) manually (tedious). Replicating ontology in spanish is doable (merely doing concepts would do that as well). Pos tagging is possible if you declare concepts like:
concept: ~nounlist NOUN NOUN_SINGULAR ( hacienda ...)
concept: ~nounlistplural NOUN NOUN_PLURAL (haciendas ....)
But chatbots dont usually need pos-tagging and parsing. They do want canonicals for ease of rule writing. If you write the noun list concept like above, that will do spelling correction. You probably want to remove ALL the english dictionary files.
How long it takes to write the concept lists is how big is your vocabulary?
Hi Bruce, I can answer you all this question, enlightening all.
A Spanish vocabulary is about 110k canonical forms, and 300+ prefixes, many of them nestable, and about 160 inflected forms for verbs, and over 50 for nouns and adjectives. Verbs have also many complex structures, so you cannot simply detect a verb, because you need to parse out the auxiliar verbs, similar to English phrasal verbs, and compound tenses, but the variation in inflection arrives to more than 14 complex tense-variations, this only is analyzable by means of a chunker, capable of detecting this complex syntax, which I built into my system.
Hopefully only adverbs don’t admit suffix inflection, but many derivational operations are very common, verbs convert into nouns, adjectives and even adverbs, most adjectives turn easily (almost all) into adverbs, and so on.
So solve this, I built a inflection+derivation engine to do all the analysis in my chatbot structure, and obtain the canonical form, and in this engine there are more than 5600 suffix rules as well as 300 common prefixes, mostly all are nestable.
The result? - there are over 300 million simple-inflected words, but many have nested inflections, some people counted over 3000 million different Spanish words.
To make a good parsing and POS tagging, to allow a decent ontology walk, you need corpus frequency of almos all the inflected word-forms, but this also remains as a big mystery for Spanish, because there is no huge lemmatized corpus available due to the lack of good and free lemmatization engines (analyzers). The only solution is to infer them.. not a good one!
The problem lies in that the freq of most of the inflected words on corpuses, are zero, or one, for example in a 450 million word corpus, which I got access, there is only 1M different words with known frequency, and almost after the first 80k words, the freq is 1, makin easily a failure on any statistical system to disambiguate, due to sparseness. In this corpus, even common verbs are never fully inflectional represented. So this is also a latent problem for proper Spanish POS-tagging!
the other big problem, is that most of the inflected words, in Spanish use written accents (diacritic marks) and they are commonly bad-spelled on most corpuses, for example on the 450M word corpus from Marquez et. al. (Chilean Spanish) almos 90% of the 1M supposedly unique words, were misspelled! (this was because he used a bad-lemmatizer, from Conexxor, who are Norwegian, not native Spanish and sold a bad tool to most universities allover the world..)
The Catalan, UPC Freeling has a basic dictionary, but lacks lots of vocabulary and is difficult to extend. The 600k inflected word dictionary built in (less than 6000 verbs) has lots of inflection errors (I tested it) and lacks most of the commonly used Latin-American words. So it rendered useless to me. also the license is restrictive, you cannot use it commercially w/o sharing for free all the derived work.
Hope this sheds some light over Spanish problems for ChatBots
And this also explains why there are no known Spanish contests for Agents, most of them really work bad, and as it it complicated to testor tryout, they hide behind a thick wall of lies, saying they “do inflection-analysis” but they really don’t, they only mess up with some simple behind-scene suffix stripping, which yields reasonable results, and no one cares for more!
Also google-translate suffers from the same flexion related problems, it hardly ever translates complex nor inflected phrases from Spanish to English nor the other way.
|
|
|
|
|
Posted: Oct 30, 2015 |
[ # 50 ]
|
|
Senior member
Total posts: 218
Joined: Jun 20, 2012
|
OK. So I tried these steps:
Created a new set of folders for Chatscript 5.72
Deleted the dictionary files in the “DICT\ENGLISH” subfolder
Created one dictionary file h.txt with one entry:
hablar ( meanings=1 glosses=1 VERB VERB_INFINITIVE posdefault:VERB_INFINITIVE)
hablar~1vz to talk
Replaced the file “LIVEDATA\SYSTEM\canonical.txt” with a new file of the same name with these entries:
hablar hablar
hablo hablar
hablas hablar
habla hablar
hablamos hablar
habláis hablar
hablais hablar
hablan hablar
hablé hablar
hablaste hablar
habló hablar
hablasteis hablar
hablaron hablar
hablaba hablar
hablabas hablar
hablábamos hablar
hablabais hablar
hablaban hablar
hablaría hablar
hablarías hablar
hablaríamos hablar
hablaríais hablar
hablarían hablar
hablaré hablar
hablarás hablar
hablará hablar
hablaremos hablar
hablaréis hablar
hablarán hablar
Created a topic file in my chatbot folder “RAWDATA\HARRY\”
called “Spanish.top” with the following concept and topic:
concept: ~verblist VERB VERB_PRESENT ( hablar hablo hablas habla hablamos habláis hablais hablan hablé ... )
topic: ~spanish ( spanish hablar )
u:( _hablar ) ^keep() ^repeat()
It worked. canonical: _0 original: ‘_0
u:( _*1 ) ^keep() ^repeat()
Fell through. canonical: _0 original: ‘_0
I get the following output:
Alaric: hablo espanol
Harry: It worked. canonical: hablar original: hablo
Alaric: hablé espanol
Harry: Fell through. canonical: unknown-word original: hablé
It seems to work for regular characters but not for special utf8 characters. Using the :prepare command it seems to flag it as an adjective and not as a verb.
I know the latest version of CS was modified to handle UTF8 characters. Was there something I needed to do to enable CS to use the canonical.txt file with special characters?
Answer: Yes, make sure to resave the file as UTF-8 format from Notepad.
(Then stop the server, close your browser, restart the CS server, reopen the browser and launch your web client chatbot since special characters do not show up on the windows console client).
New Output:
Alaric: hablé espanol
Harry: It worked. canonical: hablar original: hablé
Alaric: habláis espanol
Harry: It worked. canonical: hablar original: habláis
So it is possible to map the 29 verb conjugations for the word hablar to just one canonical “hablar” that can be used in all of your topics! And you can still reference the original word used in the sentence if needed!
Alaric: hadlo espanol
Harry: It worked. canonical: hablar original: hablo
It seems like spell checking works most of the time as well.
With all of the conjugations it seems that a Spanish chatbot would benefit from the canonical pattern matching in Chatscript even more than an English chatbot. Everyone have fun creating a Spanish dictionary.
|
|
|
|
|
Posted: Oct 30, 2015 |
[ # 51 ]
|
|
Senior member
Total posts: 218
Joined: Jun 20, 2012
|
NOTE: Also, I originally created a concept with just the VERB tag but Chatscript seemed to require VERB_PRESENT or one of the other more specific verb tags to correctly tag the words as verbs.
This did not work:
concept: ~verblist VERB ( hablar hablo hablas habla hablamos habláis hablais hablan hablé ... )
Therefore, the list I included here should be correctly broken down into the more specific verb tenses ( for testing all verb conjugations were incorrectly flagged as VERB_PRESENT.)
|
|
|
|
|
Posted: Nov 6, 2015 |
[ # 52 ]
|
|
Senior member
Total posts: 179
Joined: Feb 11, 2015
|
Thanks Bruce, thanks Alaric
So it seems creating a new Dict file is the way to workaround it, thanks for the explanation n examples.
But I have a huge doubt, posts before (page 3) Bruce said… “You have no use for the gloss, so meanings=4 which tells how many lines of gloss will always be meanings=0 (or omitted entirely).” but I see that you Alaric have put… “meanings=1” why is that??
Could you please tell me Bruce, if you have something to add about what Alaric posted??
Thanks both, I spent 4 previous days reading hundreds of kinda surveys from patients, gathering paterns to my spanish chatbot, now Im rereading the CS manual, hope I could get the grip of it with your help. THanks Again.
Andres, yes indeed your research is astonish, you have a huge advantage and a long race in the spanish chatbot, congrats. I remember that time I told you my impossibility to share users chats because of users rights and confidential matters. Thanks for share your concepts n knowledge, hope you could guide my little endevour too.
|
|
|
|
|
Posted: Nov 7, 2015 |
[ # 53 ]
|
|
Moderator
Total posts: 2372
Joined: Jan 12, 2010
|
Aleric may have put meanings=1, but CS has no use itself for meanings, except to deliver the to the script for presentation to a user if you use a ^define call. So if you don’t have a use for the definition (ie you don’t need to tell the user what it means), you don’t need a meaning entry.
And for the parts of an entry you do use (the flags like VERB & VERB_PRESENT) or whatever, you can supply those instead via a concept set. So while Aleric’s concept: ~verblist VERB (....) did not work for him, I’m not sure what it means that it didnt work. It should have enabled him to use ~verb in patterns. And had he declared separate concepts for
~verbpresentlist VERB VERBPRESENT (...) and ~verbpastlist VERB VERBPAST (...) etc, then that would have replicated what can be done via the dictionary, although putting it in a dictionary is more memory concise because it does not have to create facts (additional memory) to add the word to the concept list.
One can always create things so that CS gives both the original and the canonical forms. You can define in script what the canonical form of any word is. This is potentially not successful if a word is both a noun and a verb AND the canonical form of the word is different (generally not true in english). Since there is no spanish parser/pos-tagger built into CS, it would not be able to determine if a word was being used as a noun or a verb.
|
|
|
|
|
Posted: Nov 7, 2015 |
[ # 54 ]
|
|
Senior member
Total posts: 218
Joined: Jun 20, 2012
|
I included the definition for hablar in the dictionary so that it could later be used by the ^define command and the chatbot could use it in responses.
Besides entering forms in canonical.txt, what other method in chatscript allows for the assignment of a canonical form to a word so that both the canonical form and the original text can be captured in _0 and ‘_0 variables?
I did not understand that POS tagging would not work in a limited manner. When I included verb_present the :Prepare command showed the Spanish word hablar correctly tagged as verb_present and when I omitted it the POS tagger did not recognize it as a verb. So we should forget POS tagging and focus only on the concepts.
Where do the concepts ~verb_bits, ~animate_verbs, ~enable, ~goodness come from in the following :prepare statement? (I deleted the dictionary files. I added the concepts for pronouns for ~pronoun_subject_s1 and ~subjectpronounlist. It seems to still spell check espanol->enable somehow.)
>:prepare yo hablo espanol
TokenControl: DO_SUBSTITUTE_SYSTEM DO_NUMBER_MERGE DO_PROPERNAME_MERGE DO_DATE_MERGE DO_SPELLCHECK DO_INTERJECTION_SPLITTING DO_PARSE
Original User Input: yo hablo espanol
Tokenized into: yo hablo espanol
Spelling changed into: yo hablo enable
Actual used input: yo hablo enable
Xref: 1:yo 2:hablo 3:enable
Fragments: 1:yo 2:hablo 3:enable
Tagged POS 3 words: yo (MAINSUBJECT Pronoun_subject) hablo/hablar (MAINVERB Verb_present) enable (VERB2 Verb_infinitive)
MainSentence: Subj: yo Verb: hablo PRESENT
Concepts:
1: yo raw= +~pronoun(1) +~pronoun_subject(1) +~pronoun_bits(1) +~mainsubject(1) +yo(1) +~pronoun_subject_s1(1)
. +~subjectpronounlist(1) //
1: yo canonical= //
2: hablo raw= +~verb_present(2) +~verb_bits(2) +~verb(2) +~mainverb(2) +hablo(2) +~verblist(2) //
2: hablar canonical= +hablar(2) +T~spanish (2) //
3: enable raw= +~verb_infinitive(3) +~verb_bits(3) +~verb(3) +~verb2(3) +~sentenceend(3) +enable(3)
. +~causal_to_infinitive_verbs(3) +~misc_parsedata(3) +~enable(3) +~alter_functionality_verbs(3) +~affect_object_verbs(3)
. +~animate_verbs(3) +~verbs(3) +~active_verbs(3) +~goodness(3) //
3: enable canonical= //
+~repeatinput1(1) +~repeatinput2(1)
sequences=
After parse TokenFlags: SPELLCHECK PRESENT USERINPUT
|
|
|
|
|
Posted: Nov 7, 2015 |
[ # 55 ]
|
|
Moderator
Total posts: 2372
Joined: Jan 12, 2010
|
Aside from editing canonical.txt, you can also declare a canonical form in your script using canon:
Given pos tagging data on words, either from the dictionary or from concept declarations, CS will use its internal pos-tagging system. Where a word only has one possible pos type (eg verb), then the system should work. If you have a word with 2 or more types, the rules for deciding which is the correct pos will be “english” rules and probably completely useless for spanish.
Pos-tagging concepts like ~verb_bits, ~verb_present, etc come from the engine doing pos-tagging.
~goodness is in the std ontology of Chatscript folder as affect.top
~animate_verb comes from the std ontology as verbs.top as does ~enable probably
|
|
|
|
|
Posted: Nov 7, 2015 |
[ # 56 ]
|
|
Senior member
Total posts: 218
Joined: Jun 20, 2012
|
Thanks.
I created pronoun concepts. I created the following to test the ability to catch all forms of the question: Do I/You/He/We/You All/They speak <language>?
All subject pronouns are grouped into the concept ~Subject_Pronouns. All forms of to speak are mapped to canonical “hablar”. Several languages are grouped into the concept ~Languages.
Additionally, all second person singular subject pronouns: tú vos usted are mapped to canonical tú; the same was done for other groups of pronouns. So the concept ~Pronoun_Subject_S2 can be used in patterns or just the canonical tú.
English spellings of languages were mapped to canonical Spanish spellings.
Substitutions were added to replace Spanish words spelled without accents to the correct spelling using Spanish accented characters.
SAMPLE OUTPUT:
>usted hablas espanol?
Sí, hablo Español.
>ella habla english?
Sí, habla Inglés.
I noticed if I do not include a PRONOUN_SUBJECT flag in my ~Subject_Pronouns that the canonical mapping for ella to él in the canonical.txt file will not happen. Is this canonical mapping dependent on flagging a word with a hard-coded POS flag? Is it possible to add a new POS flag PRONOUN_INDIRECT_OBJECT without modifying the code?
SAMPLE TOPIC:
concept: ~Pronouns ( ~subject_pronouns ~object_pronouns ~indirect_object_pronouns )
concept: ~Subject_Pronouns PRONOUN_SUBJECT ( ~Pronoun_Subject_S1 ~Pronoun_Subject_S2 ~Pronoun_Subject_S3 ~Pronoun_Subject_P1 ~Pronoun_Subject_P2 ~Pronoun_Subject_P3 )
concept: ~Pronoun_Subject_S1 ( yo )
concept: ~Pronoun_Subject_S2 ( tú vos usted )
concept: ~Pronoun_Subject_S3 ( él ella ello )
concept: ~Pronoun_Subject_P1 ( nosotros nosotras )
concept: ~Pronoun_Subject_P2 ( vosotros vosotras ustedes )
concept: ~Pronoun_Subject_P3 ( ellos ellas )
...
concept: ~Languages ( English Inglés Spanish Español French Francés German Alemán Italian Italiano )
topic: ~spanish ( hablar ~Languages )
?: ( _~Subject_Pronouns hablar _~Languages ) ^keep() ^repeat() ^refine()
a: ( yo ) ^keep() ^repeat() Sí, hablas ‘_1.
a: ( tú ) ^keep() ^repeat() Sí, hablo ‘_1.
a: ( él ) ^keep() ^repeat() Sí, habla ‘_1.
a: ( nosotros ) ^keep() ^repeat() Sí, hablamos ‘_1.
a: ( vosotros ) ^keep() ^repeat() Sí, habláis ‘_1.
a: ( ellos ) ^keep() ^repeat() Sí, hablan ‘_1.
|
|
|
|
|
Posted: Nov 8, 2015 |
[ # 57 ]
|
|
Moderator
Total posts: 2372
Joined: Jan 12, 2010
|
canonical data is not tied to part of speech data.
If you are running with the regular CS dictionary, in addition to your spanish stuff, then Ella is a girl’s name, and does :prepare you ella me
give you Ella as a name or ella as your word? When I declare canonical of batec to be el (since batec is not a word)
and declare batec in a concept set concept: ~ll (batac) with no pos info, it does correctly give me el as the caonical.
|
|
|
|
|
Posted: Nov 10, 2015 |
[ # 58 ]
|
|
Senior member
Total posts: 179
Joined: Feb 11, 2015
|
Hi Bruce, hi Alaric
Alaric, you said that you erased all the contents of the canonical.txt file and replaced it with all the conjugations of ‘hablar’. Does this mean that I would have to type in the canonical.txt all the conjugations of all the verbs that I have in my spanish DICT files (h.txt hablar, hacer; e.txt enviar, entrar, etc), right? please correct me if I’m wrong.
Sorry, Bruce, I did not understand what you answered when Alaric asked…
“Besides entering forms in canonical.txt, what other method in chatscript allows for the assignment of a canonical form to a word so that both the canonical form and the original text can be captured in _0 and ‘_0 variables?”
I would like to make a chatbot that speaks only spanish, what is the most proper way to get the canonical n original text??
Thanks Advanced.
|
|
|
|
|
Posted: Nov 12, 2015 |
[ # 59 ]
|
|
Moderator
Total posts: 2372
Joined: Jan 12, 2010
|
There are two ways to have CS recognize canonical words. Canonical.txt is one way. Another way is to in some script file use canon: (a build command like topic:). It’s extra typing since you have to put canon: for each word. And in your case, since you are not augmenting canonical.txt since you don’t want the english, it makes more sense for you to simply use your own entire copy of canonical.txt in spanish.
|
|
|
|
|
Posted: Nov 12, 2015 |
[ # 60 ]
|
|
Senior member
Total posts: 179
Joined: Feb 11, 2015
|
Thanks Bruce for your reply,
so, that means that I have to type in the canonical.txt all the conjugations of all the verbs that I have in my spanish DICT files (h.txt hablar, hacer; e.txt enviar, entrar, etc), right? that would make it a big txt file, way more bigger than the DICT, please correct me if I’m wrong.
If that is true, then how I would do to differenciate between “te enviare” (FUTURE_TENSE of verb send) and “ya te envie” (PAST_TENSE of verb send) in my topic???
Thanks Advanced Bruce.
Regards
|
|
|
|