AI Zone Admin Forum Add your forum

NEWS: survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

Questions about ^input(), ^original(), ^tokenize(), %originalinput and punctuation.

I have a few questions about how punctuation is handled when your scripts put strings into the input queue.
I am trying to inject a string into the input queue and then compare it to a variable which contains the same string. This does not work! The string gets spaces put before all of the commas. This make it impossible to compare the %originalsentence to the variable and determine that they are the same.

In the following code, the output from the :trace is in the comments below line(s) that generate the trace info.

My questions:
1) is there a way to prevent the space from getting added before the commas?
2) Why does ^tokenize(WORD %originalsentence) get made into 11 facts (which I expected) and ^tokenize(WORD $$_newText) get made in to only one fact?
3) why does “ice cream” not ever show up as “ice_cream”? The documentation for ^original() specifically uses ice cream as an example. This code does not seem to follow the doc in that regard.

topic: ~test-original keep repeat "test original" )

uSTEP-ONE one )
# with trace simple  fact input ~test-original

^log(\n your original input was %originalinput # your original input was test original oneToken

$$_newText "I do not eat cake, ice cream, or donuts"
# Original User Input: `` I do not eat cake , ice cream , or donuts  `
    # Tokenized into: I  do  not  eat  cake  ,  ice  cream  ,  or  donuts
    # Actual used input: I do not eat cake , ice cream , or donuts

^log(\n next input done \n)
# next input done

^log(\n originalinput is %originalinput \n)
# originalinput is test original one

^log(\n originalsentence is %originalsentence \n )
# originalsentence is I do not eat cake , ice cream , or donuts

@18 = ^tokenizeWORD %originalsentence \n)
log(\n tokenized original sentencelength = ^length(@18\n )
create ^tokenize ^tokenize x1000010 Created 215754
....,...create ( do ^tokenize ^tokenize x1000010 Created 215755
....,...create not ^tokenize ^tokenize x1000010 Created 215756
....,...create eat ^tokenize ^tokenize x1000010 Created 215757
....,...create cake ^tokenize ^tokenize x1000010 Created 215758
....,...create ( , ^tokenize ^tokenize x1000010 Created 215759
....,...create ice ^tokenize ^tokenize x1000010 Created 215760
....,...create cream ^tokenize ^tokenize x1000010 Created 215761
....,...create ( , ^tokenize ^tokenize x1000010 Created 215762
....,...create ( or ^tokenize ^tokenize x1000010 Created 215763
....,...create donuts ^tokenize ^tokenize x1000010 Created 215764
tokenized original sentencelength 11

@15 = ^tokenizeWORD $$_newText )
log(\n tokenized newTextlength = ^length(@15\n)
create "I do not eat cake, ice cream, or donuts" ^tokenize ^tokenize x1000010 Created 215765
tokenized newTextlength 1

@19 = ^uniquefacts(@18 @15)
log(\n created unique factslength  = ^length(@19\n)
created unique factslength 11
....,.ResultNOPROBLEM Topic: ~test-original

if( ^length(@19) == 0){
        \n they match
!! \n
        \n not sure what this means
# this is the output

  [ # 1 ]

The input q is internal, it is not “user input” so you cannot expect %originalInput to reflect it as you might expect. And in fact, such input is separated for processing as you have discovered. Your variable $$_newtext . By the way there is no meaning in adding the underscore into your variable.  Variable prefixes are $, $$, and $_ .  Your original variable consists of a single thing the quoted string with quotes maintained. If you had done $$_newtext = ^“I do not eat cake, ice cream, or donuts” then that would not have quotes in the variable and would decompose into more tokens.

Spell correction has changed over time, the example in the manual is now wrong. Here is the revised document:
u: (my _life) ^original(_0)
For input “my lif” spell correction will change the input to “life”, which matches here,
but ^original will return “lif”.


  [ # 2 ]

Having given you the literal answers, now here is a bit more depth. ChatScript does things to make it easier to detect meaning.  You passed a quoted string into the input. That would not likely be particularly useful as a single token (what would be the point), so the quoted string is broken into its bits so that you can see inside it.


  [ # 3 ]

You ask, is there a way to prevent the space from appearing.  Since actual processing of meaning would want the space there, the question becomes—- what are you trying to do?  Why do you want the space not there?


  [ # 4 ]


Thanks for all of the answers. As to the literal answers….
1) I thought that the variables were:
    a) $xxxx perm global
    b) $_xxxx perm local
    c) $$xxxx volley global
    d) $$_xxxx volley local ... guess this is the same as $_xxxx
    e) _n match variable
    f) %systemVariable (see system variables manual
    g) @1-20 fact sets

2) The %originalinput and %originalsentence are good. It gives us a way to always get the user’s reply even when we are playing with ^input() and ^analyze()

3) Why doesn’t ice cream get tokenized into ice_cream like the manual says?

4) The bit about the strings still confuses me. It seems that we have 4 types of strings and I can’t tell how CS views them.
    a) $var = plain text with no quotes
    b) $var = “text in double quotes”
    c) $var = ‘text in single quotes’
    d) $var = ^“dynamic text in quotes that resolves the value of $another_variable”

As to what I am trying to do….

I am adding decision tree functionality to my bot. The decision tree data has been imported from another application and the data is stored in a database as nodes, options and branches. There over 900 trees. Most nodes have several options which branch to another node. The terminal nodes, do not have options, obviously.  The option text were written by the business years ago. They are not interested in converting them to CS patterns (which I use for my normal options framework). Our client app sends the full text of the selected option back to the bot when the user clicks its button. So I am trying to match the full text of the selected option against a list of expected options that are stored in the user context as a fact set. I have been using simple string compare using ^findText(). This worked great until we hit the punctuation and the inserted spaces. Now my strategy may be:

1)  tokenize the %originalsentence and each option’s text and see if they are equal using ^uniqueFacts()

2)  pass the ordinal number for the option (e.g. 1,2,3,4) from the client and then look up the corresponding fact in the user facts.

3) ^tokenize() each option and then use ^uniqueFacts() to discover unique words in each option that can then be used to form a CS pattern.

4) do a combination of all of the above

Ideally I would like the user to be able to type in a non-exact match to an option and have CS discern which option the user means. for instance

1. Twins with the same initials
2. Alabama Child Care Program
3. Newborn claim
4. Group 1245
5. Place of treatment 1, 2, or 3

I would love for the user to just type “twins” and get option 1, or “place of treatment” and get option 5.



  [ # 5 ]

1) I thought that the variables were:
  a) $xxxx perm global
  b) $_xxxx perm local
  c) $$xxxx volley global
  d) $$_xxxx volley local ... guess this is the same as $_xxxx
  e) _n match variable
  f) %systemVariable (see system variables manual
  g) @1-20 fact sets

a) yes
b) there is no such meaning because local implies that when you leave the local context everything disappears.  When you execute a topic, when it generates an output it ALWAYS returns to top level. The system may return to that topic next volley, but it doesnt start from where it left of (except for rejoinders). So d) means merely $$ and _ has no meaning whatsoever.

3) having revised the manual I am no longer accountable for ice_cream

4) there is multiple ways of “creating” strings. Strings are the same AFTER creation.  a) is not possible. You can’t say $var = my text   (only the my is assigned).  b) is a text string with double quotes which are kept.  c) is not a text string, it is merely the word ‘text   d) there are two forms of dynamic strings (equal to sprintfs in c), there is ^” and ^’ . The latter makes it easy to use ” inside your string without having to escape it.



  [ # 6 ]

More on what you are trying to do, later.


  [ # 7 ]

I must confess that despite your 6-7 readings of all the docs, there are many things to miss unless you have a use. CS embeds multiple “languages” within it, including javascript AND the query language. you can customize queries into facts.


  [ # 8 ]


Thanks for the clarification on Strings and variables. Hopefully your clarification here will make these topics easier for others.

Yes, there are a lot of things in CS that it is easy to miss. I had thought of a Javascript function for the comparison but I have avoided using JS with CS because there is no good way to debug or trace the JS functions. However, I may reevaluate that decision. Do you have any plans on allowing fact sets / json to be passed into the Javascript code?

As to the hear of the matter…

I already have the ability for the options to have a separate CS pattern. I will continue to test the %originalsentence against each option first.

Then, I am going to start with ^tokenize(%originalsentence) from the client and then looping through each option and ^tokenize() it and use ^uniquefacts() to determine a match.

In the future, I am thinking of having the data conversion program use CS to look at a set of options and “figure out” a unique CS pattern for each.

I will post a solution(s) soon.

CS is tremendously powerful.


  [ # 9 ]

Stephen, we have code to do what you are trying to do - have some partial input from users match against an option.
That code is too big to post here and contains aspects that are specific to our environment. But the essence is fairly similar to what you have described.

We tokenize each option string and the input to create a set of concepts for each unique word. We add word variations (canonicals and punctuation variations) to those concepts. We also do a bit of spell correction on the input given the options (not all of the words in the options will be real words, e.g. product names).

We can then use the various fact set functions like ^intersectfacts() and ^uniquefacts() on those concepts to determine the overlap between an option and the input.

We compute TF-IDF scores, and ultimately a cosine similarity to determine the best match, or if there are several close matches then present a disambiguation menu.


  login or register to react