I am considering further work on pos tagging to generate data from documents. I have not read anything about the Document reader aspect of CS but am more concerned about the pos tagger.
The documentation says that functions like partofspeech and decodepos can return 64 bit pos info of word at location. I assume this means location in a input sentence but since the arguments are either role or location I am not sure how to pass in the sentence. It says to look in Dictionary.h for more on bit info but I have search my CS dir recursively and cannot find this file.
I have used :prepare on input like “It is a mammal belonging to the horse family”. The pos section inaccurately tags “belonging” in its noun form but the concept set recognizes both the noun and verb aspect when it recognizes the canonical form. Will there be any way to get this double recognition when working with the pos parser?
I am posting the full prepare output here:
TokenControl: DO_SUBSTITUTE_SYSTEM DO_NUMBER_MERGE DO_PROPERNAME_MERGE DO_DATE_MERGE DO_SPELLCHECK DO_PARSE
Original User Input: it is a mammal belonging to the horse family
Tokenized into: it is a mammal belonging to the horse family
Actual used input: it is a mammal belonging to the horse familyXref: 1:it 2:is o5 3:a >5 4:mammal >5 5:belonging 6:to 7:the >9 8:horse >9 9:family
Fragments: 1:it 2:is 3:a 4:mammal 5:belonging 6:to 7:the 8:horse 9:family
Tagged POS 9 words: it (MAINSUBJECT Pronoun_subject) is/be (MAINVERB Verb_present_3ps) a (Determiner) mammal (Adjective_noun) belonging/belong (MAINOBJECT Noun_singular) to (Particle Preposition) the/a (Determiner) horse (Adjective_noun) family (APPOSITIVE Noun_singular)
MainSentence: Subj: it Verb: is Obj: [ a mammal] belonging PRESENT
Concepts:1: it raw= +~pronoun(1) +~pronoun_subject(1) +~pronoun_bits(1) +~kindergarten(1) +~mainsubject(1) +it(1) +~it_words(1) //
1: it canonical= //2: is raw= +~verb_present_3ps(2) +~verb_bits(2) +~verb(2) +~kindergarten(2) +~mainverb(2) +is(2) +~linkingverb(2) +~auxverblist(2)
. +~wordnetpropogate(2) +~equals(2) //
2: be canonical= +be(2) +~tobe(2) +~be_verbs(2) +~states_of_being(2) +~static_verbs(2) +~usefulfactverb(2) //3: a raw= +~determiner(3) +~determiner_bits(3) +~kindergarten(3) +a(3) +~determinerlist(3) +~vowels(3) +~letters(3) //
3: a canonical= //4: mammal raw= +~adjective(4) +~adjective_noun(4) +~grade3_4(4) +mammal(4) +~animals_generic(4) +~mammals(4) +~beings(4) +~tool(4)
. +~animate_thing(4) +~objects(4) +~nounlist(4) +~animals(4) +~rideable(4) +~functions(4) +~eatable(4) +~burnable(4) +~animal_kingdoms(4)
. +being~1(4) +~nounroot(4) //
4: mammal canonical= //5: belonging raw= +~noun_abstract(5) +~noun(5) +~noun_singular(5) +~singular(5) +~normal_noun_bits(5) +~noun_bits(5)
. +~kindergarten(5) +~mainobject(5) +belonging(5) +~feeling_attached(5) +~feeling_words(5) +~emotions(5) +~sensations(5)
. +~attributes(5) +~nounlist(5) +~goodness(5) +~nounroot(5) //
5: belong canonical= +belong(5) +~own(5) +~possess(5) +~possession_verbs(5) +~social_verbs(5) +~animate_verbs(5) +~verbs(5)
. +~active_verbs(5) +~use_intentionverbs(5) +~static_verbs(5) +~do_with_titles(5) //6: to raw= +~lowercase_title(6) +~particle(6) +~preposition(6) +~kindergarten(6) +~locationword(6) +~locatedentity(6)
. +~there(6) +to(6) +~directionpreposition(6) +~spacepreposition(6) +~prepositionroot(6) +~focus(6) +~directions(6) //
6: to canonical= //7: the raw= +~lowercase_title(7) +~determiner(7) +~determiner_bits(7) +~kindergarten(7) +the(7) +~determinerlist(7) //
7: a canonical= +a(7) +~vowels(7) +~letters(7) //8: horse raw= +~adjective(8) +~adjective_noun(8) +~kindergarten(8) +horse(8) +~sizes(8) +~soundmaker(8) +~vehicles_land(8)
. +~vehicle(8) +~tool(8) +~rideable(8) +~functions(8) +~enterable(8) +~auto_dealer(8) +~store_type(8) +~store(8) +~attributes(8)
. +~nounlist(8) +~artifacts(8) +~objects(8) +~human_data(8) +~herbivore(8) +~hobbies_animals(8) +~hobby(8)
. +~entertainment_stuff(8) +~pet_animals(8) +~pet_store(8) +~animals(8) +~eatable(8) +~burnable(8) +~beings(8) +~animate_thing(8)
. +~animals_generic(8) +~animal_kingdoms(8) +~mammals(8) +being~1(8) +~nounroot(8) //
8: horse canonical= //9: family raw= +~noun_abstract(9) +~noun(9) +~noun_singular(9) +~singular(9) +~normal_noun_bits(9) +~noun_bits(9)
. +~kindergarten(9) +~appositive(9) +~sentenceend(9) +family(9) +~related_list(9) +~societal_data(9) +~human_data(9)
. +~stronggoodness(9) +~goodness(9) +~nounroot(9) +~life_taxonomy(9) +being~1(9) //
9: family canonical= //sequences=
+it_be(1-2)
+belong_to(5-6)
+to(2) +~directionpreposition(2) +~spacepreposition(2) +~prepositionroot(2) +~focus(2)
. +~directions(2) After parse TokenFlags: PRESENT USERINPUT