In foreign terminology, the term part-of-speech tagging (POS-tagging) is used, literally – part-of-speech markup . In fact, morphological marks include not only a sign of a part of speech, but also a lemma, as well as signs of grammatical categories characteristic of this part of speech.
Automatic morphological analysis is a special module of automatic language analysis that provides analysis of word forms at the morphological level.
This is the main type of markup : firstly, most large corpora are just morphologically marked up corpora , secondly, morphological analysis is considered as the basis for further forms of analysis – syntactic and semantic , and, thirdly, advances in computer morphology allow automatically with with a high degree of correctness to mark the cases of large sizes .
One of the main components of the correct parsing of words are the bases of morphemes . When the program starts, the dictionaries are loaded and the search is optimized for them .
Any analysis of a word is made from its beginning to the end . The program is “trying” to pick up a sequence of morphemes that belong to a certain part of speech . So, for example, based on the base of morphemes, after the verbal suffix
“-l-” (indicative mood, past tense) can be followed by endings such as: “-a-“, “-o-“, “-and-“, or zero ending. Parsing is considered completed successfullyif the whole word was parsed into morphemes, in accordance with the rules of the Russian language, and there are no unparsed letters left .
The program accumulates all possible parsing options and selects the optimal one from them . To do this, a system of morpheme weights is used: each morpheme or group of morphemes is assigned a certain weight . The parsing option that has gained the most weight is considered optimal .
So, interjections have a higher weight than nouns , this is done in order to avoid choosing the obviously false option for parsing the interjection as a noun as the optimal option (“protection” has a root stored, not oh). The weight of the parsing option may decrease if it contains many roots (because the share of multi-root words in Russian is lower than single-root words). If the morpheme consists of a large number of characters, then its weight will increase (attraction: so that the system does not continue to highlight the prefix do, roots hundred, sword).
After parsing, the program generates a report file , in which all words are entered with their parsing options, where the morphemes of the word are clearly demonstrated. In addition, statistics are calculated for the entered text . So, for example, the program gives out how often and what roots were found in the text.
The analysis performed by the morphological automatic natural language processing module can be as follows:
1. normalization of word forms (lemmatization), i.e. reduction of various word forms to some single representation – to the original form, or lemma );
2. stemming – another type of normalization, when different word forms are reduced to the same stem, more precisely, “pseudo stem” (for some tasks, including searching on the Internet, it is enough to bring different derivatives to the same stem; for example, the adjective photographic and the noun photography, since the user request documents with the phrase photographic portrait and with the phrase portrait photography will also satisfy)
3. part-time tagging (pos-tagging), i.e. indication of the part of speech for each word form in the text)
4. full morphological analysis – attribution of grammatical characteristics to a word form
In 1980, a marked-up version of the Brown Corpus appeared , in which word forms were lemmatized, their superficial syntactic functions were marked, etc.
The morphological marking of the Brown corpus is as follows:
the_AT jury_NN further_RB said_VBD in_IN term-end_NN presentations_NNS that_CS the_AT *city_NP *executive_NP *committee_NP ,_, which_WDT had_HVD over-all_JJ charge_NN of_IN the_AT election_NN ,_, deserves_VBZ the_AT praise_NN and_CC thanks_NNS of_IN the_AT *city_NP ofNN_NP *committee_NP for_IN the_DT_AT manner in the_DT_AT which_IN election_NN was_BEDZ conducted_VBN |
Let us give an example of the morphological markup of a fragment of the text in Russian “ They called for Vespers. Solemn rumble of bells ”in XML format based on the AOT marker ( Fig. 1 ).
The presented entry uses the tags
Syntactic markup is the result of parsing performed on the basis of morphological analysis data . This type of markup describes the syntactic links between lexical units and various syntactic constructions (for example, a subordinate clause, a verb phrase, etc.).
Rice. 1 . An example of morphological text markup in Russian
(for a list of grammes, see Appendix 3)
Unlike morphology, the ways in which syntactic structure and syntactic relations are represented are not so unified. There is a variety of syntactic theories and formalisms:
Grammar of dependencies
Grammar of direct constituents;
· grammar of structural schemes;
traditional syntactic teachings about the members of the sentence;
semantic syntax, etc.
Syntactic analysis for the Russian language is most often represented by dependency structures. Figure 2 shows an example of dependency tree visualization .
Long ago, in the city of Babylon, the people began to build a huge tower which seemed to reach the heavens soon.
Rice. 2. An example of parsing
(grammar of dependencies, ETAP-3 system)
Semantic tags most often denote semantic categories to which a given word or phrase belongs, and narrower subcategories that specify its meaning. The semantic marking of corpora provides for the specification of the meaning of words, the resolution of homonymy and synonymy, the categorization of words (categories), the allocation of thematic classes, signs of causation, evaluative and derivational characteristics , etc.
NKRY offers its own version of semantic markup. In this corpus, three types of labels are assigned to each word form.
1) discharge (proper name, reflexive pronoun, etc.);
2) lexico-semantic characteristics (thematic class of the lexeme, signs of causation, evaluation, etc.);
3) derivational characteristics (“diminutive”, “adjective adverb”, etc.).
Actually lexico-semantic tags are grouped into the following fields:
• taxonomy (thematic class of a lexeme) – for nouns, adjectives, verbs and adverbs;
• mereology (indication of the relationship “part – whole”, “element – set”) – for objective and non-objective names;
• topology (topological status of the designated object) – for subject names;
• causation – for verbs;
• official status – for verbs;
• evaluation – for subject and non-subject names, adjectives and adverbs.
Word -forming characteristics include several types:
• morpho-semantic derivational features (for example, “caritive”, “semelfactive”);
• the category of the generating word (for example, a verbal noun or an adjective adverb);
• lexico-semantic (taxonomic) type of generating word (for example, an adverb formed from an adjective size);
• morphological type of word formation (substantiation, compound word) (for more details, see http://ruscorpora.ru, section “Semantics”).
There are other types of markup , in particular:
anaphoric markup. It captures referential connections, for example, pronouns;
prosodic markup. Prosodic corpora use tags to indicate stress and intonation. In colloquial speech corpora, prosodic marking is often accompanied by the so-called discourse marking, which serves to indicate pauses, repetitions, reservations, etc.