API Reference (auto-generated)¶
Morphological Analyzer¶
- class pymorphy2.analyzer.MorphAnalyzer(path=None, result_type=<class 'pymorphy2.analyzer.Parse'>, units=None, probability_estimator_cls=<class 'pymorphy2.analyzer.SingleTagProbabilityEstimator'>)[source]¶
Morphological analyzer for Russian language.
For a given word it can find all possible inflectional paradigms and thus compute all possible tags and normal forms.
Analyzer uses morphological word features and a lexicon (dictionary compiled from XML available at OpenCorpora.org); for unknown words heuristic algorithm is used.
Create a MorphAnalyzer object:
>>> import pymorphy2 >>> morph = pymorphy2.MorphAnalyzer()
MorphAnalyzer uses dictionaries from pymorphy2-dicts package (which can be installed via pip install pymorphy2-dicts).
Alternatively (e.g. if you have your own precompiled dictionaries), either create PYMORPHY2_DICT_PATH environment variable with a path to dictionaries, or pass path argument to pymorphy2.MorphAnalyzer constructor:
>>> morph = pymorphy2.MorphAnalyzer('/path/to/dictionaries')
By default, methods of this class return parsing results as namedtuples Parse. This has performance implications under CPython, so if you need maximum speed then pass result_type=None to make analyzer return plain unwrapped tuples:
>>> morph = pymorphy2.MorphAnalyzer(result_type=None)
- DEFAULT_UNITS = [[<class 'pymorphy2.units.by_lookup.DictionaryAnalyzer'>, <class 'pymorphy2.units.abbreviations.AbbreviatedFirstNameAnalyzer'>, <class 'pymorphy2.units.abbreviations.AbbreviatedPatronymicAnalyzer'>], <class 'pymorphy2.units.by_shape.NumberAnalyzer'>, <class 'pymorphy2.units.by_shape.PunctuationAnalyzer'>, [<class 'pymorphy2.units.by_shape.RomanNumberAnalyzer'>, <class 'pymorphy2.units.by_shape.LatinAnalyzer'>], <class 'pymorphy2.units.by_hyphen.HyphenSeparatedParticleAnalyzer'>, <class 'pymorphy2.units.by_hyphen.HyphenAdverbAnalyzer'>, <class 'pymorphy2.units.by_hyphen.HyphenatedWordsAnalyzer'>, <class 'pymorphy2.units.by_analogy.KnownPrefixAnalyzer'>, [<class 'pymorphy2.units.by_analogy.UnknownPrefixAnalyzer'>, <class 'pymorphy2.units.by_analogy.KnownSuffixAnalyzer'>], <class 'pymorphy2.units.unkn.UnknAnalyzer'>]¶
- ENV_VARIABLE = u'PYMORPHY2_DICT_PATH'¶
- iter_known_word_parses(prefix=u'')[source]¶
Return an iterator over parses of dictionary words that starts with a given prefix (default empty prefix means “all words”).
- parse(word)[source]¶
Analyze the word and return a list of pymorphy2.analyzer.Parse namedtuples:
Parse(word, tag, normal_form, para_id, idx, _score)(or plain tuples if result_type=None was used in constructor).
- word_is_known(word, strict_ee=False)[source]¶
Check if a word is in the dictionary. Pass strict_ee=True if word is guaranteed to have correct е/ё letters.
Note
Dictionary words are not always correct words; the dictionary also contains incorrect forms which are commonly used. So for spellchecking tasks this method should be used with extra care.
Analyzer units¶
Dictionary analyzer unit¶
Analogy analyzer units¶
This module provides analyzer units that analyzes unknown words by looking at how similar known words are analyzed.
- class pymorphy2.units.by_analogy.KnownPrefixAnalyzer(morph)[source]¶
Parse the word by checking if it starts with a known prefix and parsing the reminder.
Example: псевдокошка -> (псевдо) + кошка.
Analyzer units for unknown words with hyphens¶
- class pymorphy2.units.by_hyphen.HyphenAdverbAnalyzer(morph)[source]¶
Detect adverbs that starts with “по-”.
Example: по-западному
- class pymorphy2.units.by_hyphen.HyphenSeparatedParticleAnalyzer(morph)[source]¶
Parse the word by analyzing it without a particle after a hyphen.
Example: смотри-ка -> смотри + “-ка”.
Note
This analyzer doesn’t remove particles from the result so for normalization you may need to handle particles at tokenization level.
Analyzer units that analyzes non-word tokes¶
- class pymorphy2.units.by_shape.LatinAnalyzer(morph)[source]¶
This analyzer marks latin words with “LATN” tag. Example: “pdf” -> LATN
Tagset¶
Utils for working with grammatical tags.
Wrapper class for OpenCorpora.org tags.
Warning
In order to work properly, the class has to be globally initialized with actual grammemes (using _init_grammemes method).
Pymorphy2 initializes it when loading a dictionary; it may be not a good idea to use this class directly. If possible, use morph_analyzer.TagClass instead.
Example:
>>> from pymorphy2 import MorphAnalyzer >>> morph = MorphAnalyzer() >>> Tag = morph.TagClass # get an initialzed Tag class >>> tag = Tag('VERB,perf,tran plur,impr,excl') >>> tag OpencorporaTag('VERB,perf,tran plur,impr,excl')
Tag instances have attributes for accessing grammemes:
>>> print(tag.POS) VERB >>> print(tag.number) plur >>> print(tag.case) None
Available attributes are: POS, animacy, aspect, case, gender, involvement, mood, number, person, tense, transitivity and voice.
You may check if a grammeme is in tag or if all grammemes from a given set are in tag:
>>> 'perf' in tag True >>> 'nomn' in tag False >>> 'Geox' in tag False >>> set(['VERB', 'perf']) in tag True >>> set(['VERB', 'perf', 'sing']) in tag False
In order to fight typos, for unknown grammemes an exception is raised:
>>> 'foobar' in tag Traceback (most recent call last): ... ValueError: Grammeme is unknown: foobar >>> set(['NOUN', 'foo', 'bar']) in tag Traceback (most recent call last): ... ValueError: Grammemes are unknown: {'bar', 'foo'}
This also works for attributes:
>>> tag.POS == 'plur' Traceback (most recent call last): ... ValueError: 'plur' is not a valid grammeme for this attribute. Valid grammemes: ...
Return Latin representation for tag_or_grammeme string
Cyrillic representation of this tag
Replace rare cases (loc2/voct/...) with common ones (loct/nomn/...).
A frozenset with grammemes for this tag.
A frozenset with Cyrillic grammemes for this tag.
Return Cyrillic representation for tag_or_grammeme string
Return a new set of grammemes with required grammemes added and incompatible grammemes removed.
Command-Line Interface¶
Usage:
pymorphy dict compile <DICT_XML> [--out <PATH>] [--force] [--verbose] [--min_ending_freq <NUM>] [--min_paradigm_popularity <NUM>] [--max_suffix_length <NUM>]
pymorphy dict download_xml <OUT_FILE> [--verbose]
pymorphy dict mem_usage [--dict <PATH>] [--verbose]
pymorphy dict make_test_suite <XML_FILE> <OUT_FILE> [--limit <NUM>] [--verbose]
pymorphy dict meta [--dict <PATH>]
pymorphy prob download_xml <OUT_FILE> [--verbose]
pymorphy prob estimate_cpd <CORPUS_XML> [--out <PATH>] [--min_word_freq <NUM>]
pymorphy _parse <IN_FILE> <OUT_FILE> [--dict <PATH>] [--verbose]
pymorphy -h | --help
pymorphy --version
Options:
-v --verbose Be more verbose
-f --force Overwrite target folder
-o --out <PATH> Output folder name [default: dict]
--limit <NUM> Min. number of words per gram. tag [default: 100]
--min_ending_freq <NUM> Prediction: min. number of suffix occurances [default: 2]
--min_paradigm_popularity <NUM> Prediction: min. number of lexemes for the paradigm [default: 3]
--max_suffix_length <NUM> Prediction: max. length of prediction suffixes [default: 5]
--min_word_freq <NUM> P(t|w) estimation: min. word count in source corpus [default: 1]
--dict <PATH> Dictionary folder path
Utilities for OpenCorpora Dictionaries¶
- class pymorphy2.opencorpora_dict.wrapper.Dictionary(path)[source]¶
OpenCorpora dictionary wrapper class.
- build_paradigm_info(para_id)[source]¶
Return a list of
(prefix, tag, suffix)tuples representing the paradigm.
- build_stem(paradigm, idx, fixed_word)[source]¶
Return word stem (given a word, paradigm and the word index).
- iter_known_words(prefix=u'')[source]¶
Return an iterator over (word, tag, normal_form, para_id, idx) tuples with dictionary words that starts with a given prefix (default empty prefix means “all words”).
- word_is_known(word, strict_ee=False)[source]¶
Check if a word is in the dictionary. Pass strict_ee=True if word is guaranteed to have correct е/ё letters.
Note
Dictionary words are not always correct words; the dictionary also contains incorrect forms which are commonly used. So for spellchecking tasks this method should be used with extra care.
Various Utilities¶
- pymorphy2.tokenizers.simple_word_tokenize(text)[source]¶
Split text into tokens. Don’t split by hyphen.
- pymorphy2.shapes.is_latin(token)[source]¶
Return True if all token letters are latin and there is at least one latin letter in the token:
>>> is_latin('foo') True >>> is_latin('123-FOO') True >>> is_latin('123') False >>> is_latin(':)') False >>> is_latin('') False
- pymorphy2.shapes.is_punctuation(token)[source]¶
Return True if a word contains only spaces and punctuation marks and there is at least one punctuation mark:
>>> is_punctuation(', ') True >>> is_punctuation('..!') True >>> is_punctuation('x') False >>> is_punctuation(' ') False >>> is_punctuation('') False
- pymorphy2.shapes.is_roman_number(token)[source]¶
Return True if token looks like a Roman number:
>>> is_roman_number('II') True >>> is_roman_number('IX') True >>> is_roman_number('XIIIII') False >>> is_roman_number('') False
- pymorphy2.shapes.restore_capitalization(word, example)[source]¶
Make the capitalization of the word be the same as in example:
>>> restore_capitalization('bye', 'Hello') 'Bye' >>> restore_capitalization('half-an-hour', 'Minute') 'Half-An-Hour' >>> restore_capitalization('usa', 'IEEE') 'USA' >>> restore_capitalization('pre-world', 'anti-World') 'pre-World' >>> restore_capitalization('123-do', 'anti-IEEE') '123-DO' >>> restore_capitalization('123--do', 'anti--IEEE') '123--DO'
In the alignment fails, the reminder is lower-cased:
>>> restore_capitalization('foo-BAR-BAZ', 'Baz-Baz') 'Foo-Bar-baz' >>> restore_capitalization('foo', 'foo-bar') 'foo'
- pymorphy2.shapes.restore_word_case(word, example)[source]¶
This function is renamed to restore_capitalization
- pymorphy2.utils.combinations_of_all_lengths(it)[source]¶
Return an iterable with all possible combinations of items from it:
>>> for comb in combinations_of_all_lengths('ABC'): ... print("".join(comb)) A B C AB AC BC ABC
- pymorphy2.utils.download_bz2(url, out_fp, chunk_size=262144, on_chunk=<function <lambda> at 0x2486938>)[source]¶
Download a bz2-encoded file from url and write it to out_fp file.
- pymorphy2.utils.json_read(filename, **json_options)[source]¶
Read an object from a json file filename
- pymorphy2.utils.json_write(filename, obj, **json_options)[source]¶
Create file filename with obj serialized to JSON
- pymorphy2.utils.largest_group(iterable, key)[source]¶
Find a group of largest elements (according to key).
>>> s = [-4, 3, 5, 7, 4, -7] >>> largest_group(s, abs) [7, -7]