관리-도구
편집 파일: METADATA
Metadata-Version: 2.1 Name: snowballstemmer Version: 2.2.0 Summary: This package provides 29 stemmers for 28 languages generated from Snowball algorithms. Home-page: https://github.com/snowballstem/snowball Author: Snowball Developers Author-email: snowball-discuss@lists.tartarus.org License: BSD-3-Clause Keywords: stemmer Platform: UNKNOWN Classifier: Development Status :: 5 - Production/Stable Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: BSD License Classifier: Natural Language :: Arabic Classifier: Natural Language :: Basque Classifier: Natural Language :: Catalan Classifier: Natural Language :: Danish Classifier: Natural Language :: Dutch Classifier: Natural Language :: English Classifier: Natural Language :: Finnish Classifier: Natural Language :: French Classifier: Natural Language :: German Classifier: Natural Language :: Greek Classifier: Natural Language :: Hindi Classifier: Natural Language :: Hungarian Classifier: Natural Language :: Indonesian Classifier: Natural Language :: Irish Classifier: Natural Language :: Italian Classifier: Natural Language :: Lithuanian Classifier: Natural Language :: Nepali Classifier: Natural Language :: Norwegian Classifier: Natural Language :: Portuguese Classifier: Natural Language :: Romanian Classifier: Natural Language :: Russian Classifier: Natural Language :: Serbian Classifier: Natural Language :: Spanish Classifier: Natural Language :: Swedish Classifier: Natural Language :: Tamil Classifier: Natural Language :: Turkish Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 2 Classifier: Programming Language :: Python :: 2.6 Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.4 Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: Implementation :: CPython Classifier: Programming Language :: Python :: Implementation :: PyPy Classifier: Topic :: Database Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search Classifier: Topic :: Text Processing :: Indexing Classifier: Topic :: Text Processing :: Linguistic Description-Content-Type: text/x-rst License-File: COPYING Snowball stemming library collection for Python =============================================== Python 3 (>= 3.3) is supported. We no longer actively support Python 2 as the Python developers stopped supporting it at the start of 2020. Snowball 2.1.0 was the last release to officially support Python 2. What is Stemming? ----------------- Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps *connection*, *connections*, *connective*, *connected*, and *connecting* to *connect*. So a searching for *connected* would also find documents which only have the other forms. This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so *awe* and *awful* don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer. How to use library ------------------ The ``snowballstemmer`` module has two functions. The ``snowballstemmer.algorithms`` function returns a list of available algorithm names. The ``snowballstemmer.stemmer`` function takes an algorithm name and returns a ``Stemmer`` object. ``Stemmer`` objects have a ``Stemmer.stemWord(word)`` method and a ``Stemmer.stemWords(word[])`` method. .. code-block:: python import snowballstemmer stemmer = snowballstemmer.stemmer('english'); print(stemmer.stemWords("We are the world".split())); Automatic Acceleration ---------------------- `PyStemmer <https://pypi.org/project/PyStemmer/>`_ is a wrapper module for Snowball's ``libstemmer_c`` and should provide results 100% compatible to **snowballstemmer**. **PyStemmer** is faster because it wraps generated C versions of the stemmers; **snowballstemmer** uses generate Python code and is slower but offers a pure Python solution. If PyStemmer is installed, ``snowballstemmer.stemmer`` returns a ``PyStemmer`` ``Stemmer`` object which provides the same ``Stemmer.stemWord()`` and ``Stemmer.stemWords()`` methods. Benchmark ~~~~~~~~~ This is a crude benchmark which measures the time for running each stemmer on every word in its sample vocabulary (10,787,583 words over 26 languages). It's not a realistic test of normal use as a real application would do much more than just stemming. It's also skewed towards the stemmers which do more work per word and towards those with larger sample vocabularies. * Python 2.7 + **snowballstemmer** : 13m00s (15.0 * PyStemmer) * Python 3.7 + **snowballstemmer** : 12m19s (14.2 * PyStemmer) * PyPy 7.1.1 (Python 2.7.13) + **snowballstemmer** : 2m14s (2.6 * PyStemmer) * PyPy 7.1.1 (Python 3.6.1) + **snowballstemmer** : 1m46s (2.0 * PyStemmer) * Python 2.7 + **PyStemmer** : 52s For reference the equivalent test for C runs in 9 seconds. These results are for Snowball 2.0.0. They're likely to evolve over time as the code Snowball generates for both Python and C continues to improve (for a much older test over a different set of stemmers using Python 2.7, **snowballstemmer** was 30 times slower than **PyStemmer**, or 9 times slower with **PyPy**). The message to take away is that if you're stemming a lot of words you should either install **PyStemmer** (which **snowballstemmer** will then automatically use for you as described above) or use PyPy. The TestApp example ------------------- The ``testapp.py`` example program allows you to run any of the stemmers on a sample vocabulary. Usage:: testapp.py <algorithm> "sentences ... " .. code-block:: bash $ python testapp.py English "sentences... "