Utilisation de PunktSentenceTokenizer dans NLTK

PunktSentenceTokenizer est la classe abstraite pour le tokenizer de phrase par défaut, c'est-à-dire sent_tokenize() , fourni dans NLTK. Il s'agit d'une implémentation de la détection de limite de phrase multilingue non supervisée (Kiss and Strunk (2005). Voir https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init .py#L79

Étant donné un paragraphe avec plusieurs phrases, par exemple :

>>> from nltk.corpus import state_union
>>> train_text = state_union.raw("2005-GWBush.txt").split('\n')
>>> train_text[11]
u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. '

Vous pouvez utiliser le sent_tokenize() :

>>> sent_tokenize(train_text[11])
[u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.', u'This evening I will set forth policies to advance that ideal at home and around the world. ']
>>> for sent in sent_tokenize(train_text[11]):
...     print sent
...     print '--------'
... 
Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.
--------
This evening I will set forth policies to advance that ideal at home and around the world. 
--------

Le sent_tokenize() utilise un modèle pré-formé de nltk_data/tokenizers/punkt/english.pickle . Vous pouvez également spécifier d'autres langues, la liste des langues disponibles avec des modèles pré-formés en NLTK sont :

[email protected]:~/nltk_data/tokenizers/punkt$ ls
czech.pickle     finnish.pickle  norwegian.pickle   slovene.pickle
danish.pickle    french.pickle   polish.pickle      spanish.pickle
dutch.pickle     german.pickle   portuguese.pickle  swedish.pickle
english.pickle   greek.pickle    PY3                turkish.pickle
estonian.pickle  italian.pickle  README

Étant donné un texte dans une autre langue, procédez comme suit :

>>> german_text = u"Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter. Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. "

>>> for sent in sent_tokenize(german_text, language='german'):
...     print sent
...     print '---------'
... 
Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, Göttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter.
---------
Über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südniedersächsischen Orgellandschaft vollständig oder in Teilen erhalten. 
---------

Pour entraîner votre propre modèle punkt, consultez https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py et le format de données d'entraînement pour nltk punkt

PunktSentenceTokenizer est un algorithme de détection de limite de phrase qui doit être formé pour être utilisé [1]. NLTK inclut déjà une version pré-formée du PunktSentenceTokenizer.

Donc, si vous utilisez initialiser le tokenizer sans aucun argument, il sera par défaut la version pré-formée :

In [1]: import nltk
In [2]: tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
In [3]: txt = """ This is one sentence. This is another sentence."""
In [4]: tokenizer.tokenize(txt)
Out[4]: [' This is one sentence.', 'This is another sentence.']

Vous pouvez également fournir vos propres données de formation pour former le tokenizer avant de l'utiliser. Le tokenizer Punkt utilise un algorithme non supervisé, ce qui signifie que vous l'entraînez simplement avec du texte normal.

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

Dans la plupart des cas, il est tout à fait acceptable d'utiliser la version pré-formée. Vous pouvez donc simplement initialiser le tokenizer sans fournir d'arguments.

Alors "qu'est-ce que tout cela a à voir avec le marquage POS" ? Le tagger NLTK POS fonctionne avec des phrases tokenisées, vous devez donc diviser votre texte en phrases et mots symboliques avant de pouvoir taguer POS.

Documentation de NLTK.

[1] Kiss and Strunk, "Détection non supervisée des limites de phrases multilingues"