NLTK Stanford Segmentor, cómo configurar CLASSPATH

Estoy tratando de usar el bit Stanford Segementer del paquete NLTK Tokenize. Sin embargo, me encuentro con problemas tratando de usar el conjunto de prueba básico. Ejecutando lo siguiente:

# -*- coding: utf-8 -*- from nltk.tokenize.stanford_segmenter import StanfordSegmenter seg = StanfordSegmenter() seg.default_config('zh') sent = u'这是斯坦福中文分词器测试' print(seg.segment(sent)) 

Resultados en este error: Error

Llegué a agregar …

 import os javapath = "C:/Users/User/Folder/stanford-segmenter-2017-06-09/*" os.environ['CLASSPATH'] = javapath 

… al frente de mi código, pero eso no pareció ayudar.

¿Cómo consigo que el segmentor se ejecute correctamente?

Nota: esta solución solo funcionaría para :

  • NLTK v3.2.5 (v3.2.6 tendría una interfaz aún más simple)
  • Stanford CoreNLP (versión> = 2016-10-31)

Primero, primero debe instalar Java 8 correctamente y, si Stanford CoreNLP funciona en la línea de comandos, la API de Stanford CoreNLP en NLTK v3.2.5 es la siguiente.

Nota: debe iniciar el servidor CoreNLP en el terminal ANTES de usar la nueva API CoreNLP en NLTK.

Inglés

En la terminal:

 wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31 java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \ -preload tokenize,ssplit,pos,lemma,parse,depparse \ -status_port 9000 -port 9000 -timeout 15000 

En Python:

 >>> from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger >>> stpos, stner = CoreNLPPOSTagger(), CoreNLPNERTagger() >>> stpos.tag('What is the airspeed of an unladen swallow ?'.split()) [(u'What', u'WP'), (u'is', u'VBZ'), (u'the', u'DT'), (u'airspeed', u'NN'), (u'of', u'IN'), (u'an', u'DT'), (u'unladen', u'JJ'), (u'swallow', u'VB'), (u'?', u'.')] >>> stner.tag('Rami Eid is studying at Stony Brook University in NY'.split()) [(u'Rami', u'PERSON'), (u'Eid', u'PERSON'), (u'is', u'O'), (u'studying', u'O'), (u'at', u'O'), (u'Stony', u'ORGANIZATION'), (u'Brook', u'ORGANIZATION'), (u'University', u'ORGANIZATION'), (u'in', u'O'), (u'NY', u'O')] 

chino

En la terminal:

 wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31 wget http://nlp.stanford.edu/software/stanford-chinese-corenlp-2016-10-31-models.jar wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \ -serverProperties StanfordCoreNLP-chinese.properties \ -preload tokenize,ssplit,pos,lemma,ner,parse \ -status_port 9001 -port 9001 -timeout 15000 

En python

 >>> from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger >>> from nltk.tokenize.stanford import CoreNLPTokenizer >>> stpos, stner = CoreNLPPOSTagger('http://localhost:9001'), CoreNLPNERTagger('http://localhost:9001') >>> sttok = CoreNLPTokenizer('http://localhost:9001') >>> sttok.tokenize(u'我家没有电脑。') ['我家', '没有', '电脑', '。'] # Without segmentation (input to`raw_string_parse()` is a list of single char strings) >>> stpos.tag(u'我家没有电脑。') [('我', 'PN'), ('家', 'NN'), ('没', 'AD'), ('有', 'VV'), ('电', 'NN'), ('脑', 'NN'), ('。', 'PU')] # With segmentation >>> stpos.tag(sttok.tokenize(u'我家没有电脑。')) [('我家', 'NN'), ('没有', 'VE'), ('电脑', 'NN'), ('。', 'PU')] # Without segmentation (input to`raw_string_parse()` is a list of single char strings) >>> stner.tag(u'奥巴马与迈克尔·杰克逊一起去杂货店购物。') [('奥', 'GPE'), ('巴', 'GPE'), ('马', 'GPE'), ('与', 'O'), ('迈', 'O'), ('克', 'PERSON'), ('尔', 'PERSON'), ('·', 'O'), ('杰', 'O'), ('克', 'O'), ('逊', 'O'), ('一', 'NUMBER'), ('起', 'O'), ('去', 'O'), ('杂', 'O'), ('货', 'O'), ('店', 'O'), ('购', 'O'), ('物', 'O'), ('。', 'O')] # With segmentation >>> stner.tag(sttok.tokenize(u'奥巴马与迈克尔·杰克逊一起去杂货店购物。')) [('奥巴马', 'PERSON'), ('与', 'O'), ('迈克尔·杰克逊', 'PERSON'), ('一起', 'O'), ('去', 'O'), ('杂货店', 'O'), ('购物', 'O'), ('。', 'O')] 

alemán

En la terminal:

 wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31 wget http://nlp.stanford.edu/software/stanford-german-corenlp-2016-10-31-models.jar wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-german.properties java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \ -serverProperties StanfordCoreNLP-german.properties \ -preload tokenize,ssplit,pos,ner,parse \ -status_port 9002 -port 9002 -timeout 15000 

En Python:

 >>> from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger >>> stpos, stner = CoreNLPPOSTagger('http://localhost:9002'), CoreNLPNERTagger('http://localhost:9002') >>> stpos.tag('Ich bin schwanger'.split()) [('Ich', 'PPER'), ('bin', 'VAFIN'), ('schwanger', 'ADJD')] >>> stner.tag('Donald Trump besuchte Angela Merkel in Berlin.'.split()) [('Donald', 'I-PER'), ('Trump', 'I-PER'), ('besuchte', 'O'), ('Angela', 'I-PER'), ('Merkel', 'I-PER'), ('in', 'O'), ('Berlin', 'I-LOC'), ('.', 'O')] 

Español

En la terminal:

 wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31 wget http://nlp.stanford.edu/software/stanford-spanish-corenlp-2016-10-31-models.jar wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-spanish.properties java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \ -serverProperties StanfordCoreNLP-spanish.properties \ -preload tokenize,ssplit,pos,ner,parse \ -status_port 9003 -port 9003 -timeout 15000 

En Python:

 >>> from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger >>> stpos, stner = CoreNLPPOSTagger('http://localhost:9003'), CoreNLPNERTagger('http://localhost:9003') >>> stner.tag(u'Barack Obama salió con Michael Jackson .'.split()) [(u'Barack', u'PERS'), (u'Obama', u'PERS'), (u'sali\xf3', u'O'), (u'con', u'O'), (u'Michael', u'PERS'), (u'Jackson', u'PERS'), (u'.', u'O')] >>> stpos.tag(u'Barack Obama salió con Michael Jackson .'.split()) [(u'Barack', u'np00000'), (u'Obama', u'np00000'), (u'sali\xf3', u'vmis000'), (u'con', u'sp000'), (u'Michael', u'np00000'), (u'Jackson', u'np00000'), (u'.', u'fp')] 

francés

En la terminal:

 wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31 wget http://nlp.stanford.edu/software/stanford-french-corenlp-2016-10-31-models.jar wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-french.properties java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \ -serverProperties StanfordCoreNLP-french.properties \ -preload tokenize,ssplit,pos,parse \ -status_port 9004 -port 9004 -timeout 15000 

En Python:

 >>> from nltk.tag.stanford import CoreNLPPOSTagger >>> stpos = CoreNLPPOSTagger('http://localhost:9004') >>> stpos.tag('Je suis enceinte'.split()) [(u'Je', u'CLS'), (u'suis', u'V'), (u'enceinte', u'NC')] 

Arábica

En la terminal:

 wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip unzip stanford-corenlp-full-2016-10-31.zip && cd stanford-corenlp-full-2016-10-31 wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2016-10-31-models.jar wget https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \ -serverProperties StanfordCoreNLP-french.properties \ -preload tokenize,ssplit,pos,parse \ -status_port 9005 -port 9005 -timeout 15000 

En Python:

 >>> from nltk.tag.stanford import CoreNLPPOSTagger >>> from nltk.tokenize.stanford import CoreNLPTokenizer >>> sttok = CoreNLPTokenizer('http://localhost:9005') >>> stpos = CoreNLPPOSTagger('http://localhost:9005') >>> text = u'انا حامل' >>> stpos.tag(sttok.tokenize(text)) [('انا', 'DET'), ('حامل', 'NC')]