基于PyTorch/TorchText的自然语言处理库 !

高级工作流

定义NLP任务

拓展Model类并实现forward()和lost()方法,分别返回预测和损失

用HParams类简单定义超参

定义一个数据函数来使用TorchText API返回DataSet迭代器、词汇表等。选中conll.py作为示例

设置Evaluator和Trainer类以使用模型、数据集迭代器和度量,查看ner.py查看详情

运行trainer,获取epoch次数early stopping criteria

使用evaluator对特定数据集上训练的模型进行评估

在训练的模型上运行inference

部件

Model: 处理模型的加载和保存以及相关的超参数;

HParams: 定义超参;

Trainer: 在数据集上训练给定的模型。支持预定义的学习速率衰减和early stopping等功能;

Evaluator: 评估模型;

get_input_processor_words: 在推理期间使用它可以快速地将输入字符串转换成模型可以处理的格式。

支持的模型

transformer.Encoder, transformer.Decoder: Transfomer network implementation from Attention is all you need

CRF: Conditional Random Field layer which can be used as the final output

TransformerTagger: Sequence tagging model implemented using the Transformer network and CRF

BiLSTMTagger: Sequence tagging model implemented using bidirectional LSTMs and CRF

安装

TorchNLP requires a minimum of Python 3.5 and PyTorch 0.4.0 to run. Check Pytorch for the installation steps. Clone this repository and install other dependencies like TorchText:

pip install -r requirements.txt

Go to the root of the project and check for integrity with PyTest:

pytest

Install this project:

python setup.py

使用

NER Task

The NER task can be run on any dataset that confirms to the CoNLL 2003 format. To use the CoNLL 2003 NER dataset place the dataset files in the following directory structure within your workspace root:

.data
|
|—conll2003
|
|—eng.train.txt
|—eng.testa.txt
|—eng.testb.txt

eng.testa.txt is used the validation dataset and eng.testb.txt is used as the test dataset.

Start the NER module in the python shell which sets up the imports:

python -i -m torchnlp.ner

Task: Named Entity Recognition

Available models:

TransformerTagger

Sequence tagger using the Transformer network (https://arxiv.org/pdf/1706.03762.pdf)
Specifically it uses the Encoder module. For character embeddings (per word) it uses
the same Encoder module above which an additive (Bahdanau) self-attention layer is added

BiLSTMTagger

Sequence tagger using bidirectional LSTM. For character embeddings per word
uses (unidirectional) LSTM

Available datasets:

conll2003: Conll 2003 (Parser only. You must place the files)

Train the Transformer model on the CoNLL 2003 dataset:

train(‘ner-conll2003’, TransformerTagger, conll2003)

The first argument is the task name. You need to use the same task name during evaluation and inference. By default the train function will use the F1 metric with a window of 5 epochs to perform early stopping. To change the early stopping criteria set the PREFS global variable as follows:

PREFS.early_stopping=’lowest_3_loss’

This will now use validation loss as the stopping criteria with a window of 3 epochs. The model files are saved under taskname-modelname directory. In this case it is ner-conll2003-TransformerTagger

Evaluate the trained model on the testb dataset split:

evaluate(‘ner-conll2003’, TransformerTagger, conll2003, ‘test’)

It will display metrics like accuracy, sequence accuracy, F1 etc

Run the trained model interactively for the ner task:

interactive(‘ner-conll2003’, TransformerTagger)

Ctrl+C to quit
Tom went to New York
I-PER O O I-LOC I-LOC

You can similarly train the bidirectional LSTM CRF model by using the BiLSTMTagger class. Customizing hyperparameters is quite straight forward. Let’s look at the hyperparameters for TransformerTagger:

h2 = hparams_transformer_ner()
h2

Hyperparameters:
filter_size=128
optimizer_adam_beta2=0.98
learning_rate=0.2
learning_rate_warmup_steps=500
input_dropout=0.2
embedding_size_char=16
dropout=0.2
hidden_size=128
optimizer_adam_beta1=0.9
embedding_size_word=300
max_length=256
attention_dropout=0.2
relu_dropout=0.2
batch_size=100
num_hidden_layers=1
attention_value_channels=0
attention_key_channels=0
use_crf=True
embedding_size_tags=100
learning_rate_decay=noam_step
embedding_size_char_per_word=100
num_heads=4
filter_size_char=64

Now let’s disable the CRF layer:

h2.update(use_crf=False)

Hyperparameters:
filter_size=128
optimizer_adam_beta2=0.98
learning_rate=0.2
learning_rate_warmup_steps=500
input_dropout=0.2
embedding_size_char=16
dropout=0.2
hidden_size=128
optimizer_adam_beta1=0.9
embedding_size_word=300
max_length=256
attention_dropout=0.2
relu_dropout=0.2
batch_size=100
num_hidden_layers=1
attention_value_channels=0
attention_key_channels=0
use_crf=False
embedding_size_tags=100
learning_rate_decay=noam_step
embedding_size_char_per_word=100
num_heads=4
filter_size_char=64

Use it to re-train the model:

train(‘ner-conll2003-nocrf’, TransformerTagger, conll2003, hparams=h2)

Along with the model the hyperparameters are also saved so there is no need to pass the HParams object during evaluation. Also note that by default it will not overwrite any existing model directories (will rename instead). To change that behavior set the PREFS variable:

PREFS.overwrite_model_dir = True

The PREFS variable is automatically persisted in prefs.json

Chunking Task

The CoNLL 2000 dataset is available for the Chunking task. The dataset is automatically downloaded from the public repository so you don’t need to manually download it.

Start the Chunking task:

python -i -m torchnlp.chunk

Train the Transformer model:

train(‘chunk-conll2000’, TransformerTagger, conll2000)

There is no validation partition provided in the repository hence 10% of the training set is used for validation.

Evaluate the model on the test set:

evaluate(‘chunk-conll2000’, TransformerTagger, conll2000, ‘test’)

独立使用

The transformer.Encoder, transformer.Decoder and CRF modules can be independently imported as they only depend on PyTorch:

from torchnlp.modules.transformer import Encoder
from torchnlp.modules.transformer import Decoder
from torchnlp.modules.crf import CRF

    原文作者:不谈风月_0eb8
    原文地址: https://www.jianshu.com/p/ab6722ff6e0b
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞