torchtext is a great library, putting a layer of abstraction over the usually very heavy component in NLP projects, making the work with complex datasets a pace. Sadly, as is based and built on PyTorch, using it with is not directly possible.

I wrote a little wrapper library called Keras ❤ torchtext (keras-loves-torchtext) to make torchtext work with Keras.

The approach may be considered a bit dirty and unefficient as it requires to convert torch tensors to numpy arrays but the gain is a huge increase in productivity when working with NLP datasets in Keras.

Ultimately loading and processing a dataset like the IMDB movie review one can be as simple as following:*

from keras.layers import *
from keras.models import *
from torchtext import data, datasets
from kltt import WrapIterator
text_field = data.Field(fix_length=0)
label_field = data.Field(sequential=False, unk_token=None)
train_set, test_set = datasets.IMDB.splits(text_field, label_field)
train_it, test_it = data.BucketIterator.splits([train_set, test_set], [32] * 2, repeat=True)
text_field.build_vocab(train_set, _size=000)
train_data, test_data = WrapIterator.wraps([train_it, test_it], ['text'], ['label'], permute={'text': (1, 0)})
model = Sequential()
model.add(Embedding(000, 300))
model.add(Dense(1, activation='sigmoid'))
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])
model.fit_generator(iter(train_data), steps_per_epoch=len(train_data), epochs=3)
loss, acc = model.fit_generator(iter(test_data), epochs=len(test_data))

Have fun!

* using the Keras loading functions for the IMDB dataset makes it not much more complicated but these functions also rely on pre-processed data while torchtext generalizes much better as it allows to define complex processing pipelines on raw text data.

Source link
thanks you RSS link


Please enter your comment!
Please enter your name here