In His Shoes - Using Sequence Models To Generate Sentence Forecasts

Autocomplete features in messaging applications and search boxes on websites are common examples of systems that utilise sentence forecasting. The underlying assumption in these models is that it is possible to predict the next word in a sequence of words using some form of context such as the preceding sequence of words, browsing history, and time of day.

In this post I explore a simple approach of building a sentence forecasting model using preceding words as input data. I train using recurrent neural networks (RNNs), specifically networks that utilise long short-term memory (LSTM) models.

I take inspiration and code from a few blog posts from Jason Brownlee and Will Koehrsen, so credit to these guys.

I also do not go into the theory behind these models but for more information on the fundamentals of neural networks, I encourage you to check out a previous blog post I have written. For some material on RNNs and LSTMs I can recommend a couple of well-written posts here.

Data Collection

My initial aim for this project was to be able to build models that could replicate text from a wide variety of content, with the aim of being able to compare the writing styles from different publications (BBC vs. Daily Mail), different books (Alice in Wonderland vs. Doors of Perception), and Tweets from different people (Trump vs. Elon Musk). However, I was limited by the amount of time I was able to spend training models and so I focussed on using this open dataset on BBC articles.

This dataset gives article text split from different sections of their website and I thought these sections would give enough variation to expose the different themes that the model learns.

Data Preparation

Tokenising and Word<>Index Mappings

To convert this raw text data into a format I am able to train a model on, I first convert the raw text documents into an array of integers using the Tokenizer object from Keras module from TensorFlow. This does several useful things for us: 1. It performs some string processing (concerting all text to lowercase and filtering some non alpha-numeric characters), 2. it builds a word to index lookups, and 3. applies the word to index lookup on the string data. This results in us having a sequence of integers represent the input strings and with these we can begin to form a training / validation set.

    # Train Tokenizer and Apply
    from tensorflow.keras.preprocessing.text import Tokenizer
    
    # Train
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(text)
    
    # Apply Tokenzier on Documents (convert words to numbers)
    sequences = tokenizer.texts_to_sequences(text)
    
    word_idx = tokenizer.word_index
    num_words = len(word_idx) + 1

Sliding Windows

At this point, I pass a sliding window across sentences and extract sequences of words to generate training examples. The sliding window method, in this case, works by iterating through a fixed length, n, of words through input text and extracting the first (n-1) words as a feature, and the final nth word as the label. In a document with N words, this produces a total of (N - n + 1) samples of data to train on.

Sliding window

    # Create Sliding Window
    sequence_len = 10
    
    features = []
    labels = []
    
    for sequence in sequences:
        for i in range(len(sequence) - sequence_len):
            window = sequence[i:sequence_len + i + 1]
            
            features.append(window[:-1])
            labels.append(window[-1])
    
    print(f'There are {len(features)} sequences.')

Word Embeddings

When I build my model, I know that I want to the input layer of the network to be an Embedding layer. Word embeddings are a way for us to represent words as vectors, where the relationship between words can be, in some way, described by the relationship between their vectors. There are two commonly used word embeddings, GloVe and Word2Vec, and these have typically been trained on a large corpus of text like (think Wikipedia or Common Crawl) and they provide an n-dimensional vector for a given word. Basic arithmetic operations between these vectors can describe some real world relationship between corresponding words.

Word embeddings

To use the embeddings within the network, the vectors are reordered so it is aligned with word indexes that we will feed the network with. This embedding matrix is used to initialise the Embedding layer.

>
    glove_vectors = np.loadtxt(glove_vector_filepath, dtype='str', comments=None)
    
    vectors = glove_vectors[:, 1:].astype('float')
    words = glove_vectors[:, 0]
    print('Glove Vectors loading with dimension {}'.format(vectors.shape[1]))

    word_lookup = {word: vector for word, vector in zip(words, vectors)}

    # Create Empty Index 
    embedding_matrix = np.zeros((num_words, vectors.shape[1]))

    # Fill Index with Embedings
    not_found = 0
    for i, word in enumerate(word_idx.keys()):
        # Look up the word embedding
        vector = word_lookup.get(word, None)

        # Record in matrix
        if vector is not None:
            embedding_matrix[i + 1, :] = vector
        else:
            not_found += 1

    print(f'There were {not_found} words without pre-trained embeddings.')

    # Normalize and convert nan to 0
    embedding_matrix = embedding_matrix / np.linalg.norm(embedding_matrix, axis=1).reshape((-1, 1))

    embedding_matrix = np.nan_to_num(embedding_matrix)

Model Design, Training and Evaluation

I build the network using the Keras Sequential API from TensorFlow 2. This quickly lets me initialise a network, add a variety of layers, define an optimiser, and train the model efficiently. When training the model, I take advantage of EarlyStopping which automatically interrupts the training process if some monitored metric does not improve after a few epochs. I also use ModelCheckpoint which saves the model during training so progress is not lost if the kernel crashes.

    from tensorflow.keras import Sequential
    from tensorflow.keras.layers import Embedding, LSTM, Dropout, Dense
    from tensorflow.keras.models import load_model
    from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

    # Define the model
    model = Sequential()

    model.add(    
        Embedding(
        input_dim=num_words,
        output_dim=embedding_matrix.shape[1],
        weights=[embedding_matrix],
        trainable=True)
    )

    model.add(LSTM(256, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))
    model.add(Dropout(0.2))

    model.add(LSTM(256))
    model.add(Dropout(0.5))

    model.add(Dense(128, activation='relu'))

    model.add(Dense(num_words, activation='softmax')) # output layer

    # Compile the model
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

    # Train the model
    callbacks = [
        EarlyStopping(monitor='val_accuracy', patience=25),
        ModelCheckpoint(f'{model_filepath}', save_best_only=True, save_weights_only=False, monitor='val_accuracy')
    ]

    history = model.fit(
        X_train, 
        y_train, 
        epochs=epochs, 
        batch_size=2048, 
        validation_data=(X_test, y_test), 
        verbose=1,
        callbacks=callbacks
    )

Example model summary

I train models using a cloud instance of a notebook in Google Colab as this allows me to train using a cloud GPU, which decreases training time from ~120 seconds to ~15 seconds per epoch, an 8x improvement when compared to training on my 2019 MacBook Pro 13" (1.4Ghz quad core i5, 8GB ram).

When using Google Colab, I ran into the problem of running out of RAM, and I discovered this was because of a few large variables that where not used and large Numpy matrices. The following code helped to solve this issue.

	# Convert from float to int8 to reduce object size
	from scipy.sparse import csr_matrix
	labels = csr_matrix(labels).astype(np.int8)
	labels = labels.toarray()


    # Delete unused variables
    import gc
    gc.enable()
    del labels
    gc.collect()

Hyperparameter Tuning: Stage 1

To find the best model for each dataset, I perform a series of training and marginal model redesigns steps, and keeping parameters that optimise for validation accuracy. The results of the process when applied to the BBC Technology dataset is summarised in the tables below.

In this stage I change the following set of parameters:

  • Input sequence length
  • Number of LTSM layers
  • Bidirectional final LSTM layer
  • Trainable (updateable) embedding matrix

LTSM Layers Bidirectional Trainable Embeddings Sequence Length Validation Loss Validation Accuracy
1 False False 50 6.4511 0.1130
2 False False 50 6.5666 0.1039
1 True False 50 6.3906 0.1154
2 True False 50 6.5250 0.1065
1 False True 50 6.6767 0.1770
1 True True 50 6.9115 0.1669
2 True True 20 6.8365 0.1451
1 False True 20 6.6482 0.1757

Hyperparameter Tuning: Stage 2

I noticed that the train loss and train accuracy where consistently higher than validation Loss and validation accuracy, suggesting that the model was over fitting hence I tried to alter the dropout within the LSTM layer with the following results.

LTSM Layers Bidirectional Trainable Embeddings Sequence Length LSTM Dropout LSTM Recurrent Dropout Validation Loss Validation Accuracy
1 False True 50 0.1 0.1 6.5235 0.1715
1 False True 50 0.2 0.2 6.4229 0.1706
1 False True 50 0.5 0.5 6.2247 0.1638
1 False True 20 0.1 0.1 6.4449 0.1725
1 False True 20 0.2 0.2 6.2899 0.1706
1 False True 20 0.5 0.5 6.1667 0.1719

Hyperparameter Tuning: Stage 3

I could not conclude that using dropouts within the LSTM layer had any affect, so I decided to redesign the model with multiple LSTM layers with dropout layers between them. I also had a go at changing the number of cells within each LSTM layer. The following results keep the embedding layer trainable and I reduce the sequence length to 10 words.

LSTM Layers LSTM Cells per Layer Dropout % Validation Loss Validation Accuracy
1 64 0.2 8.9918 0.2271
1 256 0.2 10.7950 0.2354
1 256 0.5 8.9682 0.2153
2 64 0.2, 0.5 6.9549 0.1490
2 256 0.2, 0.5 7.4581 0.1683
2 256 0.5, 0.5 7.3286 0.1650

Results after 100 epochs

LSTM Layers LSTM Cells per Layer Dropout % Validation Loss Validation Accuracy
1 64 0.2 16.4351 0.3229
1 256 0.2 22.5922 0.3373
1 256 0.5 18.8511 0.3329
2 64 0.2, 0.5 8.8127 0.2230
2 256 0.2, 0.5 10.8240 0.2854
2 256 0.5, 0.5 10.4388 0.2666

Results after 500 epochs

Training history

Final Model Results and Baseline Comparison

To get a baseline metric to benchmark the trained networks, I came up with a few simpler approaches in predicting following words. One approach is to randomly guess the following word, and the other set of approaches is to build a mapping between every n-gram and predicting the most frequent word after n-gram. The numbers in the table below are the accuracy scores on a validation set where the most-frequent n-gram model is labelled as 'Conditional Guess' and the LSTM column refers to the best model I found trained to convergence.

Dataset Random Guess Conditional Guess (n=1) Conditional Guess (n=2) Conditional Guess (n=3) LSTM
BBC Technology 0.01 0.10 0.23 0.29 0.38
BBC Politics 0.001 0.09 0.16 0.17 0.19
BBC Business 0.001 0.08 0.11 0.09 0.15

Conclusion

Interpreting the results, our models are generally able to correctly predict the following 1/5-1/3 of the time. For a model trained with a few hours on a fairly small dataset and limited tuning, this is already better than my human-level performance.


Sample Output

The following quotes are some examples of input text data fed into the model, sequences generated by the best performing model, and the actual following sequences.

BBC Tech

and the current generation of mobiles using flash technology can store data for phones but some readers of the file sharing and both yahoo and bt in europe in the
and the current generation of mobiles using flash technology can
store up to one gigabyte of music enough for 250 songs we are working in the hard disk area and
cut between 6 000 and 7 000 jobs and reduce costs up from looking for to a international based development in california said iptv was scrapped in japan in 1976
cut between 6 000 and 7 000 jobs and reduce costs by 5bn £2 7bn a year analysts had warned recently that the airline might have to seek chapter 11

BBC Politics

the new year in the meantime we will be studying the announcement on his spending plans on the same after a meeting on labour's media media lord woolf for labour's
the new year in the meantime we will be studying
the judgment carefully to see whether it is possible to modify our legislation to address the concerns raised by the
executive faces more than 1 000 similar claims for damages from their final crisis with senior police officers and more with half their people across custody and then in many
executive faces more than 1 000 similar claims for damages law firm tods murray where he is a partner mr mcletchie said he has taken advice from holyrood officials about

BBC Business

retirement under 65 employers will no longer be able to advise for the companies said mr rubinsohn is due to delivering improvements and the us's biggest manufacturers could be above
retirement under 65 employers will no longer be able to
force workers to retire before 65 unless they can justify it the government has announced that firms will be barred
surging and there are many companies who will continue with forecasts by 90 of its known manufacturing making currently frozen goods the firm recently indicated that the merger for an
surging and there are many companies who will continue with
existing trading relationships christian aid has called on british firms not to simply cut and run but look after their