Introduction

You might have noticed that, despite the frequency with which we encounter sequential data in the real world, there isn’t a huge amount of content online showing how to build simple LSTMs from the ground up using the Pytorch functional API. Even the LSTM example on Pytorch’s official documentation only applies it to a natural language problem, which can be disorienting when trying to get these recurrent models working on time series data. In this article, we’ll set a solid foundation for constructing an end-to-end LSTM, from tensor input and output shapes to the LSTM itself. 

This article is structured with the goal of being able to implement any univariate time-series LSTM. We begin by examining the shortcomings of traditional neural networks for these tasks, and why an LSTM’s input is differently shaped to simple neural nets. We’ll then intuitively describe the mechanics that allow an LSTM to “remember.” With this approximate understanding, we can implement a Pytorch LSTM using a traditional model class structure inheriting from nn.Module, and write a forward method for it. We use this to see if we can get the LSTM to learn a simple sine wave. Finally, we attempt to write code to generalise how we might initialise an LSTM based on the problem at hand, and test it on our previous examples.

Simplest possible neural network

Let’s suppose we have the following time-series data. Rather than using complicated recurrent models, we’re going to treat the time series as a simple input-output function: the input is the time, and the output is the value of whatever dependent variable we’re measuring. This is essentially just simplifying a univariate time series.

You might be wondering there’s any difference between the problem we’ve outlined above, and an actual sequential modelling approach to time series problems (as used in LSTMs). The difference is in the recurrency of the solution. Here, we’re simply passing in the current time step and hoping the network can output the function value. However, in recurrent neural networks, we not only pass in the current input, but also previous outputs. In this way, the network can learn dependencies between previous function values and the current one. Here, the network has no way of learning these dependencies, because we simply don’t input previous outputs into the model.

Let’s suppose that we’re trying to model the number of minutes Klay Thompson will play in his return from injury. Steve Kerr, the coach of the Golden State Warriors, doesn’t want Klay to come back and immediately play heavy minutes. Instead, he will start Klay with a few minutes per game, and ramp up the amount of time he’s allowed to play as the season goes on. We’re going to be Klay Thompson’s physio, and we need to predict how many minutes per game Klay will be playing in order to determine how much strapping to put on his knee. 

Thus, the number of games since returning from injury (representing the input time step) is the independent variable, and Klay Thompson’s number of minutes in the game is the dependent variable. Suppose we observe Klay for 11 games, recording his minutes per game in each outing to get the following data.

X = [x for x in range(11)]
y = [1.6*x + 4 + np.random.normal(10, 1) for x in X]
X, y

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [8.059610387807004,
  11.05288064074008,
  11.353963162111054,
  13.816355592580631,
  14.13887152857681,
  15.694474605527,
  15.684278885945714,
  15.532815595076784,
  18.247200671926283,
  20.520472619447048,
  20.127253834627])

Here, we’ve generated the minutes per game as a linear relationship with the number of games since returning. We’re going to use 9 samples for our training set, and 2 samples for validation.

X_train = X[:9]
y_train = y[:9]
X_val = X[9:]
y_val = X[9:]

We know that the relationship between game number and minutes is linear. However, we’re still going to use a non-linear activation function, because that’s the whole point of a neural network. (Otherwise, this would just turn into linear regression: the composition of linear operations is just a linear operation.) As per usual, we use nn.Sequential to build our model with one hidden layer, with 13 hidden neurons.

seq_model = nn.Sequential(
    nn.Linear(1, 13),
    nn.Tanh(),
    nn.Linear(13, 1))
seq_model

>>> Sequential(
  (0): Linear(in_features=1, out_features=13, bias=True)
  (1): Tanh()
  (2): Linear(in_features=13, out_features=1, bias=True)
)

We now need to write a training loop, as we always do when using gradient descent and backpropagation to force a network to learn. To remind you, each training step has several key tasks:

  1. Compute the forward pass through the network by applying the model to the training examples.
  2. Calculate the loss based on the defined loss function, which compares the model output to the actual training labels.
  3. Backpropagate the derivative of the loss with respect to the model parameters through the network. This is done with call .backward() on the loss, after setting the current parameter gradients to zero with .zero_grad().
  4. Update the model parameters by subtracting the gradient times the learning rate. This is done with our optimiser, using optimiser.step().
def training_loop(n_epochs, optimiser, model, loss_fn, X_train,  X_val, y_train, y_val):
    for epoch in range(1, n_epochs + 1):
        output_train = model(X_train) # forwards pass
        loss_train = loss_fn(output_train, y_train) # calculate loss
        output_val = model(X_val) 
        loss_val = loss_fn(output_val, y_val)
        
        optimiser.zero_grad() # set gradients to zero
        loss_train.backward() # backwards pass
        optimiser.step() # update model parameters
        if epoch == 1 or epoch % 10000 == 0:
            print(f"Epoch {epoch}, Training loss {loss_train.item():.4f},"
                  f" Validation loss {loss_val.item():.4f}")

Now, all we need to do is instantiate the required objects, including our model, our optimiser, our loss function and the number of epochs we’re going to train for.

optimiser = optim.SGD(seq_model.parameters(), lr=1e-3)
training_loop(
    n_epochs = 500000, 
    optimiser = optimiser,
    model = seq_model,
    loss_fn = nn.MSELoss(),
    X_train = X_train,
    X_val = X_val, 
    y_train = y_train,
    y_val = y_val)
    
>>> Epoch 1, Training loss 422.8955, Validation loss 372.3910
    ...
    Epoch 500000, Training loss 0.0007, Validation loss 299.8014

As we can see, the model is likely overfitting significantly (which could be solved with many techniques, such as regularisation, or lowering the number of model parameters, or enforcing a linear model form). The training loss is essentially zero. Due to the inherent random variation in our dependent variable, the minutes played taper off into a flat curve towards the last few games, leading the model to believes that the relationship more resembles a log rather than a straight line.

Although it wasn’t very successful, this initial neural network is a proof-of-concept that we can just develop sequential models out of nothing more than inputting all the time steps together. However, without more information about the past, and without the ability to store and recall this information, model performance on sequential data will be extremely limited.

Intuition behind LSTMs

The simplest neural networks make the assumption that the relationship between the input and output is independent of previous output states. It assumes that the function shape can be learnt from the input alone. In cases such as sequential data, this assumption is not true. The function value at any one particular time step can be thought of as directly influenced by the function value at past time steps. There is a temporal dependency between such values. Long-short term memory networks, or LSTMs, are a form of recurrent neural network that are excellent at learning such temporal dependencies.

The key to LSTMs is the cell state, which allows information to flow from one cell to another. This represents the LSTM’s memory, which can be updated, altered or forgotten over time. The components of the LSTM that do this updating are called gates, which regulate the information contained by the cell. Gates can be viewed as combinations of neural network layers and pointwise operations.

If you don’t already know how LSTMs work, the maths is straightforward and the fundamental LSTM equations are available in the Pytorch docs. There are many great resources online, such as this one. As a quick refresher, here are the four main steps each LSTM cell undertakes:

  1. Decide what information to remove from the cell state that is no longer relevant. This is controlled by a neural network layer (with a sigmoid activation function) called the forget gate. We feed the output of the previous cell into the forget gate, which in turn outputs a number between 0 and 1 determining how much or little to forget. 
  2. Update the cell state with new information. An NN layer called the input gate takes the concatenation of the previous cell’s output and the current input and decides what to update. A tanh layer takes the same concatenation and creates a vector of new candidate values that could be added to the state. 
  3. Update the old cell state to create a new cell state. We multiply the old state by the value determined in Step 1, forgetting the things we decided to forget earlier. Then we add the new candidate values we found in Step 2. These constitute the new cell state, scaled by how much we decided to update each state value. This is finished for this cell; we can pass this directly to the next cell in the model.
  4. Generate the model output based on the previous output and the current input. First, we take our updated cell state and pass it through an NN layer. We then find the output of the output/input vector passed through the sigmoid layer, and then pointwise compose it with the modified cell state. This allows the cell full control over composing the cell state and the current cell inputs, which gives us an appropriate output.

Note that we give the output twice in the diagram above. One of these outputs is to be stored as a model prediction, for plotting etc. The other is passed to the next LSTM cell, much as the updated cell state is passed to the next LSTM cell.

Pytorch LSTM

Our problem is to see if an LSTM can “learn” a sine wave. This is actually a relatively famous (read: infamous) example in the Pytorch community. It’s the only example on Pytorch’s Examples Github repository of an LSTM for a time-series problem. However, the example is old, and most people find that the code either doesn’t compile for them, or won’t converge to any sensible output. (A quick Google search gives a litany of Stack Overflow issues and questions just on this example.) Here, we’re going to break down and alter their code step by step.

Generating the data

We begin by generating a sample of 100 different sine waves, each with the same frequency and amplitude but beginning at slightly different points on the x-axis.

N = 100 # number of samples
L = 1000 # length of each sample (number of values for each sine wave)
T = 20 # width of the wave
x = np.empty((N,L), np.float32) # instantiate empty array
x[:] = np.arange(L)) + np.random.randint(-4*T, 4*T, N).reshape(N,1)
y = np.sin(x/1.0/T).astype(np.float32)

Let’s walk through the code above. N is the number of samples; that is, we are generating 100 different sine waves. Many people intuitively trip up at this point. Since we are used to training a neural network on individual data points, such as the simple Klay Thompson example from above, it is tempting to think of N here as the number of points at which we measure the sine function. This is wrong; we are generating N different sine waves, each with a multitude of points. The LSTM network learns by examining not one sine wave, but many.

Next, we instantiate an empty array x. Think of this array as a sample of points along the x-axis. The array has 100 rows (representing the 100 different sine waves), and each row is 1000 elements long (representing L, or the granularity of the sine wave i.e. the number of distinct sampled points in each wave). We then fill x by sampling the first 1000 integers points and then adding a random integer in a certain range governed by T, where x[:] is just syntax to add the integer along rows. Note that we must reshape this second random integer to shape (N, 1) in order for Numpy to be able to broadcast it to each row of x.  Finally, we simply apply the Numpy sine function to x, and let broadcasting apply the function to each sample in each row, creating one sine wave per row. We cast it to type float32. We can pick any individual sine wave and plot it using Matplotlib. Let’s pick the first sampled sine wave at index 0.

To build the LSTM model, we actually only have one nn module being called for the LSTM cell specifically. First, we’ll present the entire model class (inheriting from nn.Module, as always), and then walk through it piece by piece.

class LSTM(nn.Module):
    def __init__(self, hidden_layers=64):
        super(LSTM, self).__init__()
        self.hidden_layers = hidden_layers
        # lstm1, lstm2, linear are all layers in the network
        self.lstm1 = nn.LSTMCell(1, self.hidden_layers)
        self.lstm2 = nn.LSTMCell(self.hidden_layers, self.hidden_layers)
        self.linear = nn.Linear(self.hidden_layers, 1)
        
    def forward(self, y, future_preds=0):
        outputs, num_samples = [], y.size(0)
        h_t = torch.zeros(n_samples, self.hidden_layers, dtype=torch.float32)
        c_t = torch.zeros(n_samples, self.hidden_layers, dtype=torch.float32)
        h_t2 = torch.zeros(n_samples, self.hidden_layers, dtype=torch.float32)
        c_t2 = torch.zeros(n_samples, self.hidden_layers, dtype=torch.float32)
        
        for time_step in y.split(1, dim=1):
            # N, 1
            h_t, c_t = self.lstm1(input_t, (h_t, c_t)) # initial hidden and cell states
            h_t2, c_t2 = self.lstm2(h_t, (h_t2, c_t2)) # new hidden and cell states
            output = self.linear(h_t2) # output from the last FC layer
            outputs.append(output)
            
        for i in range(future_preds):
            # this only generates future predictions if we pass in future_preds>0
            # mirrors the code above, using last output/prediction as input
            h_t, c_t = self.lstm1(output, (h_t, c_t))
            h_t2, c_t2 = self.lstm2(h_t, (h_t2, c_t2))
            output = self.linear(h_t2)
            outputs.append(output)
        # transform list to tensor    
        outputs = torch.cat(outputs, dim=1)
        return outputs

Initialisation

The key step in the initialisation is the declaration of a Pytorch LSTMCell. You can find the documentation here. The cell has three main parameters:

  • input_size: the number of expected features in the input x.
  • hidden_size: the number of features in the hidden state h.
  • bias: this defaults to true, and in general we leave it that way.

Some of you may be aware of a separate torch.nn class called LSTM. The distinction between the two is not really relevant here, but just know that LSTMCell is more flexible when it comes to defining our own models from scratch using the functional API.

Keep in mind that the parameters of the LSTM cell are different from the inputs. The parameters here largely govern the shape of the expected inputs, so that Pytorch can set up the appropriate structure. The inputs are the actual training examples or prediction examples we feed into the cell. We define two LSTM layers using two LSTM cells. Much like a convolutional neural network, the key to setting up input and hidden sizes lies in the way the two layers connect to each other. For the first LSTM cell, we pass in an input of size 1. Recall why this is so: in an LSTM, we don’t need to pass in a sliced array of inputs. We don’t need a sliding window over the data, as the memory and forget gates take care of the cell state for us. We don’t need to specifically hand feed the model with old data each time, because of the model’s ability to recall this information. This is what makes LSTMs so special.

We then give this first LSTM cell a hidden size governed by the variable when we declare our class, n_hidden. This number is rather arbitrary; here, we pick 64. As mentioned above, this becomes an output of sorts which we pass to the next LSTM cell, much like in a CNN: the output size of the last step becomes the input size of the next step. In this cell, we thus have an input of size hidden_size, and also a hidden layer of size hidden_size. We then pass this output of size hidden_size to a linear layer, which itself outputs a scalar of size one. We are outputting a scalar, because we are simply trying to predict the function value y at that particular time step.

One of the most important things to keep in mind at this stage of constructing the model is the input and output size: what am I mapping from and to?

Forward method

In the forward method, once the individual layers of the LSTM have been instantiated with the correct sizes, we can begin to focus on the actual inputs moving through the network. An LSTM cell takes the following inputs: input, (h_0, c_0).

  • input: a tensor of inputs of shape (batch, input_size), where we declared input_size in the creation of the LSTM cell.
  • h_0: a tensor containing the initial hidden state for each element in the batch, of shape (batch, hidden_size).
  • c_0: a tensor containing the initial cell state for each element in the batch, of shape (batch, hidden_size).

To link the two LSTM cells (and the second LSTM cell with the linear, fully-connected layer), we also need to know what an LSTM cell actually outputs: a tensor of shape (h_1, c_1).

  • h_0: a tensor containing the next hidden state for each element in the batch, of shape (batch, hidden_size).
  • c_0: a tensor containing the next cell state for each element in the batch, of shape (batch, hidden_size).

Here, our batch size is 100, which is given by the first dimension of our input; hence, we take n_samples = x.size(0). Since we know the shapes of the hidden and cell states are both (batch, hidden_size), we can instantiate a tensor of zeros of this size, and do so for both of our LSTM cells.

The next step is arguably the most difficult. We must feed in an appropriately shaped tensor. Here, that would be a tensor of m points, where m is our training size on each sequence. However, in the Pytorch split() method (documentation here), if the parameter split_size_or_sections is not passed in, it will simply split each tensor into chunks of size 1. We want to split this along each individual batch, so our dimension will be the rows, which is equivalent to dimension 1.

It’s always a good idea to check the output shape when we’re vectorising an array in this way. Suppose we choose three sine curves for the test set, and use the rest for training. We can check what our training input will look like in our split method:

a = torch.from_numpy(y[3:, :-1])
b = a.split(1, dim=1)
b[0].shape

>>> torch.Size([97, 1])

So, for each sample, we’re passing in an array of 97 inputs, with an extra dimension to represent that it comes from a batch. (Pytorch usually operates in this way. Even if we’re passing in a single image to the world’s simplest CNN, Pytorch expects a batch of images, and so we have to use unsqueeze().) We then output a new hidden and cell state. As we know from above, the hidden state output is used as input to the next LSTM cell. The hidden state output from the second cell is then passed to the linear layer.

Great - we’ve completed our model predictions based on the actual points we have data for. But the whole point of an LSTM is to predict the future shape of the curve, based on past outputs. So, in the next stage of the forward pass, we’re going to predict the next future time steps. Recall that in the previous loop, we calculated the output to append to our outputs array by passing the second LSTM output through a linear layer. This variable is still in operation - we can access it and pass it to our model again. This is good news, as we can predict the next time step in the future, one time step after the last point we have data for. The model takes its prediction for this final data point as input, and predicts the next data point.

output = self.linear(h_t2)
....
for i in range(future_preds):
    h_t, c_t = self.lstm1(output, (h_t, c_t))

We then do this again, with the prediction now being fed as input to the model. In total, we do this future number of times, to produce a curve of length future_preds, in addition to the 1000 predictions we’ve already made on the 1000 points we actually have data for.

The last thing we do is concatenate the array of scalar tensors representing our outputs, before returning them. That’s it! We’ve built an LSTM which takes in a certain number of inputs, and, one by one, predicts a certain number of time steps into the future.

Training the model

Defining a training loop in Pytorch is quite homogeneous across a variety of common applications. However, in our case, we can’t really gain an intuitive understanding of how the model is converging by examining the loss. Yes, a low loss is good, but there’s been plenty of times when I’ve gone to look at the model outputs after achieving a low loss and seen absolute garbage predictions. This is usually due to a mistake in my plotting code, or even more likely a mistake in my model declaration. Thus, the most useful tool we can apply to model assessment and debugging is plotting the model predictions at each training step to see if they improve.

Our first step is to figure out the shape of our inputs and our targets. We know that our data y has the shape (100, 1000). That is, 100 different sine curves of 1000 points each. Next, we want to figure out what our train-test split is. We’ll save 3 curves for the test set, and so indexing along the first dimension of y we can use the last 97 curves for the training set.

Now comes time to think about our model input. One at a time, we want to input the last time step and get a new time step prediction out. To do this, we input the first 999 samples from each sine wave, because inputting the last 1000 would lead to predicting the 1001st time step, which we can’t validate because we don’t have data on it. Similarly, for the training target, we use the first 97 sine waves, and start at the 2nd sample in each wave and use the last 999 samples from each wave; this is because we need a previous time step to actually input to the model - we can’t input nothing. Hence, the starting index for the target in the second dimension (representing the samples in each wave) is 1. This gives us two arrays of shape (97, 999).

# y = (100, 1000)
train_input = torch.from_numpy(y[3:, :-1]) # (97, 999)
train_target = torch.from_numpy(y[3:, 1:]) # (97, 999)

The test input and test target follow very similar reasoning, except this time, we index only the first three sine waves along the first dimension. Everything else is exactly the same, as we would expect: apart from the batch input size (97 vs 3) we need to have the same input and outputs for train and test sets.

test_input = torch.from_numpy(y[:3, :-1]) # (3, 999)
test_target = torch.from_numpy(y[:3, 1:]) # (3, 999)

We now need to instantiate the main components of our training loop: the model itself, the loss function, and the optimiser. The model is simply an instance of our LSTM class, and the loss function we will use for what amounts to a regression problem is nn.MSELoss(). The only thing different to normal here is our optimiser. Instead of Adam, we will use what is called a limited-memory BFGS algorithm, which essentially boils down to estimating an inverse of the Hessian matrix as a guide through the variable space. You don’t need to worry about the specifics, but you do need to worry about the difference between optim.LBFGS and other optimisers. We’ll cover that in the training loop below.

model = LSTM()
criterion = nn.MSELoss()
optimiser = optim.LBFGS(model.parameters(), lr=0.08)

You might be wondering why we’re bothering to switch from a standard optimiser like Adam to this relatively unknown algorithm. An LBFGS solver is a quasi-Newton method which uses the inverse of the Hessian to estimate the curvature of the parameter space. In sequential problems, the parameter space is characterised by an abundance of long, flat valleys, which means that the LBFGS algorithm often outperforms other methods such as Adam, particularly when there is not a huge amount of data.

Finally, we get around to constructing the training loop. Fair warning, as much as I’ll try to make this look like a typical Pytorch training loop, there will be some differences. These are mainly in the function we have to pass to the optimiser, closure, which represents the typical forward and backward pass through the network. We update the weights with optimiser.step() by passing in this function. According to Pytorch, the function closure is a callable that reevaluates the model (forward pass), and returns the loss. So this is exactly what we do.

def training_loop(n_epochs, model, optimiser, loss_fn, 
                  train_input, train_target, test_input, test_target):
    for i in range(n_epochs):
        def closure():
            optimiser.zero_grad()
            out = model(train_input)
            loss = loss_fn(out, train_target)
            loss.backward()
            return loss
        optimiser.step(closure)
        with torch.no_grad():
            future = 1000
            pred = model(test_input, future=future)
            # use all pred samples, but only go to 999
            loss = loss_fn(pred[:, :-future], test_target)
            y = pred.detach().numpy()
        # draw figures
        plt.figure(figsize=(12,6))
        plt.title(f"Step {i+1}")
        plt.xlabel("x")
        plt.ylabel("y")
        plt.xticks(fontsize=20)
        plt.yticks(fontsize=20)
        n = train_input.shape[1] # 999
        def draw(yi, colour):
            plt.plot(np.arange(n), yi[:n], colour, linewidth=2.0)
            plt.plot(np.arange(n, n+future), yi[n:], colour+":", linewidth=2.0)
        draw(y[0], 'r')
        draw(y[1], 'b')
        draw(y[2], 'g')
        plt.savefig("predict%d.png"%i, dpi=200)
        plt.close()
        # print the loss
        out = model(train_input)
        loss_print = loss_fn(out, train_target)
        print("Step: {}, Loss: {}".format(i, loss_print))

The training loop starts out much as other garden-variety training loops do. However, notice that the typical steps of forward and backwards pass are captured in the function closure. This is just an idiosyncrasy of how the optimiser function is designed in Pytorch. We return the loss in closure, and then pass this function to the optimiser during optimiser.step(). And that’s pretty much it for the training step.

Next, we want to plot some predictions, so we can sanity-check our results as we go. To do this, we need to take the test input, and pass it through the model. This is where our future parameter we included in the model itself is going to come in handy. Recall that passing in some non-negative integer future to the forward pass through the model will give us future predictions after the last output from the actual samples. This allows us to see if the model generalises into future time steps. We then detach this output from the current computational graph and store it as a numpy array.

Finally, we write some simple code to plot the model’s predictions on the test set at each epoch. There are only three test sine curves, so we only need to call our draw function three times (we’ll draw each curve in a different colour). The plotted lines indicate future predictions, and the solid lines indicate predictions in the current range of the data.

The predictions clearly improve over time, as well as the loss going down. Our model works: by the 8th epoch, the model has learnt the sine wave. However, if you keep training the model, you might see the predictions start to do something funny. This is because, at each time step, the LSTM relies on outputs from the previous time step. If the prediction changes slightly for the 1001st prediction, this will perturb the predictions all the way up to prediction 2000, resulting in a nonsensical curve. There are many ways to counter this, but they are beyond the scope of this article. The best strategy right now would be to watch the plots to see if this error accumulation starts happening. Then, you can either go back to an earlier epoch, or train past it and see what happens.

If you’re having trouble getting your LSTM to converge, here’s a few things you can try:

  • Lower the number of model parameters (maybe even down to 15) by changing the size of the hidden layer. This reduces the model search space.
  • Try downsampling from the first LSTM cell to the second by reducing the hidden_size passed to the second cell. You could also do this from the second LSTM cell to the linear fully-connected layer.
  • Add batchnorm regularisation, which limits the size of the weights by placing penalties on larger weight values, giving the loss a smoother topography.
  • Add dropout, which zeros out a random fraction of neuronal outputs across the whole model at each epoch. This generates slightly different models each time, meaning the model is forced to rely on individual neurons less.

If you implement the last two strategies, remember to call model.train() to instantiate the regularisation during training, and turn off the regularisation during prediction and evaluation using model.eval().

Automating model construction

This whole exercise is pointless if we still can’t apply an LSTM to other shapes of input. Let’s generate some new data, except this time, we’ll randomly generate the number of curves and the samples in each curve. We won’t know what the actual values of these parameters are, and so this is a perfect way to see if we can construct an LSTM based on the relationships between input and output shapes.

N = np.random.randint(50, 200) # number of samples
L = np.random.randint(800, 1200) # length of each sample (number of values for each sine wave)
T = np.random.randint(10, 30) # width of the wave
x = np.empty((N,L), np.float32) # instantiate empty array
x[:] = np.arange(L) + np.random.randint(-4*T, 4*T, N).reshape(N,1)
y = np.cos(np.sin(x/1.0/T)**2).astype(np.float32)

We could then change the following input and output shapes by determining the percentage of samples in each curve we’d like to use for the training set.

train_prop = 0.95
train_samples = round(N * train_prop) 
test_samples = N - train_samples

The input and output shapes thus become:

# y = (N, L)
train_input = torch.from_numpy(y[test_samples:, :-1]) # (train_samples, L-1)
train_target = torch.from_numpy(y[test_samples:, 1:]) # (train_samples, L-1)
test_input = torch.from_numpy(y[:test_samples, :-1]) # (train_samples, L-1)
test_target = torch.from_numpy(y[:test_samples, 1:]) # (train_samples, L-1)

You can verify that this works by running these inputs and targets through the LSTM (hint: make sure you instantiate a variable for future_preds based on the length of the input).

Let’s see if we can apply this to the original Klay Thompson example. We need to generate more than one set of minutes if we’re going to feed it to our LSTM. That is, we’re going to generate 100 different hypothetical sets of minutes that Klay Thompson played in 100 different hypothetical worlds. We’ll feed 95 of these in for training, and plot three of the remaining five to see how our model is learning.

N = 100 # number of theoretical series of games
L = 11 # number of games in each series
x = np.empty((N,L), np.float32) # instantiate empty array
x[:] = np.arange(L))
y = (1.6*x + 4).astype(np.float32)

# add some noise
for i in range(len(y)):
    y[i] += np.random.normal(10, 1)

After using the code above to reshape the inputs and outputs based on L and N, we run the model and achieve the following:

training_loop(n_epochs = 10,
              model = model,
              optimiser = optimiser,
              loss_fn = criterion,
              L = L,
              train_input = train_input,
              train_target = train_target,
              test_input = test_input,
              test_target = test_target)

Step: 0, Loss: 7.279130458831787
Step: 1, Loss: 1.4280033111572266
Step: 2, Loss: 0.5719900727272034
Step: 3, Loss: 0.31642651557922363
Step: 4, Loss: 0.20579740405082703
Step: 5, Loss: 0.15880005061626434
Step: 6, Loss: 0.1370033472776413
Step: 7, Loss: 0.11637765169143677
Step: 8, Loss: 0.09049651771783829
Step: 9, Loss: 0.06212589889764786

This gives us the following images (we only show the first and last):

Very interesting! Initially, the LSTM also thinks the curve is logarithmic. Whilst it figures out that the curve is linear on the first 11 games after a bit of training, it insists on providing a logarithmic curve for future games. What is so fascinating about that is that the LSTM is right - Klay can’t keep linearly increasing his game time, as a basketball game only goes for 48 minutes, and most processes such as this are logarithmic anyway. Obviously, there’s no way that the LSTM could know this, but regardless, it’s interesting to see how the model ends up interpreting our toy data. A future task could be to play around with the hyperparameters of the LSTM to see if it is possible to make it learn a linear function for future time steps as well. Additionally, I like to create a Python class to store all these functions in one spot. Then, you can create an object with the data, and you can write functions which read the shape of the data, and feed it to the appropriate LSTM constructors.

Conclusion

In summary, creating an LSTM for univariate time series data in Pytorch doesn’t need to be overly complicated. However, the lack of available resources online (particularly resources that don’t focus on natural language forms of sequential data) make it difficult to learn how to construct such recurrent models. Hopefully, this article provided guidance on setting up your inputs and targets, writing a Pytorch class for the LSTM forward method, defining a training loop with the quirks of our new optimiser, and debugging using visual tools such as plotting.

If you would like to learn more about the maths behind the LSTM cell, I highly recommend this article which sets out the fundamental equations of LSTMs beautifully (I have no connection to the author). I also recommend attempting to adapt the above code to multivariate time-series. All the core ideas are the same - you just need to think about how you might expand the dimensionality of the input.