So I wanted to post this here to verify if I’ve actually made a working LSTM. I worked from this guide primarily
I implemented ADAM from the research paper to optimize it
But I’m not sure I’m using a stochastic objective function as I’m computing the gradient for each item in the training sequence. Would that just be a batch size of 1?
Here is the stack overflow post
script.py on the github is printing out the average loss for each training epoch. It can be seen decreasing by about half but seems to get stuck there.
The loop continues until convergence but not only do I not know how to test for convergence but the loss seems to stay pretty high so I don’t think it is converging anyway. The convergence analysis in the ADAM paper wen’t over my head.
lstm.py uses a slightly different notation system than the guide above. For feedforward in the cell state the guide uses h(gate) but I use (gate)t (gate at timestep t.
ct_ and ht_ are the hidden state ht, and cell state c from the previous timestep. I know there is also a ct gate activation but I was trying to make a distinction between the cell state gate activation and the fully activated cell state.
The notation in cell backprop is similar, except the guide writes dh(gate) for the gate activation and I just write d(gate), notice that dhc is replaced with dct while dc remains the same.
I was hoping the community could help me verify that my model is a correct LSTM, also I don’t know how to sample from the model and generate text.