So I wanted to post this here to verify if I’ve actually made a . I worked from this guide primarily

I implemented ADAM from the research paper to optimize it

But I’m not sure I’m using a stochastic objective function as I’m computing the gradient for each item in the sequence. Would that just be a batch size of 1?

Here is the stack overflow post on the github is printing out the average for each training epoch. It can be seen by about half but seems to get stuck there.
The loop continues until convergence but not only do I not know how to test for convergence but the loss seems to stay pretty high so I don’t think it is converging anyway. The convergence analysis in the ADAM paper wen’t over my head. uses a slightly different notation system than the guide above. For feedforward in the cell state the guide uses h(gate) but I use (gate)t (gate at timestep t.
ct_ and ht_ are the hidden state ht, and cell state c from the previous timestep. I know there is also a ct gate activation but I was trying to make a distinction between the cell state gate activation and the fully activated cell state.

The notation in cell backprop is similar, except the guide writes dh(gate) for the gate activation and I just write d(gate), notice that dhc is replaced with dct while dc remains the same.

I was hoping the community could help me verify that my model is a correct LSTM, also I don’t know how to from the model and generate text.

Source link


Please enter your comment!
Please enter your name here