Had an idea for an mechanism.

Attention mechanisms assume you can learn a specific element to attend to over a sequence. What about problems when it’s more practical to easily eliminate one element recursively, until one remains?

For example, with elements:

  • t=0, get attention scores via over all elements, use them to weight the input, zero out the argmax element

  • t=1, your input would be 10 elements; one element now a zero vector. The function that produces the attention weights should have an easier time eliminating one element than at t=0, because the size of the set is (effectively) reduced.

Is this a realistic idea or is there a better way to accomplish it?

Source link
thanks you RSS link
( https://www.reddit.com/r//comments/9nxia1/d__softmax_for_attention_by_/)


Please enter your comment!
Please enter your name here