Had an idea for an attention mechanism.
Attention mechanisms assume you can learn a specific element to attend to over a sequence. What about problems when it’s more practical to easily eliminate one element recursively, until one remains?
For example, with 10 elements:
t=0, get attention scores via softmax over all elements, use them to weight the input, zero out the argmax element
t=1, your input would be 10 elements; one element now a zero vector. The function that produces the attention weights should have an easier time eliminating one element than at t=0, because the size of the set is (effectively) reduced.
Is this a realistic idea or is there a better way to accomplish it?