Interesting developments in subquadratic alternatives to self-attention based transformers for large sequence modeling (32k and more).
Hyena Hierarchy: Towards Larger Convolutional Language Models
https://arxiv.org/abs/2302.10866
They propose to replace the quadratic self-attention layers by an operator built with implicitly parametrized long kernel 1D convolutions.
#DeepLearning #LLMs #PaperThread
1/n