- new
- past
- show
- ask
- show
- jobs
- submit
Historically T5 are good when you finetune them for task specific models (translation, summarization, etc).
Decoder gets you the autoregressive generation you’d use for an llm.
Beyond that, there’s this advantage of having small LLMs train better, they kinda hit a wall a year or two ago IMHO. E.g. original Gemma 3 small models were short context and only text.
As far as I understand you have to pay for that by 2x inference cost at runtime
(Would be happy to be corrected on any of the above, I maintain a multi platform app that has llama.cpp inference in addition to standard LLMs, and I do embeddings locally, so I’m operating from a practical understanding more than ML phd)
The issue is that they're generally harder to train (need input/output pairs as a training dataset) and don't naturally generalize as well
Decoder-only models also do this, the only difference is that they use a masked attention.
The original transformer paper from google was encoder-decoder, but then encoder BERT was hot and then decoder GPT was hot; now encoder-decoder is hot again!
Decoders are good at generative tasks - chatbots etc.
Encoders are good at summaration.
Encoder decoders are better at summaration. It’s steps towards “understanding” (quotes needed).
Encoder Decoder comes from the original transformers implementation way back in 2017. If you look at figure 1 you'll see what the first transformer ever looked like.
Since that time different implementations of transformers use either just the encoder portion, or the decoder portion, or both. Its a deep topic so hard to summarize here, but Gemini explains it really well! Hope this gets you started on some prompting to learn more
A encoder-decoder model splits input and output. This makes sense for translation tasks, summarization, etc. They're good when there's a clear separation of "understand the task" and "complete the task", but you can use it for anything really. A example would be send "Translate to english: Le chat est noir." to the encoder, the encoder processes everything in a single step, that is understand the task as a whole, then the output of the encoder is fed to the decoder and then the decoder runs one token at a time.
GPT ditches the encoder altogether and just runs the decoder with some slight changes, this makes it more parameter efficient but tends to hallucinate more due to past tokens containing information that might be wrong. You can see it as the encoder running on each token as they are read/generated.
Edit: On re-read I noticed it might not be clear what I mean by past tokens containing wrong information. I mean that for each token the model generates a hidden state, those states don't change, so for example an input of 100 tokens will have 100 hidden states, the states are generated at once on the encoder model, and one token at a time on the decoder models. Since the decoder doesn't have the full information yet, the hidden state will contain extra information that might not having anything to do with the task, or even confuse it.
For example if you give the model the task "Please translate this to chinese: Thanks for the cat, he's cute. I'm trying to send it to my friend in hong kong.". For a enc-dec model it would read the whole thing at once and understand that you mean cantonese. But a decoder only model would "read" it one token a time it could trip in several places, 1. assume chinese means mandarin chinese not cantonese, 2. assume that the text after "cute." it's something to also translate and not a clarification. This would have several token worth of extra information that would confuse the model. Models are trained with this in mind so they're used to tokens having lots of different meanings embeded in them, then having later tokens narrow down the meanings, but it might cause models to ignore certain tokens, or hallucinate.
don't care. prove effective context length or gtfo.
I get not trying to cannibalize Gemma, but that's weird. A 540M multimodel model that performs well on queries would be useful and "just post-train it yourself" is not always an option.
From here it looks like it still is long context and multimodal though?
>Inputs and outputs Input:
Text string, such as a question, a prompt, or a document to be summarized
Images, normalized to 896 x 896 resolution and encoded to 256 tokens each
Total input context of 128K tokens Output:
Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
Total output context up to 32K tokens
Even if you did finetune the models with text and images then you could run into issues with using different descriptions for images to what it was trained with. Though you could probably work around that by getting the model to describe the images, but you'll still need to audit the results to correct any issues or add what you are training for.
You can also run into overfitting if your data does not include enough variations along a given training set that the original model had access to.
Using different training parameters could also affect the models capabilities. Just knowing things like the input context isn't enough.
On the other hand, RL on deployed systems looks promising to essentially JIT optimize models. Experiments with model routers and agentic rag have shown good results.
Edit: Just noticed
> Also note pre-training and post-training benchmarks are different, so scores are not comparable across plots.
The paper gives more details about the specific benchmarks and the scores obtained in them: https://arxiv.org/html/2512.14856v1#S4