Why Notation Matters
A practical observation on LLMs that is both encouraging and worrisome: it is surprising how much NOTATION matters. Simply how you write something down makes it much easier or harder for a transformer to understand the actual meaning behind it. It’s a little like LLMs are able to read a quantum physics book, but only if it’s written in 18-point Comic Sans to look like a Dr. Seuss book.
Three examples: (1/3) The paper “Language Models can be Logical Solvers” generates training data by first running first-order logic statements (“Charlie is green. All green, white people are nice. True, false or unknown: Charlie is not green?”) through a solver (like the good old logic language Prolog), and then fine-tuning an LLM on the step-by-step trace output. While out-of-the-box LLMs are surprisingly bad at executing logic statements, the fine-tuned models become good at it. But: if you simply fine-tune on “Charlie is green” instead of “Green(Charlie, True)”, the fine-tuned LLM’s accuracy astonishingly drops from 85% to 60%.
(2/3) The paper “Transformers Can Achieve Length Generalization But Not Robustly” deals with the famous issue that when you train a transformer on lots of multi-digit additions (like 123 + 456 = 579), it will indeed get good at adding numbers - but only up until the largest-sized numbers in its training data. (E.g., if you train with 10-digit numbers, the transformer will start making mistakes if you give it 11-digit numbers and above.) This paper manages to train a transformer on 40-digit numbers and gets it to be good at 100-digit numbers. How? Partly through notation tricks: 1) by reversing numbers (21 + 43 = 64 instead of 12 + 34 = 46), and 2) by adding index hints (a4b2 + a3b9 = a8b1 instead of 42 + 39 = 81). Reversing numbers works because that’s really how you add numbers (you start with the smallest digits), but the fact that you need index hints as “tokenized training wheels” is just wild.