A Survey on Evaluation of Large Language Models (Jul 2023, added 7/14/23)
Survey paper that categorizes dozens of recent LLM papers on LLM applications and evaluation. This is a useful “paper tree”:
The paper goes through a bunch of these recent papers, but there isn’t a lot of summary compilation of insights from across papers. Here is a useful table of LLM benchmarks:
Performance Comparison of Various Models (May 2023, added 5/11/23)
Great table from Yao Fu (here):
Google’s PaLM 2 (announced on 5/10/23): GSM8K = 80.7, MATH = 34.3, MMLU = 78-81 (from their technical report).
Open-source LLM leaderboard by Hugging Face (live, added 5/11/23)
Very useful (here): constantly updating Hugging Face Space that tracks the performance of open-source LLMs on various important benchmarks. On 5/11, Llama-65B is leading (basically, all the best models are Llama plus some of their instruction-tuned derivatives, including the 7B model!).
“Chatbot Arena”: Elo Rating for LLMs (live, added 5/15/23)
Cool site by Ion Stoica who led the Vicuna development (LLM instruction-tuned from Llama), here: best hands-on way to compare chat capabilities of various models, by facing them off vs. each other. Constantly getting updated.
General Notes
Three alternatives to set up document Q&A:
Langchain: able to string together various LLM API calls into a coherent framework
LlamaIndex (previously GPTIndex):
Offers data connectors to sources (APIs, PDFs, docs, SQL etc.)
Provides indices over your unstructured and structured data for use with LLMs. These help with: storing context in an easy-to-access format for prompt insertion. Dealing with prompt limitations (e.g. 4096 tokens for Davinci) when context is too big. Dealing with text splitting.
Provides users an interface to query the index (feed in an input prompt) and obtain a knowledge-augmented outputz
Useful Tools
LMQL: A SQL-like programming language for language model interaction. You can wrap some simple control and select statements around your typical LLM prompt, and the language will handle the rest, such as extracting the salient pieces of information you really want from the model’s response. Very useful to save time on processing LLM output strings.