Introducing manual analysis of large language model performance rankings to Hugging Face

Estimated read time 5 min read

When building applications for large language models, in addition to quality, speed and cost are also indispensable considerations.

For consumer apps and chat experiences, quick responses are key to engaging users. Users expect almost instant responses, and any delay may directly reduce user activity. When developing complex applications involving tool usage or agent systems, the importance of speed and cost is self-evident, and they may even become bottlenecks that restrict overall system performance. The cumulative time it takes for a large language model to process successive requests for each user request will directly increase the cost.

therefore,Artificial Analysis(@ArtificialAnlys) has launched a new ranking that takes into account price, speed and quality, now available on Hugging Face.

ClickhereCheck out the leaderboard!

The large language model performance rankings aim to provide comprehensive evaluation indicators to help AI engineers choose the large language model and API provider that is most suitable for their AI applications.

When choosing the right AI technology, engineers need to consider quality, price, and responsiveness (latency and throughput). The ranking integrates information from these three aspects to make the decision-making process more focused and efficient, covering both proprietary and open models.

Indicator coverage

The ranking includes the following key indicators:

  • Quality: A simplified index used to compare the quality and accuracy of different models, based on the MMLU, MT-Bench, HumanEval scores provided by each model author, and the Chatbot Arena ranking.
  • Context window: The maximum number of tokens (including input and output tokens) that the large language model can process in one processing.
  • Pricing: How much different providers charge for model inference queries. The report mentions input/output prices per token, as well as a comprehensive comparison of “hybrid” pricing across hosting providers. Hybrid pricing is based on a 3:1 ratio where the input length is three times the output length.
  • Throughput: The speed at which the endpoint outputs Tokens during the inference process, measured in Tokens per second (Token/s, commonly known as “TPS”). Median, 5th percentile, 25th percentile, 75th percentile, and 95th percentile values ​​are reported for the past 14 days.
  • Latency: The time it takes for an endpoint to start responding after receiving a request, called the “Time to First Token Arrival” (TTFT), measured in seconds. Also reported are the median, 5th percentile, 25th percentile, 75th percentile, and 95th percentile values ​​for the past 14 days.

For more detailed definitions, please visit ourMethodology full page.

Test workload

The leaderboard supports testing performance under several different workload conditions, including six different combinations:

  • Varying tip lengths : ~100 Tokens, ~1K Tokens, ~10K Tokens.
  • Parallel queries : single query and 10 parallel queries.


We test each API endpoint on the leaderboard 8 times a day, and the data shown is the median of the past 14 days. We also provide detailed percentile data.

Currently, quality metrics are collected and reported individually for each model, with the data provided by the model’s creator. Stay tuned as we will begin publishing the results of each endpoint’s independent quality assessment.

For more detailed definitions, please visit ourMethodology full page.

Highlights (May 2024, please check the latest rankings for details)

  • The language model market has become increasingly complex over the past year. Major launches that have caused market turmoil in the past two months include Anthropic’s Claude 3 series and open models such as Databricks’ DBRX, Cohere’s Command R Plus, Google’s Gemma, Microsoft’s Phi-3, Mistral’s Mixtral 8x22B and Meta’s Llama 3.
  • Prices and speeds vary greatly between models and providers. From Claude 3 Opus to Llama 3 8B, the price difference reaches 300 times, which is more than two orders of magnitude!
  • API providers speed up model rollout. In less than 48 hours, seven providers started offering Llama 3 models, reflecting market demand for new open source models and competition among API providers.
  • Key models to focus on at different quality levels include:
    • High quality, but often pricier and slower models like GPT-4 Turbo and Claude 3 Opus.
    • Models with mid-range quality, price and speed, such as Llama 3 70B, Mixtral 8x22B, Command R+, Gemini 1.5 Pro, DBRX.
    • Lower quality, but faster and cheaper models such as Llama 3 8B, Claude 3 Haiku, Mixtral 8x7B.

Use Case: Speed ​​and cost are as important as quality

In some cases, designing application patterns that involve multiple requests using a faster and cheaper model can not only reduce costs but also improve overall system quality compared to using a single larger model.

For example, imagine a chatbot needs to browse the web and extract relevant information from the latest news articles. One strategy is to use a large, high-quality model, such as GPT-4 Turbo, to perform the search and then read and process several primary articles. Another strategy is to use a smaller, more responsive model, such as Llama 3 8B, to read and extract key information from dozens of web pages in parallel, and then use GPT-4 Turbo to evaluate and summarize the most relevant result. The second strategy is more cost-effective and may result in higher quality results despite reading ten times more content.

You May Also Like

More From Author

+ There are no comments

Add yours