LogoLogo
Python SDKSlack
  • Documentation
  • Cookbooks
  • Self-Hosting
  • Release Notes
  • Reference
  • Code Examples
    • Applications
      • Agents
      • RAG
      • Voice
    • Tracing
    • Evaluations
    • Experiments
      • Summarization
      • Text2SQL
    • Guardrails
  • Guides
    • AI Research
Powered by GitBook

Support

  • Chat Us On Slack
  • support@arize.com

Get Started

  • Signup For Free
  • Book A Demo

Copyright © 2025 Arize AI, Inc

On this page
  • Time Series Evals with OpenAI o1-preview
  • Prompt Caching Benchmarking
  • Multi-Agent Systems: Swarm
  • Instrumenting LLMs with OTel
  • Comparing Agent Frameworks
  • Testing Generation in RAG

Was this helpful?

  1. Guides

AI Research

Last updated 22 days ago

Was this helpful?

We benchmarked o1-preview on our hardest eval task - time series trend evaluations. This post compares that performance against GPT-4o-mini, Claude 3.5 sonnet, and GPT-4o.


We compare the performance and cost savings of prompt caching on Anthropic vs OpenAI.


We compare and contrast OpenAI's experimental Swarm repo against other popular multi-agent frameworks: Autogen and CrewAI




Testing the generation stage of RAG across GPT-4 and Claude 2.1.

Lessons learned from our journey to one million downloads of our OpenTelemetry wrapper, .

We built the in LangGraph, LlamaIndex Workflows, CrewAI, Autogen, and pure code. See how each framework compares.

Prompt Caching Benchmarking
Multi-Agent Systems: Swarm
Instrumenting LLMs with OTel
OpenInference
Comparing Agent Frameworks
same agent
Testing Generation in RAG
Time Series Evals with OpenAI o1-preview
Time Series Evals with OpenAI o1-preview
Prompt Caching Benchmarking
Multi-Agent Systems: Swarm
Instrumenting LLMs with OTel
Comparing Agent Frameworks
Testing Generation in RAG