Python SDK Slack

Documentation
Cookbooks
Self-Hosting
Release Notes
Reference

Code Examples
Guides
- AI Research

Powered by GitBook

Support

Chat Us On Slack
support@arize.com

Get Started

Signup For Free
Book A Demo

Copyright © 2025 Arize AI, Inc

On this page

Time Series Evals with OpenAI o1-preview
Prompt Caching Benchmarking
Multi-Agent Systems: Swarm
Instrumenting LLMs with OTel
Comparing Agent Frameworks
Testing Generation in RAG

Was this helpful?

Guides

AI Research

Last updated 22 days ago

Was this helpful?

We benchmarked o1-preview on our hardest eval task - time series trend evaluations. This post compares that performance against GPT-4o-mini, Claude 3.5 sonnet, and GPT-4o.

We compare the performance and cost savings of prompt caching on Anthropic vs OpenAI.

We compare and contrast OpenAI's experimental Swarm repo against other popular multi-agent frameworks: Autogen and CrewAI

Testing the generation stage of RAG across GPT-4 and Claude 2.1.

Lessons learned from our journey to one million downloads of our OpenTelemetry wrapper, .

We built the in LangGraph, LlamaIndex Workflows, CrewAI, Autogen, and pure code. See how each framework compares.

Prompt Caching Benchmarking

Multi-Agent Systems: Swarm

Instrumenting LLMs with OTel

Comparing Agent Frameworks

Testing Generation in RAG

Time Series Evals with OpenAI o1-preview

Time Series Evals with OpenAI o1-preview

Prompt Caching Benchmarking

Multi-Agent Systems: Swarm

Instrumenting LLMs with OTel

Comparing Agent Frameworks

Testing Generation in RAG