Research

Published work and things I tried that didn't necessarily work out.

Published

Calibration Failures in Retrieval-Augmented Generation Systems

2024-01

We show that RAG systems exhibit systematic overconfidence when retrieved context contradicts the model's parametric knowledge, and propose a lightweight calibration intervention.

Read paper ↗

Experiments

Things I tried. Including failures.

Using LLMs as automated paper reviewers

abandoned

Tried to use GPT-4 to pre-screen papers for a workshop. The reviews were superficially plausible but consistently missed domain-specific errors that any expert would catch. Abandoned after testing on 20 papers.