Language models are great with words, but budgets need numbers. Our benchmark revealed a gap that general-purpose AI still can't close.