OpenAI unveils o1, a model that can fact-check itself

Kyle Wiggers 10:02 AM PDT • September 12, 2024

ChatGPT maker OpenAI has announced its next major product release: A generative AI model code-named Strawberry, officially called OpenAI o1.

To be more precise, o1 is actually a family of models. Two are available Thursday in ChatGPT and via OpenAI’s API: o1-preview and o1-mini, a smaller, more efficient model aimed at code generation.

You’ll have to be subscribed to ChatGPT Plus or Team to see o1 in the ChatGPT client. Enterprise and educational users will get access early next week.

Note that the o1 chatbot experience is fairly barebones at present. Unlike GPT-4o, o1’s forebear, o1 can’t browse the web or analyze files yet. The model does have image-analyzing features, but they’ve been disabled pending additional testing. And o1 is rate-limited; weekly limits are currently 30 messages for o1-preview and 50 for o1-mini.

In another downside, o1 is expensive. Very expensive. In the API, o1-preview is $15 per 1 million input tokens and $60 per 1 million output tokens. That’s 3x the cost versus GPT-4o for input and 4x the cost for output. (Tokens are bits of raw data; 1 million is equivalent to around 750,000 words.)

OpenAI says it plans to bring o1-mini access to all free users of ChatGPT but hasn’t set a release date. We’ll hold the company to it.

Chain of reasoning

OpenAI o1 avoids some of the reasoning pitfalls that normally trip up generative AI models because it can effectively fact-check itself by spending more time considering all parts of a question. What makes o1 “feel” qualitatively different from other generative AI models is its ability to “think” before responding to queries, according to OpenAI.

When given additional time to “think,” o1 can reason through a task holistically — planning ahead and performing a series of actions over an extended period of time that help the model arrive at an answer. This makes o1 well-suited for tasks that require synthesizing the results of multiple subtasks, like detecting privileged emails in an attorney’s inbox or brainstorming a product marketing strategy.

In a series of posts on X on Thursday, Noam Brown, a research scientist at OpenAI, said that “o1 is trained with reinforcement learning.” This teaches the system “to ‘think’ before responding via a private chain of thought” through rewards when o1 gets answers right and penalties when it does not, he said.

Brown added that OpenAI used a new optimization algorithm and training dataset containing “reasoning data” and scientific literature specifically tailored for reasoning tasks. “The longer [o1] thinks, the better it does,” he said.

TechCrunch wasn’t offered the opportunity to test o1 before its debut; we’ll get our hands on it as soon as possible. But according to a person who did have access — Pablo Arredondo, VP at Thomson Reuters — o1 is better than OpenAI’s previous models (e.g., GPT-4o) at things like analyzing legal briefs and identifying solutions to problems in LSAT logic games.

“We saw it tackling more substantive, multi-faceted, analysis,” Arredondo told TechCrunch. “Our automated testing also showed gains against a wide range of simple tasks.”

In a qualifying exam for the International Mathematical Olympiad (IMO), a high school math competition, o1 correctly solved 83% of problems while GPT-4o only solved 13%, according to OpenAI. (That’s less impressive when you consider that Google DeepMind’s recent AI achieved a silver medal in an equivalent to the actual IMO contest.) OpenAI also says that o1 reached the 89th percentile of participants — better than DeepMind’s flagship system AlphaCode 2, for what it’s worth — in the online programming challenge rounds known as Codeforces.

In general, o1 should perform better on problems in data analysis, science, and coding, OpenAI says. (GitHub, which tested o1 with its AI coding assistant GitHub Copilot, reports that the model is adept at optimizing algorithms and app code.) And, at least per OpenAI’s benchmarking, o1 improves over GPT-4o in its multilingual skills, especially in languages like Arabic and Korean.

Ethan Mollick, a professor of management at Wharton, wrote his impressions of o1 after using it for a month in a post on his personal blog. On a challenging crossword puzzle, o1 did well, he said — getting all the answers correct (despite hallucinating a new clue).

Read more on TechCrunch