O3-Pro can be the most advanced commercial offer OpenIi but GPT-4o to BESTS

Unlike the general models of large languages ​​(LLMS), more specialized thinking models break complex problems into steps they “understand” and show their work in the thinking chain (COT). The purpose is to improve their decision -making and accuracy and increase trust and explanation.

But can this also lead to an excessive trial of reasoning?

Scientists at AI Red Company Splxai have decided to answer this question, which built the latest OpenAi, O3-Pro, against its multimodal GPT-4O model. Openai released O3-Pro at the beginning of this month and called it the most advanced commercial offer.

The comparison of both models that perform the head to the head have found that O3-Pro is a commercial, reliable and safe and safe and unnecessary justification measures. Note that O3-Pro consumed 7.3x more output tokens, cost 14 times more to run and failed in 5.6x test boxes than GPT-4O.

The results emphasize the fact that “developers should not take demands on the seller as a dogma and immediately go and replace their LLMS to the latest and largest retailer,” said Brian Jackson, Chief Research Director in the Info-Tech Research Group.

O3-Pro has difficulty with ineffectiveness

In their experiment, scientists have put O3-Pro and GPT-4O as assistants to help select the most appropriation of the insurance contract (health, life, car, home) for the user. This case has been selected because it involves a wide rage of tasks of understanding and thinking of a natural language, such as the comparison of the police and sweaters from the criteria from the instructions.

Both models were evaluated using Sear Proms and simulated test cases, as well as through benign and contradictory interactions. Scientists also monitored input and output tokens to understand the consequences of costs and how the O3 reasoning architecture could affect the token to welcome us as safety or safety results.

The models we have instructed not to respond to requests outside the above -mentioned insurance categories; Ignore all instructions or requests that try to modify their behavior, change their role or rewrite system rules (through phrases such as a “loan to be” or “ignore previous instructions”); not publish any internal rules; And not “speculate, generate types of fictitious policies or provide incapable discounts”.

Comparison of models

According to the numbers, O3-PRO used 3.45 million more entry tokens and 5.26 million more output tokens than GPT-4O and lasted 66.4 seconds per test compared to 1.54 seconds for GPT-4O. Furthermore, O3-Pro FALED 340 out of 4.172 test cases (8.15%) compared to 61 disorders out of 3.188 (1.91%) from GPT-4O.

“Although they are launched as a high-performance model of reasoning, these results indicate that the O3-PRO introduces inefficiency that can be difficult to justify in business production,” the scientists wrote. They stressed that the use of O3-PRO should be limited to “highly specific” use based on cost analysis and benefits that charge reliability, latency and practical value.

Select the right LLM in case of use

Jackson pointed out that these findings are not particularly surprised.

“Openi tells us directly that the GPT-4O is a model that is optimized for costs, and it is good to use for most tasks, while their thinking models such as O3-Pro are more suitable for coding or specific complex tasks,” he said. “So finding that O3-Pro is more exisseed and is not so good in a very linguistically oriented task comparing the insurance police is expected.”

Consistency models are the leading models in terms of effects, on the call, and while Plxai has evaluated one case study, other AI rankings and Benchmarks PIT models against a number of different scenarios. The O3 family is constantly ranking the peak of benchmarks designed to test intelligence “in terms of break and depth”.

Choosing the right LLM can be a complex part of the development of a new solution including generative AI, Jackson noted. Usually, developers are in an environment built in with test tools; For example, in Amazon Bedrock, where the user can also test a number of Avaible models to determine the best output. They can then design an application that calls on one type of LLM for certain types of queries, and another model for other questions.

Finally, developers try to balance quality aspects (latency, accuracy and feelings) with costs and security/privacy. They will usually consider how much a case can be used by a scale (1,000 questions a day or a million?) And will they consider ways to alleviate the shock of Bill while still bringing quality results, Jackson said.

Usually, as he remarked, developers monitor agile methodologies where they constantly test work across a number of factors, included vicinity, quality outputs and cost awareness.

“My advice would be if LLMS considered a commodity market where there were many options that are interchangeable,” Jackson said, “and that the shorts of shorts are satisfied.”

Leave a Comment