Why is GPT-4 Important and What has Changed?

Table of Contents

What Has Changed in GPT-4?

The release of GPT-4, OpenAI’s most recent big language model, has been announced. This model is a sizable multimodal model that can produce text outputs from both image and text inputs.

A substantial development in artificial intelligence, particularly in natural language processing, may be seen with the recent release of GPT-4. GPT-4 was created to increase model “alignment,” or the capacity to carry out user goals, while simultaneously improving model veracity and producing output that is less obscene or hazardous.

In order to produce more individualized outcomes, GPT-4 will be more creative when it comes to generating creative writings, including screenplays, poems, and song compositions. It will also be better at imitating users’ writing styles.

Due to its ability to analyze the elements of images, it will be able to provide captions and give responses.

Performance upgrades

As one may anticipate, GPT-4 models outperform GPT-3.5 models in terms of the veracity of the responses. With GPT-4 scoring 40% higher than GPT-3.5 on OpenAI’s internal factual performance benchmark, the percentage of “hallucinations,” when the model commits factual or reasoning errors, is reduced.

Also, it enhances “steerability,” or the capacity to modify behavior in response to user demands. You may instruct it to write, for instance, in a different tone, style, or voice. Attempt prompts that begin, “You are a garrulous data expert,” or “You are a terse data expert,” and have it walk you through a data science idea.

The model’s adherence to guardrails is another enhancement. It is better at declining requests to perform things that are wrong or unsavoury.

Performance Benchmarks for GPT-4

OpenAI assessed GPT4 by mimicking tests made for humans, such as the SAT for college admission and the Uniform Bar Examination for lawyers. The outcomes demonstrated that GPT4 performed at a human-level on a number of academic and professional standards.

Simulated exams	GPT-4estimated percentile	GPT-4 (no vision)estimated percentile	GPT-3.5estimated percentile
Uniform Bar Exam (MBE+MEE+MPT)¹	298 / 400~90th	298 / 400~90th	213 / 400~10th
LSAT	163~88th	161~83rd	149~40th
SAT Evidence-Based Reading & Writing	710 / 800~93rd	710 / 800~93rd	670 / 800~87th
SAT Math	700 / 800~89th	690 / 800~89th	590 / 800~70th
Graduate Record Examination (GRE) Quantitative	163 / 170~80th	157 / 170~62nd	147 / 170~25th
Graduate Record Examination (GRE) Verbal	169 / 170~99th	165 / 170~96th	154 / 170~63rd
Graduate Record Examination (GRE) Writing	4 / 6~54th	4 / 6~54th	4 / 6~54th
USABO Semifinal Exam 2020	87 / 15099th–100th	87 / 15099th–100th	43 / 15031st–33rd
USNCO Local Section Exam 2022	36 / 60	38 / 60	24 / 60
Medical Knowledge Self-Assessment Program	75%	75%	53%
Codeforces Rating	392below 5th	392below 5th	260below 5th
AP Art History	586th–100th	586th–100th	586th–100th
AP Biology	585th–100th	585th–100th	462nd–85th
AP Calculus BC	443rd–59th	443rd–59th	10th–7th

GPT-4 was also put through its paces on established machine learning benchmarks by OpenAI, where it surpassed both the majority of cutting-edge models that may have been customized for the benchmark and existing big language models.

These standards covered 57 different disciplines with multiple-choice questions, common sense analysis of real -world situations, elementary school science multiple-choice questions, and more.

Benchmark	GPT-4 Evaluated few-shot	GPT-3.5 Evaluated few-shot	LM SOTA Best external LM evaluated few-shot	SOTA Best external model (includes benchmark-specific training)
MMLU Multiple-choice questions in 57 subjects (professional & academic)	86.4% 5-shot	70.0% 5-shot	70.7% 5-shot U-PaLM	75.2% 5-shot Flan-PaLM
HellaSwag Commonsense reasoning around everyday events	95.3% 10-shot	85.5% 10-shot	84.2% LLAMA (validation set)	85.6% ALUM
AI2 Reasoning Challenge (ARC) Grade-school multiple choice science questions. Challenge-set.	96.3% 25-shot	85.2% 25-shot	84.2% 8-shot PaLM	85.6% ST-MOE
WinoGrande Commonsense reasoning around pronoun resolution	87.5% 5-shot	81.6% 5-shot	84.2% 5-shot PALM	85.6% 5-shot PALM
HumanEval Python coding tasks	67.0% 0-shot	48.1% 0-shot	26.2% 0-shot PaLM	65.8% CodeT + GPT-3.5
DROP (f1 score) Reading comprehension & arithmetic.	80.9 3-shot	64.1 3-shot	70.8 1-shot PaLM	88.4 QDGAT

Overall, GPT-4’s more realistic results show that OpenAI’s efforts to create AI models with increasingly sophisticated skills are making substantial strides.

Visual inputs

In contrast to the text-only default, GPT-4 can accept a prompt with both text and graphics, allowing the user to specify any vision or language task. In more detail, it produces text outputs (natural language, code, etc.) from inputs that contain a mixture of text and images. GPT-4 shows comparable capabilities across a variety of domains, including documents with text and images, schematics, or screenshots. Moreover, test-time strategies like few-shot and chain-of-thought prompting, which were created for language models that simply use text, can be added to it. Picture inputs remain a research preview and are not accessible to the general public.

Benchmark	GPT-4 Evaluated few-shot	Few-shot SOTA	SOTA Best external model (includes benchmark-specific training)
VQAv2 VQA score (test-dev)	77.2% 0-shot	67.6% Flamingo 32-shot	84.3% PaLI-17B
TextVQA VQA score (val)	78.0% 0-shot	37.9% Flamingo 32-shot	71.8% PaLI-17B
ChartQA Relaxed accuracy (test)	78.5%^A	–	58.6% Pix2Struct Large
AI2 Diagram (AI2D) Accuracy (test)	78.2% 0-shot	–	42.1% Pix2Struct Large
DocVQA ANLS score (test)	88.4% 0-shot (pixel-only)	–	88.4% ERNIE-Layout 2.0
Infographic VQA ANLS score (test)	75.1% 0-shot (pixel-only)	–	61.2% Applica.ai TILT
TVQA Accuracy (val)	87.3% 0-shot	–	86.5% MERLOT Reserve Large
LSMDC Fill-in-the-blank accuracy (test)	45.7% 0-shot	31.0% MERLOT Reserve 0-shot	52.9% MERLOT

Limitations

The GPT-4 has comparable restrictions to preceding GPT models despite its capabilities. Most significantly, it still lacks complete reliability. When utilizing language model outputs, especially in high-stakes situations and with the precise protocol corresponding to the requirements of a given use-case, extreme caution should be exercised.

Even though it is still a problem, GPT-4 has a 40% higher internal adversarial factuality evaluation score than our most recent GPT-3.5.

In general, GPT4 does not learn from its experience and is unaware of events that have taken place after the bulk of its data is shut off (September 2021).

It occasionally exhibits simple reasoning flaws that do not seem to be consistent with its proficiency in so many other areas, or it may be unduly trusting when accepting blatantly fraudulent claims from a user.

However, it occasionally makes mistakes when solving complex issues, much like people do, for example, adding security flaws to the code it generates.

When it is likely to make a mistake, GPT-4 can also be confidently inaccurate in its predictions and neglect to double-check its work. It’s interesting to note that the pre-trained base model is well tuned (its predicted confidence in an answer generally matches the probability of being correct). The calibration is nonetheless lowered by present post-training procedure.

Left: The pre-trained GPT-4 model’s calibration plot on a subset of the MMLU data. The likelihood that the model’s prediction will be accurate closely matches its level of confidence. Perfect calibration is represented by the dotted diagonal line. Right: Calibration plot for the same MMLU subset using the post-trained PPO GPT-4 model. Our current procedure significantly degrades calibration.

Access Methods for GPT-4

Via ChatGPT, OpenAI is making text input for GPT-4 available. For the time being, ChatGPT Plus subscribers can access it. For the GPT-4 API, a waiting list exists.

The ability to input images has not yet been made publicly available.

To enable anybody to report flaws in their models and direct future improvements, OpenAI has made OpenAI Evals, a platform for automated evaluation of AI model performance, available for use by anyone.

Value Centric Innovation

Value Centric Innovation

Why is GPT-4 Important and What has Changed?

What Has Changed in GPT-4?

Performance upgrades

Performance Benchmarks for GPT-4

Visual inputs

Limitations

Access Methods for GPT-4

aivalutric

Related Posts

7 Proven Strategies for Answer Engine Optimization (AEO) in 2025

7 Groundbreaking Ways Diffusion LLMs (DLLMs) Are Set to Transform AI Forever

Other Story

7 Proven Strategies for Answer Engine Optimization (AEO) in 2025

7 Groundbreaking Ways Diffusion LLMs (DLLMs) Are Set to Transform AI Forever

8 Reasons AI Engineers Can’t Stop Talking About Model Context Protocol (MCP)

7 Ways Cloudflare Just Made Building AI Apps & Agents Incredibly Easy

7 Key Reasons Why Prime Video Cut UI Latency 7.6x by Switching to Rust

Best AI Research Tools Compared: Google Co-Scientist vs. OpenAI Deep Research vs. Perplexity Deep Research