In order to produce more individualized outcomes, GPT-4 will be more creative when it comes to generating creative writings, including screenplays, poems, and song compositions. It will also be better at imitating users’ writing styles.
Due to its ability to analyze the elements of images, it will be able to provide captions and give responses.
Performance upgrades
Performance Benchmarks for GPT-4
| Simulated exams | GPT-4estimated percentile | GPT-4 (no vision)estimated percentile | GPT-3.5estimated percentile |
| Uniform Bar Exam (MBE+MEE+MPT)1 | 298 / 400~90th | 298 / 400~90th | 213 / 400~10th |
| LSAT | 163~88th | 161~83rd | 149~40th |
| SAT Evidence-Based Reading & Writing | 710 / 800~93rd | 710 / 800~93rd | 670 / 800~87th |
| SAT Math | 700 / 800~89th | 690 / 800~89th | 590 / 800~70th |
| Graduate Record Examination (GRE) Quantitative | 163 / 170~80th | 157 / 170~62nd | 147 / 170~25th |
| Graduate Record Examination (GRE) Verbal | 169 / 170~99th | 165 / 170~96th | 154 / 170~63rd |
| Graduate Record Examination (GRE) Writing | 4 / 6~54th | 4 / 6~54th | 4 / 6~54th |
| USABO Semifinal Exam 2020 | 87 / 15099th–100th | 87 / 15099th–100th | 43 / 15031st–33rd |
| USNCO Local Section Exam 2022 | 36 / 60 | 38 / 60 | 24 / 60 |
| Medical Knowledge Self-Assessment Program | 75% | 75% | 53% |
| Codeforces Rating | 392below 5th | 392below 5th | 260below 5th |
| AP Art History | 586th–100th | 586th–100th | 586th–100th |
| AP Biology | 585th–100th | 585th–100th | 462nd–85th |
| AP Calculus BC | 443rd–59th | 443rd–59th | 10th–7th |
| Benchmark |
GPT-4
Evaluated few-shot
|
GPT-3.5
Evaluated few-shot
|
LM SOTA
Best external LM evaluated few-shot
|
SOTA
Best external model (includes benchmark-specific training)
|
|
Multiple-choice questions in 57 subjects (professional & academic)
|
86.4%
5-shot
|
70.0%
5-shot
|
70.7%
|
75.2%
|
|
Commonsense reasoning around everyday events
|
95.3%
10-shot
|
85.5%
10-shot
|
84.2%
|
85.6%
|
|
Grade-school multiple choice science questions. Challenge-set.
|
96.3%
25-shot
|
85.2%
25-shot
|
84.2%
|
85.6%
|
|
Commonsense reasoning around pronoun resolution
|
87.5%
5-shot
|
81.6%
5-shot
|
84.2%
|
85.6%
|
|
Python coding tasks
|
67.0%
0-shot
|
48.1%
0-shot
|
26.2%
|
65.8%
|
|
DROP (f1 score)
Reading comprehension & arithmetic.
|
80.9
3-shot
|
64.1
3-shot
|
70.8
|
88.4
|
Overall, GPT-4’s more realistic results show that OpenAI’s efforts to create AI models with increasingly sophisticated skills are making substantial strides.
Visual inputs
In contrast to the text-only default, GPT-4 can accept a prompt with both text and graphics, allowing the user to specify any vision or language task. In more detail, it produces text outputs (natural language, code, etc.) from inputs that contain a mixture of text and images. GPT-4 shows comparable capabilities across a variety of domains, including documents with text and images, schematics, or screenshots. Moreover, test-time strategies like few-shot and chain-of-thought prompting, which were created for language models that simply use text, can be added to it. Picture inputs remain a research preview and are not accessible to the general public.
| Benchmark |
GPT-4
Evaluated few-shot
|
Few-shot SOTA |
SOTA
Best external model (includes benchmark-specific training)
|
|
VQA score (test-dev)
|
77.2%
0-shot
|
67.6%
|
84.3%
|
|
VQA score (val)
|
78.0%
0-shot
|
37.9%
|
71.8%
|
|
Relaxed accuracy (test)
|
78.5%A
|
– |
58.6%
|
|
Accuracy (test)
|
78.2%
0-shot
|
– |
42.1%
|
|
ANLS score (test)
|
88.4%
0-shot (pixel-only)
|
– |
88.4%
|
|
ANLS score (test)
|
75.1%
0-shot (pixel-only)
|
– |
61.2%
|
|
Accuracy (val)
|
87.3%
0-shot
|
– |
86.5%
|
|
Fill-in-the-blank accuracy (test)
|
45.7%
0-shot
|
31.0%
|
52.9%
|
Limitations
The GPT-4 has comparable restrictions to preceding GPT models despite its capabilities. Most significantly, it still lacks complete reliability. When utilizing language model outputs, especially in high-stakes situations and with the precise protocol corresponding to the requirements of a given use-case, extreme caution should be exercised.
Even though it is still a problem, GPT-4 has a 40% higher internal adversarial factuality evaluation score than our most recent GPT-3.5.

Access Methods for GPT-4
Via ChatGPT, OpenAI is making text input for GPT-4 available. For the time being, ChatGPT Plus subscribers can access it. For the GPT-4 API, a waiting list exists.
The ability to input images has not yet been made publicly available.
To enable anybody to report flaws in their models and direct future improvements, OpenAI has made OpenAI Evals, a platform for automated evaluation of AI model performance, available for use by anyone.
