Code Red: A large scale study on ChatGPT generated code vulnerabilities

Recent observations have shown that Large Language Models (LLMs), when employed for code generation, can inadvertently produce vulnerable code. This issue has been highlighted in various small-scale studies, including blog posts and research articles, demonstrating its potential risks. However, the broader implications for software engineering remained unclear, while many developers now rely on AI tools for code generation and auto-completion. Currently, ChatGPT is at the forefront as a leading proprietary technology in LLMs. Understanding the extent to which it is affected by generating vulnerable code, is a crucial question that needs addressing.

A recent large-scale study by Tihanyi et al. has revealed some concerning findings about C programs generated by GPT-3.5, as these programs exhibit vulnerabilities 51.24% of the time across various programming scenarios, posing a significant threat to software safety and security. Their research paper, titled “The FormAI Dataset: Generative AI in Software Security through the Lens of Formal Verification,” [1],  goes beyond these alarming statistics. It introduces a method for assessing the propensity of various Large Language Models (LLMs) to generate vulnerable code. This blog post will delve into the crucial insights of this paper.

The research questions raised in the paper were the following:
• RQ1: How likely is purely LLM-generated code to contain vulnerabilities on the first output when using simple zero- shot text-based prompts?
• RQ2: What are the most typical vulnerabilities LLMs introduce when generating code? 

To accurately address these questions, it was essential to examine a substantial corpus of AI-generated code. Before this study, all extensive C program code databases were either sourced from real-world projects authored by human developers or synthetically designed to include specific vulnerabilities for machine learning or related objectives. The study introduced the FormAI dataset, marking the first large-scale AI-generated database focusing on vulnerable C programs. Merely generating C programs without classifying vulnerabilities would not suffice to validate the research hypotheses. The methodology adopted in the paper was as follows:

  1. Generate a diverse set of C program codes for a wide range of programming tasks.
  2. Reduce redundancy and code replication with a novel zero shot prompting template.
  3. Perform vulnerability classification without introducing false positives.

To facilitate the study, a specialized prompting technique was devised to interact with ChatGPT through API calls for various programming tasks as shown in Figure 1. These tasks, referred to as [Type], encompass a wide range of applications such as Wi-Fi strength checkers, terminal-based games, scientific calculators, cryptography protocols, and more, randomly chosen from a pool of 200 coding tasks. To enhance diversity, each prompt was also paired with a [Style] selected randomly from a collection of 100 different coding styles.

A black and white text on a white background

Description automatically generated

Figure 1. Dynamic code generation prompt

The process of the dataset creation can be seen in Figure 2. In this case the LLM module is GPT-3.5-turbo. After the dataset has been populated with compliable C programs, the important task of vulnerability classification and detection begins. Due to the impracticality of manually labelling the entire dataset, given its extensive size, the authors opted to utilize the Efficient SMT-based Bounded Model Checker (ESBMC) to perform the vulnerability labelling. This tool can formally verify the presence of specific vulnerabilities in the data through symbolic execution. A constraint of this approach is its dependency on the available computational resources, which limits the search depth to a predefined boundary. This restriction means it can only identify vulnerabilities within this set range, owing to the method’s intensive resource demands.

Figure 2. AI-driven Dataset Generation and Vulnerability Labeling with Program Classification by the BMC Module

In the case of FormAI, individual C programs in the dataset might contain even more vulnerabilities then detected. The dataset encompasses over 8.8 million lines of C code, typically 79 lines per sample, with 47-line programs being most common. An interesting property of the generated dataset was that the use of most common and least frequent C-keywords by ChatGPT, was similar to both human written and synthetic datasets as shown in Figure 3.

Figure 3. C Keyword frequency in FormAI, SARD, and BigVul

Counting the 32 C keywords with a token-based frequency counter, normalizing their frequency per million lines shows, that the distribution of if-statements, loops, and variables is similar to real-world projects. This is likely due to that GPT models are being trained on actual GitHub projects and human created code.
Define Σ as the set of all C samples, Σ = {c1,c2,…,c112,000}. The vulnerabilities identified by ESBMC module in the FormAI dataset were classified into 4 main categories:

  • VS ⊆ Σ: the set of samples for which verification was successful (no vulnerabilities have been detected within the bound k); 
  • VF ⊆ Σ: the set of samples for which the verification status failed (known counterexamples); 
  • TO ⊆ Σ: the set of samples for which the verification process was not completed within the provided time frame (as a result, the status of these files remains uncertain); 
  • ER ⊆ Σ: the set of samples for which the verification status resulted in an error. 

The category VF is the most interesting for our purposes, as it signifies a case where one or more vulnerabilities were found. In such cases ESBMC found a “counterexample” during the symbolic execution. The category VF was divided into 9 subcategories:

  • BOF : Buffer overflow on scanf()/fscanf() 
  • DFN : Dereference failure : NULL pointer
  • DFA : Dereference failure : array bounds violated
  • ARO : Arithmetic overflow
  • ABV : Array bounds violated
  • DFI : Dereference failure : invalid pointer
  • DFF : Dereference failure : forgotten memory
  • OTV : Other vulnerabilities
  • DBZ : Division by zero

Through linking the vulnerabilities to Common Weakness Enumeration (CWE) identifiers, 41 unique CWEs were identified. Among these, eight CWE identifiers are featured in MITRE’s Top 25 CWEs for 2022.
In sequence, these include: CWE-787 (1st), CWE-20 (4th), CWE-125 (5th), CWE-416 (7th), CWE-476 (11th), CWE-190 (13th), CWE-119 (19th), and CWE-362 (22nd). Within FormAI, it’s common for a single file to encompass several vulnerabilities. Moreover, a single vulnerability might be associated with several CWEs. Table 1 offers an overview, listing the identified CWEs along with the frequency of their occurrences within the dataset.

#Vulns Vuln. Associated CWE-numbers 
88,049BOFCWE-20, CWE-120, CWE-121, CWE-125, CWE-129, CWE- 131, CWE-628, CWE-676, CWE-680, CWE-754, CWE-787 
31,829DFNCWE-391, CWE-476, CWE-690 
24,702DFACWE-119, CWE-125, CWE-129, CWE-131, CWE-755, CWE- 787 
23,312AROCWE-190, CWE-191, CWE-754, CWE-680, CWE-681, CWE- 682 
11,088ABVCWE-119, CWE-125, CWE-129, CWE-131, CWE-193, CWE- 787, CWE-788 
9823DFICWE-416, CWE-476, CWE-690, CWE-822, CWE-824, CWE- 825 
5810DFFCWE-401, CWE-404, CWE-459 
1620OTVCWE-119, CWE-125, CWE-158, CWE-362, CWE-389, CWE- 401, CWE-415, CWE-459, CWE-416, CWE-469, CWE-590, CWE-617, CWE-664, CWE-662, CWE-685, CWE-704, CWE- 761, CWE-787, CWE-823, CWE-825, CWE-843 
1567DBZCWE-369 

Table 1. The vulnerabilities identified by ESBMC, linked to Common Weakness Enumeration identifiers.

The dataset comprises three distinct files:

  • FormAI dataset C samples-V1.zip – This file contains all the 112,000 C files.
  • FormAI dataset classification-V1.zip – This file contains a CSV file with the original code and vulnerability classification.
  • FormAI dataset human readable-V1.csv – Human readable version without the code

The dataset can be accessed on both GitHub and IEEE Dataport. GitHub: https://github.com/FormAI-Dataset/
IEEE dataport: https://dx.doi.org/10.21227/vp9n-wv96 A logo with a cat head

Description automatically generatedA close-up of a logo

Description automatically generated

The paper drew the following conclusions:

  • RQ1: How likely is purely LLM-generated code to contain vulnerabilities on the first output when using simple zero-shot text-based prompts?
    Answer: At least 51.24% of the samples from the 112,000 C programs contain vulnerabilities. This indicates that GPT-3.5 often produces vulnerable code. Therefore, one should exercise caution when considering its output for real-world projects. 
  • RQ2: What are the most typical vulnerabilities LLMs introduce when generating code?Answer: For GPT-3.5: Arithmetic Overflow, Array Bounds Violation, Buffer Overflow, and various Dereference Failure issues were among the most common vulnerabilities. These vulnerabilities are pertinent to MITRE’s Top 25 list of CWEs.

Authors

Tamas György Bisztray, Vasileios Mavroeidis

University of Oslo (UiO)

[1] https://dl.acm.org/doi/10.1145/3617555.3617874