Formal verification has gained significant traction in recent years for ensuring software correctness and reliability, especially in critical industrial applications. However, Existing solutions either sacrifice usability by requiring extensive program annotation or expressiveness, where certain desired properties cannot be specified.

TPot offers an automated verification framework that attempts to strike a middle ground by supporting strong specifications without requiring exhaustive annotations. In this project, we assess TPot’s capabilities on an unverified codebase: the array data structures of the Collections-C library. First, we ported existing symbolic tests to TPot, evaluating their expressiveness. We then augmented these tests with more general invariants, proving correctness over arbitrarily large arrays and an unbounded sequence of API calls. Our results show that TPot’s proof annotations remain within reasonable effort compared to the complexity of the verified code. Along the way, we discovered a previously undetected integer overflow bug in the library’s resize function, highlighting TPot’s practical value in detecting subtle errors.

Introduction
#

In recent years, formal verification techniques have become increasingly used for ensuring the correctness and security of software [@bornholdt_2021; @leblanc_2023], especially in critical industrial applications [@li_2021; @nyberg_2018]. These methods offer strong guarantees about software behavior, helping to identify bugs, vulnerabilities, and security risks.

Full-fledged interactive verification frameworks can be time-intensive and require significant expertise, making them less practical for many real-world software projects. In contrast, unit testing and lightweight validation techniques, while more accessible, fail to provide the same level of assurance.

TPot [@tpot] has shown to be a practical automated verifier that strikes a balance between expressiveness and usability for previously verified codebases. However, these claims have yet to be tested on unverified codebases.

In this project we apply TPot to an unverified industrial code repository containing commonly used data structures written in C, evaluating TPot’s claims as well as strengthening the verified codebase’s guarantees and rendering it more reliable for critical usages. Furthermore, we compare TPot to other verification frameworks such as Gillian [@gillian_2021] and CN [@cn_2023], discussing their differences in expressiveness, structure and user experience, gaining a better understanding of their respective strengths and weaknesses.

Background
#

Software verification exists on a spectrum, ranging from simple unit testing where individual code components are tested in isolation to full-fledged interactive theorem proving, which requires developers to manually construct correctness proofs for each function. Between these extremes, automated verification frameworks attempt to balance rigor and usability, reducing the manual effort required while still providing strong guarantees of correctness.

The complexity of formal verification has historically limited their adoption, as they require specialized knowledge and are time intensive to produce [@klein_2009], deterring developers from using them.

Several verification frameworks have emerged to address different trade-offs in expressiveness and usability. Among them, the CN verifier, Gillian, and TPot represent distinct approaches to automated verification.

The CN verifier [@cn_2023] offers a high level of expressiveness, supporting general recursive predicates and full universal quantification. These features make it particularly well-suited for verifying complex data structures and systems with intricate invariants. For instance, CN allows for the verification of functions manipulating recursive data structures such as doubly linked lists, ensuring that desired properties of the structure hold. However, this expressiveness comes at a cost: CN requires extensive annotation, necessitating the specification of all internal functions. This significantly increases the time and expertise required to use CN effectively.

In contrast, the Gillian verifier [@gillian_2021] is designed for ease of use, focusing on compositional symbolic execution and modular reasoning. Gillian allows developers to verify properties with less effort compared to CN by enabling more automated reasoning over code components. However, its trade-off lies in offering weaker guarantees. While it effectively verifies properties related to a limited amount of function calls, it struggles to comprehensively verify programs involving properties which must hold on arbitrary number of function calls.

TPot [@tpot] is a novel verification framework for single-threaded C systems code that claims to be capable of the level of expressiveness similar to CN, while keeping a developer’s manual effort low. Built on the KLEE symbolic execution engine, TPot allows a developer to write "proof-oriented tests" (POTs), which allow for stating and verifying stronger formal specification than symbolic tests. Unlike CN, TPot does not require developers to specify all internal functions, making verification more accessible. However, to achieve this ease of use and strong guarantees, specifications written in TPot take longer to verify then when using other verifiers.

While TPot sacrifices some expressiveness, such as general recursive predicates and full universal quantification, its approach ensures a practical middle ground. This makes TPot particularly useful in CI pipelines and for component-level specifications where verification time is a reasonable trade-off for reduced annotation overhead.

TPot has demonstrated promising results when applied to five codebases that had already been previously verified by state-of-the-art provers such as VeriFast [@verifast2008], RefinedC [@refinedC_2021], CN [@cn_2023], and Serval [@serval_2019]. These results highlight TPot’s ability to reduce manual effort while maintaining verification times that are low enough for integration into CI pipelines. Encouraged by these findings, we now aim to extend this evaluation to an unverified codebase, assessing whether TPot can uphold its efficiency and if TPot’s limitations in expressiveness pose a significant constraint.

Design
#

Verifying Collections-C array
#

For our verification efforts, we selected the array data structure from the Collections-C library [@collections_c]. The library also provides other real-world data structures including lists, treetables, hashtables and queues and has almost 3K stars on GitHub. This choice was driven by several key factors that align with the objectives of verifying real-world, practical systems codebases as well as allowing us to compare our efforts to symbolic tests already written for the codebase using Gillian [@gillian-code].

Furthermore, the Collections-C library’s arrays employ dynamic memory management, which introduces a level of complexity well-suited for testing TPot’s verification capabilities. Dynamic memory managed arrays allocate memory at runtime to accommodate varying sizes of stored elements. While this provides flexibility and efficiency for a wide range of applications, it also introduces challenges in ensuring correctness and safety. These challenges include:

Memory Safety Concerns: Dynamic memory management involves allocation and deallocation operations, which are prone to common errors such as memory leaks, dangling pointers, and double-free errors. Verifying that these operations are correctly implemented and that memory is safely managed is non-trivial.
Pointer Arithmetic and Aliasing: Dynamic arrays often involve extensive pointer manipulations, which can lead to aliasing issues and subtle bugs that are harder to detect and verify compared to statically allocated arrays.

The Collections-C array data structure is a dynamically resizable array that efficiently manages memory allocation and deallocation. Many programming languages, such as C++, support this data structure natively and call it a vector. It consists of a buffer that stores elements, along with metadata such as size (current number of elements) and capacity (allocated memory slots). The expansion factor (exp_factor) determines how the array grows when it reaches capacity. Additionally, the structure includes function pointers for memory management (mem_alloc, mem_calloc, mem_free), allowing customization of allocation strategies. This design provides flexibility while maintaining efficient memory usage.

The Collections-C array API includes many functions for managing and manipulating arrays of which we study 4 key functions as part of our verification efforts:

cc_array_add: Appends a new element to the array. If the array is at full capacity, an internal expand_capacity function is called using the defined expansion factor to allocate a new buffer which is large enough for the added element.
cc_array_remove: Removes the first occurrence of a specified element from the array. If the element is found, subsequent elements are shifted left to maintain contiguous storage.
cc_array_reverse: Reverses the order of elements in the array. This operation swaps elements symmetrically from the beginning to the end of the buffer, ensuring that all elements are repositioned correctly while preserving the size and capacity.
cc_array_get_at: Retrieves the element stored at a specified index. The function ensures that the requested index is within bounds, returning the corresponding element if valid or an error if out of bounds.

Previous Collections-C Verification Efforts
#

The Collections-C library has already been partially verified using the Gillian verifier [@gillian_2021], which focuses on compositional symbolic execution and modular reasoning. However, Gillian’s verification case study of Collections-C is limited in scope: it verifies the library’s data structures only for scenarios with a small number of elements and API calls as seen in the following example, where three symbolic integers are added to the dynamic array with three calls to the library’s API.

int main() {
    CC_Array arr;
    int stat = cc_array_new(&arr);

    int a = __builtin_annot_intval("symb_int", a);
    int b = __builtin_annot_intval("symb_int", b);
    int c = __builtin_annot_intval("symb_int", c);

    cc_array_add(arr, &a);
    cc_array_add(arr, &b);
    cc_array_add(arr, &c);

    cc_array_get_at(arr, 0, (void *)&ar);
    cc_array_get_at(arr, 1, (void *)&br);
    cc_array_get_at(arr, 2, (void *)&cr);

    ASSERT(a == *ar);
    ASSERT(b == *br);
    ASSERT(c == *cr);
}

where cc_array_new initializes an empty array and __builtin_annot_intval initializes a symbolic integer.

This type of test is unable to provide guarantees regarding the state of the array after an arbitrary number of function calls. This limitation arises because Gillian’s verification is constrained by the bounded nature of symbolic execution, making it unsuitable for reasoning about general properties that hold indefinitely over multiple function invocations.

While useful for identifying basic correctness properties, this limited scope does not address the more comprehensive properties necessary for real-world usage, such as handling larger datasets or guaranteeing properties after an arbitrary number of function calls.

With our project, we extend and strengthen the verification of the Collections-C array library. Unlike Gillian, TPot provides a mechanism to verify properties of the arrays over broader and more realistic scenarios, including:

Verification with arbitrary data structure size: Ensuring correctness and memory safety when arrays handle large numbers of elements, a critical feature for practical applications.
Guaranteeing Stronger Properties: Beyond basic safety, we focus on verifying properties such as dynamic resizing correctness and functional correctness.
Guaranteeing a valid state of the data structure regardless of the number of operations performed on it through various API calls.

Collections-C Array Invariants
#

To ensure the correctness of the dynamically managed arrays in the Collections-C library, we define and verify key invariants that encapsulate the expected properties of a valid Collection-C array. TPot allows for the specification of global invariants that must hold over an arbitrary sequence of API calls. Through inductive reasoning, TPot ensures that these invariants are preserved across multiple calls. In the implementation section, we leverage TPot to verify these invariants under various scenarios, including resizing operations involving capacity increases, element insertions and deletions that modify the array size, and other library API calls.

By formally verifying this invariant, we extend the guarantees provided by previous verification efforts, demonstrating not only that the array functions correctly for small data sizes but that it maintains correctness and safety under any size. This ensures the reliability and robustness of the Collections-C arrays library in practical applications.

Implementation
#

Before diving into the formal verification of the Collections-C array data structure using TPot, we first began by converting the Collections-C array symbolic tests originally written using Gillian into TPot and measured their performance differences. This provided us with insights into how TPot handles verification tasks compared to an alternative framework as well as allowing us to familiarize ourselves with TPot and formal verification in general.

The symbolic tests written with Gillian were designed to verify array operations using compositional symbolic execution. We rewrote these tests in TPot, maintaining the same verification objectives but leveraging TPot’s proof-oriented test methodology.

Array Invariant
#

Having established a foundational understanding of TPot and observed its efficiency, we proceeded with a comprehensive verification of the Collections-C array data structure, ensuring correctness across various API functions and scenarios.

We applied formal verification to these API functions to ensure their correctness under different scenarios. Each verification was conducted using specifications that assert desired properties as well as guaranteeing that the following TPot invariants hold:

bool inv__valid_array() {
    if (!names_obj(arr, CC_Array))	return false;
    if (arr->capacity <= 0)	       return false;
    if (arr->capacity > CC_MAX_ELEMENTS)  return false;
    if (arr->size > arr->capacity)	return false;
    if (arr->size < 0)		    return false;
    if (!names_obj(arr->buffer, void *[arr->capacity])) return false;
    return true;
}

Valid Object Type: The array object (arr) must point to an object as large as sizeof(CC_Array), which is verified using names_obj(arr, CC_Array)
Capacity Constraints: The array’s capacity must be greater than 0 and less than or equal to a defined maximum CC_MAX_ELEMENTS. These constraints ensure that the array capacity is properly maintained and avoids exceeding platform-defined limits. The library sets this value to the maximum possible size_t, which triggers the bug we discuss later. We verify the invariant by setting CC_MAX_ELEMENTS to MAX_SIZE_T/16, where MAX_SIZE_T is the largest possible value of type size_t.
Size Constraints: The size of the array, representing the number of elements currently stored, must lie within the range $[0, capacity]$. This prevents buffer overflows and ensures no invalid indices are accessed.
Buffer Validation: The array’s buffer must be a valid pointer to an array capable of holding up to capacity elements of the specified array element type.

Verification of `cc_array_add`
#

The function cc_array_add appends an element to the end of an array, expanding its capacity when necessary.

To verify its correctness, we wrote the following POT, which specifies that after calling cc_array_add with a given value, the last entry in the underlying array buffer corresponds to that value:

void spec__inv_add() {
    any(void *, a);
    size_t size_orig = arr->size;

    int stat = cc_array_add(arr, a);

    if (stat == CC_OK) {
        assert(arr->size == size_orig + 1);
        assert(arr->buffer[arr->size-1] == a);
    }
}

To verify that the array is resized correctly, we modify the type of the expansion factor from a float to an integer and fix its value to the default of 2. This is necessary due to KLEE unsoundly concretizing symbolic floats, which would underapproximate our specification.

This formal verification confirms that the append operation behaves as expected, preserving memory safety and ensuring data integrity.

Verification of `cc_array_remove`
#

The cc_array_remove function removes the first occurrence of a given value from the array. We wrote the following POT specifying that the array decreases in size when the remove function is called.

void spec__inv_remove() {
    any(void *, a); void *out;
    size_t size_orig = arr->size;

    int stat = cc_array_remove(arr, a, out);
    if (stat == CC_OK) {
        assert(arr->size + 1 == size_orig);
    }
}

Furthermore we wrote an axiomatic specification, guaranteeing that the removed element is equivalent to one added using the cc_array_add function.

void spec__inv_remove_ax() {
    any(void *, a); void *out;
    size_t size_orig = arr->size;

    int stat = cc_array_add(arr, a);

    stat = cc_array_remove(arr, a, &out);
    if (stat != CC_OK) return;
    assert(a == out);
    assert(size_orig == arr->size);
}

We add an implementation of memmove for TPot to verify the remove function, replacing the call in the Collections-C library with a simple loop copying one memory region to the other. This is because TPot does not assume the correct behavior of certain standard C library functions and instead verifies their implementation. The TPot library, provides the implementation for some common used functions such as memcpy and memset, but not for memmove. Our implementation is however not a general one since memmove should support overlap of the source and destination memory regions, which our implementation does not support in general.

By verifying this property, we ensure that cc_array_remove correctly identifies and removes the intended element without affecting other elements or causing unintended side effects.

Verification of `cc_array_reverse`
#

The function cc_array_reverse inverts the order of elements within the array. To verify this behavior, we specified the following POT:

bool reverse_prev_elems_copied(uint64_t *unused, int64_t i, 
		void **buffer, size_t size) {
	if (i < 0 || i >= size) return true;
	let_old(void *, old_buffer_entry, buffer[size-1-i]);
	return buffer[i] == old_buffer_entry; 
}

void spec__inv_reverse() {
	cc_array_reverse(arr);
	size_t size_original = arr->size;
	size_t capacity_original = arr->capacity;
	assert(forall_elem_(arr->buffer, &reverse_prev_elems_copied, 
		arr->buffer, arr->size));
	assert(arr->size == size_original);
	assert(arr->capacity == capacity_original);
}

ensuring that The elements in the reversed array appear in the exact opposite order from their original arrangement and that the size and capacity of the array remains unchanged.

To verify this property, we add a new API call to TPot let_old, which binds the old value to a fresh local variable of a given memory location. This allows us to state properties on the relation between the array before and after the reverse function was called.

Verification of `cc_array_get_at`
#

The function cc_array_get_at retrieves the value at a given index within the array. To ensure its correctness, we specified properties verifying that:

Retrieving an element at a valid index correctly returns the stored value.
Attempting to access an index greater than or equal to size results in an out-of-bounds error code.

This verification ensures that cc_array_get_at maintains memory safety by preventing invalid memory accesses, which could lead to crashes or security vulnerabilities.

Loop Invariants and TPot Loop Unrolling
#

In our implementation, we leveraged TPot’s loop unrolling capability to simplify the conversion of the Gillian test cases. Specifically, we did not add explicit loop invariants when verifying the symbolic tests for the Collections-C array functions. Instead, we relied on TPot’s ability to handle bounded loops through automatic unrolling, allowing us to verify these cases without the need for additional annotations.

However, for verifying our stronger array invariants, we found it necessary to introduce explicit loop invariants. Without these invariants, TPot could not establish correctness properties over all loop iterations, making verification infeasible for operations that modify array contents.

For example, when verifying the cc_array_reverse function, we introduced an invariant using the let_old construct. This allowed us to specify that each element at index i was swapped with its corresponding element at index size - 1 - i, ensuring correctness across all loop iterations. Similarly, for cc_array_add, we specified an invariant ensuring that only the last entry of the array was modified while all other elements remained unchanged.

Evaluation
#

Annotation Overhead
#

To assess the verification effort required for the Collections-C array library, we measured the annotation overhead in terms of specification lines, global definitions, and other proof-related constructs similar to the original TPot evaluation. The results of our analysis are summarized in Table [tab:verification]{reference-type=“ref” reference=“tab:verification”}.

For the Collections-C array verification, we observed an annotation overhead of 104 lines, with 94 lines dedicated to specifying API functions and related definitions and 10 lines for global invariants and global data structure predicates. Additionally, 31 lines were categorized as "syntax-only," meaning they were purely syntactic or stylistic rather than necessary for correctness proofs. The total annotation overhead relative to the codebase resulted in 98%. However, a more meaningful metric is the proof-to-code ratio, which excludes syntax-only lines and stands at 68%.

Interestingly, the annotation overhead we observed in our verification effort aligns closely with the proof-to-code ratio reported in the original TPot paper when evaluating already verified codebases. In that study, TPot’s proof-to-code ratio for previously verified software ranged from 35% to 82%, depending on the complexity of the codebase. Our result of 68% falls within this range, reinforcing the idea that TPot keeps the annotation requirements low on unverified code bases similar to already verified projects. For an improved comparison a specification of our verified properties written for another verifier such as CN would be required.

Execution Time
#

To evaluate TPot’s performance, we measured the total verification time required for the converted Gillian test cases on the Collections-C array and compare the results to Gillian’s performance. Our results indicate that when averaging 10 test runs, using Gillian, test cases took 22.47 minutes to complete, whereas TPot completed the same set of test cases in 5.54 minutes. This significant improvement demonstrates TPot’s efficiency in verification, reducing verification time by approximately 75.3% compared to Gillian. We were not able to reproduce the execution times of the original Gillian authors [@gillian_2021] indicating a possible discrepancy of these results.

Further evaluation of TPot’s verification capabilities on the Collections-C library array invariants yielded promising results. We evaluate the execution time on a Linux Desktop Computer with an 12th Gen Intel Core i7-12700F, running TPot in a Docker Container. The execution times for verifying various properties of the array data structure are summarized in Table 1{reference-type=“ref” reference=“tab:verificationTime”}. The total verification time across all tested specifications amounted to 7.4 minutes. This duration is reasonable given the complexity of the verification tasks and aligns with expectations for formal verification within a CI pipeline.

::: tabular l|m2.5cm & # Lines
Specifications & 94
Globals & 10
Total & 104
Syntax-only & 31
Total % & 98%
Proof-to-code % & 68%
:::

::: {#tab:verificationTime} Time [s]

spec_get_at 8 spec_index_of 11 spec_remove 51 spec_reverse 70 spec_add 289 Total 429 :::

Memory Allocation Overflow Bugs
#

Through our verification efforts, we uncovered a critical bug in the expand_capacity function of the Collections-C array library. This internal function is responsible for dynamically resizing the underlying buffer when additional capacity is required. However, our analysis revealed a vulnerability that could lead to the new buffer not having enough allocated space due to integer overflow.

The problematic code is:

void **new_buff = malloc(new_capacity * sizeof(void*));

This statement allocates memory for the resized array. However, if arr->capacity is larger then MAX_SIZE_T / 8, multiplying it by sizeof(void*) (which is typically 8 bytes on 64-bit systems) can overflow and can cause a smaller then expected buffer to be allocated.

The bug is triggered by calling cc_array_add on an array that has already reached its current capacity (i.e., size == capacity). In this scenario, the library attempts to allocate a new buffer inside expand_capacity to accommodate additional elements, using a growth factor (commonly doubling the existing capacity). If the new capacity is large enough that multiplying it by sizeof(void*) exceeds the maximum representable size in size_t, the result overflows to a smaller value than intended. Consequently, malloc returns a buffer that is much smaller than required, and subsequent writes beyond the allocated boundary can corrupt the heap, lead to security vulnerabilities or trigger other undefined behavior.

This type of bug was not uncovered by previous symbolic testing efforts, such as those conducted using Gillian, due to the bounded nature of symbolic execution and the absence of verification over unbounded sequences of API calls. Since the bug only occurs when size == capacity and capacity >= MAX_SIZE_T / 8, numerous calls to cc_array_add must be performed for the bug to manifest itself. Symbolic testing over small, bounded executions never exercised the conditions required for overflow.

Additionally, finding this overflow bug using TPot led us to manually identify a similar issue in the cc_array_new_conf function, which initializes a new array given a user-supplied configuration. In this function, the user can specify a capacity value, which is directly used for allocating the buffer’s memory of the fresh array. If the provided capacity exceeds MAX_SIZE_T / 8, the same integer overflow occurs, leading to a smaller-than-expected buffer allocation.

To further illustrate this issue, we provide an implementation that reproduces the bug.

These findings highlight the importance of automated verification tools like TPot in uncovering deepseated memory allocation vulnerabilities that might otherwise go undetected through conventional testing methods.

Timeline
#

The project’s timeline deviated from the initial expectations in a few key areas.

The first phase, familiarization with TPot and formal verification, proceeded as planned over the expected two-week period. Gaining an understanding of TPot’s proof-oriented tests (POTs) and symbolic execution, as well as reviewing documentation from other verification frameworks, was straightforward. No significant delays occurred at this stage.

The second phase, searching for an unverified codebase, took the allocated time as well. Initially, three weeks were allocated for identifying and evaluating potential C codebases. We evaluated several projects, such as SQLite [@sqlite-code] and memcached [@memcached-code]. However, the Collections-C library was selected as the best fit, as it met the complexity criteria and already had symbolic tests written in Gillian, which provided a useful reference for verification efforts.

The third phase, verification and usability evaluation, took longer than expected. Originally scheduled for seven weeks, it ultimately extended to about nine weeks. The primary reason for this delay was the complexity of working with TPot’s proof primitives as a first time user of formal verification. A major challenge was debugging verification failures. Understanding why a POT did not pass often took significant time, as the failure messages were not always immediately clear. Writing loop invariants, in particular, proved challenging. Verifiers require a deep understanding of how to correctly describe invariants, and ensuring that the verification process captured all relevant properties of the data structure was more difficult than anticipated. The initial plan was to apply TPot to multiple codebases, but due to these difficulties, the project was only able to verify one: the Collections-C library.

Conclusion
#

In this project, we evaluated the effectiveness of TPot as a formal verification framework for previously unverified codebases by applying it to the Collections-C array library. Our findings demonstrate that TPot provides a promising balance between usability and expressiveness. Unlike Gillian’s case study, which is limited to verifying small-scale symbolic executions, TPot allows for the verification of array operations over arbitrary numbers of function calls while maintaining reasonable annotation overhead and execution time. Furthermore, our verification efforts uncovered a critical integer overflow vulnerability in the expand_capacity function, highlighting the practical benefits of applying automated verification to real-world software.

One of the key strengths of TPot observed in our study is its ability to verify a broad range of properties, including memory safety, functional correctness, and structural invariants. These strong properties, however come at the cost of longer verification times, which we consider acceptable when considering these verification tasks to be part of a CI pipeline. Additionally, while TPot reduces the need for developers to specify all internal functions manually, one modification to the original code, modifying the expansion to an integer were necessary to accommodate KLEE’s constraints.

Introduction #

Background #

Design #

Verifying Collections-C array #

Previous Collections-C Verification Efforts #

Collections-C Array Invariants #

Implementation #

Array Invariant #

Verification of cc_array_add #

Verification of cc_array_remove #

Verification of cc_array_reverse #

Verification of cc_array_get_at #

Loop Invariants and TPot Loop Unrolling #

Evaluation #

Annotation Overhead #

Execution Time #

Memory Allocation Overflow Bugs #

Timeline #

Conclusion #