Dynamic Arrays

Dynamic Arrays#

When creating a low-level array in a computer system, the precise size of that array must be explicitly declared in order for the system to properly allocate a consecutive piece of memory for its storage. For example, Figure 5.11 displays an array of 12 bytes that might be stored in memory locations 2146 through 2157.

https://i.ibb.co/ggpDSqz/array12.png — An array of 12 bytes allocated in memory locations 2146 through 2157#

Because the system might dedicate neighboring memory locations to store other data, the capacity of an array cannot trivially be increased by expanding into sub- sequent cells. In the context of representing a Python tuple or str instance, this constraint is no problem. Instances of those classes are immutable, so the correct size for an underlying array can be fixed when the object is instantiated.

Python’s list class presents a more interesting abstraction. Although a list has a particular length when constructed, the class allows us to add elements to the list, with no apparent limit on the overall capacity of the list. To provide this abstraction, Python relies on an algorithmic sleight of hand known as a dynamic array.

The first key to providing the semantics of a dynamic array is that a list instance maintains an underlying array that often has greater capacity than the current length of the list. For example, while a user may have created a list with five elements, the system may have reserved an underlying array capable of storing eight object references (rather than only five). This extra capacity makes it easy to append a new element to the list by using the next available cell of the array.

If a user continues to append elements to a list, any reserved capacity will eventually be exhausted. In that case, the class requests a new, larger array from the system, and initializes the new array so that its prefix matches that of the existing smaller array. At that point in time, the old array is no longer needed, so it is reclaimed by the system. Intuitively, this strategy is much like that of the hermit crab, which moves into a larger shell when it outgrows its previous one.

We give empirical evidence that Python’s list class is based upon such a strategy. The source code for our experiment is displayed in Code Fragment 5.1, and a sample output of that program is given in Code Fragment 5.2. We rely on a function named getsizeof that is available from the sys module. This function reports the number of bytes that are being used to store an object in Python. For a list, it reports the number of bytes devoted to the array and other instance variables of the list, but not any space devoted to elements referenced by the list.

import sys  # provides getsizeof function

n = 27
data=[]
for k in range(n):          # note: must fix choice of n 
    a = len(data)           # number of elements
    b = sys.getsizeof(data) # actual size in bytes 
    print('Length: {0:3d}; Size in bytes: {1:4d}'.format(a, b))
    data.append(None)       # increase length by one

Length:   0; Size in bytes:   56
Length:   1; Size in bytes:   88
Length:   2; Size in bytes:   88
Length:   3; Size in bytes:   88
Length:   4; Size in bytes:   88
Length:   5; Size in bytes:  120
Length:   6; Size in bytes:  120
Length:   7; Size in bytes:  120
Length:   8; Size in bytes:  120
Length:   9; Size in bytes:  184
Length:  10; Size in bytes:  184
Length:  11; Size in bytes:  184
Length:  12; Size in bytes:  184
Length:  13; Size in bytes:  184
Length:  14; Size in bytes:  184
Length:  15; Size in bytes:  184
Length:  16; Size in bytes:  184
Length:  17; Size in bytes:  248
Length:  18; Size in bytes:  248
Length:  19; Size in bytes:  248
Length:  20; Size in bytes:  248
Length:  21; Size in bytes:  248
Length:  22; Size in bytes:  248
Length:  23; Size in bytes:  248
Length:  24; Size in bytes:  248
Length:  25; Size in bytes:  312
Length:  26; Size in bytes:  312

In evaluating the results of the experiment, we draw attention to the first line of output from Code Fragment 5.2. We see that an empty list instance already requires a certain number of bytes of memory (72 on our system). In fact, each object in Python maintains some state, for example, a reference to denote the class to which it belongs. Although we cannot directly access private instance variables for a list, we can speculate that in some form it maintains state information akin to:

_n: The number of actual elements currently stored in the list.
_capacity: The maximum number of elements that could be stored in the currently allocated array.
_A: The reference to the currently allocated array (initially None)

As soon as the first element is inserted into the list, we detect a change in the underlying size of the structure. In particular, we see the number of bytes jump from 72 to 104, an increase of exactly 32 bytes. Our experiment was run on a 64-bit machine architecture, meaning that each memory address is a 64-bit number (i.e., 8 bytes). We speculate that the increase of 32 bytes reflects the allocation of an underlying array capable of storing four object references. This hypothesis is consistent with the fact that we do not see any underlying change in the memory usage after inserting the second, third, or fourth element into the list.

After the fifth element has been added to the list, we see the memory usage jump from 104 bytes to 136 bytes. If we assume the original base usage of 72 bytes for the list, the total of 136 suggests an additional 64 = 8 × 8 bytes that provide capacity for up to eight object references. Again, this is consistent with the experiment, as the memory usage does not increase again until the ninth insertion. At that point, the 200 bytes can be viewed as the original 72 plus an additional 128-byte array to store 16 object references. The 17th insertion pushes the overall memory usage to 272 = 72 + 200 = 72 + 25 × 8, hence enough to store up to 25 element references

Because a list is a referential structure, the result of getsizeof for a list instance only includes the size for representing its primary structure; it does not account for memory used by the objects that are elements of the list. In our experiment, we repeatedly append None to the list, because we do not care about the contents, but we could append any type of object without affecting the number of bytes reported by getsizeof(data).

If we were to continue such an experiment for further iterations, we might try to discern the pattern for how large of an array Python creates each time the ca- pacity of the previous array is exhausted (see Exercises R-5.2 and C-5.13). Before exploring the precise sequence of capacities used by Python, we continue in this section by describing a general approach for implementing dynamic arrays and for performing an asymptotic analysis of their performance.

Implementing a Dynamic Array#

Although the Python list class provides a highly optimized implementation of dynamic arrays, it is instructive to see how such a class might be implemented.

The key is to provide means to grow the array A that stores the elements of a list. Of course, we cannot actually grow that array, as its capacity is fixed. If an element is appended to a list at a time when the underlying array is full, we perform the following steps:

Allocate a new array B with larger capacity.
Set B[i] = A[i], for \(i = 0,...,n−1\), where n denotes current number of items.
Set A = B, that is, we henceforth use B as the array supporting the list.
Insert the new element in the new array.

https://i.ibb.co/gdktfcN/array13.png — An illustration of the three steps for “growing” a dynamic array: (a) create new array `B`; (b) store elements of `A` in `B`; (c) reassign reference A to the new array. Not shown is the future garbage collection of the old array, or the insertion of the new element.#

The remaining issue to consider is how large of a new array to create. A com- monly used rule is for the new array to have twice the capacity of the existing array that has been filled.

The code cell below offers a concrete implementation of dynamic arrays in Python. Our DynamicArray class is designed using ideas described in this section. While consistent with the interface of a Python list class, we provide only limited functionality in the form of an append method, and accessors __len__ and __getitem__. Support for creating low-level arrays is provided by a module named ctypes. Because we will not typically use such a low-level structure in the remain- der of this book, we omit a detailed explanation of the ctypes module. Instead, we wrap the necessary command for declaring the raw array within a private utility method _make_array. The hallmark expansion procedure is performed in our nonpublic _resize method.

import ctypes 

class DynamicArray: 
    '''A dynamic array class akin to a simplified Python list.'''
    
    def __init__(self):
        '''Create an empty array.'''
        self._n = 0                                    # count actual elements
        self._capacity = 1                             # default array capacity
        self._A = self._make_array(self._capacity)     # low-level array
        
    def __len__(self):
        '''Return number of elements stored in the array.'''
        return self._n
    
    def __getitem__(self, k):
        '''Return element at index k.'''
        if not 0 <= k < self._n:
            raise IndexError('invalid index')
        return self._A[k]                              # retrieve from array
    
    def append(self, obj):
        '''Add object to end of the array.'''
        if self._n == self._capacity:                  # not enough room
            self._resize(2 * self._capacity)           # so double capacity
        self._A[self._n] = obj
        self._n += 1
        
    def _resize(self, c):                              # nonpublic utitity
        '''Resize internal array to capacity c.'''
        B = self._make_array(c)                        # new (bigger) array
        for k in range(self._n):                       # for each existing value
            B[k] = self._A[k]
        self._A = B                                    # use the bigger array
        self._capacity = c
        
    def _make_array(self, c):                          # nonpublic utitity
        '''Return new array with capacity c.'''
        return (c * ctypes.py_object)()                # see ctypes documentation
    
    def insert(self, k, value):
        '''Insert value at index k, shifting subsequent values rightward.'''
        # (for simplicity, we assume 0 <= k <= n in this verion)
        if self._n == self._capacity:                  # not enough room
            self._resize(2 * self._capacity)           # so double capacity
        for j in range(self._n, k, -1):                # shift rightmost first
            self._A[j] = self._A[j-1]
        self._A[k] = value                             # store newest element
        self._n += 1
        
    def remove(self, value):
        '''Remove first occurrence of value (or raise ValueError).'''
        # (for simplicity, we do not consider shrinking the dynamic array in this version)

Amortized Analysis of Dynamic Arrays#

In this section, we perform a detailed analysis of the running time of operations on dynamic arrays. We use the big-Omega notation introduced in Section 3.3.1 to give an asymptotic lower bound on the running time of an algorithm or step within it.

The strategy of replacing an array with a new, larger array might at first seem slow, because a single append operation may require Ω(n) time to perform, where n is the current number of elements in the array. However, notice that by doubling the capacity during an array replacement, our new array allows us to add n new elements before the array must be replaced again. In this way, there are many simple append operations for each expensive one (see Figure 5.13). This fact allows us to show that performing a series of operations on an initially empty dynamic array is efficient in terms of its total running time.

Using an algorithmic design pattern called amortization, we can show that per- forming a sequence of such append operations on a dynamic array is actually quite efficient.

https://i.ibb.co/KL5QG1q/array14.png — Running times of a series of append operations on a dynamic array.#

Let S be a sequence implemented by means of a dynamic array with initial capacity one, using the strategy of doubling the array size when full. The total time to perform a series of n append operations in S, starting from S being empty, is O(n).

The proof of this result is based on the following observations:

The first append operation takes O(1) time.
The next two append operations take O(1) time each.
The next four append operations take O(1) time each.
The next eight append operations take O(1) time each.
The next sixteen append operations take O(1) time each.
The next thirty-two append operations take O(1) time each.

In general, the next \(2^i\) append operations take O(1) time each, for \(i = 0, 1, 2, ...\). The total time to perform the first \(2^i\) append operations is O(\(2^i\)). The total time to perform the first n append operations is the sum of the times to perform the first \(2^i\) append operations for all \(i = 0, 1, 2, ..., k\), where \(2^k ≤ n < 2^{k+1}\). This sum is at most \(2^0 + 2^1 + 2^2 + ... + 2^k = 2^{k+1} − 1 = 2n − 1 = O(n)\).

Geometric Increase in Capacity#

Although the proof of Proposition 5.1 relies on the array being doubled each time we expand, the O(1) amortized bound per operation can be proven for any geo- metrically increasing progression of array sizes (see Section 2.4.2 for discussion of geometric progressions). When choosing the geometric base, there exists a trade- off between run-time efficiency and memory usage. With a base of 2 (i.e., doubling the array), if the last insertion causes a resize event, the array essentially ends up twice as large as it needs to be. If we instead increase the array by only 25% of its current size (i.e., a geometric base of 1.25), we do not risk wasting as much memory in the end, but there will be more intermediate resize events along the way. Still it is possible to prove an O(1) amortized bound, using a constant factor greater than the 3 cyber-dollars per operation used in the proof of Proposition 5.1 (see Exercise C-5.15). The key to the performance is that the amount of additional space is proportional to the current size of the array.

Beware of Arithmetic Progression#

To avoid reserving too much space at once, it might be tempting to implement a dynamic array with a strategy in which a constant number of additional cells are reserved each time an array is resized. Unfortunately, the overall performance of such a strategy is significantly worse. At an extreme, an increase of only one cell causes each append operation to resize the array, leading to a familiar 1 + 2 + 3 + ···+n summation and Ω(n2) overall cost. Using increases of 2 or 3 at a time is slightly better, as portrayed in Figure 5.13, but the overall cost remains quadratic.

https://i.ibb.co/Rgq1THy/array15.png — Running times of a series of append operations on a dynamic array using arithmetic progression of sizes. (a) Assumes increase of 2 in size of the array, while (b) assumes increase of 3.#

Using a fixed increment for each resize, and thus an arithmetic progression of intermediate array sizes, results in an overall time that is quadratic in the number of operations, as shown in the following proposition. Intuitively, even an increase in 1000 cells per resize will become insignificant for large data sets.

Proposition 2: Fixed increment \(\rightarrow Ω(n^2)\)#

Performing a series of n append operations on an initially empty dynamic array using a fixed increment with each resize takes \(Ω(n^2)\) time.

The proof of this result is based on the following observations:

Let \(c > 0\) represent the fixed increment in capacity that is used for each resize event. During the series of \(n\) append operations, time will have been spent initializing arrays of size \(c,2c,3c,...,mc\) for \(m = \lceil{n}/c\rceil\), and therefore, the overall time would be proportional to \(c + 2c + 3c + · · · + mc\).

\[ \sum_{i=1}^{m} ci = c \sum_{i=1}^{m} i = c \frac{m(m+1)}{2} \geq c \frac{\frac{n}{c}(\frac{n}{c} + 1)}{2} \geq \frac{n^2}{2c} \]

Therefore, performing the n append operations takes \(Ω(n^2)\) time.

Python’s List Class#

Python’s list class is using a form of dynamic arrays for its storage but does neither uses a pure geometric progression nor an arithmetic progression.

Despite this, Python’s implementation of the append method exhibits amortized constant-time behavior. We can demonstrate this fact experimentally.

A single .append operation typically executes so quickly that it would be difficult for us to accurately measure the time elapsed at that granularity, although we should notice some of the more expensive operations in which a resize is performed. We can get a more accurate measure of the amortized cost per operation by performing a series of n append operations on an initially empty list and determining the average cost of each.