Arrays (under the hood)

9. Arrays (under the hood)#

An Array is simply a contiguous block of memory, where each cell in the block holds a value directly or reference to an object.

This block of memory is often visualized as a numbered sequence of cells, where the number denotes the position of a cell within the sequence.

https://i.ibb.co/FnWvV70/array3.png — Fig. 9.1 A higher-level abstraction for the string portrayed in figure before.#

Array serve as low-level data structures used internally by Python to implement higher-level data structures such as lists, tuples, and strings. To accurately describe the way in which Python represents the sequence types, we must first discuss aspects of the low-level computer architecture.

Each byte (8 bits) in the computer’s memory has a unique memory address, which is a number that identifies the byte’s location in memory. In this way, the computer system can refer to the data in “byte #2150” versus the data in “byte #2157,” for example.

Memory addresses are typically coordinated with the physical layout of the memory system, and so we often portray the numbers in sequential fashion. Figure below provides such a diagram, with the designated memory address for each byte.

https://i.ibb.co/DLyLmTw/array1.png — Fig. 9.2 A representation of a portion of a computer’s memory, with individual bytes labeled with consecutive memory addresses.#

Despite the sequential nature of the numbering system, computer hardware is designed, in theory, so that any byte of the main memory can be efficiently accessed based upon its memory address. In this sense, we say that a computer’s main memory performs as Random Access Memory (RAM).

That is, in theory, it is just as easy to retrieve byte #8675309 as it is to retrieve byte #309. Using the notation for asymptotic analysis, we say that any individual byte of memory can be stored or retrieved in \(O(1)\) time.

In general, a programming language keeps track of the association between an variable name and the memory address in which the associated value is stored. For example, identifier x might be associated with one value stored in memory, while y is associated with another value stored in memory.

A group of related values can be stored one after another in a contiguous portion of the computer’s memory i.e. an array. For example, a text string is stored as an ordered sequence of individual characters. On most computers, Python internally represents each character with 16 bits (i.e., 2 bytes). Therefore, a six-character string, such as SAMPLE , would be stored in 12 consecutive bytes of memory, as diagrammed in Figure below.

https://i.ibb.co/fQGDLsG/array2.png — Fig. 9.3 A Python string embedded as an array of characters in the computer’s memory. We assume that each Unicode character of the string requires two bytes of memory. The numbers below the entries are indices into the string.#

We describe this as an array of six characters, even though it requires 12 bytes of memory. We will refer to each location within an array as a cell, and will use an integer index to describe its location within the array, with cells numbered starting with 0, 1, 2, and so on. For example, in Figure above, the cell of the array with index 4 has contents L and is stored in bytes 2154 and 2155 of memory.

Each cell of an array must use the same number of bytes. This requirement is what allows an arbitrary cell of the array to be accessed in constant time based on its index. In particular, if one knows the memory address at which an array starts (e.g., \(2146\) in figure above), the number of bytes per element (e.g., 2 for a Unicode character), and a desired index within the array, the appropriate memory address can be computed using the calculation,

\[ \text{start} + (\text{cellsize} \times \text{index}) \]

By this formula, the cell at index 0 begins precisely at the start of the array, the cell at index 1 begins precisely \(\text{cellsize}\) bytes beyond the start of the array, and so on. As an example, cell 4 of Figure above begins at memory location \(2146 + (2 \times 4)\) = 2146 + 8 = 2154.

9.1. Static Arrays#

A static array is a type of array whose size is fixed at the time of its creation. Immutable sequences in Python, such as tuple and str instances, are examples of static arrays.

When creating a low-level array in a computer system, the precise size of that array must be explicitly declared in order for the system to properly allocate a consecutive piece of memory for its storage. For example, the figure below displays an array of 12 bytes that might be stored in memory locations 2146 through 2157.

https://i.ibb.co/ggpDSqz/array12.png — Fig. 9.4 An array of 12 bytes allocated in memory locations 2146 through 2157#

Because the system might dedicate neighboring memory locations to store other data, the capacity of an array cannot trivially be increased by expanding into subsequent cells.

In the context of representing a Python tuple or str instance, this constraint is no problem. Instances of those classes are immutable, so the correct size for an underlying array can be fixed when the object is instantiated.

Memory Usage of Python Objects

sys.getsizeof(..) function returns the size of an object in bytes.

Note that empty lists, strings etc. have some initial memory overhead associated with them. After that, the memory usage increases as you add elements to the object.

Generally, the size of an empty list is 56 bytes, while the size of an empty tuple is 40 bytes. When you add one element to the list, its size increases to 64 bytes, and so on.

import sys

print(sys.getsizeof([]))  # 56 (empty list has 56 bytes overhead, 8 bytes for every element after that)
print(sys.getsizeof(()))  # 40 (empty tuple has 40 bytes overhead, 8 bytes for every element after that)
print(sys.getsizeof(""))  # 49 (empty string has 49 bytes overhead, 1 byte for every character after that)

Python allows you to query the actual number of bytes being used for the primary storage of any object. This is done using the getsizeof function of the sys module.

Note that in the code cell below, the memory usage of the string type variable txt increases linearly with the number of characters in the string.

import sys 

txt = ""

for i in range(10):
    print("Number of characters: ", len(txt), "\t Memory (in bytes):", sys.getsizeof(txt))
    txt += "0"

Number of characters:  0 	 Memory (in bytes): 49
Number of characters:  1 	 Memory (in bytes): 50
Number of characters:  2 	 Memory (in bytes): 51
Number of characters:  3 	 Memory (in bytes): 52
Number of characters:  4 	 Memory (in bytes): 53
Number of characters:  5 	 Memory (in bytes): 54
Number of characters:  6 	 Memory (in bytes): 55
Number of characters:  7 	 Memory (in bytes): 56
Number of characters:  8 	 Memory (in bytes): 57
Number of characters:  9 	 Memory (in bytes): 58

9.2. Referential Arrays#

Instead, Python represents a list or tuple instance using an internal storage mechanism of an array of object references. At the lowest level, what is stored is a consecutive sequence of memory addresses at which the elements of the sequence reside. A high-level diagram of such a list is shown in figure below.

https://i.ibb.co/CJmjccS/array4.png — Fig. 9.5 An array storing references to strings.#

Although the relative size of the individual elements may vary, the number of bits used to store the memory address of each element is fixed (e.g., 64-bits per address). In this way, Python can support constant-time access to a list or tuple element based on its index.

As a result, two lists of the same length will consume the same amount of memory, regardless of the sizes of the elements that they store.

import sys 

data  = ['Rene' , 'Joseph' , 'Janet' , 'Jonas' , 'Helen' , 'Virginia']

data2 = ['r', 'j', 'j', 'j', 'h', 'v']

sys.getsizeof(data) == sys.getsizeof(data2)

True

Note that in the code cell above, the size of the two variables and data and data2 is the same because they both contain an equal number of memory addresses, even though the individual strings they reference have different lengths.

Referential arays are generally not as memory efficient as static arrays.

Note that memory addresses are typically 64 bits (8 bytes) in size on modern computers. This means that the memory overhead for a list of \(n\) elements is \(8n\) bytes.

As another case study, suppose we wish to store a sequence of one million, 64-bit integers. In theory, we might hope to use only 64 million bits. However, we estimate that a Python list will use four to five times as much memory. Each element of the list will result in a 64-bit memory address being stored in the primary array, and an int instance being stored elsewhere in memory.

data = [1, 2, 3, 4]

print(sys.getsizeof(data)) # 120 = 56 (overhead) + 32 (4 addresses, each address is 4 bytes) + 32 (4 integers, each integer is 8 bytes)

data2 = [data, data, data, data]

print(sys.getsizeof(data2)) # 88 = 56 (overhead) + 32 (4 addresses, each address is 8 bytes). The integers themselves are not primary storage in data2.

120
88

sys.getsizeof("a")

In Figure above, we characterize a list of strings that are the names of the patients in a hospital. It is more likely that a medical information system would manage more comprehensive information on each patient, perhaps represented as an in- stance of a Patient class. From the perspective of the list implementation, the same principle applies: The list will simply keep a sequence of references to those ob- jects. Note as well that a reference to the None object can be used as an element of the list to represent an empty bed in the hospital.

The fact that lists and tuples are referential structures is significant to the se- mantics of these classes. A single list instance may include multiple references to the same object as elements of the list, and it is possible for a single object to be an element of two or more lists, as those lists simply store references back to that object. As an example, when you compute a slice of a list, the result is a new list instance, but that new list has references to the same elements that are in the original list, as portrayed in Figure below.

https://i.ibb.co/X2Fp4Yx/array5.png — Fig. 9.6 The result of the command `temp = primes[3:6]`#

When the elements of the list are immutable objects, as with the integer in- stances in Figure 5.5, the fact that the two lists share elements is not that significant, as neither of the lists can cause a change to the shared object. If, for example, the command temp[2] = 15 were executed from this configuration, that does not change the existing integer object; it changes the reference in cell 2 of the temp list to reference a different object. The resulting configuration is shown in Figure below.

https://i.ibb.co/yQftP2V/array6.png — Fig. 9.7 The result of the command `temp[2] = 15` upon the configuration portrayed in previous figure.#

The same semantics is demonstrated when making a new list as a copy of an existing one, with a syntax such as backup = list(primes). This produces a new list that is a shallow copy, in that it references the same elements as in the first list. With immutable elements, this point is moot. If the contents of the list were of a mutable type, a deep copy, meaning a new list with new elements, can be produced by using the deepcopy function from the copy module.

As a more striking example, it is a common practice in Python to initialize an array of integers using a syntax such as counters = [0] 8. This syntax produces a list of length eight, with all eight elements being the value zero. Technically, all eight cells of the list reference the same object, as shown below.

https://i.ibb.co/f4mVB8Y/array7.png — Fig. 9.8 The result of the command `data = [0]*8`.#

At first glance, the extreme level of aliasing in this configuration may seem alarming. However, we rely on the fact that the referenced integer is immutable. Even a command such as counters[2] += 1 does not technically change the value of the existing integer instance. This computes a new integer, with value 0 + 1, and sets cell 2 to reference the newly computed value. The resulting configuration is shown in Figure 5.8

https://i.ibb.co/QpmfZ9L/array8.png — Fig. 9.9 The result of the command `counters[2] += 1` upon the configuration portrayed in previous figure.#

As a final manifestation of the referential nature of lists, we note that the extend command is used to add all elements from one list to the end of another list. The extended list does not receive copies of those elements, it receives references to those elements. Figure 5.9 portrays the effect of a call to extend.

https://i.ibb.co/1ZZFWTp/array9.png — Fig. 9.10 The effect of command `primes.extend(extras)`, shown in light gray.#

9.3. Compact Arrays#

A compact array is one in which the cells of the array directly contain the data values only.

In other words, the array is storing the bits that represent the primary data (characters, in the case of strings) directly, rather than storing references to the data.

Strings in Python are represented using an array of characters and NOT an array of references.

https://i.ibb.co/SdfVPnj/array10.png — Fig. 9.11 A compact array representation of a string.#

Compact arrays have several advantages over referential structures in terms of computing performance.

Most significantly, the overall memory usage will be much lower for a compact structure because there is no overhead devoted to the explicit storage of the sequence of memory references.

That is, a referential structure will typically use 64-bits for the memory address stored in the array, on top of whatever number of bits are used to represent the object that is considered the element.

Also, each Unicode character stored in a compact array within a string typically requires 2 bytes (8 bits). If each character were stored independently as a one-character string, there would be significantly more bytes used.

Another important advantage to a compact structure for high-performance computing is that the primary data are stored consecutively in memory. Note well that this is not the case for a referential structure. That is, even though a list maintains careful ordering of the sequence of memory addresses, where those elements reside in memory is not determined by the list. Because of the workings of the cache and memory hierarchies of computers, it is often advantageous to have data stored in memory near other data that might be used in the same computations.

Primary support for compact arrays is in a module named array. That module defines a class, also named array, providing compact storage for arrays of primitive data types. A portrayal of such an array of integers is shown in Figure 5.10.

https://i.ibb.co/w6j8Qk5/array11.png — Fig. 9.12 Integers stored compactly as elements of a Python array.#

The public interface for the array class conforms mostly to that of a Python list. However, the constructor for the array class requires a type code as a first parameter, which is a character that designates the type of data that will be stored in the array. As a tangible example, the type code, 'i' , designates an array of (signed) integers, typically represented using at least 16-bits each.

from array import array

primes = array('i' , [2, 3, 5, 7, 11, 13, 17, 19])

The type code allows the interpreter to determine precisely how many bits are needed per element of the array. The type codes supported by the array module, as shown in Table 5.1, are formally based upon the native data types used by the C programming language (the language in which the the most widely used distri- bution of Python is implemented). The precise number of bits for the C data types is system-dependent, but typical ranges are shown in the table.

Type Code	C Type	Minimum Size in Bytes
‘b’	signed char	1
‘B’	unsigned char	1
‘u’	Py_UNICODE	2
‘h’	signed short	2
‘H’	unsigned short	2
‘i’	signed int	2
‘I’	unsigned int	2
‘l’	signed long	4
‘L’	unsigned long	4
‘f’	float	4
‘d’	double	8

The array module does not provide support for making compact arrays of user- defined data types. Compact arrays of such structures can be created with the lower- level support of a module named ctypes.

9.4. Dynamic Arrays#

Python’s list class presents a more interesting abstraction. Although a list has a particular length when constructed, the class allows us to add elements to the list, with no apparent limit on the overall capacity of the list. To provide this abstraction, Python relies on an algorithmic sleight of hand known as a Dynamic Array.

The first key to providing the semantics of a dynamic array is that a list instance maintains an underlying array that often has greater capacity than the current length of the list. For example, while a user may have created a list with five elements, the system may have reserved an underlying array capable of storing eight object references (rather than only five). This extra capacity makes it easy to append a new element to the list by using the next available cell of the array.

If a user continues to append elements to a list, any reserved capacity will eventually be exhausted. In that case, the class requests a new, larger array from the system, and initializes the new array so that its prefix matches that of the existing smaller array. At that point in time, the old array is no longer needed, so it is reclaimed by the system. Intuitively, this strategy is much like that of the hermit crab, which moves into a larger shell when it outgrows its previous one.

We give empirical evidence that Python’s list class is based upon such a strategy. We rely on a function named getsizeof that is available from the sys module. This function reports the number of bytes that are being used to store an object in Python. For a list, it reports the number of bytes devoted to the array and other instance variables of the list, but not any space devoted to elements referenced by the list.

import sys  # provides getsizeof function

n = 27
data=[]
for k in range(n):          # note: must fix choice of n 
    a = len(data)           # number of elements
    b = sys.getsizeof(data) # actual size in bytes 
    print('Length: {0:3d}; Size in bytes: {1:4d}'.format(a, b))
    data.append(None)       # increase length by one

Length:   0; Size in bytes:   56
Length:   1; Size in bytes:   88
Length:   2; Size in bytes:   88
Length:   3; Size in bytes:   88
Length:   4; Size in bytes:   88
Length:   5; Size in bytes:  120
Length:   6; Size in bytes:  120
Length:   7; Size in bytes:  120
Length:   8; Size in bytes:  120
Length:   9; Size in bytes:  184
Length:  10; Size in bytes:  184
Length:  11; Size in bytes:  184
Length:  12; Size in bytes:  184
Length:  13; Size in bytes:  184
Length:  14; Size in bytes:  184
Length:  15; Size in bytes:  184
Length:  16; Size in bytes:  184
Length:  17; Size in bytes:  248
Length:  18; Size in bytes:  248
Length:  19; Size in bytes:  248
Length:  20; Size in bytes:  248
Length:  21; Size in bytes:  248
Length:  22; Size in bytes:  248
Length:  23; Size in bytes:  248
Length:  24; Size in bytes:  248
Length:  25; Size in bytes:  312
Length:  26; Size in bytes:  312

In evaluating the results of the experiment, we draw attention to the first line of output from Code Fragment 5.2. We see that an empty list instance already requires a certain number of bytes of memory (72 on our system). In fact, each object in Python maintains some state, for example, a reference to denote the class to which it belongs. Although we cannot directly access private instance variables for a list, we can speculate that in some form it maintains state information akin to:

_n: The number of actual elements currently stored in the list.
_capacity: The maximum number of elements that could be stored in the currently allocated array.
_A: The reference to the currently allocated array (initially None)

As soon as the first element is inserted into the list, we detect a change in the underlying size of the structure. In particular, we see the number of bytes jump from 72 to 104, an increase of exactly 32 bytes. Our experiment was run on a 64-bit machine architecture, meaning that each memory address is a 64-bit number (i.e., 8 bytes). We speculate that the increase of 32 bytes reflects the allocation of an underlying array capable of storing four object references. This hypothesis is consistent with the fact that we do not see any underlying change in the memory usage after inserting the second, third, or fourth element into the list.

After the fifth element has been added to the list, we see the memory usage jump from 104 bytes to 136 bytes. If we assume the original base usage of 72 bytes for the list, the total of 136 suggests an additional 64 = 8 × 8 bytes that provide capacity for up to eight object references. Again, this is consistent with the experiment, as the memory usage does not increase again until the ninth insertion. At that point, the 200 bytes can be viewed as the original 72 plus an additional 128-byte array to store 16 object references. The 17th insertion pushes the overall memory usage to 272 = 72 + 200 = 72 + 25 × 8, hence enough to store up to 25 element references

Because a list is a referential structure, the result of getsizeof for a list instance only includes the size for representing its primary structure; it does not account for memory used by the objects that are elements of the list. In our experiment, we repeatedly append None to the list, because we do not care about the contents, but we could append any type of object without affecting the number of bytes reported by getsizeof(data).

The strategy of replacing an array with a new, however it can be proven that the amortized cost (average-case time complexity) of each append operation is O(1) and worst-case time complexity is O(n) for a dynamic array with geometric progression.

https://i.ibb.co/Rgq1THy/array15.png — Fig. 9.13 Running times of a series of append operations on a dynamic array using arithmetic progression of sizes. (a) Arithmetic progression: n + 2 (b) Geometric progression: 3n#

Using a fixed increment for each resize, and thus an arithmetic progression of intermediate array sizes, results in an overall time that is quadratic in the number of operations, as shown in the following proposition. Intuitively, even an increase in 1000 cells per resize will become insignificant for large data sets.

Python’s list class is using a form of dynamic arrays for its storage but does neither uses a pure geometric progression nor an arithmetic progression.

Despite this, Python’s implementation of the append method exhibits amortized constant-time behavior. We can demonstrate this fact experimentally.

A single .append operation typically executes so quickly that it would be difficult for us to accurately measure the time elapsed at that granularity, although we should notice some of the more expensive operations in which a resize is performed. We can get a more accurate measure of the amortized cost per operation by performing a series of n append operations on an initially empty list and determining the average cost of each.

from time import time 

def compute_average(n):
    """Perform n appends to an empty list and return average time elapsed."""
    data = [ ]
    start = time()
    for _ in range(n):
        data.append(None)
    end = time( )
    return (end - start) / n

for i in range(1, 8):
    n = 10**i
    avg = compute_average(n)
    print("Avg time: ", avg, "\tInserts: ", n)

Avg time:  9.5367431640625e-08 	Inserts:  10
Avg time:  7.152557373046875e-08 	Inserts:  100
Avg time:  5.7935714721679686e-08 	Inserts:  1000
Avg time:  5.671977996826172e-08 	Inserts:  10000
Avg time:  5.377054214477539e-08 	Inserts:  100000
Avg time:  5.0510883331298826e-08 	Inserts:  1000000

Avg time:  5.0329303741455076e-08 	Inserts:  10000000

9.4.1. Implementation#

Although the Python list class provides a highly optimized implementation of dynamic arrays, it is instructive to see how such a class might be implemented.

The key is to provide means to grow the array A that stores the elements of a list. Of course, we cannot actually grow that array, as its capacity is fixed. If an element is appended to a list at a time when the underlying array is full, we perform the following steps:

Allocate a new array B with larger capacity.
Set B[i] = A[i], for \(i = 0,...,n−1\), where n denotes current number of items.
Set A = B, that is, we henceforth use B as the array supporting the list.
Insert the new element in the new array.

https://i.ibb.co/gdktfcN/array13.png — Fig. 9.14 An illustration of the three steps for “growing” a dynamic array: (a) create new array `B`; (b) store elements of `A` in `B`; (c) reassign reference A to the new array. Not shown is the future garbage collection of the old array, or the insertion of the new element.#

The remaining issue to consider is how large of a new array to create. A com- monly used rule is for the new array to have twice the capacity of the existing array that has been filled.

The code cell below offers a concrete implementation of dynamic arrays in Python. Our DynamicArray class is designed using ideas described in this section. While consistent with the interface of a Python list class, we provide only limited functionality in the form of an append method, and accessors __len__ and __getitem__. Support for creating low-level arrays is provided by a module named ctypes. Because we will not typically use such a low-level structure in the remain- der of this book, we omit a detailed explanation of the ctypes module. Instead, we wrap the necessary command for declaring the raw array within a private utility method _make_array. The hallmark expansion procedure is performed in our nonpublic _resize method.

import ctypes 

class DynamicArray: 
    '''A dynamic array class akin to a simplified Python list.'''
    
    def __init__(self):
        '''Create an empty array.'''
        self._n = 0                                    # count actual elements
        self._capacity = 1                             # default array capacity
        self._A = self._make_array(self._capacity)     # low-level array
        
    def __len__(self):
        '''Return number of elements stored in the array.'''
        return self._n
    
    def __getitem__(self, k):
        '''Return element at index k.'''
        if not 0 <= k < self._n:
            raise IndexError('invalid index')
        return self._A[k]                              # retrieve from array
    
    def append(self, obj):
        '''Add object to end of the array.'''
        if self._n == self._capacity:                  # not enough room
            self._resize(2 * self._capacity)           # so double capacity
        self._A[self._n] = obj
        self._n += 1
        
    def _resize(self, c):                              # nonpublic utitity
        '''Resize internal array to capacity c.'''
        B = self._make_array(c)                        # new (bigger) array
        for k in range(self._n):                       # for each existing value
            B[k] = self._A[k]
        self._A = B                                    # use the bigger array
        self._capacity = c
        
    def _make_array(self, c):                          # nonpublic utitity
        '''Return new array with capacity c.'''
        return (c * ctypes.py_object)()                # see ctypes documentation
    
    def insert(self, k, value):
        '''Insert value at index k, shifting subsequent values rightward.'''
        # (for simplicity, we assume 0 <= k <= n in this verion)
        if self._n == self._capacity:                  # not enough room
            self._resize(2 * self._capacity)           # so double capacity
        for j in range(self._n, k, -1):                # shift rightmost first
            self._A[j] = self._A[j-1]
        self._A[k] = value                             # store newest element
        self._n += 1
        
    def remove(self, value):
        '''Remove first occurrence of value (or raise ValueError).'''
        # (for simplicity, we do not consider shrinking the dynamic array in this version)

9.5. Summary#

Static vs. Dynamic Arrays: Static arrays have fixed size, while dynamic arrays can grow in size. In Python, tuple and str are examples of static arrays, while list is an example of a dynamic array.

Compact vs. Referential Arrays: Compact arrays store the data directly, while referential arrays store references to the data. In Python, str and array are examples of compact arrays, while list and tuple are examples of referential arrays.

-	Static (Fixed Size)	Dynamic (Variable Size)
Compact (Store Data directly)	`str`,`array`	-
Referential (Store References)	`tuple`	`list`