LINEAR ALGEBRA

Scalars

Most everyday mathematics consists of manipulating numbers one at a time. Formally, we call these values scalars.

For example, the temperature in Greenville is a balmy \(72\) degrees Fahrenheit. If you wanted to convert the temperature to Celsius you would evaluate the expression \(c = \frac{5}{9}(f - 32)\), setting \(f = 72\). In this equation, the values \(5\), \(9\), and \(32\) are constant scalars. The variables \(c\) and \(f\) in general represent unknown scalars.

We denote scalars by ordinary lower-cased letters (e.g. \(x\), \(y\), and \(z\)) and the space of all (continuous) real-valued scalars by \(\mathbb{R}\). The expression \(x \in \mathbb{R}\) is a formal way to say that \(x\) is a real-valued scalar. The symbol \(\in\) (pronounced “in”) denotes membership in a set. For example, \(x, y \in {0, 1}\) indicates that \(x\) and \(y\) are variables that can only take on values of \(0\) or \(1\).

Scalars in Python are represented by numeric types such as int and float.

x = 3
y = 2

print("x+y:", x+y, "x-y:", x-y, "x*y:", x*y, "x/y:", x/y, "x**y:", x**y)

Vectorized Operations

Vector based programming (also known as array based programming) is a programming paradigm that uses operations on arrays to execute tasks. This is in contrast to scalar based programming where operations are performed on individual elements of an array.

Vectorization operates at the level of individual instructions sent to a processor within each node. For instance, in the illustration shown here, the instruction is to add 5 to a column of numbers and copy the results to a new column B. With vectorization, all the data elements in that column are transformed simultaneously, i.e. the instruction to add 5 is applied to multiple pieces of data at the same time. This paradigm is sometimes referred to as Single Instruction Multiple Data (or SIMD).

We can think of vectorization as subdividing the work into smaller chunks that can be handled independently by different computational units at the same time.

In this course, we will be minimizing the use of loops.


1. Serial / Sequential Operations	2. Vectorized Operations

This is orders of magnitude faster than the conventional sequential model where each piece of data is handled one after the other in sequence.

Vectorized operations are also known as SIMD (Single Instruction Multiple Data) operations in the context of computer architecture. In contrast, scalar operations are known as SISD (Single Instruction Single Data) operations.

With vectorization, performing the same operation on a modern intel CPU is 16 times faster than the sequential mode. The performance gains on GPUs with thousands of computational cores is even greater. However, despite these remarkable performance benefits, most analytical code out there is written in the slower sequential mode. This is not a surprise, since until about a decade ago, CPU and GPU hardware could not really support vectorization for data analysis. So most implementations had to be sequential.

The last 10 years, however, have seen the rise of new technologies like CUDA from NVidia and advanced vector extensions from Intel that have dramatically shifted our ability to apply vectorization. Because of the power of vectorization, some traditional vendors now make claims about including vectorization in their offerings. But shifting to this new vectorized paradigm is not easy, since all of your code needs to be written from scratch to utilize these capabilities.

Vectorization can only be applied in situations when operations at individual elements are independent of each other. For example, if we want to add two arrays, we can do so by adding each element of the first array to the corresponding element of the second array. This is a vectorized operation.

However, for problems such as the Fibonacci sequence, where the value of an element depends on the values of the previous two elements, we cannot vectorize the operation. Similarly, finding minimum or maximum of an array cannot be vectorized.

Dimensions and Shapes

Dimensionality, in the context of data, refers to the number of axes or directions in which data can be represented. The most common dimensions are 0, 1, 2, and n.

Scalars (0-dimensional data; values) are single numbers. They can be integers, real numbers, or complex numbers. Scalars are the simplest objects in linear algebra. In Python, we can represent scalars using the built-in int and float data types. For example, 3 and 3.0 are both scalars.

Vectors (1-dimensional data, collection of values) are one-dimensional arrays of scalars. They are used to represent quantities that have both magnitude and direction. In native Python, we can represent vectors using lists or tuples. For example, [1, 2, 3] is a vector.

Matrices (2-dimensional data, collection of vectors) are two-dimensional arrays of scalars. They are used to represent linear transformations from one vector space to another. In native Python, we can represent matrices using lists of lists. For example, [[1, 2], [3, 4]] is a matrix.

Tensors (n-dimensional data, collection of matrices) are n-dimensional arrays of scalars. They are used to represent multi-dimensional data.

Tabular (2-dimensional) Data

Tables are one of the most common ways to organize data. This is in large part due to the simplicity and flexibility of tables. Tables allow us to represent each observation, or instance of collecting data from an individual, as its own row. We can record distinct characteristics, or features, of each observation in separate columns.

To see this in action, we’ll explore the elections dataset, which stores information about political candidates who ran for president of the United States in various years.

The first few rows of elections dataset in CSV format are as follows:

Year,Candidate,Party,Popular vote,Result,%\n
1824,Andrew Jackson,Democratic-Republican,151271,loss,57.21012204\n
1824,John Quincy Adams,Democratic-Republican,113142,win,42.78987796\n
1828,Andrew Jackson,Democratic,642806,win,56.20392707\n
1828,John Quincy Adams,National Republican,500897,loss,43.79607293\n
1832,Andrew Jackson,Democratic,702735,win,54.57478905\n

This dataset is stored in Comma Separated Values (CSV) format. CSV files due to their simplicity and readability are one of the most common ways to store tabular data. Each line in a CSV file (file extension: .csv) represents a row in the table. In other words, each row is separated by a newline character \n. Within each row, each column is separated by a comma ,, hence the name Comma Separated Values.

DataFrame, Series and Index

There are three fundamental data structures in pandas:

Series: 1D labeled array data; best thought of as columnar data
DataFrame: 2D tabular data with rows and columns
Index: A sequence of row/column labels

DataFrames, Series, and Indices can be represented visually in the following diagram, which considers the first few rows of the elections dataset.

Notice how the DataFrame is a two-dimensional object – it contains both rows and columns. The Series above is a singular column of this DataFrame, namely, the Result column. Both contain an Index, or a shared list of row labels (here, the integers from 0 to 4, inclusive).

`.shape` attribute

.shape is an attribute of a DataFrame that returns a tuple representing the dimensions of the DataFrame.

elections.shape

(182, 6)

The first element of the tuple is the number of rows, and the second element is the number of columns.