NumPy: Numerical Python

NumPy is an open source package (i.e. extension library) for the Python programming language originally developed by Travis Oliphant. It primarily provides

N-dimensional array data structures (some might call these tensors...) well suited for numeric computation.
Sophisticaed "broadcasting" operations to allow efficient application of mathematical functions and operators over entire arrays of data, e.g. calling sin(x) an NumPy array x returns the element-wise sine of the numbers contained in x.

"Duck" typing makes Python slow

If you've heard that Python is slow for numerical computations, it's primarily due to "duck typing". Duck typing is a colloquial phrase that refers to the fact that in Python, we don't declare variable types (e.g. string, float, int, etc.) when we declare variables. They are inferred at runtime. In summary,

Duck Typing:

If it looks like a duck, then it is a duck.
a.k.a. dynamic typing
Dynamic typing requires lots of metadata around a variable.

This causes runtime overhead leading to poor performance in numerical computations.

Solution: NumPy data structures

Data structures, as objects, that have a single type and contiguous storage.
Common functionality with implementation in C.

Contiguous storage means that the array data is stored in a continuous "chunk" in memory, i.e. the elements of an array are next to each other in memory in the order they appear in the array. This adds performance by avoiding "cache misses" and other low-level performance issues.

Most NumPy operations can be expected to perform at a level very-close to what you would expect from compiled C code.

The fact that NumPy arrays are objects makes them slightly different that arrays in C code. NumPy arrays have attributes that can be changed and queried, e.g. the shape or data type of an array.

How slow is Python?

To demonstrate the efficacy of NumPy, we'll perform the same operation, i.e. adding 1 to an array containing a million numbers, using a Python list comprehension and then using a NumPy array.

We use the %timeit magic function from IPython which runs the command that follows it a number of times and reports some statistics on how long it takes the function to execute. First, the Python list comprehension

%timeit [i+1 for i in range(1000000)]

67.1 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

And now the NumPy equivalent

Here the + 1 is broadcast across the array, i.e. each element has 1 added to it.

import numpy as np
%timeit np.arange(1000000) + 1

1.07 ms ± 8.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Here we see that adding 1 to a million numbers in NumPy is significantly faster than using a Python list comprehension (which itself is much faster than a for loop would be in pure Python).

Universal functions

Universal functions are vectorized functions that operate on arrays in an element-by-element fashion.
Arithmetic operators (+, -, /, *, **) are overloaded to work in an element-by-element fashion.

Another speed comparison:

import math
%timeit [math.sin(i) ** 2 for i in range(1000000)]

221 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.sin(np.arange(1000000)) ** 2

36.8 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Creating NumPy arrays

NumPy offers several built-in functions for creating arrays

It's idiomatic to import NumPy with the statement

import numpy as np

and use the abbreviated np namespace.

import numpy as np
x = np.array([2,3,11])
x = np.array([[1,2.],[0,0],[1+1j,2.]])
x = np.arange(-10,10,2, dtype=float)
x = np.linspace(1.,4.,6)
x = np.fromfile('foo.dat')

NumPy arrays can be created from regular Python lists of floats, integers, strings, etc. and the types will be infered. However, it's also possible (and not a bad idea to enhance readability/clarity) to specify the data-type explicitly using the optional keyword dtype. There are several other ways to create arrays from arange, linspace, etc.

Array functions

NumPy has many built-in functions for slicing, getting info, etc.

import numpy as np
x = np.arange(9).reshape(3,3)
x

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

Here we create a one-dimensional array with the np.arange() function that containts the integer numbers 0-9. Then we reshape it to be a 3 x 3 array. Keep in mind that the memory is still continguous and the reshape operation merely offers a different 'view' of the data. It also allows us to use multiple indexes and special notation to retrieve parts of the data, e.g.

x[:,0]

array([0, 3, 6])

An explanation of the notation used in the operation above: the comma (,) separates dimensions of the array in the indexing operation. The colon (:) represents all of that dimension. So is this operation, the colon signifies getting all rows, i.e. the $0^\mbox{th}$ dimension of the array. Along the second dimension, i.e. the columns, the 0 represents the $0^\mbox{th}$ indexed column.

Getting parts of array with this type of notation is called slicing in NumPy.

x.shape

(3, 3)

shape is an attribute of the x NumPy array object. We can see that it has been changed from what would have been originally (9,) before the reshape() class method was called.

Here is another, more sophisticated slicing operation.

y = x[::2, ::2]
y

array([[0, 2],
       [6, 8]])

The equivalent and more explicit syntax for this slicing operation might read:

y = x[0:-1:2,0:-1:2]

The first 0:-1 gets the $0^\mbox{th}$ entry through the last entry (-1 index) along the $0^\mbox{th}$ dimesion, i.e. all rows. The final :2 indicates the increment, i.e. get every $2^\mbox{nd}$ entry along the dimension.

It's possible to use indexing to specify a single entry in an array, e.g.

y[0,0] = 100
y

array([[100,   2],
       [  6,   8]])

array([[100,   1,   2],
       [  3,   4,   5],
       [  6,   7,   8]])

Note that this operation also changes the value of the (0,0) entry in the original array.

x[0,0]

This is because the operation

y = x[::2, ::2]

is only a shallow copy or "view" to that portion of x. If you wanted to make an actual copy of the data contained in x into y, use the copy() method.

y = x[::2, ::2].copy()
y[0,0] = 200

print("x = \n\n {}\n".format(x))
print("y = \n\n {}".format(y))

x = 

 [[100   1   2]
 [  3   4   5]
 [  6   7   8]]

y = 

 [[200   2]
 [  6   8]]

Efficient and compact finite differences

A common approximation for computing derivatives of functions is a finite difference approximation

$$ \frac{\mathrm{d}}{\mathrm{d}x}y(x) \approx \frac{y(x + \Delta x) - y(x)}{\Delta x} $$

or for discrete data

$$ \frac{\mathrm{d}}{\mathrm{d}x}y(x_i) \approx \frac{y(x_i + \Delta x_{i+1/2}) - y(x_i)}{\Delta x_{i+1/2}} $$

where $\Delta x_{i + 1/2} = x_{i+1} - x_{i}$. This operation can be accomplished efficiently and compactly using NumPy's slicing and vectorized algebraic operations.

x = np.arange(0,40,2)
y = x ** 2

dy_dx = (y[1:] - y[:-1]) / (x[1:] - x[:-1]); dy_dx

array([ 2.,  6., 10., 14., 18., 22., 26., 30., 34., 38., 42., 46., 50.,
       54., 58., 62., 66., 70., 74.])

A plot showing the derivative of the quadratic curve is a straight line as expected.

import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(x, y, 'k-', label=r'$y(x)$')
plt.plot((x[1:] + x[:-1]) / 2, dy_dx, 'r-', label=r'$\frac{\mathrm{d}}{\mathrm{d}x} y(x)$')
plt.legend();

Sophisticated broadcasting rules

Broadcasting is a term that is used when performing arithmetic operations on arrays of different shapes. The example of adding 1 to an array with one million entries shown earlier was a simple example of broadcasting. A more complex operation is shown below.

First we create 3 arrays, each with 800 rows and 600 columns of random numbers. Then we combine these arrays into a multidimensional array with the shape (3, 800, 600). Transposing this array gives us an 600 by 800 array (possibly representing individual pixels in an image), with each entry containing a three entris (possibly representing the RGB color definition of that pixel).

red = np.random.rand(800,600)
blue = np.random.rand(800, 600)
green = np.random.rand(800, 600)
rgb = np.array([red, blue, green])
rgb.shape

(3, 800, 600)

We can scale the RGB values by half for each pixel by multiplying by another, smaller array. This smaller array is broadcast accross the transpose of the rbg array along the trailing dimension.

rgb.T * np.array([0.5, 0.5, 0.25])

array([[[0.25662041, 0.00502207, 0.24769972],
        [0.19149629, 0.20883319, 0.23829523],
        [0.46681176, 0.35557144, 0.14015422],
        ...,
        [0.0348722 , 0.05280115, 0.01396591],
        [0.26909897, 0.47279755, 0.1192298 ],
        [0.22404736, 0.35276878, 0.01083904]],

       [[0.2665856 , 0.14895701, 0.19213905],
        [0.33322808, 0.18160469, 0.04252525],
        [0.32235744, 0.13715222, 0.065781  ],
        ...,
        [0.13829858, 0.47588993, 0.18670519],
        [0.36929096, 0.01340508, 0.21873785],
        [0.00925614, 0.06931558, 0.07606218]],

       [[0.49100378, 0.22079434, 0.11928502],
        [0.27947495, 0.46288044, 0.19669467],
        [0.28173528, 0.13182849, 0.01902225],
        ...,
        [0.45485622, 0.22099314, 0.18531605],
        [0.07672595, 0.13844577, 0.21601908],
        [0.17463518, 0.06181743, 0.21742355]],

       ...,

       [[0.36956116, 0.26501948, 0.19036778],
        [0.05782819, 0.37007202, 0.22106929],
        [0.16184632, 0.36265131, 0.00374051],
        ...,
        [0.32113261, 0.24974063, 0.21393577],
        [0.23813496, 0.29769275, 0.13956785],
        [0.43158058, 0.1906393 , 0.1603652 ]],

       [[0.46412024, 0.47734478, 0.2482293 ],
        [0.17290189, 0.01474402, 0.20545   ],
        [0.13248245, 0.35024772, 0.14337709],
        ...,
        [0.34955183, 0.07900957, 0.09023946],
        [0.32000456, 0.13054004, 0.06472156],
        [0.28836272, 0.22835036, 0.15980144]],

       [[0.06712011, 0.07222979, 0.00838645],
        [0.32316471, 0.33302575, 0.10337032],
        [0.19377161, 0.1855362 , 0.11872245],
        ...,
        [0.19115581, 0.48784076, 0.1867412 ],
        [0.0012272 , 0.42474092, 0.17710788],
        [0.17773595, 0.08858826, 0.16707045]]])

Of course the arrays are quite large so inspecting the individual numbers is difficult. Below are the two arrays plotted as images.

import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(1,2, figsize=(10,12) )
ax[0].imshow(rgb.T);
ax[0].set_title('Original');
ax[1].imshow(rgb.T * np.array([1., 1., 0.25]));
ax[1].set_title('Scaled');

Fancy Indexing

Fancy indexing is special syntax, specific to NumPy, that allows for extracting parts of arrays. Starting with an array x

x = np.arange(9); x

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

We saw earlier that you can select indiviual parts of arrays using indexing. The following selects and returns the 4th entry of the array x cooresponding to the index 3.

x[3]

This functionality is extended by giving a list of indices. The following selects the first, forth, and last (-1 index) entries in the array x.

x[[0,3,-1]]

array([0, 3, 8])

xx = [1 ,3, 5, 7]
xx[[2,3]]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-24-04237b235198> in <module>
      1 xx = [1 ,3, 5, 7]
----> 2 xx[[2,3]]

TypeError: list indices must be integers or slices, not list

Fancy indexing can also be used in multple dimensions.

y = x.reshape((3,3)); y

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

Now using the 0 index returns the entire first row.

y[0]

array([0, 1, 2])

Providing a list along the first dimension returns the entire second and third rows.

y[[1,2]]

array([[3, 4, 5],
       [6, 7, 8]])

A more complicated example where entries along each dimension are provided in a tuple. This can be interpreted as requesting the $(0,0), (1,1),$ and $(2,0)$ entries.

y[([0,1,2],[0,1,0])]

array([0, 4, 6])

Booleen Arrays

Booleen arrays can be combined with fancy indexing to provide a powerful way to do advanced data manipulation and cleaning. We'll start by created a random array of integers selected between -10 and 10.

x = np.random.randint(-10, 10, 10); x

array([ 3,  0,  7,  6, -4, -8, -2, -5, -3, -1])

We can use a broadcasted booleen operation to create a booleen array as follows. In this case, everywhere in the array that the entry is greater than zero we get a True, and a False otherwise.

ind = x > 0; ind

array([ True, False,  True,  True, False, False, False, False, False,
       False])

Now we can use this booleen array to return only the values of x that are greater than zero.

x[ind]

array([3, 7, 6])

The NumPy where() function gives a powerful way to do selective data manipulation. E.g. we can set the any values of the array x that are negative to np.nan.

y = np.where(ind, x, 0); y

array([3, 0, 7, 6, 0, 0, 0, 0, 0, 0])