NumpyBasics

From CASA Guides
Jump to: navigation, search

Back to the PythonOverview.

Introduction to NumPy

The built in python collections leave something to be desired for serious astronomical number crunching. Almost all current astronomical data processing inside python takes advantage of the numpy libraries. These are third party routines packaged with CASA that allow efficient definition and manipulation of matrices (or images or data cubes). This CASAguide will show you how to import this key module, build an array, do basic math, and cleverly access the contents of the array.

An overriding concern to keep in mind as you explore numpy is that numpy is fast and python is slow. What we mean is that the same operation using one of numpy's built in (C) functions will run orders of magnitude faster than using python to carry out the same operation via looping. If you have worked with arrays in IDL, MatLab, or a similar language before this approach should be familiar.

Importing NumPy

NumPy is not part of the basic python distribution. If you are not using CASA then you will need to download and install numpy. CASA includes numpy as pat of its basic distribution but keeps it as a separate module (similar to the "os" or "math" modules in the basic python distribution).

Once it is correctly installed, you can import numpy and gain access to its functionality in the following way.

import numpy as np

Note that this imports numpy as "np". Had we just typed "import numpy" you would have access to the functionality but substituting "numpy" for "np" throughout.

Now we can use numpy functions like so:

print( np.arange(10) )

"arange" is analogous to the basic python "range" but generates a numpy array.

Just to see how these arrays work, type:

print( np.arange(10) + 5 )

Notice how the 5 is broadcast to each element in the array. Similarly, we can write a bunch of powers of 2 by:

print( 2**np.arange(10) )

Your First Numpy Array

Before we get going, type np. and hit the tab key. You might be a bit freaked out by the large list (~560 possibilites at time of writing), but it's good to know that's there. Also remember that you can get help on each of these functions. Use ?np.median to bring up the same content as help np.median but with some useful header info.

First, let's make a simple numpy array and figure out how to get its basic properties. We can do this by hand like so:

ra = np.array([[0,1,2],[3,4,5]])
print(ra)

Or use the arange function and the reshape method to make the same array like so:

ra = np.arange(6).reshape(2,3)
print(ra)

We now have an array stored in the variable "ra". Let's get its basic properties. Again it may be useful to type ra. and tap <tab> first.

Get the shape of the array like so:

ra.shape

And its total number of elements like so:

ra.size

Its number of dimensions:

ra.ndim

Notice that we're not using () here, these are attributes of the array, ndim and size are ints, shape is a tuple, as we can see from:

type(ra.shape)

Now get the data type of our array

ra.dtype

Notice that numpy has a slightly different nomenclature for its data types than baseline python.

Also notice that some of the loosey-goosey casting from basic python is gone. Let's try to shove a float into our integer array.

print(ra)
ra[0,0] = 5.75
print(ra)

Notice that it got floored and entered as in int.

Now let's do something with our array, say square every element

ra**2

Or add it to itself

ra + ra

Notice it's doing element-by-element manipulation, the aforementioned broadcasting.

5 * ra

Array Creation

Now that we've seen some array basics, let's look a bit closer at how we can make them. There's a lot here, we'll just look at a couple of approaches.

We already saw the by-hand approach

ra = np.array([[0,1,2],[3,4,5]])
print(ra)
ra = np.array([0,1,2,3,4,5])
print(ra)

See how the square brackets control the shape? If you forget them entirely you'll get an unfortunate error.

We also saw the arange approach, which generates a sequential array of integers. It takes stop, start, and step as arguments like so:

ra = np.arange(5,10,1)
print(ra)
ra = np.arange(5,10,3)
print(ra)

It stops at the last element under 10

ra = np.arange(5,10.1,1.0)
print(ra)

See how the float step made the array a float?

ra = np.arange(5,10.1,1.0)
print(ra)

We can also give an explicity type to most array creation functions like so:

ra = np.arange(0,10,1,dtype=np.float32)
print(ra)

Otherwise it would have been an int

And there are many ways of doing similar things, a more stable float version of arange is linspace:

ra = np.linspace(1.0,2.0,11)
print(ra)

Which covers 1.0 to 2.0 inclusive with 11 elements.

We can also make arrays full of ones or zeros just specifying the size:

ra = np.zeros((3,3))
print(ra)
ra = np.zeros((3,3,3),dtype=np.bool)
print(ra)

And we can mimic one array with another

ara = np.zeros((3,3))
print(ara)
bra = np.zeros_like(ara)
print(bra)
bra = np.ones_like(ara)
print(bra)

Numpy will also implictly create arrays from array math:

ara = np.ones((3,3))
bra = ara + ara
print(bra)
cra = (ara == ara)
print(cra)

Of course, what we really want is to shove our astronomical data into a numpy array. We'll see several ways to do this in subsequent guides. For now just keep in mind that these 3 x 3 arrays of floats are like 4096 x 4096 CCD images or 3- or 4-dimensional radio data cubes.

Copies and Views: That Mutable/Immutable Stuff

Numpy arrays are objects and as a result you need to be a bit careful copying them. If you simply do this

dra = ara

it does not make a new array.

In detail try this:

test = np.array([1,2,3,4,5])
b = test
print(b)
test[2] = 6
print(b)

Notice how by changing test, we also change b!

In the above examples, dra just points at ara and b just points at test. The two in fact share the same data, so that changing one changes the other. You can check this via:

dra is ara
b is test

This gets a bit complicated but we would have been okay with:

b = test.copy()

or

dra = ara*1.0

so just be aware that there can be referencing issues with the sense that if you do not explicitly copy an array then you can end up with two views of the same array and not two truly independent arrays.

Math With Numpy

Here we'll see do basic math with numpy arrays. As above, we need to be sure to have numpy imported. Bring it in as np to make the examples below work.

import numpy as np

Basic Math

Numpy gives you the ability to efficiently and compactly do math on whole arrays. By default this is the element-wise manipulation that we saw earlier:

ara = np.arange(10).reshape(5,2)
print ara
bra = ara**2
print bra
bra = ara*3.0
print bra
cra = ara + bra
print cra

Note that to combine arrays in this way, you need them to have the same shape, not just the same number of elements. Because ara is two dimensional this doesn't work:

dra = np.arange(10)
print ara + dra

But you can reshape one array to match the other if they have the same number of elements

print ara + dra.reshape(ara.shape)

Though of course this only works if you have the same number of elements to start with.

Fancier Operations

Numpy can do more complex element-wise operations, for example trigonometric calculations:

from math import pi
ara = np.arange(11)/10.*2.0*pi
print ara
print np.sin(ara)
print np.cos(ara)

But note that you do need to be using the numpy not the math trigonometric functions. Math chokes on numpy arrays. For example try:

import math
ara = np.arange(11)/10.*2.0*pi
print math.sin(ara)

Not good! There is a way around this, using the the "map" functionality to compactly operate on each element of ara:

bra = np.array(map(math.sin, ara))
bra
print bra == np.sin(ara)

But this isn't ideal, it's still python-speed not numpy-speed. So do use the numpy functions when available.

Collapsing arrays

Numpy also has functions to also "collapse" arrays in useful ways. For example, take the sum of all elements:

ara = np.arange(11)
print np.sum(ara)

or the median value:

print np.median(ara)

You can take the mininum and maximum like so:

print np.min(ara), np.max(ara)

These collapsing functions can also operate selectively along one or more axes. For example, take the world's simplest data cube (z, y, x):

ara = np.arange(50).reshape(2,5,5)

Collapse it along the first axis (e.g., to make a moment map):

im = np.sum(ara,axis=0)
print im.shape

Get the maximum along the first axis (e.g., to make a peak intensity map):

im = np.max(ara,axis=0)
print im.shape

To collapse the cube along the second and third axes (e.g., to make a spectrum) you can use two successive sums:

spec = np.sum(np.sum(ara,axis=2),axis=1)
print spec

Or you can use the apply_over_axes function, which exists for exactly this purpose:

spec2 = np.apply_over_axes(np.sum,ara,[1,2])
print spec2

Note that apply_over_axes preserves the number dimensions of the original so for a totally equivalent call you'd want:

spec2 = (np.apply_over_axes(np.sum,ara,[1,2])).flatten()

The flatten makes the array one-d. The complementary function here is apply_along_axis

Numpy can also do cumulative sums for you via

ara = np.arange(11)
print ara
print np.cumsum(ara)

Masking and Element-wise Logic

Truth statements also operate element-wise in numpy. For example try

ara = np.arange(11)
print ara > 5
bra = np.arange(11)*2.0 - 5.0
print ara > bra

You can see that you get an element-wise comparison.

This is very useful for data analysis. A common radio astronomy trick is to make a "mask" that considers only part of an image, defined subject by some condition. Here we mask "ara" where it is less than "bra":

mask = ara > bra

and we now have an array of Trues and Falses called mask. Booleans multiply as ones and zeros, so that multiply a mask times an array gives zero for values that are "False" in the mask:

print ara*mask

Linear Algebra

You have probably noticed that so far we aren't really working with matrices. We're working with n-dimensional data sets and broadcasting operations to each datum. Numpy also includes linear algebra functionality. The simplest example is a vector dot product:

avec = [0, 1, 2]
bvec = [1, 3, 0]
print np.dot(avec, bvec)

The dot function in fact does general matrix multiplication. So to multiply a matrix by a vector:

ara = [[1,3],[2,2]]
vec = [1,4]
print np.dot(ara,vec)

Or to multiply two matrices together:

ara = [[1,3],[2,2]]
bra = [[1,2],[4,0]]
print np.dot(ara,bra)

It's similarly simple to transpose a matrix (and note that transpose is very important for you mental health as you combine numpy and CASA images):

ara = np.array([[0,1],[2,3]])
print np.transpose(ara)

You can also take the inverse or the trace:

print ara
print np.invert(ara)
print np.trace(ara)

You can solve systems of linear equations. To solve this system of equations:

 3 x + y = 9

 x + 2 y = 8

you would call solve inside the linear algebra sub-module like so:

mat = np.array([[3, 1],[1,2]])
vec = np.array([9,8])
print np.linalg.solve(mat,vec)

Slicing and Iteration With Numpy

Getting at parts of the array in clever ways can make the difference between lightning-fast code and taking forever to do anything (either because you have to write a million lines of code or because the program is slow). Numpy has very powerful slicing and iteration capabilities. Again make sure that we have numpy as imported as np before stepping through these examples:

import numpy as np

Slicing

(Note that unless you copy explicitly, slicing just points - returning 'views'. Use .copy() to get a copy.)

The most basic slicing is that you can just index an array using square brackets to get the value at a particular point. Let's define a basic 5 by 5 array and have a look at different ways to access the elements.

ara = np.arange(25).reshape(5,5)
print ara
print ara[0,0]
print ara[1,0]
print ara[0,2]

(Also note that last index refers to the fastest moving part of the array. This is important to keep in mind.)

The syntax y:x gets you elements y through x, excluding x. :x means "everything until x" so that

print ara[:3,0]

prints elements 0, 1, and 2 in column 0. Meanwhile just a : gets everything so that

print ara[:,0]

prints the whole column. Try

print ara[0,:]
print ara[0]

and joining three numbers by colons, y:x:z, gives you y through x not including x and stepping by z each time. So, switching to a one-d array here, we have:

bra = np.arange(16)
print bra
print bra[4:12]
print bra[4:12:2]

print elements 4-12 and 4-12 stepping by twos. Remember that you never get that last element (we saw this with lists already)!

Similar to lists, you can do negative indexing from the end of the column or row.

print ara
print ara[-1,0]

You can leave off the later axes of an array and it will implictly place : in these. If you just want to get at the last axis of an array, you can use the ellipsis. For example, define a 3 dimensional array and try to index the last spot:

ara = np.arange(125).reshape(5,5,5)
print ara[...,0]
print ara[:,:,0]

the last two are the same.

Indexing With Arrays

You can also index arrays with other arrays. For example:

ara = np.arange(15)*2
bra = np.array([1,5,12])
print ara[bra]

but note that the results may not be intuitive for higher dimensional arrays. For example, this:

ara = np.arange(25).reshape(5,5)
bra = np.array([[1,2],[1,3]])
print ara
print bra
print ara[bra]

is filling in an array of size bra with rows pulled from ara.

that is you get bra[0,0] set to hold ara[1]:

print bra[0,0]
print ara[1]

which is a 1-d array and so the whole thing grows by a dimension

print ara[bra].shape

this can be useful but requires some careful thinking for high-dimension arrays.

Possibly more intuitive, you can index an array with a tuple of lists (or a tuple of arrays):

ara = np.arange(25).reshape(5,5)
ind = ([0,1,2], [0,3,4])

and it does what you might expect, pairing the first element of each list to give you a series of values at these coordinates as a one-d numpy array.

print ara[ind]
print ara[0,0], ara[1,3], ara[2,4]

Another trick is to index with a matched-size boolean array. This will return the values from the places where the indexing array was set to True.

ara = np.arange(25).reshape(5,5)
bra = ara > 15
print bra
print ara[bra]

This is also potentially very useful, especially in combination with astronomical masking and/or the indices function (see below).

Finding the Elements of an Array That Meet Some Condition

There are a few really indispensible more advanced tricks for image processing stuff. First and foremost, you need to be able to define some condition and extract the elements of an array where that condition is met. At the most basic level, this means that you want to get back an array of coordinates where some user-defined condition is "True". IDL does this with "where", numpy does it most directly with the .nonzero() method (or indexing via boolean arrays). ".nonzero()" gives you back the set of indices where the array calling the method is not zero (remember that False is zero). So:

ara = np.arange(25).reshape(5,5)
ara = ara*(ara > 15)
ind = ara.nonzero()
print ara, ind
print ara[ind]

You can skip the middle step and just call .nonzero() on the (ara > 15) ... remember that this has made a matched-size boolean array.

ara = np.arange(25).reshape(5,5)
ind = (ara > 15).nonzero()
print ara, ind
print ara[ind]

This is your IDL "where."

Numpy's actual where command is a little different. It lets you do lightning-fast search and replaces. The statement:

ara = np.arange(25).reshape(5,5)
bra = np.where(ara > 15, ara, 100)

tells numpy: "where ara > 15 use ara else use 100" and the result is returned as a new array. This new array looks like:

print bra

Indices

A final useful bit of functionality for handling data cubes is that the indices function generates arrays of indices to match a specific array shape. You call it on the shape of the array:

ara = np.ones((5,5,3))
ind = np.indices(ara.shape)

and notice that it has an extra dimension compared to ara. This axis is ndim long and tells you which axis you are pulling the index from (e.g., ind[0,x,y,z] is the index of x,y,z along the first dimension ind[1,x,y,z] is the index along the second dimension, and so on). So:

print ind.shape

And this gives the index along the first dimension:

print ind[0], ind[0].shape

This gives the index along second dimension:

print ind[1], ind[1].shape

and so on:

print ind[2], ind[2].shape

You could, at the cost of memory, more or less completely replicate .nonzero() with indices and boolean array indexing.