Sunday, November 22, 2020

Python | Tutorial: Introduction to Numpy

While this post has (almost) nothing to do with atmospheric science, it is an important introduction to the Numpy package, which is how you will be storing and working with data in Python.  This tutorial is heavily geared towards beginners and those who want to strengthen their background understanding of Numpy.  If you're already comfortable using Numpy to work with data, there probably isn't a whole lot new in this tutorial for you to learn, but who knows.

As stated in the introduction to Python post, Python stores vector data in lists, tuples, and dictionaries.  Numpy is an all encompassing package that transforms these types into numerical arrays and allows you to carry out efficient, vectorized, and higher order numerical functions.  To demonstrate using a simple example, if you wanted to add to vectors stored as Python lists:

list1=[5,5,5]
list2=[5,5,5]

list3=list1+list2

You end up with the result:

list3=[5, 5, 5, 5, 5, 5]

Which is not addition, but rather list concatenation.  To actually add the numbers together to get a vector of length three with "10s" for each value, you need to loop through the list.  To demonstrate how to do this with lists here is simple function that uses the zip function to access both lists at once:

def list_add(list1,list2):
    list3=[]
    for i,j in zip(list1,list2):
        list3.append(i+j)
    return list3

In the above function, a third list (creatively named "list3") is initialized with no values, and a "for loop" is used to loop through the input lists and add the list values together, storing them in list3.


list3=list_add(list1,list2)

result: [10, 10, 10]

which is the result we were looking for.  Great, we can now add vectors together using lists.  Although, addition is a pretty simple thing, and you can probably imagine the "list_add" function starts to get really complex as you grow to multidimensional arrays.  Enter Numpy.  Numpy allows you to define arrays, which behave much more intuitively for data analysis.  I fact, I don't think it's too much of a stretch to say that, almost 90-95% of the data structures you will ever work with in Python will involve using Numpy arrays.  Building off of the above example, if we repeat the exercise, only replacing "list1" and "list2" with Numpy arrays, call them "array1" and "array2":


import numpy as np

array1=np.array([5,5,5])
array2=np.array([5,5,5])

print(array1+array2)

result: [10, 10, 10]

Much simpler, in contrast to adding the two lists together, which required defining a function and using a for loop, simply adding the two arrays gives the correct solution.  Additionally, since Numpy is optimized, it is way more efficient to use Numpy array operations than it is to use loops.

Numpy array intrinsic attributes and functions

What's great about Numpy, is that it takes full advantage of Python object oriented framework when storing data.  Specifically, when you define a Numpy array, it isn't only a block of numbers, it's an entire data object with attributes and functions attached to it that make it more intuitive to work with.  To briefly demonstrate, a couple of the more useful attributes, a two-dimensional array will be defined using the np.ones function (which defines an array of ones given specific dimensions):

ones=np.ones((10,5)) #Note passing the array dimensions as a list instead of a tuple works too

This will define an array of 5 columns and 10 rows, we can confirm that using the shape attribute:

print(ones.shape)
 
result: (10,5)

You can also find out what kind of array it is using the dype attribute (e.g., is it an integer or floating point array).

print(ones.dype)
 
result: float64

You can also easily transpose the array:

ones_transpose=ones.T
print(ones.shape,ones_transpose.shape)
 
result:(10,5),(5,10)

Or define a new array with the same size and attributes:

twos=np.ones_like(ones)*2

This is just a small sampling of the features and functions intrinsic to any defined Numpy array, that I happen to find most useful.  There are dozens more that I won't describe here.

Numpy: My Top 5 Numpy functions for Atmospheric Science Research.

Perhaps this heading should be at the very top of this blog post, since the previous sections mostly dealt with trying to provide a cursory explanation of what makes Numpy distinct and powerful.  These are the 5 functions I use the most, not including the numpy.ma (masking) package which is extremely useful, but needs a dedicated blog post to adequately describe.
  1. Numpy linspaceand arange

    Two functions for the price of one.  Basically, these functions allow you to quickly create vectors that span a range.  The difference between the two functions is best summarized as follows: 
    • arange takes up to three arguments: start, end, increment
    • linspace also takes up to three arguments: start, end, vector length 

  2. lines=np.linspace(0,10,11)
    arange=np.arange(0,10,2)
    
    print(lines)
    print(arange)
    
    result (linspace): [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
    result (arange): [0 2 4 6 8]
    

    A few things to unpack from this example, note that the arange vector stops at 8, despite the second argument being 10.  This is because the non-inclusive nature of Python for the right hand side:  Basically, arange is incrementing 2 and stops before it hits 10.  Linspace on the other hand just builds a range interpolating between 0 and 10 that has 11 values.
  3. Numpy Meshgrid

    I use meshgrid a lot when I need a 2D spatial array for mapping purposes, but only have two 1D vectors that define my coordinate system.  I don't know of a much better way to describe it beyond that, so I'll just throw up an example.

    lon=np.linspace(0,360,100)
    lat=np.linspace(-90,90,50)
    
    lon,lat=np.meshgrid(lon,lat)
    
    print(lon.shape)
    
    result: (50,100)
    

    In this example, the 1D vectors are rebroadcasted into 2D arrays taking on the shape of the original 1D vectors in each dimension.  Now, you have 2D coordinate information that is often required for certain functions (e.g., 2D interpolation, or regridding) Be warned: when using this function, if you redefine lat/lon as 2D arrays (as in this example), and you for some reason try to run meshgrid again, for instance as part of a loop or in a notebook setting, it's very easy to run out of memory. 

  4. Numpy Concatenate

    Concatenate is a great way to build an array without knowing it's dimensions up front.  For example, lets say you have a collection of netCDF files with gridded data, and all of these files have matching spatial coordinates, but each file has a different number of time variables, and different time coordinates.  If you want to put all of this information into an array, you can't simply define the array size up front, you need to add to the time dimension as you loop through each file.  Concatenate allows you to simply take an array and append it to another array along a specified dimension.  To simplify the example here: 
    array1=np.ones((5,50,40))
    array2=np.ones((3,50,40))
    array3=np.ones((7,50,40))
    
    array_total=np.concatenate((array1,array2),axis=0)
    array_total=np.concatenate((array_total,array3),axis=0)
    
    print(np.shape(array_total))
    
    result: (15,50,40)
    

    See that the end result has 15 in the 0 axis dimension (5+3+7).  Note that ALL axes except the specified axis must match, or the function will crash. While concatentate (and it's relative functions hstack,vstack and dstack) are convenient, beware, it is still much better to define your entire array up front and load data into it if you know the array size.  This is because Concatentate is relatively inefficient because it requires that the entire array is reallocated into memory every time you run it, which can get really slow for large arrays.
  5. Numpy Mean and Nanmean

    These are pretty self explanatory.  These functions simple take an average of an array, allowing you to specify an axis as well.  The "axis" command is really handy if you want to do a time-average while preserving the spatial dimensions, for instance.

    arr1=np.linspace(0,20,5)
    print(np.mean(arr1))
    
    result: 10.0
    
    ## set the 3rd index to a nan
    arr1[3]=np.nan
    print(np.nanmean(arr1))
    
    result: 8.75  

    As you noticed, once you change the 3rd index value from 15 to a nan, the function now takes the mean 0,5,10,20 (which is 8.75) instead of 0,5,10,15,20 (which is 10).  However, if you try to run plain "mean" on an array with nans in it, the returned value will also be a nan.
  6. Numpy diff and gradient


    Theses functions are pretty similar, in that they allow you to take a gradient of a function.  diff is the simpliest of the two functions listed here, and just returns Δx along a specified dimension.  This will reduce your array length along the specified dimension by 1.  Gradient is similar, but comes with a few extra bells and whistles:
    • Gradient preserves the array length by using numerical approximations along the boundaries
    • Allows you to specify a sample distance for each axis (i.e., the dx for dy/dx)
    • If no "axis" argument is supplied will compute the gradient in ALL dimensions and return a tuple of length n-dimensions with each index corresponding to the gradient along that index.
    Most often, I use "gradient" especially if I need to multiply the gradient against something because I can be somewhat lazy, and I don't want to have to deal with different sized arrays/vectors. The example I will provide will be show how to use these functions to approximate the derivative of "sin(x)" which is analytically (of course) "cos(x)".  Note that this example uses matplotlib.

    import numpy as np
    from matplotlib import pyplot as plt
    
    x=np.linspace(0,2*np.pi,40)
    array=np.sin(x)
    dx=x[1]-x[0]
    
    xstag=(x[:-1]+x[1:])/2.
    
    plt.figure()
    plt.plot(x,np.gradient(array,dx),label='d/dx(sin(x)) (gradient)',lw=4.4)
    plt.plot(x,array,label='Sin(x)')
    plt.plot(xstag,np.diff(array)/dx,label="d/dx(sin(x)) (diff)",ls='--',color='r')
    plt.legend()
    plt.show()
    

    This will generate the following figure: 
    sin(x) and it's approximated derivative (cos(x))





    I played with the line colors and thicknesses to make it easier to differentiate the "diff" approximation from the "gradient" approximation.  Note that to correctly plot the "diff" function I had to create a new version of "x" called "xstag" which basically shifted the x-center points 0.5dx, and shortened the array length (n) by one simply by averaging the first n-1 array elements with the last n-1 array elements.  Also note, that I obtained dx by simply differencing the first two array elements.


Final Notes:

While this post had little to do with atmospheric science, I hope it provided a useful introduction to the Numpy object class.  For more information on Numpy and its capabilities please visit the official site.

1 comment:

  1. This comment has been removed by a blog administrator.

    ReplyDelete