As stated in the introduction to Python post, Python stores vector data in lists, tuples, and dictionaries. Numpy is an all encompassing package that transforms these types into numerical arrays and allows you to carry out efficient, vectorized, and higher order numerical functions. To demonstrate using a simple example, if you wanted to add to vectors stored as Python lists:
list1=[5,5,5] list2=[5,5,5] list3=list1+list2
You end up with the result:
list3=[5, 5, 5, 5, 5, 5]
Which is not addition, but rather list concatenation. To actually add the numbers together to get a vector of length three with "10s" for each value, you need to loop through the list. To demonstrate how to do this with lists here is simple function that uses the zip function to access both lists at once:
def list_add(list1,list2): list3=[] for i,j in zip(list1,list2): list3.append(i+j) return list3
In the above function, a third list (creatively named "list3") is initialized with no values, and a "for loop" is used to loop through the input lists and add the list values together, storing them in list3.
list3=list_add(list1,list2) result: [10, 10, 10]
which is the result we were looking for. Great, we can now add vectors together using lists. Although, addition is a pretty simple thing, and you can probably imagine the "list_add" function starts to get really complex as you grow to multidimensional arrays. Enter Numpy. Numpy allows you to define arrays, which behave much more intuitively for data analysis. I fact, I don't think it's too much of a stretch to say that, almost 90-95% of the data structures you will ever work with in Python will involve using Numpy arrays. Building off of the above example, if we repeat the exercise, only replacing "list1" and "list2" with Numpy arrays, call them "array1" and "array2":
import numpy as np
array1=np.array([5,5,5]) array2=np.array([5,5,5]) print(array1+array2) result: [10, 10, 10]
Much simpler, in contrast to adding the two lists together, which required defining a function and using a for loop, simply adding the two arrays gives the correct solution. Additionally, since Numpy is optimized, it is way more efficient to use Numpy array operations than it is to use loops.
Numpy array intrinsic attributes and functions
What's great about Numpy, is that it takes full advantage of Python object oriented framework when storing data. Specifically, when you define a Numpy array, it isn't only a block of numbers, it's an entire data object with attributes and functions attached to it that make it more intuitive to work with. To briefly demonstrate, a couple of the more useful attributes, a two-dimensional array will be defined using the np.ones function (which defines an array of ones given specific dimensions):ones=np.ones((10,5)) #Note passing the array dimensions as a list instead of a tuple works too
This will define an array of 5 columns and 10 rows, we can confirm that using the shape attribute:
print(ones.shape)
result: (10,5)
You can also find out what kind of array it is using the dype attribute (e.g., is it an integer or floating point array).
print(ones.dype)
result: float64
You can also easily transpose the array:
ones_transpose=ones.T print(ones.shape,ones_transpose.shape)
result:(10,5),(5,10)
Or define a new array with the same size and attributes:
twos=np.ones_like(ones)*2
This is just a small sampling of the features and functions intrinsic to any defined Numpy array, that I happen to find most useful. There are dozens more that I won't describe here.
Numpy: My Top 5 Numpy functions for Atmospheric Science Research.
Perhaps this heading should be at the very top of this blog post, since the previous sections mostly dealt with trying to provide a cursory explanation of what makes Numpy distinct and powerful. These are the 5 functions I use the most, not including the numpy.ma (masking) package which is extremely useful, but needs a dedicated blog post to adequately describe.Numpy linspaceand arange
Two functions for the price of one. Basically, these functions allow you to quickly create vectors that span a range. The difference between the two functions is best summarized as follows:
- arange takes up to three arguments: start, end, increment
- linspace also takes up to three arguments: start, end, vector length
Numpy Meshgrid
I use meshgrid a lot when I need a 2D spatial array for mapping purposes, but only have two 1D vectors that define my coordinate system. I don't know of a much better way to describe it beyond that, so I'll just throw up an example.
lon=np.linspace(0,360,100) lat=np.linspace(-90,90,50) lon,lat=np.meshgrid(lon,lat) print(lon.shape) result: (50,100)
In this example, the 1D vectors are rebroadcasted into 2D arrays taking on the shape of the original 1D vectors in each dimension. Now, you have 2D coordinate information that is often required for certain functions (e.g., 2D interpolation, or regridding) Be warned: when using this function, if you redefine lat/lon as 2D arrays (as in this example), and you for some reason try to run meshgrid again, for instance as part of a loop or in a notebook setting, it's very easy to run out of memory.Numpy Concatenate
Concatenate is a great way to build an array without knowing it's dimensions up front. For example, lets say you have a collection of netCDF files with gridded data, and all of these files have matching spatial coordinates, but each file has a different number of time variables, and different time coordinates. If you want to put all of this information into an array, you can't simply define the array size up front, you need to add to the time dimension as you loop through each file. Concatenate allows you to simply take an array and append it to another array along a specified dimension. To simplify the example here:
array1=np.ones((5,50,40)) array2=np.ones((3,50,40)) array3=np.ones((7,50,40)) array_total=np.concatenate((array1,array2),axis=0) array_total=np.concatenate((array_total,array3),axis=0) print(np.shape(array_total)) result: (15,50,40)
See that the end result has 15 in the 0 axis dimension (5+3+7). Note that ALL axes except the specified axis must match, or the function will crash. While concatentate (and it's relative functions hstack,vstack and dstack) are convenient, beware, it is still much better to define your entire array up front and load data into it if you know the array size. This is because Concatentate is relatively inefficient because it requires that the entire array is reallocated into memory every time you run it, which can get really slow for large arrays.Numpy Mean and Nanmean
These are pretty self explanatory. These functions simple take an average of an array, allowing you to specify an axis as well. The "axis" command is really handy if you want to do a time-average while preserving the spatial dimensions, for instance.
arr1=np.linspace(0,20,5) print(np.mean(arr1)) result: 10.0 ## set the 3rd index to a nan arr1[3]=np.nan print(np.nanmean(arr1)) result: 8.75
As you noticed, once you change the 3rd index value from 15 to a nan, the function now takes the mean 0,5,10,20 (which is 8.75) instead of 0,5,10,15,20 (which is 10). However, if you try to run plain "mean" on an array with nans in it, the returned value will also be a nan.Numpy diff and gradient
Theses functions are pretty similar, in that they allow you to take a gradient of a function. diff is the simpliest of the two functions listed here, and just returns Δx along a specified dimension. This will reduce your array length along the specified dimension by 1. Gradient is similar, but comes with a few extra bells and whistles:
- Gradient preserves the array length by using numerical approximations along the boundaries
- Allows you to specify a sample distance for each axis (i.e., the dx for dy/dx)
- If no "axis" argument is supplied will compute the gradient in ALL dimensions and return a tuple of length n-dimensions with each index corresponding to the gradient along that index.
import numpy as np from matplotlib import pyplot as plt x=np.linspace(0,2*np.pi,40) array=np.sin(x) dx=x[1]-x[0] xstag=(x[:-1]+x[1:])/2. plt.figure() plt.plot(x,np.gradient(array,dx),label='d/dx(sin(x)) (gradient)',lw=4.4) plt.plot(x,array,label='Sin(x)') plt.plot(xstag,np.diff(array)/dx,label="d/dx(sin(x)) (diff)",ls='--',color='r') plt.legend() plt.show()
This will generate the following figure:sin(x) and it's approximated derivative (cos(x))
lines=np.linspace(0,10,11) arange=np.arange(0,10,2) print(lines) print(arange) result (linspace): [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.] result (arange): [0 2 4 6 8]
A few things to unpack from this example, note that the arange vector stops at 8, despite the second argument being 10. This is because the non-inclusive nature of Python for the right hand side: Basically, arange is incrementing 2 and stops before it hits 10. Linspace on the other hand just builds a range interpolating between 0 and 10 that has 11 values.
Final Notes:
While this post had little to do with atmospheric science, I hope it provided a useful introduction to the Numpy object class. For more information on Numpy and its capabilities please visit the official site.
This comment has been removed by a blog administrator.
ReplyDelete