Sunday, November 22, 2020

Python | Tutorial: Introduction to reading netCDF files

Introduction

NetCDF (Network Common Data Format) is an extremely useful way to store, transfer, subset, and otherwise work with gridded data sets.  This format is commonly encountered in the atmospheric sciences, climate, and oceanography communities.  It is much easier to work with than the "grib" data format, as it allows for metadata attributes to be attached directly to the datafile.  Finally, there are numerous stand alone programs (e.g., ncview, nco) and bindings (Fortran, Python, IDL, Matlab, GrADS:see here for my post on netCDF with GrADS) that make it easy to work with NetCDF data regardless of platform, or programming language.  Because NetCDF is so common, and because most of the blog posts on this site require some knowledge on how to work with NetCDF data, I'm going to provide a short tutorial on how to read and work with NetCDF files in Python.  In this tutorial, you will learn how to:
  • Open at NetCDF file using "NetCDF4"
  • Query the file attributes, dimensions, and variables
  • Convert NetCDF integer time to Python "datetime" objects
  • Extract monthly temperature anomaly
  • Interface with the NOMADS OpenDAP-alt data repository
Important Note: There are two supported packages (that I'm aware of) that can open and translate NetCDF data (NetCDF4 and xarray).  For this tutorial, I will focus on the NetCDF4 package, perhaps I will post an xarray tutorial later on, but if you're already reading and working with NetCDF data using xarray, and you have no intention on changing, then maybe this tutorial isn't for you.

Opening, and Querying a NetCDF file

First off, if you are going to work with the NetCDF data, you will need to have the NetCDF4 package installed and imported.

Once you have that package installed, we'll need to grab some data.  For this tutorial, I'll be using monthly mean air temperature data from the NCEP reanalysis.  To get this data, go to this website and download the "air.mon.mean.nc" file and place it in the same directory as the script you'll be working with.  Once that is done, you're ready to proceed with the tutorial.

The first step is to import the relevant packages:

import numpy as np
import netCDF4 as NC
from matplotlib import pyplot as plt 
from cartopy import crs as ccrs
import datetime as DT

I included the Cartopy import to plot the data on a map, if you don't have this package and you don't want to install it, that's fine you can simply remove that line from the import section, and skip the mapping portion of the tutorial.

Once the packages are imported, set the path to the file, and use the "Dataset" function to open the data file:

## Since file is in current folder, the path is just the file name
ncdata_path='air.mon.mean.nc'
ncdata=NC.Dataset(ncdata_path,'r')
print(ncdata)

This will result in the following output to the screen:

root group (NETCDF4_CLASSIC data model, file format HDF5):
    description: Data from NCEP initialized reanalysis (4x/day).  These are the 0.9950 sigma level values
    platform: Model
    Conventions: COARDS
    NCO: 20121012
    history: Thu May  4 20:11:16 2000: ncrcat -d time,0,623 /Datasets/ncep.reanalysis.derived/surface/air.mon.mean.nc air.mon.mean.nc
Thu May  4 18:11:50 2000: ncrcat -d time,0,622 /Datasets/ncep.reanalysis.derived/surface/air.mon.mean.nc ./surface/air.mon.mean.nc
Mon Jul  5 23:47:18 1999: ncrcat ./air.mon.mean.nc /Datasets/ncep.reanalysis.derived/surface/air.mon.mean.nc /dm/dmwork/nmc.rean.ingest/combinedMMs/surface/air.mon.mean.nc
/home/hoop/crdc/cpreanjuke2farm/cpreanjuke2farm Mon Oct 23 21:04:20 1995 from air.sfc.gauss.85.nc
created 95/03/13 by Hoop (netCDF2.3)
Converted to chunked, deflated non-packed NetCDF4 2014/09
    title: monthly mean air.sig995 from the NCEP Reanalysis
    References: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.derived.html
    dataset_title: NCEP-NCAR Reanalysis 1
    dimensions(sizes): lat(73), lon(144), time(863)
    variables(dimensions): float32 lat(lat), float32 lon(lon), float64 time(time), float32 air(time,lat,lon)
    groups: 

Note that a lot of the information above is probably not relevant to you, but you can see some pieces of information in there that might be useful.  One thing that you notice, is that there printed output has colon (:) separated text, these correspond to the global metadata attributes, and structures attached to the file, and they can be accessed individually using the following syntax:
For example, accessing the file "description"

print(ncdata.description)

Returns:

Data from NCEP initialized reanalysis (4x/day).  These are the 0.9950 sigma level values

Which you'll notice matches the text after the colon separating the word "description"

Though, it can be kind of burdensom to have to print the entire netCDF file, just to see what's in it before you can start extracting data.  One way to findout what attributes the file has without dumping the entire file out to screen is to use the "ncattrs" function which is a shorthand for "NetCDF (nc) + attributes (attrs) - ncattrs" to find out what attributes the file has:

print(ncdata.ncattrs())

Returns:

['description', 'platform', 'Conventions', 'NCO', 'history',
'title', 'References', 'dataset_title']

You can use the "ncgetattr" function to return an attribute value for a given attribute.  Following the example above:

print(ncdata.getncattr('description'))

Returns:

Data from NCEP initialized reanalysis (4x/day).  These are the 0.9950 sigma level values

You may be asking the question: why would I use the "getncattr" function to get an attribute instead of just using the shorthand "ncdata.description"?  The advantage is that the getncattr function allows you to pass string information, where as the shorthand method does not, which makes it more versitile incase you don't know what the attributes are.

Loading NetCDF variables and Dimensions

Dimensions:

The dimensions structure includes the dimensions used in the file.  In geographic datasets, these are typically x,y,z,time.  x and y are often latitude and longitude, but not always (for example, WRF data has the x and y dimensions as "west_east" and "south_north").  z is usually a vertical coordinate (sigma, height, pressure, etc), and time is usually a floating point value referenced to a specific starting time.  The file dimensions can be printed simply:

print(ncdata.dimensions)

Returns:

OrderedDict([('lat', <class 'netCDF4._netCDF4.Dimension'>: name = 'lat', size = 73), 
('lon', <class 'netCDF4._netCDF4.Dimension'>: name = 'lon', size = 144), 
('time', <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'time', size = 863)])

What you get is a dictionary of dimensions with names and sizes: in this instance, latitude, longitude, and time.  Now, the key about the dimensions, is that EACH variable will have dimensions that will be some combination of the dimensions within the dimensions dictionary.

Variables:

Simiarly, the variables, can be accessed:

for v in ncdata.variables:
    print(v)

Returns:
lat
lon
time
air

Similar to the dimensions object, the variables object is a dictionary, where the variables can be accessed using keys.  To demonstrate, I'm going to access the "time" variable and convert the time values into Python datetime.
Printing the time variable:

print(ncdata.variables['time'])

Returns:
<class 'netCDF4._netCDF4.Variable'>
float64 time(time)
    long_name: Time
    delta_t: 0000-01-00 00:00:00
    avg_period: 0000-01-00 00:00:00
    prev_avg_period: 0000-00-01 00:00:00
    standard_name: time
    axis: T
    units: hours since 1800-01-01 00:00:0.0
    actual_range: [1297320. 1927008.]
unlimited dimensions: time
current shape = (863,)
filling on, default _FillValue of 9.969209968386869e+36 used

This statement tells us that the time data is stored as a 64bit (double) float, and is hours since 1800-01-01 00 UTC. We also see that it's only dimension is the "time" dimension.  So, from this data, we can use the datetime strptime and timedelta functions to convert the floating point values to a list of datetime objects.  Knowing the string format of the "units" attribute, the starting "strp" time object is set following:

starttime=DT.datetime.strptime(ncdata.variables['time'].units,
        'hours since %Y-%m-%d %H:%M:0.0')

Now, pulling the time data into a Numpy array, and using list iteration, we can build a list of datetime objects:

time_values=ncdata.variables['time'][:]
strptimes=[starttime+DT.timedelta(hours=i) for i in time_values]

Okay, now that we have a list of times, let's use it to get a time index for a specific input time.  For example: March 1997:

choice='1997-03-01'
choice_strp=DT.datetime.strptime(choice,'%Y-%m-%d')
tidx=np.argmin(np.abs(np.array(strptimes)-choice_strp))

print("Index:%i, time:"%tidx, strptimes[tidx])

Returns:

Index:590, time: 1997-03-01 00:00:00

Now, finally, lets plot the data!  Looking back at the variables, data, we see a lat,lon and air variable as well.  Each variable can be printed to find out it's metadata but, for brevity, I'll just tell you that, lon and lat are each 1-dimensional arrays that vary with their corresponding dimensions, and air is air temperature varying in time,lat,lon.

Simply accessing the data for each dimension:

t2=ncdata.variables['air'][tidx,:]
lon=ncdata.variables['lon'][:]
lat=ncdata.variables['lat'][:]

and plotting it:

plt.figure(figsize=(10,7))
plt.pcolormesh(lon,lat,t2,cmap='jet')
plt.show()
Simple air temperature plot

Or, if you want to include the Cartopy mapping:

plt.figure(figsize=(10,7))
ax=plt.subplot(111,projection=ccrs.PlateCarree())
Z=ax.pcolormesh(lon,lat,t2,cmap='jet')
plt.colorbar(Z)
ax.coastlines()
plt.show()

With a map and colorbar.

Finally, since 1997 was a famous En Nino year, we can plot the annual temperature anomaly by passing a tuple of time indices to the time index of the data array and plot the annual mean temperature anomaly.

month_idx=[i for i in range(len(strptimes)) if strptimes[i].strftime('%Y')=='1997']
nino_mean=np.mean(ncdata.variables['air'][tuple(month_idx),:],axis=0)
total_mean=np.mean(ncdata.variables['air'][:],axis=0)

plt.figure(figsize=(10,7))
ax=plt.subplot(111,projection=ccrs.PlateCarree())
Z=ax.pcolormesh(lon,lat,nino_mean-total_mean,cmap='coolwarm',vmin=-2.5,vmax=2.5)
plt.title("Annual Mean Temperature Anomaly: 1997")
plt.colorbar(Z)
ax.coastlines()
plt.show()

Basically, we build a tuple of indices where that matches where the time array indicates 1997 and then average the temperature data for all of those indicies to get the 1997 mean temperature.  We then compute a total mean, which takes the mean along the entire time axis, and difference the two values.  This results in the following image:
1997 ENSO event shown in the 0.995 sigma NCEP reanalysis temperature field

Interfacing with the NOMADS server

One great aspect of using the NetCDF library is that you can access near realtime model data from the NOAA NOMADS data server.  This section will be a short add on, since there isn't a whole lot new to add, but it will show you how to access near real time model data.

First, a simple line of code is required to a) get todays date, and b) define a filepath to the data.  In this example, I'll use the 0.25 degree GFS data, explore the NOMADS server for more options.  Assuming you have imported the packages above:

today=DT.datetime.now().strftime('%Y%m%d')
input_file='http://nomads.ncep.noaa.gov:80/dods/gfs_0p25/gfs%s/gfs_0p25_00z'%today
gfsdata=NC.Dataset(input_file,'r')

Now, you are welcome (encouraged?) to query the file using the same techniques as above, but for brevity, I'll just write down the code to store the lat/lon values into arrays (which similarly, are 1-D arrays cooresponding to x and y).  Note that time is stored in a very similar way as in the NCEP reanalysis file and can be handled in an identical way, but won't be demonstrated here.

lons=gfsdata.variables['lon'][:]
lats=gfsdata.variables['lat'][:] 

Then the data can be extracted and plotted (in this example, 2 meter temperature at time index 6):

t2=gfsdata.variables['tmp2m'][6,:]

plt.figure(figsize=(11,9))
ax=plt.subplot(111,projection=ccrs.PlateCarree())
plt.pcolormesh(lons,lats,t2,cmap='jet')
plt.colorbar()
plt.title(gfsdata.variables['tmp2m'].long_name)
ax.coastlines()
plt.show()

Note that the above example gets the figure title directly from the NetCDF metadata.

Finally, it's always a good practice to close your NetCDF files when you are done using them:

gfsdata.close()

Note that when you close a NetCDF file, any data that you have loaded into a Numpy array is still accessible.  However!  Any variables that are just pointing to the NetCDF object are no longer accesible.  For example:

t2_pointer=gfsdata.variables['tmp2m']
t2_values=gfsdata.variables['tmp2m'][6,:]

gfsdata.close()
 
t2_values ## Still here!
t2_pointer ## GONE!

Wrap up

That's it for the introduction to NetCDF tutorial.  There is a lot left to be discovered and learned when it comes to working with NetCDF data using Python, however I tried to provide a basic overview on how to open, query, and load data from netCDF files with a eye towards more common applications. There is a lot of great documentation out there if you want to learn more, a good place to start is the Unidata Documentation.

So get out there and start working with NetCDF data!

2 comments:

  1. Really nice tutorial, but I can't get the Cartopy plotting to work. Throws up a "TypeError: draw_wrapper() got an unexpected keyword argument 'inframe', " and a lot of file references to various matplotlib.py files and cartopy.py files

    I've since copied and pasted the code exactly as it is, using the exact same data, and still getting this error. I'm a complete beginner so totally stumped - if anyone has the fix or knows what I'm doing wrong, please let us know! :)

    ReplyDelete
  2. โปรโมชั่น pg slot มากมาย เล่นง่ายจ่ายจริง แตกจริง ต้อง pg slot เท่านั้น! เล่นสล็อต พีจีสล็อต เว็บไซต์ตรงผู้ให้บริการเกมสล็อตออนไลน์ชั้นหนึ่ง ทกลอง เล่น ฟรี พร้อมโบนัส

    ReplyDelete