Data Cleansing

Data cleansing or preprocessing is an important step that is required prior to data analysis. The quality and value of the analysis results or model output are often dependent on the quality of input data. We perform our data cleansing using a parsimonious set of processes as explained below.

Identifying and Removing Outliers

We identify outliers in our hyperspectral data by applying two criteria that are defined by (1) the allowable range for instantaneous data values and (2) the allowable change in reflectance values within a limited time period. We exclude hyperspectral reflectance data that are identified as outliers from the dataset for subsequent data processing and analyses.

Removing Sun-angle Effect (BRDF Models)

The effect or trend on spectral reflectance of objects that is associated with the direction of incoming light is known as bidirectional reflectance distribution function or BRDF. We developed a prototype methodology for minimizing the sun-angle effects on hyperspectral reflectance collected from land surface by modeling the trend of reflectance value change in association with sun-angle changes. To preserve the integrity of the measurements, independent BRDF was developed for each of the 2,150 spectral bands.

diurnal course of reflectance values at 531 nm before and after sun-angle effect removal
Diurnal course of reflectance values at 531 nm before and after sun-angle effect removal

Removing Cloud Effects

Unlike sun angles, sky conditions are considerably dynamic and difficult to predict. Also effects of cloud cover on spectral reflectance appear to have intricate interaction with sun angles. We are currently studying this complex relationship between cloud cover and hyperspectral reflectance of land surface.

Ever-changing sky conditions greatly affect the spectral reflectance of land surface.
Dynamic sky conditions from early morning to evening on August 7, 2016

Matching Temporal Frequency

The consistency of scale across the data often influences selection methods and accuracy of output. We calculate the average hyperspectral reflectance values over 30 minutes to match the temporal frequency to that of the meteorological data (e.g., carbon flux, heat flux, and soil moisture) and model output (e.g., gross primary production [GPP] and ecosystem respiration [Reco]) derived using the eddy covariance method to achieve the optimal results from our analyses.

Hyperspectral data averaged over 30 minutes
Hyperspectral data averaged over 30 minutes

Calculating Spectral Indices

The spectral index is a mathematical transformation of spectral reflectance values into a single values that correlate with phenomena present in the real world. Among a large pool of spectral indices, we compute approximately 20 indices are potentially effective for studying ecosystem functions. We continue to add new indices in our study as we discover the effectiveness.

Data Visualization

Other than displaying a colorful data cube for instantaneously collected hyperspectral data, there are not many options for visualizing hyperspectral reflectance data that were collected over time at high temporal frequency. Also, there is no established tool for displaying hyperspectral data with meteorological and biological measurements that are simultaneously collected so that we can visually identify patterns and associations between the data types. In collaboration with the School of Informatics and Computing at the Indiana University-Purdue University Indianapolis, we have been developing a set of visualization tools.

We use these interactive visualization tools inspect overall data quality, identify problematic data points, associated patterns between hyperspectral reflectance and physical measurements. The information and knowledge are useful for developing and testing our hypotheses. We continue to improve the visualization tools and add new functionality to facilitate our discovery in ecosystem sciences.

3D visualization of diurnal hyperspectral reflectance signatures (unprocessed).
3D visualization of diurnal hyperspectral reflectance signatures (unprocessed).

Data Analysis

We collect unprecedented types and volume of data from land surface. We use this rich dataset to explore analytics for big, complex ecological time series data to address some of the most interesting science questions and untangle mysteries in ecosystem function research. Our current approach includes:

  • Seasonal pattern analysis of GPP using the mid-day average data using a coupled regression model (see graphs below)
  • Nested multiple-scale time-series analysis using diurnal data across the growing season using the neural network approach supported by conditional inference tree method
 Poster: Feature Selection for High Dimensional Time Series Forecasting with Artificial Neural Networks, Paul Tarpey, Yuki Hamada
Seasonal course of spectral indices and GPP
Seasonal course of spectral indices and GPP of 2015