Multivariate Calibration Techniques for Infrared Spectroscopy Data
In this thesis, use of multivariate statistical methods for predicting soil properties from infrared spectroscopy data is addressed. Different methods for analyzing complex data and analysis tools to address the computational complexity, when dealing with soil spectroscopy data were developed. Infrared spectroscopy is providing soil scientists with a new tool for assessing soil quality rapidly and cheaply. This is opening up new possibilities for monitoring soil quality or fertility in landscapes. However, spectroscopy techniques generate high dimensional datasets that are complex data to process, analyze and interpret. Therefore, for the generated data to be translated into a form where soil can be classified as from a fertile or poor area requires knowledge and correct use of multivariate statistical techniques. In addition, novel approaches in developing predictive models using mixed effects linear regression, partial least square regression (PLS) and random forest regression methods are used and tested. The study used mid-infrared (MIR) spectroscopy data for soil samples collected from western Kenya. Exploratory data analysis methods were used to assess the distribution of different soil properties, analyzed on 10% of the samples that had reference data. Principal component analysis (PCA) scores plots computed from the spectra were used to screen for any extreme spectral outliers. The dataset was split into two: (i) a training set consisting of two-thirds of the soil samples with both MIR data and soil properties reference data acquired using conventional methods and (ii) a testing set which was used to assess the predictive power of the fitted models. The models’ predictive performances were evaluated using bias and root mean square error of prediction (RMSEP) parameters. Further, residual regression kriging was used to investigate spatial dependence of the model residuals. The methods were tested further on a bigger dataset of samples collected from 60 different sites across Africa. Due to the large number of samples in this dataset, pattern recognition methods were to search for local subspaces. The cosine of angles between pairs of spectra, hit quality index, archetype analysis and self-organizing maps were computed to determine and group similar spectra together. Computer codes for these methods were done using R statistical software. xv A major achievement for this work was the adaptation and development of tools and methods fully customized for data of this type. For instance, the function to directly read raw spectral measurements from instruments reduced processing steps and time required. Improved prediction for aluminum, copper and boron from the hybrid method of PLS and regression kriging of residuals showed that accounting for their spatial dependence can minimize model residuals. Another achievement is the successful partitioning of spectral datasets into groups, which upon evaluation, revealed innate fertility levels. This is particularly useful because it can be used to rapidly assess soil condition and make a decision or recommendations on optimal land use types, without use of chemicals in the laboratory.
The following license files are associated with this item: