Chemometric Modeling - NMR Testing Laboratory

Introduction
Chemometrics is the statistical processing of analytical chemistry data with various numerical techniques in order to extract information. The technique utilized for the Foxboro NMR product is partial least squares analysis which reduces the large amount of spectral data obtained on a process stream and reduces the information into principal components (factors) that describe the spectrum/measured parameter correlation in a data reduced manner.

Chemometric Modeling of Refinery Streams
The data processing routines that are commonly used with near-infrared spectroscopy in analysis of refinery streams are being used to process the spectral data from the Foxboro NMR instrument. A series of mathematical manipulations of the data are used with a previously developed calibration model to predict the fuel content value of the property of interest.
RAW DATA, X > Chemometrics: f(X) > MON, RON, Benzene, T90, Cetane etc.,

Mathematical Tools
Chemometric data analysis routines consist of proven mathematical tools from the statistics and engineering literature. Tools range from baseline correction and smoothing to multivariate analysis techniques such as partial least squares regression, PLS, principal components regression, PCR, and neural networks. The detection of outliers and new and different samples will be handled by hierarchical cluster analysis followed by Mahalanobis distance testing of single-cluster subsets of the spectral data.

Why Use Chemometrics?
The complex relationship of fuel content to fuel property often requires a complex solution. Consider the property of research octane number, or RON. The RON value of a particular fuel depends in a complex way on the chemical composition of that fuel. Aromatics tend to increase RON, while long and very short saturated chains tend to decrease the RON value of a fuel. Each particular chemical species contributes uniquely to the overall RON value of a fuel. Some types of molecules strongly influence (negatively and positively) the RON number while others have a more modest influence. Many fuel and distillate properties are a weighted function of the array of fuel constituents. Chemometric techniques provide a means for extracting complex property information from subtle variations in the fuel spectrum that arise from variations in the array of chemical constituents present in the fuel sample.

Long Term Modeling Approach
The ultimate goal in process modeling with chemometrics is to develop the “global” multivariate calibration model. The global calibration model would be invariant through time on one instrument and across different instruments. Little progress has been made on the development of global models on near IR instruments over the past 10 years. The global model would require that the calibration developed on one instrument be transferred to other instruments and through time on a single instrument without a significant loss of accuracy. This is a very challenging problem because of the complexity of the model being transported.
Foxboro NMR recognizes that chemometric process models must be developed and maintained. The strategy being implemented currently involves the updating of on-line models with new data points currently not described in the calibration models. This step is performed when a shift in the process is indicated by the appearance of outliers arising from chemical variation of the process stream. The expected maintenance interval for a given calibration is expected to decrease with time as the process variation is revealed and incorporated into the model database.

Chemometric Regression Methods

Model Building Using Soft Modeling Methods
Modeling in the absence of a direct quantitative physical understanding of the relation between the measurement variables(spectra) and the physical or chemical properties (e.g. RON) to be determined requires a different approach when compared with models based on well-defined physical systems. Take for example the property of Research Octane Number, RON. RON is known to depend on the distribution of chain lengths and fractional content of aliphatics, and on the distribution and fractional content of aromatic species. Even if the exact impact (hard model) of each molecular type on the octane number was quantitatively known, the accounting for the hundreds of chemical species that make up a gasoline stream would require an exhaustive level of analysis to compute the hard model for the fuel octane. Soft modeling with principal components makes use of the statistical correlation of the data variation with the property variation in developing a regression model.
Soft modeling was originally used in the behavioral sciences in an attempt to extract expected behaviors from complex multivariable observations. Modeling in the behavioral sciences is complicated by the selection of observations that are related to the behaviors of interest. Fortunately, the intuitive link between spectroscopic measurements and physical or chemical properties is much easier to establish. For example, it is known that 7-9 carbon straight chain aliphatics increase octane and that aromatics also increase octane. Aliphatics with a high degree of branching also increase the octane value of a fuel. Spectroscopy provides information on the chemical structure variation contained in the fuel mixture. Therefore chemical intuition can be used to support the development of regression models that relate the spectral responses to the octane number of fuels.

Multivariate Regression Models
Many engineers and scientists are familiar with the concept of linear regression. In linear regression a single independent variable, y, is regressed against a single dependent variable, x. The form of the regression model is:

y = bx + int (1)
Equation 1 requires 2 pairs of (x,y) data (2 points to define a line) since there are two unknowns to be solved, b and int. This theme of the number of data samples meeting or exceeding the number of unknowns to be determined is a very important concept that must be met in order to determine meaningful regression solutions. The solution to equation 1 is obtained by regressing known values of y against the corresponding known values of x. The unknown free parameters to be solved are the slope b, and the intercept, int.
In spectrometry, a line is used to develop a calibration between concentration, c, and absorbance, x, according to beer’s law.

c = bx + int (2)
It is possible to extend the regression relation to multiple concentration and absorbance variables. A vector is denoted by using a boldface lower case character, and a matrix is denoted by a boldface upper case character. The concentration vector includes the concentration of n sample constituents, a 1 by n dimensional vector.

c = xB + int (3)
Note that equation 3 has a 1 by n vector of concentrations(1 sample with n constituents), a 1 by m vector of spectral measurements, x, and an n by m matrix of regression coefficients(slopes) to be solved: (1 by m)(m by n) gives a 1 by n dimensional concentration vector.
If c and x are mean-centered, the intercept term is zero.

c = xB (4)
Equation 4 is the form of a multivariate predictive model that is used in the estimation of the property or concentration values of a sample with measured spectral response x. The predictive model is defined by B, the matrix of regression coefficients. A multivariate calibration model must be calculated to obtain B. In the calibration sequence, multiple samples are required to ensure a unique solution of B and the variables in the expression are all matrices.

C = XB (5)The matrix of unknown free parameters, B, is solved by a multivariable matrix regression

B = (X^TX) ^-1 X^TC (6)
Where the superscript ^T is used to denote the matrix transpose, and the superscript ^-1 is used to denote the matrix inverse.
It is important to note that the matrix B, is dimensioned as m by n, the number of number of wavelengths, m, by the number of constituents, n. Equations 5-6 describe the multiple linear regression model, MLR. If it is desired to increase the available spectral information in the model, more spectral wavelengths are included. One of the problems with MLR is that the size of the B matrix of unknowns grows rapidly as more spectral wavelengths are included in the regression model. This means that the number of calibration samples with known property/concentration values must also grow rapidly as more wavelengths are included in the model. The failure to use an adequate number of calibration samples can result if a catastrophic failure of the model in the prediction mode. Another problem with MLR is that, for spectral data that exhibit subtle variations with the typical process variation, the matrix inverse step is poorly conditioned. A poorly conditioned system will lead to large errors in the computation of the regression coefficient matrix B, and resulting poor prediction accuracy. A poorly conditioned calibration matrix will lead to models that will be extremely unreliable in predicting on samples with spectra that are dissimilar to those spectra contained in the calibration set data. Dissimilar spectra are likely to be encountered with a changeover in blending feedstock or formulation (winter/summer) changes in the product.

Latent Variable Based Soft Models
The aforementioned difficulties with MLR are addressed with latent variable regression methods. A latent variable is defined as a variable that is not directly observable, but is related to the observable variable. The observable variables, usually spectral intensities or absorbances, are used to generate latent variables. A latent variable, t, is the result of a weighted linear combination of the observable variable vector elements, x.

t = p₁x₁ + p₂x₂ + p₃x₃ + p₄x₄ + p₅x₅ + …p_mx_m (7)

Thus, information from m wavelength measurements can be compressed into 1 latent variable. The weighting coefficients, pi , in equation 7 are called the loadings, and p is the loadings vector. In practice, more than one score variable is required to capture the relevant chemical variance of complex samples. The spectral matrix is eigen-decomposed into a number of scores and loadings vectors and some analysis is required to determine the number of that are needed to capture the chemical variation inherent in the chemical system. The number of relevant scores kept is typically between 5 and 10 when calibrating on petroleum distillate streams. Most of the methods of eigen-decomposing spectral data yield orthogonal or nearly orthogonal sets of scores. Orthogonalization of the spectral data addresses the problem of inverting a poorly conditioned spectral calibration matrix, and the relatively small number of score terms used can dramatically reduce the number of free parameters to be solved. This means that fewer samples are required to over-determine the calibration model. For example, the use of 500 wavelength measurements to determine 5 constituents would require 500*5 or 2500 unique sample spectra. If 10 scores were used in place of the raw spectra, 10*5 or 50 samples would represent the number of samples to exactly determine the chemical system. Experience has determined that reliable models should be over-determined by 2-3 times, therefore 100-150 samples (as opposed to 50) would be a good starting point to model a system with 10 significant scores.

Building of PLS/PCR Models
The two most common latent variable modeling methods are principal components regression, PCR, and partial least squares, PLS. They are essentially equivalent in modeling accuracy with the following exceptions:

PLS may have an advantage in modeling systems that contain uncorrected baseline variation or pink noise.
PCR may have an advantage in systems that contain more than 6 significant score variables

The basic difference between the two is that the PLS algorithm uses information in the dependent variable matrix of properties/concentrations to direct the decomposition of the spectral matrix into loadings vectors and scores while the PCR algorithm decomposes the spectral matrix sequentially in the direction of maximum spectral variance, subtracts the contribution on the maximal axis of variance from the spectral matrix, and then repeats the process along the maximum direction of variance in the residual spectral matrix that is orthogonal to the previously determined loadings vectors. This process is continued until the variance in the spectral matrix is completely eigen-decomposed.
The following discussion on process modeling specifics will focus on the use of PCR, though the choice of method actually used in process modeling will be determined by prediction accuracy.

Model Development Environment
Proprietary algorithms used in generation of calibration models, calibration transfer and spectrum pre-processing have and will be developed in the MATLAB programming environment. The PLS and PCR modeling program used to develop the process chemometric models is Galactic Grams PLS/IQ.

Principal Components Regression, PCR
Consider fuel spectra that are arranged in rows of a matrix of spectral data, X, with measured property values for each fuel sample represented in the property matrix Y. The development of the PCR calibration model is performed as follows:
First, the row spectral matrix X is decomposed into an orthogonal basis set of scores, T(projections), and loadings vectors contained in the matrix P as in Equation 1. The set of loadings vectors make up the basis for the new coordinate system. The new coordinate system is used to define the NMR spectrum in terms of new variables that are fewer in number than the frequency range variables that makes up the original NMR frequency coordinate system. The redefining of coordinates is analogous to the conversion from rectangular to polar coordinates, though coordinate changes in a PCR decomposition only require linear transformations. The values of the spectral intensities in the new coordinate system are the spectral scores, T.

X = TP^T (8)
Selection of Principal Components
The number of coordinate (loadings) axes chosen to represent the spectral data (number of principal component scores) is a critical decision that will be made using a variety of criteria. The inclusion of too many scores leads to good fits of the model but sub-optimal prediction due to the inclusion of excess noise in the calibration model. The inclusion of too many scores leads to a model with excessive bias or systematic error in fitting and prediction. There are a number of methods (like PRESS, indicator function, f-test) that have been used to determine the optimal number of scores, but since some of these methods are statistically based and include statistical assumptions of data homogeneity, the most reliable method of determining the optimal number of scores is by expert inspection of the data structure. A careful study of the spectral calibration matrix is conducted and the residual errors fitted with the addition of each principal component are examined. Unusually large projections of a single sample on a single loadings axis can be indicative of one unusual sample dominating the score. In this case the score will be excluded from the model. In process modeling, the removal of one extra score component than the apparent optimal number, is often used to define models as it is easier to compensate for model bias than model noise.

Once the spectral scores of the calibration set spectra are obtained, they are then used along with the known matrix of property values to solve for the matrix of regression coefficients, B, which define the PCR calibration model.

B = (T^TT)^-1 T^TY (9)
The calibration model is defined by the loadings matrix, P, and by the matrix of regression coefficients, B. Once the model is defined, the forward prediction step can be used in the evaluation of fuel properties as follows:

First, obtain the vector of scores, t_i , for the i^th fuel sample by projecting the sample onto the basis set or loadings matrix of the calibration model, P.

t_i = x_iP (10)
The prediction of the fuel properties, c, for the ith sample are obtained by right hand matrix multiplication of the scores vector by the matrix of regression coefficients as in Equation 4.

c_i = t_iB (11)
Sampling Requirements
There are well-defined statistically designed sampling guidelines for modeling with multivariable systems. A statistical design of the simplest type, a two-level design, requires two samples for a line (a univariate system: 1 independent variable), four for a plane (2 independent variables), eight for a cube (three variables) and so on. The formula for a statistical design of this type is 2ⁿ samples, where n is the number of variables in the system. Samples are strategically chosen in a high-low format in each variable with all combinations of high-low in each dimension accounted for. This design provides for interpolation of all samples that fall inside the (hyper)volume bounded by the calibration set measurements. Failure to use a statistical calibration design means that extrapolation of the calibration model along one of the loadings axes is likely to occur in the prediction mode.
Statistically designed calibrations models are not practical in many process applications because the extremes of the process are undesirable and therefore extreme samples are typically unavailable. This fact necessitates an alternate strategy for building process calibration models:

An initial group of samples is selected containing as much of the process variation as possible. The model is built on these samples and the future prediction samples are evaluated in the prediction mode.

If a prediction sample appears to represent a variation that is not present in the calibration set, the sample is captured, evaluated with the reference method (e.g. octane engine) and the model is rebuilt with the inclusion of that sample in the calibration set.

Approximately 50-75 fuel samples, each with lab determined property values, are needed to initially define a model containing 6-8 significant score components.

Model Adjustments
An adjustment of the model is dictated by the (automated flag) detection of process fuel spectra that are dissimilar to the existing fuel spectra in the model database. The differences could be due to considerable changes in chemical composition of fuels, malfunction of sampling hardware or spectrometer The detection of an unusual sample triggers a lab test, a graphical evaluation of the sample spectrum and, if necessary, a recalculation of the chemometric regression model including the new sample(s). These situations can arise from, process changes, unusual process disruptions/maintenance, changes in regulations concerning fuel content, or from absence of some (seasonal) variants of the fuel composition. In the case of regulation changes, some of the older fuels will be excluded from the model database so that the model reflects current blending targets. Model adjustments are commonly accepted among refiners that use spectrometry as a part of a unified process control strategy. It is felt, however, that model adjustments for NMR based models will be much fewer in number once a wide range of process conditions have been included in the model.
The “outlier” sample in question would be linked to a cluster in the calibration set and a Mahalanobis distance (a covariance matrix, C, scaled Euclidean distance) would be used to determine if the sample belongs to the identified cluster.

D = [(x-u)C^-1 (x-u)^T]^1/2 (12)
The Mahalanobis distance of normally distributed spectral data is known to follow the c² (chi-squared) distribution, permitting a test against a 95% confidence limit, assuming that the data are homogeneous and normally distributed. This test also assumes that enough data have been collected to obtain an accurate estimate of the sample population co-variance matrix, C. In the real world of refinery measurements, the data are clustered according to stream and the assumption of homogeneity fails. The normality of the data may also be in doubt, but the Mahalanobis distance is somewhat tolerant of modest deviations from normality.
To prevent the use of the Mahalanobis distance on a heterogeneous (multiple cluster) data set, the calibration set data will be partitioned into individual cluster groups and a class covariance matrix will be calculated for each cluster. This step will be performed using hierarchical clustering to establish the number of clusters, and then self-organizing clustering to assign cluster membership. The Mahalanobis distance of the sample will be calculated for each cluster and the sample will be evaluated with the c2 test statistic at the 95 % confidence level to determine is sample belongs to an existing cluster. If it is determined that the sample does not belong to an existing cluster and it is determined that there was no hardware malfunction, the model would be rebuilt with inclusion of the outlier sample in the calibration data set.

Process Spectrometers

Near Infrared, NIR
The most widely used process analyzers on the market use near-infrared technology. The strengths of near-infrared technology include:

low sampling error
high signal to noise ratio
(multiplexed) fiber optic sampling devices
relatively long maintenance intervals

Among the weaknesses of NIR technology are:

weak analytical signal variance of overtone and combination band vibrations
heavily overlapped near IR spectral bands require the use of more complex mathematics to extract chemical and physical information
baseline drift that may be as large as 50 times the thermal noise of the detector
fiber optic sampling devices that can lead to shifting backgrounds

The heavily overlapped spectral bands that exist in the near-infrared region require sophisticated multivariate mathematics in order to take advantage of the high signal to noise ratio of the spectral measurement. Unfortunately, the reliability of multivariate process models is dependent on removal of the baseline drift in the spectra of the standard and the measurement samples as compared to the reference background, and in gradual changes in optical alignment or monochromonator mechanicals. Simple offsets or ramps in the baseline spectrum are easy to remove, while complex, nonlinear baselines due to variations in source output or detector phasing errors are more difficult to remove. Derivative methods are typically used to correct for baseline variations, though complex baselines are difficult to remove by derivative transformations, and high frequency noise is enhanced compared with the lower frequency signal components in the derivative transformation process. The failure to effectively remove these baselines or the enhancement of high frequency noise through derivative transformations can lead to a substantial loss in the ability to model NIR data accurately with multivariate regression techniques.

The complexity of the near-IR spectra for motor gas blending streams can lead to 6-10 significant principle components in a PLS or PCR calibration model. That is, the data contain 6-10 significant, mutually orthogonal coordinate axes. A simple two level experimental calibration design would contain 2ⁿ calibration samples of designed composition to prevent extrapolation of the regression model. Because refinery process streams do not generally contain samples of designed (extreme) composition, it will be necessary to extrapolate the calibration model in the prediction mode with the increased risk of large prediction errors. For these reasons, the detection of outliers is used to prevent large extrapolations of the regression model.

Nuclear Magnetic Resonance, NMR
High resolution Nuclear Magnetic Resonance spectroscopy is a relative newcomer to process chemistry though it is highly exploited in petroleum product research. Proton NMR, in particular, contains a great deal of information about the chemical composition of complex sample mixtures. Peak position is fundamentally related to the connectivity of the nuclei and splitting patterns (sometimes represented as broadness of features) contain information about the connectivity of neighboring nuclei. FT-NMR spectra obtained at small tip angles, are linear and additive in nature. Additivity, and the highly resolved nature of the NMR analytical spectra lead to better-conditioned and more reliable calibration models. NMR spectral differences of different diesel fuels are quite pronounced. Diesel fuels are obviously distinguishable by the ratio of CH2 to CH3 peak (directly related to average alkane/aromatic side chain length) and by the broadness of the aromatic features from 7-8 PPM (vs TMS). Narrow aromatic features are low in polynuclear aromatics while the broad aromatic signature is representative of fuels that contain significant percentages of polynuclear aromatics. Aromatics content and aliphatic/aromatic side chain length are directly related to the fuel cetane number and other properties. The chemical information is so well resolved in NMR spectra that early research into the cetane determination of fuels by NMR was performed without chemometrics (ratios of integrated frequency intervals were used) with analytical reproducibility’s of 1.3 cetane number. NMR is free of the baseline drift issues that plague near-IR. NMR instruments do not have moving parts that can fail mechanically, so the maintenance interval is expected to be even longer than that for near-IR. A permanently installed flow through probe means that mechanical positioning and alignment issues do not limit calibration accuracy through time. Finally, the NMR measurement is not susceptible to light scattering and other optical disturbances common in optical spectroscopy.

Another area of concern with NMR is in the maintenance of magnetic field homogeneity. Foxboro NMR has automated shimming routines and is currently testing a method for “on the fly” analysis of field homogeneity. Testing conducted to date indicates that field homogeneity may not be as important as might be suspected: integral areas remain the same under changing field homogeneity so the use of integrated spectral intervals of 0.1 to 0.5 PPM should deliver accurate models that are not significantly dependent on minor variations in the magnetic field homogeneity. The signal-to-noise ratio of NMR can be enhanced by averaging multiple measurements obtained from repeated pulsing and measurement, though the signal to noise is not likely to meet or exceed that of NIR. Finally, another opportunity for extracting even more information from the petroleum distillate samples may exist in the use of multiple pulse techniques.

NMR vs. NIR
NIR process instruments have been implemented in hundreds of refineries worldwide. The benefits of process NIR measurement include high signal to noise ratio, long pathlengths that provide good statistical samples, fiber optic multiplexing capability, maintenance intervals that are superior to the last generation of gas chromatographs, and moderate cost. NIR may have displaced gas chromatography as the process instrument of choice because of the maintenance down time advantage alone.

The strengths of NMR lie in the relation of the measurement signal to first principles and in the high resolution of the chemical information contained in the signals. NIR depends heavily on chemometrics techniques to extract information from the heavily overlapped spectral signals. Unlike NIR, the NMR spectra of protons attached to the key functional groups of aromatics, aliphatics, and olefins are mutually orthogonal to each other. Even polynuclear aromatics are easily distinguishable from single ring aromatics in the NMR frequency spectrum. These key facts should have a large impact on the accessibility of the chemical information that is required to characterize petroleum distillate streams. Also, the simplicity of a latent variable calibration model is not only due to the number of principle score components used in the model, but to the distribution of chemical information among the scores. Because specific NMR spectral information (like aromatic information) is more resolved than NIR spectral information, it will be distributed in a fewer number of component scores even if the overall number of scores used is the same. A dramatic improvement in the accessibility of the chemical information will lead to process models that are simpler, easier to maintain, and more robust when extrapolating in the prediction model or detecting outliers. NMR technology is initially more expensive than NIR, but the lack of moving parts should lead to maintenance intervals that are even greater than those for NIR.

Lastly, a new figure of merit should be defined for process chemistry. The typical signal variance to noise ratio will be more reliable in determining the precision and accuracy of property predictions. Consider a univariate measurement/calibration for simplicity: If the NIR signal is 0.4 Au and the noise is 0.001Au, but the signal variance is 0.01 Au for a range of 20 octane numbers, the signal to noise is 400:1 and the signal variance to noise (for 20 octane number range) is 10:1. A standard analytical calculation using signal to noise would suggest 1 part in 400 error in a typical octane value (89 gives about 0.3 octane numbers), but the actual signal variation would suggest a 1 part in 10 error over the 20 octane range or 2 octane numbers. For this reason, the signal to noise of NIR signals are misleading figures of merit that will not be useful in comparing with the NMR signal, that varies about 15 to 20 percent over a 20 octane range. Furthermore, the noise statistics generated by NIR spectrometer manufacturers do not include intermediate or long-term baseline drift through time! Direct comparisons of NIR and NMR on fuel analysis will reveal a more quantifiable expectation of spectrometer performance.