The analysis of chemical data has undergone a profound transformation, from early basic statistical methods to the modern era of machine learning (ML) and artificial intelligence (AI). This progression is particularly evident in the field of spectroscopy, where multivariate analysis techniques, such as classical regression, principal component analysis (PCA), and partial least squares (PLS), laid the foundation for today’s more advanced or automated ML calibration modeling techniques. This “Chemometrics in Spectroscopy” column traces the historical and technical development of these methods, emphasizing their application in calibrating spectrophotometers for predicting measured sample chemical or physical properties—particularly in near-infrared (NIR), infrared (IR), Raman, and atomic spectroscopy—and explores how AI and deep learning are reshaping the spectroscopic landscape. In this two-part series, we take a look back into the history of chemometrics (Part I) then peer into the future for an estimation of where chemometrics might be going (Part II).
Just to be clear, manuscripts specifically describing the comprehensive history and development of chemometrics for different types of analytical chemical data analysis have been previously published (1–4). To add to the chronicles of the historical development of chemometrics are the early papers and comments by Svante Wold and Bruce Kowalski (5–8). However, we begin our chronicles of statistical methods in analytical chemistry with an article by John Mandel, titled “Efficient Statistical Methods in Chemistry” and published in the American Chemical Society (ACS) journal Industrial & Engineering Chemistry Analytical Edition in 1945 (9). This article was followed by a series of the ACS journal Analytical Chemistry reviews, titled “Statistical Methods in Chemistry,” from 1956 to 1968 (10–16). This series was followed by another series of regular review articles published in Analytical Chemistry, “Statistical and Mathematical Methods in Analytical Chemistry,” published in 1972 and 1976 (17,18). In 1978, Kowalski wrote a review paper titled “Chemometrics” that was published in Analytical Letters (19), which catalyzed the most recent Fundamental Reviews series titled “Chemometrics,” published in Analytical Chemistry from 1980 to 2012 (20–36). Since 2013, there have been no similar updates.
Chemometrics emerged in the 1960s, driven by the broader accessibility of scientific computing and rapidly evolving computational tools. Initially developed alongside other computational chemistry disciplines, such as quantum chemistry, bioinformatics, and chemoinformatics, chemometrics focused on applying statistical and multivariate methods to chemical data analysis. Early pioneers, including Wold and Kowalski, formalized the field, leading to the establishment of dedicated journals, software, and conferences during the 1980s. However, unlike bioinformatics, which gained prominence with the Human Genome Project, chemometrics has remained somewhat of a niche discipline, primarily supported by industry and small research groups rather than large first-tier academic departments.
The field’s growth was hampered by short-term funding and a lack of solid academic infrastructure. Despite its lower visibility compared to other informatics fields, chemometrics remains well-established, with extensive resources and skilled practitioners. Its future depends on recognizing the importance of expert-driven data analysis, particularly as industries face increasingly complex and colossal multivariate data sets. Ensuring quality and reliability in chemometric practices will continue to be crucial to maintaining its relevance in scientific and industrial applications.
A published two-part series of articles explored the early history of chemometrics from 1972 onward through interviews with pioneers like Kowalski, Wold, and Désiré-Luc Massart, alongside notable contributors from the 1970s, including Olav H. J. Christie, Sergio Clementi, Philip K. Hopke, Harald Martens, Steven D. Brown, and Stanley N. Deming. Paul Geladi and Kim Esbensen perform interviews with several selected chemometricians on the origins, use, and future of chemometrics. These interviews consist of two parts: Interviews (2), and Discussion (3). The interviews address key themes, which include the origins of chemometrics, personal contributions, and biographical details. Interviewees provided insights into influential early literature, highlighting significant works that shaped the discipline’s foundation.
A central theme is the collaborative emergence of chemometrics, with no single defining moment marking its inception. Kowalski and Wold’s formation of the Chemometrics Society in 1974 is considered pivotal. Before this, multivariate methods and early computational tools had already been applied to analytical chemistry in the 1960s (1–16), but they lacked formal organization into a well-defined field of research. The need for chemometrics arose from the growing challenge of extracting meaningful and actionable information from vast data sets generated by computerized instruments. Researchers realized that advanced mathematical techniques and a deeper understanding of these techniques were essential to transform raw chemical data into practical knowledge.
The Geladi and Esbensen interview articles include a curated reference list of early chemometrics literature, offering valuable historical resources. The additional interviews in the appendix provide diverse perspectives, underscoring the field’s broad and collaborative history. This retrospective emphasizes how chemometrics grew from merely a niche interest into an essential tool for modern data analysis (2,3).
Chemometrics is a branch of analytical science that infers chemical properties from measurements by means of mathematical methods (4). The term was being used in Europe in the mid-1960s and was reportedly referred to by Svante Wold in a 1971 grant application (5,6). During this period, the International Chemometrics Society was being formed by Wold and Kowalski, who later described the term “chemometrics” in more detail by 1975 (7,8). Its development was driven by the rise of modern computer-driven chemical instruments capable of generating and storing sizable amounts of digital data. Chemometrics emphasized multivariate data analysis to create models, iteratively improving understanding through predictive validation. This approach enabled faster and cost-effective insights, often uncovering hidden relationships in data. Despite its advantages—such as real-time information extraction, improved data resolution, and enhanced process knowledge—the complexity and lack of standardized practices hindered widespread academic adoption of chemometrics. Currently, chemometrics has profoundly impacted fields like drug discovery through combinatorial chemistry and materials development through much improved understanding of chemical processes.
As was previously described (4), chemometrics holds the potential to fundamentally transform the intellectual framework of problem-solving. By adopting a chemometrics-based methodology, scientific problem-solving focuses on interpreting data to develop a hypotheses, or data models, with a deeper connection to reality. This exploratory process can be outlined as follows: 1) chemical instrumentation is used to measure phenomena or processes, producing data quickly and cost-effectively; 2) multivariate analysis is performed on the data; 3) analysis is iterated as needed; 4) a chemical or predictive model is computed and validated; and 5) a multivariate understanding of the underlying process is derived. This explicit methodology avoids rigid or ritualistic thinking, instead emphasizing numerous affordable measurements, occasional simulations, and detailed multivariate chemometric analysis. It represents a genuine paradigm shift, leveraging repeated experimentation and multivariate techniques to view the world through a multidimensional lens. Here, mathematics functions less as a tool for direct modeling and more as an investigative instrument—a “data microscope”—to explore, organize, and uncover hidden relationships within complex data sets to supplement the domain knowledge of the practitioner using these tools (4).
As an example of the Chemometrics reviews published in Analytical Chemistry, the 17th of the series and the 15th with the title of Chemometrics was published in 2008 (34), which covered the most significant developments in the field from January 2006 through December 2007. Such reviews evaluated key advancements, noting the increasing challenge of comprehensive literature referencing because of the field’s extremely rapid growth. Coverage extended beyond core chemometrics to topics like image analysis, signal processing, and bioinformatics, indicating the widespread adoption of multivariate analysis across many disciplines.
Key developments in the early 2000s included advances in quantitative structure-activity relationships (QSAR) and machine learning methods, such as support vector machines (SVM) and random forests (RF), resulting in enhanced predictive modeling. Reviews on in silico tools and their applications, especially in toxicity and biological activity prediction, further underscore the expanding scope of chemometrics at this time (34).
Despite progress, long-standing challenges remain, such as calibration transfer and the constant quest to improve signal-to-noise ratios for all types of measurement data. Industry standards continue to advance, supported by organizations like ASTM International and others. The early reviews conclude that while method development remains steady, applications are growing, driven by the growing need to extract meaningful insights from complex chemical data (34).
In the mid-20th century, the rapid expansion of spectroscopic and atomic instrumentation created a need for more sophisticated data analysis. Traditional univariate methods were inadequate for handling the complex, multidimensional overlapping signals typical of spectroscopic data. Multivariate analysis provided a way to extract meaningful information from these data sets by considering multiple variables simultaneously.
Simple linear regression (SLR), also referred to as classical least squares (CLS), was the starting point, reportedly published by Isaac Newton in 1700 and later described in more modern times (1877) by Carl Friedrich Gauss (37). In spectroscopy, the use of SLR or CLS involves correlating one dependent variable (such as an analyte concentration or constituent parameter) with one independent variable (a spectral measurement signal, such as absorbance, reflectance, emission, or fluorescence). However, CLS is only effective for quantitative analysis when the spectroscopic signals are linearly related to the analyte concentration in the sample. This linear relationship is generally described by the Beer-Lambert law (38,39).
Quantitative spectroscopy was completed in the early part of the twentieth century using what was referred to as a calibration curve or a working curve. This curve was derived either graphically (some of us remember the diligent use of graph paper) or mathematically by measuring the spectral response for a number of reference samples of known concentration at an appropriate single wavelength and plotting (or computing) the Log10 percentage composition (as the abscissa) versus the Log10 (Ix/I0) along the ordinate axis. Note that I0 is the intensity of the incident light, and Ix is the intensity of the transmitted light, or the light exiting the sample after interaction with it, such as absorption, scattering, emission, fluorescence, or reflection (40–41). The classic text from reference (40) published in 1948 by faculty of the Spectroscopy Laboratory at MIT is excellent for understanding the comprehensive early recent history of spectroscopic measurements.
CLS was demonstrated for spectroscopic quantitative analysis, as described in reference (42). A basic but detailed description of Beer’s law, CLS, and multivariate analysis in spectroscopy is given in reference (43). It is anyone’s guess who first actually used an automated computation for linear regression over the more typical use of graph paper to plot their calibration curves, but one would surmise that this quickly became a logical, basic, and broadly accepted practice in the early 20th century as computation tools became available. The technical difficulties then were in achieving high quality and reproducible spectroscopic data rather than generating the calibration curves once the appropriate data was measured. However, the acknowledgement that spectroscopic data typically involves numerous interdependent spectral features was a major intellectual leap. Multiple linear regression (MLR) extended CLS to handle multiple variables but proved sensitive to collinearity, a common issue found in spectral data. Collinearity in spectroscopic data refers to the high correlation or linear dependence between two or more spectral measurement variables (for example, absorbance at different spectral wavelengths), where one variable can be closely predicted from the others. This redundancy often arises from overlapping spectral features, which often complicates multivariate analysis, leading to unstable or unreliable calibration models. There are many excellent references describing the complexities of applying MLR for spectroscopic data, but reference (44) goes into 1092 pages of such descriptive text.
Introduced by Karl Pearson in 1901 and later formalized for analysis of ecological, sociological, and psychological data. For chemical data analysis, chemometrics uses PCA to transform high-dimensional data into a smaller set of orthogonal variables known as principal components (PCs). Each PC captures the maximum variance possible, making PCA invaluable for reducing noise and identifying or classifying patterns in spectroscopic data. In spectrophotometer calibration, PCA helps in dimensionality reduction and pre-processing, providing cleaner data for downstream regression models (45).
Developed by Herman Wold in the 1970s, PLS became a cornerstone of chemometrics because of its ability to handle collinearity and extract relevant information from noisy data. Unlike PCA, which focuses on maximizing variance, PLS aims to maximize the covariance between predictors (spectral data) and response variables (analyte concentrations). This makes PLS particularly suited for quantitative spectroscopic analysis, where the goal is to predict chemical concentrations from spectral data (46). PLS in its various forms has become the workhorse for multivariate quantitative analysis and discriminant analysis.
Chemometrics emerged in the late 1970s and early 1980s as a discipline combining multivariate statistics, computational tools, and chemistry. The need to calibrate spectrophotometers for accurate quantitative predictions drove much of this development. For NIR, IR, and Raman spectroscopy, chemometrics provided methods to transform complex spectral signals into reliable concentration estimates. See the comprehensive three-part review articles titled, “Review of Chemometrics Applied to Spectroscopy: 1985–95,” from the journal Applied Spectroscopy Reviews, for a comprehensive treatment of this subject (47–49).
Characterized by its ability for NIR energy to penetrate deeper into samples with minimal sample preparation, NIR spectroscopy most often relies heavily on partial least squares (PLS) calibration to deconvolute overlapping overtone and combination bands. Typically, preprocessing is applied to spectra to compensate for random noise, light scattering, reflection, and sample compression variations. In its earliest applications, NIR quantitative analysis relied on simpler, rudimentary calibration mathematics (50–53).
Mid-infrared (MIR or IR) spectra, rich in fundamental vibrations, benefit from PCA and PLS to extract meaningful chemical information from sometimes noisy data. Early AOAC-approved IR methods were simple, relying on the Beer-Lambert law. These methods calculated concentrations using fixed calibration constants without employing multivariate regression. In the early methods, a linear relationship (akin to a SLR or CLS single-variable regression) was established between absorbance and component concentration by applying simple single-variable linear regression (54).
IR spectroscopy has applied multivariate regression methods following from the use of these mathematical approaches in atomic spectroscopy and NIR. Regression is mentioned in the 1980 review of infrared spectroscopy for the journal Analytical Chemistry, which reads, “Wavelength selection and calibration is accomplished by multivariate regression methods, using reference samples whose composition has been determined by other methods.” Multiple regression initially gained traction for determining the nitrogen content in grains, partly as a response to environmental concerns associated with the disposal of sulfuric acid, a byproduct of Kjeldahl nitrogen analysis. The spectroscopic approach also offered the advantage of faster results. Using NIR diffuse reflectance proved to be more practical compared to MIR measurements. This is because NIR sources provide higher intensity, detectors exhibit greater sensitivity, and the resolution requirements in NIR spectroscopy are often sufficiently low to allow the use of simple spectral isolation techniques, such as optical filters, to construct instrumentation (55).
The American Society for Testing and Materials (ASTM) International, through the publication of ASTM E1655-97, first appeared as a standard formalizing a best practices document for multivariate calibration in IR and NIR spectroscopy, ensuring consistency and reliability across different instruments and applications. ASTM E1655-97 is an international standard for practices in infrared, multivariate, and quantitative analysis. It was originally approved in 1997, and the most recent edition was approved in 2017 (56). This practice was initially developed by Subcommittee: E13.11, and the original draft was co-authored by Jerry Workman (then of Perkin-Elmer) and Jim Brown (then of Exxon), who received a formal recognition Award from ASTM for this work.
Raman spectra, often complex because of fluorescence interference and weak signal intensity, require multivariate techniques like PLS and orthogonal signal correction to isolate meaningful signals. One early review of Raman spectroscopy used for process applications described the use of chemometrics from 1992 to 1997, covering the practical implementation of Raman-based instrumentation, which had only became advanced enough for such uses within this timeframe. Over this period, advancements in Raman instrumentation and the development of mathematical techniques for extracting quantitative data had significantly improved its utility. Key public-domain applications included monitoring hard carbon coatings on computer hard disks, analyzing chemical compositions (for example, mixtures, solvent separations, and reactions like polymerization, hydrogenation, and curing), assessing gas compositions, studying fermentation processes, evaluating polymorphism in pharmaceuticals, and understanding polymer morphology. This detailed review highlights the then current state of the field and its potential, providing a selective overview of developments (57).
In this Part I of this series chronicling the history and future of chemometrics, we have reflected on what had come before us. In Part II of this series, we will attempt to peer into the future and see if we can predict where chemometrics might be going.
Jerome Workman, Jr. serves on the Editorial Advisory Board of Spectroscopy and is the Executive Editor for LCGC and Spectroscopy. He is the co-host of the Analytically Speaking podcast and has published multiple reference text volumes, including the three-volume Academic Press Handbook of Organic Compounds, the five-volume The Concise Handbook of Analytical Spectroscopy, the 2nd edition of Practical Guide and Spectral Atlas for Interpretive Near-Infrared Spectroscopy, the 2nd edition of Chemometrics in Spectroscopy, and the 4th edition of The Handbook of Near-Infrared Analysis.●
Howard Mark serves on the Editorial Advisory Board of Spectroscopy, and runs a consulting service, Mark Electronics, in Suffern, New York. Direct correspondence to: SpectroscopyEdit@mmhgroup.com ●
Improving Citrus Quality Assessment with AI and Spectroscopy
February 13th 2025Researchers from Jiangsu University review advancements in computer vision and spectroscopy for non-destructive citrus quality assessment, highlighting the role of AI, automation, and portable spectrometers in improving efficiency, accuracy, and accessibility in the citrus industry.
Advancing Near-Infrared Spectroscopy and Machine Learning for Personalized Medicine
February 12th 2025Researchers have developed a novel approach to improve the accuracy of near-infrared spectroscopy (NIRS or NIR) in quantifying highly porous, patient-specific drug formulations. By combining machine learning with advanced Raman imaging, the study enhances the precision of non-destructive pharmaceutical analysis, paving the way for better personalized medicine.
New Method for Detecting Fentanyl in Human Nails Using ATR FT-IR and Machine Learning
February 11th 2025Researchers have successfully demonstrated that human nails can serve as a reliable biological matrix for detecting fentanyl use. By combining attenuated total reflectance-Fourier transform infrared (ATR FT-IR) spectroscopy with machine learning, the study achieved over 80% accuracy in distinguishing fentanyl users from non-users. These findings highlight a promising, noninvasive method for toxicological and forensic analysis.
New AI-Powered Raman Spectroscopy Method Enables Rapid Drug Detection in Blood
February 10th 2025Scientists from China and Finland have developed an advanced method for detecting cardiovascular drugs in blood using surface-enhanced Raman spectroscopy (SERS) and artificial intelligence (AI). This innovative approach, which employs "molecular hooks" to selectively capture drug molecules, enables rapid and precise analysis, offering a potential advance for real-time clinical diagnostics.
Best of the Week: Interview with Juergen Popp, Microplastic Detection, Machine Learning Models
February 7th 2025Top articles published this week include a video interview that explores using label-free spectroscopic techniques for tumor classification, an interview discussing how near-infrared (NIR) spectroscopy can classify different types of horsetails, and a news article about detecting colorless microplastics (MPs) using NIR spectroscopy and machine learning (ML).