The Marriage of Near-Infrared Spectroscopy with AI: The Small-Sample Breakthrough

News
Article

Scientists demonstrate a self-supervised learning framework that dramatically improves near-infrared spectroscopy classification results, even with minimal labeled data.

Artificial intelligence (AI) and NIR for Classification © putilov_denis - stock.adobe.com

Artificial intelligence (AI) and NIR for Classification © putilov_denis - stock.adobe.com

Near-infrared (NIR) spectroscopy, a cornerstone in non-destructive analysis, has long been valued for its simplicity, speed, efficiency, and non-destructive analysis capabilities. Yet, its effectiveness often hinges on large databases, complex preprocessing, and skilled feature selection, often requiring significant domain expertise. Researchers at Fujian Agriculture and Forestry University have introduced a creative approach to overcome these limitations—a convolutional neural network (CNN)-based self-supervised learning (SSL) framework designed to excel even with small datasets. Published in Analytical Methods, the study promises to reshape spectral analysis by automating feature extraction and reducing reliance on labor-intensive data labeling (1–3).

Overcoming Challenges in NIR Spectroscopy

NIR spectroscopy commonly operates within the 780–2526 nm wavelength range, exploiting absorption patterns of hydrogen-containing groups like O–H, C–H, and N–H. Despite its versatility in analyzing organic molecules, challenges such as broad overlapping peaks, and low to noise signals often complicate direct data interpretation. Traditional machine learning (ML) methods, which rely heavily on preprocessing, feature selection, and model construction, risk signal distortion and information loss (1–3).

Deep learning has emerged as a promising alternative, but its reliance on large labeled datasets has limited its adoption in NIR spectroscopy, where labeling is costly and time-consuming. Addressing this gap, the Fujian team—comprising Rongyue Zhao, Wangsen Li, Jinchai Xu, Linjie Chen, Xuan Wei, and Xiangzeng Kong affiliated with the School of Future Technology and the College of Mechanical and Electrical Engineering—developed a novel SSL framework to extract critical spectral features with minimal human intervention (1).

The Self-Supervised Learning Framework

The proposed SSL model comprises two stages: pre-training and fine-tuning. During pre-training, the model utilizes pseudo-labeled data to learn intrinsic spectral features, setting initial parameters without requiring human-labeled samples. Fine-tuning then optimizes these parameters using a smaller set of labeled data. By leveraging this two-stage process, the model reduces the need for preprocessing while enhancing classification accuracy (1).

To validate the framework, the researchers applied it to their proprietary dataset of three tea tree varieties and three publicly available datasets—mango, tablet, and coal samples. Across all datasets, the model delivered remarkable results (1):

  • Tea Dataset: Achieved a classification accuracy of 99.12%.
  • Mango Dataset: Reached an accuracy of 97.83% for four mango varieties, utilizing NIR data collected from a Fourier transform near-infrared spectrometer.
  • Tablet Dataset: Attained 98.14% accuracy in categorizing pharmaceutical samples by active substance concentration.
  • Coal Dataset: Recorded an accuracy of 99.89%, demonstrating robustness across varied coal types and acquisition conditions.

Performance Insights

The framework’s transformative potential is evident in comparative experiments. When tested with only 5% of labeled data, the SSL model outperformed traditional ML methods by a substantial margin. Even as labeled data availability increased, the SSL approach maintained superior accuracy, displaying its efficiency and adaptability (1).

Additionally, ablation studies confirmed the critical role of the pre-training phase, which enhanced model performance by up to 10.41%. The researchers attribute this success to the model’s ability to extract both local and global spectral features during pre-training, ensuring consistent generalization across datasets (1).

Implications for Spectral Analysis

This study highlights SSL’s potential to address long-standing challenges in spectral analysis. By automating feature extraction and minimizing data-labeling requirements, the CNN-based SSL framework reduces dependency on domain expertise while improving model reliability. The implications extend beyond NIR spectroscopy, offering a blueprint for advancing small-sample analyses in diverse fields, from agriculture to pharmaceutical products and environmental monitoring (1).

“Our results demonstrate that SSL can significantly enhance spectral analysis, even under the constraints of limited data availability,” the authors concluded. “This framework not only advances the capabilities of NIR spectroscopy but also opens doors for broader applications of SSL in analytical science” (1).

The study’s authors, emphasize that this breakthrough sets the stage for more automated and scalable approaches to spectroscopy (1). By combining deep learning and self-supervised methodologies, the researchers have redefined what’s possible using NIR spectroscopy for classification, marking a pivotal step toward smarter, more efficient NIR analytical techniques (1).

Reference

(1) Zhao, R.; Li, W.; Xu, J.; Chen, L.; Wei, X.; Kong, X. A CNN-Based Self-Supervised Learning Framework for Small-Sample Near-Infrared Spectroscopy Classification. Anal. Methods. 2025, 13 Jan. DOI: 10.1039/D4AY01970A

(2) Yang, J.; Xu, J.; Zhang, X.; Wu, C.; Lin, T.; Ying, Y. Deep Learning for Vibrational Spectral Analysis: Recent Progress and a Practical Guide. Anal. Chim. Acta 2019, 1081, 6–17. DOI: 10.1016/j.aca.2019.06.012.

(3) Yang, L.; Sun, Q. Recognition of the Hardness of Licorice Seeds Using a Semi-Supervised Learning Method and Near-Infrared Spectral Data. Chemom. Intell. Lab. Syst. 2012, 114, 109–115. DOI: 10.1016/j.chemolab.2012.03.010.

Related Content