• Open Access
  • Editors Choice
Article

PPO-GPR: A Custom Proximal Policy Optimization Tool for Active Reinforcement Learning
Click to copy article linkArticle link copied!

Open PDFSupporting Information (1)

ACS Engineering Au

Cite this: ACS Eng. Au 2026, XXXX, XXX, XXX-XXX
Click to copy citationCitation copied!
https://doi.org/10.1021/acsengineeringau.5c00122
Published April 7, 2026

© 2026 The Authors. Published by American Chemical Society. This publication is licensed under

CC-BY-NC-ND 4.0 .

Abstract

Click to copy section linkSection link copied!

Efficient data selection is critical in domains where data acquisition is expensive and time-consuming, such as material science. In this work, we introduce a novel active learning framework that integrates proximal policy optimization (PPO) with Gaussian process regression (GPR) to strategically select informative data points and thereby enhance predictive modeling. Leveraging the inherent stability and sample efficiency of PPO, achieved through a clipped surrogate objective, the framework guides data acquisition via a custom-designed Gymnasium environment tailored for GPR. In this environment, the PPO agent dynamically chooses data points based on their potential to improve the GPR’s performance, as measured by the R2 score, while preventing redundancy through an action masking mechanism. We apply the proposed methodology to predict the selectivity of methane (CH4) over higher alkanes in metal–organic frameworks (MOFs), focusing on CuBTC and IRMOF-1. The framework is evaluated using both ternary and quaternary gas mixtures, where the performance of the GPR is assessed through metrics such as R2, mean absolute error (MAE), and root mean squared error (RMSE). Across CuBTC and IRMOF-1 in ternary and quaternary hydrocarbon mixtures, PPO-guided acquisition achieves 77–86% data savings relative to full GCMC grids, typically querying only ∼14–23% of the candidate pool while the clipped-update PPO policy converges stably by focusing selections in the pressure–temperature–composition regions where selectivity changes most rapidly. This work shows the potential of combining advanced reinforcement learning techniques with regression models to accelerate material discovery and optimize gas separation processes.

This publication is licensed under

CC-BY-NC-ND 4.0 .
  • cc licence
  • by licence
  • nc licence
  • nd licence
© 2026 The Authors. Published by American Chemical Society

Special Issue

Published as part of ACS Engineering Au special issue “AI and Machine Learning in Chemical Engineering: Breakthroughs and Applications”.

1. Introduction

Click to copy section linkSection link copied!

In many real-world applications, especially in domains like material science, acquiring high-quality data can be both expensive and time-consuming. This is especially true in quantum and molecular modeling workflows. For instance, generating adsorption and selectivity labels can require thousands of grand canonical Monte Carlo (GCMC) steps across a combinatorial space of pressures, temperatures, and mixture compositions. Metal–organic frameworks (MOFs) exemplify this challenge: their modular chemistry and tunable pore structures create an enormous design space. (1−5) Over the past decade, MOFs have become a leading material for adsorption-based gas storage and separations, with extensive work demonstrating how pore size, topology, and chemical functionality can be engineered to modulate host–guest interactions and enable selective uptake. (6−8)
Within separations, multicomponent hydrocarbons remain a particularly demanding regime and considerable progress has been made in designing MOFs for multicomponent hydrocarbon processes, including separations spanning light and heavy alkanes, where performance can depend strongly on mixture nonideality and on how adsorption sites saturate under operating conditions. (9) At the same time, the scale of the MOF landscape has pushed the field toward high-throughput computational screening, enabled by computation-ready structural databases and standardized cleaning/processing pipelines that make it feasible to evaluate thousands of experimentally reported structures in silico. (10,11) Screening studies and subsequent reviews have highlighted both the promise of these approaches and the central bottleneck they face: even with efficient molecular simulation engines, exhaustive coverage of operating conditions (e.g., pressure/temperature grids coupled to multidimensional mixture compositions) quickly becomes prohibitive, motivating strategies that can learn accurate surrogate models from far fewer labeled points. (12−14)
Active learning (AL) (15−26) addresses this bottleneck by selecting new training samples adaptively, prioritizing those expected to maximize information gain or reduce model uncertainty rather than sampling uniformly. In materials and chemical discovery more broadly, AL and closely related Bayesian optimization ideas have become key components of closed-loop workflows, where models guide which experiments or simulations to run next, thereby accelerating discovery under constrained budgets. (15,27,) Gaussian process regression (GPR) is especially well-suited to this setting because it provides not only predictions but also uncertainty estimates, which can be converted into principled acquisition signals for data selection. (29−32) However, the effectiveness of a GPR-based surrogate depends critically on whether the training set spans the “right” regions of the data domain. AL offers a promising solution, and when combined with advanced reinforcement learning (RL) (17,33−38) techniques, it can significantly enhance predictive modeling while reducing data collection burdens.
Generally, reinforcement learning begun to appear in inverse design settings for MOFs, where the “action” is proposing new reticular building-block combinations/structures and the “reward” is a separation-relevant objective (e.g., DAC performance). (39) One of the most robust RL algorithms is proximal policy optimization (PPO). (40−42) Renowned for its stability and sample efficiency, PPO employs a clipped surrogate objective that allows for incremental policy updates without large, destabilizing changes. These characteristics make PPO particularly well-suited for AL scenarios, where the aim is to dynamically select the most informative samples for training.
To further bridge this gap in addition to our past works, (19−21,38) this work introduces an innovative methodology that integrates PPO with GPR within an AL framework. First, we detail the theoretical underpinnings and operational strengths of PPO. Building on this, we then demonstrate how the data selection capabilities of PPO can be harnessed to improve a GP model. This integration is achieved through a custom designed environment where the PPO agent strategically selects the most informative data points, thereby enhancing the predictive accuracy of the GP model while reducing the volume of training data required. We then transition to its practical application in predicting the selectivity of CH4 over higher alkanes in MOFs, using grand canonical Monte Carlo (GCMC) simulations. We focus on the ability of materials like CuBTC and IRMOF-1 to separate CH4 from higher alkanes in both ternary and quaternary gas mixtures.

2. Background

Click to copy section linkSection link copied!

2.1. Reinforcement Learning: Value and Policy Functions

Reinforcement learning (RL) is a paradigm within machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. (35) Two foundational concepts in RL are value functions and policy functions, each offering distinct approaches to learning optimal behaviors.

2.1.1. Value Functions

A value function estimates the expected cumulative rewards an agent can obtain from a particular state or state-action pair. Specifically, the state value function V (s) assesses the desirability of being in the state s, while the action-value function Q(s, a) evaluates the expected rewards of taking action a in the state s. Value-based methods, such as Q-Learning (43−47) and Deep Q-Networks (DQN), (33,44,48) focus on learning these value functions. Once the value function is accurately estimated, the optimal policy, the strategy that dictates the best action to take in each state is implicitly derived by selecting actions that maximize the estimated value.

2.1.2. Policy Functions

In contrast, a policy function π(a|s) directly maps states to actions, specifying the probability of taking each action in each state. Policy-based methods, such as policy gradient methods (49−51) and proximal policy optimization (PPO), (40,41,52,53) focus on optimizing this mapping to maximize cumulative rewards. By adjusting the policy parameters, these methods enable the agent to learn a strategy that maximizes expected rewards through trial and error. Policy-based approaches are particularly advantageous in environments with continuous or large action spaces, where value-based methods might struggle due to the vast number of possible actions.

2.1.3. Actor-Critic Methods

Hybrid approaches like actor-critic methods (54−58) combine the strengths of both value-based and policy-based approaches. An actor-network updates the policy directly, while a critic network estimates the value function to guide the policy updates. This combination allows for more stable and efficient learning, leveraging the direct optimization of policy methods while benefiting from the evaluative insights of value functions.

2.1.4. Q-Learning and the Bellman Equation

Q-Learning is a quintessential value-based RL algorithm that aims to learn the optimal action-value function Q*(s, a), representing the maximum expected cumulative reward achievable by taking action a in state s and following the optimal policy thereafter. (43−47) The Bellman Equation underpins Q-Learning, defining the recursive relationship for optimal value functions
Q*(s,a)=E[r+γmaxaQ*(s,a)|s,a]
(1)
Here, r is the immediate reward received after taking action a in state s, γ is the discount factor, and s′ is the next state resulting from action a. The Q-Learning update rule iteratively refines Q(s, a) as follows
Q(s,a)(1α)Q(s,a)+α[r+γmaxaQ(s,a)]
(2)
where α is the learning rate. This rule combines the current estimate with latest information to approach the optimal action-value function Q*(s, a).

2.1.5. Proximal Policy Optimization (PPO)

Unlike Q-learning, PPO is a policy-based RL algorithm that simplifies and stabilizes policy optimization, making it highly effective for solving high-dimensional tasks. PPO balances exploration and exploitation by constraining policy updates to remain within a certain proximity to the previous policy, preventing drastic changes that could destabilize learning. The primary objective of PPO is to maximize the expected cumulative reward
J(θ)=Eτπθ[t=0γtrt]
(3)
where γ ∈ [0, 1] is the discount factor, rt is the reward at time step t, and τ represents a trajectory of states, actions, and rewards. The gradient of J(θ) with respect to the policy parameters θ is given by
θJ(θ)=Eπθ[θlogπθ(at|st)At]
(4)
where At is the advantage function, defined as
At=Q(s,a)V(st)
(5)
To ensure stable policy updates, PPO introduces a clipped surrogate objective
LCLIP(θ)=Et[min(rt(θ)At,clip(rt(θ),1ϵ,1+ϵ)At)]
(6)
where rt(θ) is the probability ratio between the new (πθ(at|st)) and old policy (πold(at|st)), and ϵ controls the clipping range. This mechanism prevents the policy from changing too drastically, ensuring more stable and reliable learning.
2.1.5.1. Value Function Approximation
PPO jointly trains a value function Vϕ(s) to estimate the expected return from state s
Vϕ(s)=Eπθ[t=0γtrt|st=s]
(7)
This value function serves as a baseline for the advantage function. The value function is optimized by minimizing the mean squared error between the estimated and actual returns
LVF(ϕ)=Et[(Vϕ(st)Rt^)2]
(8)
where t is the discounted return.
2.1.5.2. Generalized Advantage Estimation (GAE)
To reduce variance in advantage estimates, PPO employs generalized advantage estimation (GAE)
At=l=0(γλ)lδt+l
(9)
where δt = rt + γVϕ(st+1) – Vϕ(st) is the temporal difference error and λ ∈ [0, 1] controls the trade-off between bias and variance.
2.1.5.3. Entropy Regularization
Entropy regularization encourages exploration by penalizing overly deterministic policies
LENT=βaπθ(a|s)logπθ(a|s)
(10)
where β is a regularization coefficient that controls the strength of the penalty. The Table 1 below shows a comparative summary of all RL methods discussed in this paper.
Table 1. Comparison of Value-Based and Policy-Based Methods
aspectvalue-based methodspolicy-based methods
primary focusestimating value functions (V (s), Q(s, a))directly learning the policy (π(a|s))
policy derivationindirectly derived by selecting actions with highest valuesexplicitly learned and optimized
action spacebest for discrete or small action spacessuited for continuous or large action spaces
explorationrelies on exploration strategies (e.g., ε-greedy)can incorporate stochastic policies
sample efficiencygenerally, more sample-efficienttypically requires more samples
common algorithmsQ-Learning, DQNREINFORCE, (59−61) PPO, A3C, Actor-Critic
use casesgames with discrete actions (e.g., Atari)robotics, control tasks with continuous actions

3. Methods

Click to copy section linkSection link copied!

3.1. Proximal Policy Optimization with Custom Gaussian Process Regression Environment

We integrate PPO with our GPR surrogate via a custom environment, GPR_Env, implemented against the standard Gym/Gymnasium API. (62,63) Adopting this interface lets the agent interact with the surrogate in a plug-and-play manner while we define rewards that capture predictive improvement from newly acquired points.
For the PPO implementation itself, we use Stable-Baselines3 (SB3), (64) specifically the SB3-Contrib MaskablePPO variant so we can pass action-masks that rule out already-selected indices. This preserves the core PPO behavior (clipped surrogate objective, default γ/λ/clip_range, etc.) while adding invalid-action masking.
To integrate PPO with GPR for AL, we developed a custom environment, GPR_Env. This environment facilitates the interaction between the PPO agent and the Gaussian process model (GPmodel), enabling strategic data point selection to enhance the model’s predictive performance.

3.1.1. Environment Design

The GPR_Env environment simulates the data selection process by allowing the PPO agent to choose data points that will be used to train the GPmodel. The environment is initialized with prior data (initial training set) and test data (available pool for selection). The agent’s objective is to select data points that maximize the predictive performance of the GPmodel while minimizing the number of required data points.

3.1.2. State Representation (st)

At each time step t, the state st includes:
  • Availability Vector: A binary vector indicating the availability of each data point in the test set (1 for available, 0 for selected).

  • Performance Metrics: Current performance metrics of the GPmodel, specifically the R2 score.

  • Selection History: A history of selected data points up to the current step, providing context for the agent’s decision-making process.

3.1.3. Action Space (A)

The action space is discrete, with each action at corresponding to selecting a specific data point index from the available pool. Action masking ensures that the agent cannot select data points that have already been chosen, maintaining the integrity of the selection process. This prevents redundant selections and promotes diversity in the training set.

3.1.4. Reward Function (rt)

The reward at each time step is based on the improvement in the GPModel’s R2 score
rt=α(Rt2Rt12)
(11)
where α is a scaling factor. Positive rewards are assigned when the R2 score improves, encouraging the agent to select data points that enhance model performance. Nonimprovements yield zero rewards, while significant improvements may yield higher rewards. This incentivizes the agent to make selections that have a meaningful impact on the model’s accuracy.

3.2. Termination Conditions

An episode in the GPR_Env environment terminates when one of the following conditions is met:
  • Target Performance: The GPModel achieves or exceeds a target R2.

  • Exhausted Data Pool: All data points in the test set have been selected.

  • Maximum Steps: A predefined maximum number of steps is reached, preventing excessively long training episodes.

3.3. Gaussian Process Regression Model (GPModel)

The GPModel class encapsulates the functionality required for data preprocessing, model training, incremental updates, and performance evaluation. In this work, we study the selectivity of CH4 over of CH4, C2H6, and C3H8 in the ternary mixture and of CH4, C2H6, C3H8 and C4H10 in the quaternary mixtures. For the ternary study, the input features to the GPModel are pressure, temperature, molar compositions of C2H6 and C3H8; and pressure, temperature, molar compositions of C2H6, C3H8 and C4H10 for the quaternary study. In both cases, the molar composition of CH4 is not an input feature, while the selectivity of CH4 over the higher alkanes is the target feature in both cases. We kernels used in this study are discussed in Section 4.2.

3.4. Grand Canonical Monte Carlo

We computed mixture adsorption with grand canonical Monte Carlo (μVT) in CuBTC (HKUST-1) and IRMOF-1 (MOF-5) which was modeled as rigid. The rigid MOF were modeled using the Universal Force Field (65) and the charges for these two MOFs were not considered. We considered two studies here; one for ternary mixtures and adsorbates were methane, ethane, propane, and the inclusion of n-butane for the quaternary case. The adsorbates were described using the Transferable Potentials for Phase Equilibria(TraPPE) and using Lorentz–Berthelot mixing rule. (65,66) Each GCMC state point was equilibrated for 5 × 104 cycles and sampled for 5 × 105 production cycles using RASPA. (67) The μVT move set comprised translation (0.5), rotation (0.5), reinsertion (0.5), regrowth for chain molecules (0.5), identity change (1.0), and swap (1.0), where the values in parentheses denote relative move probabilities.
In the Supporting Information (SI), we provide some adsorption isotherms of the ternary and quaternary mixtures across the two MOFs: CuBTC and IRMOF-1, across three temperatures (200, 300, and 400 K) and molar compositions. These are shown in Figures S1–S4.

4. Use Cases: Ternary and Quaternary Selectivity of CH4 over Higher Alkanes in MOFs

Click to copy section linkSection link copied!

In this work, we applied the PPO-GPR tool to the ternary and quaternary cases separately. The following section are also applicable to the case studies separately.

4.1. Feature Engineering

Input features include:
  • Pressure (X1): Measured in bar, ranging from 10–4 to 100 bar. Logarithmic transformation was applied to this feature to stabilize.

  • Temperature (X2): Ranging from 200 to 400 K. Normalized using mean and standard deviation to ensure numerical stability and consistent scaling.

  • Mole Fractions (X4, X5, X6(for ternary)): Representing the mole fractions of ethane (C2H6), propane (C3H8), and butane (C4H10, for ternary) respectively.

The target variable is the selectivity (S) for CH4 over higher alkanes is given
SCH4/k=qCH4yCH4k{C2H6,C3H8[,C4H10]}qkkyk
(12)
where qk is the adsorption uptake of component, and yk is its mole fraction in the gas mixture. The selectivity values are log-transformed and normalized to ensure consistency with input features and to facilitate the GPModel’s ability to capture nonlinear relationships.

4.2. Model Initialization and Training

The GPModel (GPModel) employs a composite kernel combining rational quadratic (RQ) (29,68−70) and Matern kernels (71,72) to capture complex, nonlinear relationships in the data. The RQ and Matern kernels were tested separately on a smaller set of data (see Section 2 of the SI), across same pressure and mole fraction range, and the best results were attained using a composite kernel. The used kernel combination
k(x,x)=kRQ(x,x)+kMatern(x,x)
(13)
The initial data was constructed at two pressure anchors to span the adsorption range, a low-pressure point (X1 = 1 × 10–4 bar) and a high-pressure point (X1 = 100 bar), crossed with temperatures from 200 to 400 K in 20 K increments (X2). For each (pressure, temperature) pair we enumerated representative quaternary gas compositions by varying the methane, ethane, and propane mole fractions (X3, X4, X5) together with n-butane (X6, for the ternary case) such that the four fractions sum to unity; the grid includes equimolar (0.25 each), corner-biased cases (e.g., CH4-rich or C4H10-rich), and intermediate mixtures to cover the composition space. The total prior data amounts to 2.3% of the total pool. Upon selecting new data points by the PPO agent, the model is incrementally retrained with the updated data set.

4.3. Model Evaluation

The GPModel performance is evaluated using the R2 score, MAE, RMSE on both testing and unlabeled data sets. These metrics provide a comprehensive assessment of the model’s predictive accuracy and reliability. Predictions are inversely transformed to the original scale before calculating performance metrics.

4.4. AL Framework Implementation

The AL framework operates through iterative interactions between the PPO agent and the GPR_Env environment. The process is designed to optimize data selection, enhance model performance, and ensure efficient resource utilization.

4.4.1. Initialization

1.

GPModel Training: An initial GPmodel is trained using prior data, establishing a baseline for performance.

2.

PPO Agent Setup: The PPO agent is initialized with the specified policy and value networks, configured with the defined hyperparameters to facilitate effective learning.

3.

Environment Configuration: The GPR_Env environment is set up with all data points available for selection, providing a comprehensive pool from which the agent can draw.

4.4.2. AL Loop

1.

State Observation: At each step, the agent observes the current state st, which includes available data points and the current performance of the GPmodel.

2.

Action Selection: Based on the observed state, the agent selects an action at, corresponding to the index of a specific data point in the test set.

3.

Environment Update: The selected data point is incorporated into the GPmodel’s training set, and the model is retrained to incorporate the latest information.

4.

Reward Assignment: The agent receives a reward rt based on the improvement in the GPModel’s R2 score, incentivizing selections that enhance predictive performance.

5.

Policy and Value Function Update: Using the PPO algorithm, the agent updates its policy and value function parameters to maximize the expected cumulative rewards, refining its data selection strategy over time.

4.4.3. Termination

The AL loop terminates when the GPmodel achieves the target R2 score, all data points have been selected, or the maximum number of steps is reached. This ensures that the learning process is both efficient and effective, preventing unnecessary computational overhead.

4.5. Training and Evaluation Scripts

4.5.1. Training Script (train.py)

This script orchestrates the AL process by initializing the environment, configuring the PPO agent, and managing the interaction loop. It monitors the GPmodel’s performance metrics, logs training progress and ensures that the agent’s learning trajectory aligns with the study’s objectives.

4.5.2. Evaluation Script (evaluate.py)

Post-training, this script evaluates the PPO agent’s effectiveness by simulating multiple AL episodes. It records the sequence of selected data points, assesses the GPModel’s performance improvement, and compiles the results for comprehensive analysis.

4.5.3. Exporting Final Data set (export_prior_from_idx.py)

This script consolidates the data points selected by the PPO agent into a final training data set. The GPModel is retrained with this optimized data set, and predictions are made on the unlabeled data to validate the model’s generalization capabilities. The script also generates detailed reports of the final performance metrics, facilitating a thorough understanding of the framework’s efficacy.

4.6. Data Management and Preprocessing

The data set comprises measurements of gas adsorption in MOFs under varying conditions. Each data point includes pressure, temperature, and mole fractions of CH4, C2H6, and C3H8, along with the corresponding selectivity values. The data is divided into prior, test, and unlabeled sets to ensure a balanced and representative distribution.

4.6.1. Normalization and Transformation

To ensure numerical stability and consistent scaling, the following preprocessing steps are applied:
  • Pressure (X1): Logarithmic transformation followed by normalization. This transformation stabilizes variance and captures relationships between pressure and selectivity.

  • Temperature (X2): Normalized using mean and standard deviation to center the data and scale it to unit variance.

  • Mole Fractions (X4, X5): Normalized to have zero mean and unit variance, ensuring that all features contribute equally to the model’s predictions.

  • Selectivity (y): Log-transformed and normalized to align with the input features’ scaling and to facilitate the GPModel’s ability to capture nonlinear relationships.

4.6.2. Data Splitting

The data is split into three subsets:
  • Prior Data (Prior.csv): This initial training set contains data points as described in Section 4.2, ensuring broad coverage of adsorption conditions and providing the GPModel with diverse training samples.

  • Test Data (Test.csv): Serving as the pool for AL, this subset contains data points available for the PPO agent to select, enabling the agent to iteratively enhance the GPModel’s performance.

  • Unlabeled Data (Unlabeled.csv): This subset is used for final prediction assessments without known selectivity values, allowing for an unbiased evaluation of the GPModel’s generalization capabilities.

To reiterate, the prior data is carefully curated to encompass all temperatures and mole compositions at the specified pressure values as explained previously stated. The remaining data is split into Test.csv and Unlabeled.csv in a 35:65 ratio, ensuring that the agent’s final model can generalize effectively to unseen data.

4.7. Model Training and Updating

The GPModel serves as the backbone of the predictive framework, modeling the complex relationships between input features and MOF selectivity. The GPModel is initially trained using the prior data set, establishing a baseline for performance. As the PPO agent selects additional data points only from the test data Test.csv, these points are appended to the training set, and the GPModel undergoes retraining. This incremental learning approach allows the model to continuously incorporate new, informative data, enhancing its predictive accuracy and robustness. Each retraining iteration involves optimizing the kernel parameters and noise variance to fit the updated data set, ensuring that the model adapts to the newly introduced information.

4.8. Hyperparameter Configuration

4.8.1. PPO Tunable Hyperparameters

We used a learning rate of 3 × 10–4, which is the Stable-Baselines3 (SB3) PPO default and a standard choice for steady, well-behaved updates. For the rollout length and optimization minibatching, we intentionally set a compact configuration (n_steps = 10) and batch_size = 10, to keep policy updates tightly coupled to the environment feedback in our custom GPR_Env. The clipping parameter for PPO’s surrogate objective was left at the SB3 default, clip_range = 0.2, which constrains policy changes and is widely recommended in reference implementations. The discount factor followed the SB3 default γ = 0.99, emphasizing long-horizon returns, and we used generalized advantage estimation with the standard setting λ = 0.95 to balance bias and variance during advantage computation

4.8.2. GPModel Hyperparameters

  • Kernel Parameters: Length scales and variances for the composite Rational Quadratic and Matern kernels. These parameters are crucial for capturing nonlinear and multiscale relationships inherent in the data.

  • Noise Variance: Represents observation noise in the GPModel, accounting for measurement uncertainties and ensuring robust predictions.

4.8.3. Environment Parameters

  • Batch Size: Number of data points added to the GPModel (GPModel) at each step (set to 1 for granular selection). This fine-grained selection promotes targeted improvements in the model.

  • Maximum Steps: Determined based on the total number of available data points (e.g., 100 steps). This parameter prevents excessively long training episodes, balancing thoroughness with computational efficiency.

5. Results

Click to copy section linkSection link copied!

We applied the PPO-integrated AL framework to predict the selectivity of CH4 over C2H6, C3H8, and C4H10 in both CuBTC and IRMOF-1 MOFs. The framework was evaluated on both ternary and quaternary gas mixtures, with performance metrics including R2, Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE), with respect to the ground truth selectivity computed from the GCMC data, reported for testing and unlabeled data sets.

5.1. Performance on Ternary Mixtures

For the ternary mixture comprising CH4, C2H6, and C3H8, the PPO agent effectively selected data points that significantly improved the GPModel’s predictive performance. Table 2 summarizes the performance metrics.
Table 2. Performance Metrics of the RL Model for CuBTC and IRMOF-1 on Various Datasets for the Ternary Mixturea
data setR2MAERMSE
CuBTC testing data set (ternary mixture)0.9800.0020.010
CuBTC unlabeled data set (ternary mixture)0.9670.0040.013
IRMOF-1 testing data set (ternary mixture)0.9830.0020.011
IRMOF-1 unlabeled data set (ternary mixture)0.8400.0070.029
a

These results are for selectivity of CH4 over C2H6, and C3H8 in the MOFs.

The results for the ternary mixtures demonstrate the efficacy of the PPO-integrated AL framework in enhancing the GPModel’s predictive accuracy. In both CuBTC and IRMOF-1, the R2 scores on the testing data sets reached 0.980 and 0.983, respectively, indicating a high degree of correlation between the predicted and actual selectivity values. This level of accuracy signifies that the model effectively captures the underlying relationships between the input features and MOF selectivity.
Moreover, performance on the unlabeled data sets, with R2 scores of 0.967 for CuBTC and 0.840 for IRMOF-1, underscores the model’s strong generalization capabilities. The higher R2 score for CuBTC suggests that the framework is particularly effective for this MOF. The slightly lower R2 score for IRMOF-1, while still substantial, indicates room for further optimization, potentially through additional data points or alternative feature representations.
Notably, the lower R2 on the IRMOF-1 unlabeled pool is driven by a concentrated subset of operating conditions (low T, low–mid P, and propane-rich compositions) where the model error increases (Figure 4), rather than by a different sampled P–T–composition domain.
An important analysis is to show the chosen points from the PPO-GPR agent in terms of the input feature space (pressure, temperature, and the molar compositions of ethane and propane), for the predictive modeling of selectivity in the ternary mixture. Figure 1 below shows the distribution input feature space for the CuBTC MOF in the ternary case of selectivity of CH4 over the higher alkanes.

Figure 1

Figure 1. Overlaid distributions of the full design pool (“All”, blue) versus PPO–GPR selections (“Selected”, orange) for CuBTC under the ternary CH4/C2H6/C3H8 case.

The pool is strongly skewed toward low pressure, and the agent likewise concentrates selections in the lowest-pressure bin while still sampling across the mid–high range. In contrast, temperature and composition selections closely track the availability of their discrete levels (nearly uniform across temperatures; higher counts at composition levels that appear more often), indicating that conditional on composition and temperature, the agent’s main leverage for informative sampling in CuBTC comes from exploring pressure.
Also, it is important to show how the model performs across various ranges of the unlabeled data set at several ranges of the feature space which only includes the pressure, temperature, and molar compositions of C2H6 and C3H8, as the molar composition of CH4 not included as a feature in the GPModel. For the ternary mixtures, we stratify composition by binning ethane (rows) and propane (columns) into three ranges each using quantile-based edges so that panels have comparable sample counts. If quantiles collapse (e.g., repeated values), we fall back to uniform spacing over the observed range. Within every (ethane, propane) panel we compute the performance RMSE on a Pressure–Temperature grid: pressure is binned uniformly, while temperature uses its discrete simulation levels. Panel titles report the exact composition interval for each axis.
In Figure 2, we present the P–T RMSE heatmaps of the predictions of the unlabeled data set (stratified by ethane × propane) show that the model is generally accurate across most of the grid, with localized pockets of larger error concentrated at low temperatures (∼200–280 K) and mid pressures (∼15–35 bar). These pockets are most visible when ethane is on the low side (≈0.10–0.20) and propane is moderate–high (≈0.30–0.70). As temperature rises above ∼320 K or pressure moves away from that midpressure band (very low <10 bar or higher >60 bar), errors flatten out. When ethane increases (≈0.40–0.60), the error field becomes more uniform and generally lower across pressures. Consistent with the optimization metric used in the PPO-GPR loop, the R2 heatmap in Figure S9 exhibits the same regime dependence: regions of elevated RMSE coincide with localized reductions in R2, while the remainder of the grid maintains R2 values near unity.

Figure 2

Figure 2. RMSE heatmaps for CuBTC (ternary). Each panel shows the prediction error over pressure (bar) × temperature (K) for a fixed composition bin (labels atop each panel). Color encodes RMSE between model predictions and true selectivity.

The sampling plots in Figure 1 explain a lot of this pattern as the PPO-GPR policy also focuses on the lowest-pressure bin while still sprinkling points at mid–high pressures; temperature is essentially uniform, and composition selections mostly track availability. The higher-error cells in the heatmaps align with parts of the space that are comparatively under-represented (midpressure, low-temperature, low-ethane/high-propane panels), so the model is being asked to interpolate with less support there. In contrast, densely sampled regions (very low P and the most common composition levels) exhibit consistently low error, indicating the surrogate is reliable for screening in those regimes. Note that Figure 1 reports raw counts; because the candidate pool is highly skewed toward low pressures, the smaller PPO-selected counts can be visually compressed near the axis. Moreover, PPO-GPR is not intended to sample proportionally to the pool distribution; instead, it optimizes an information-seeking acquisition policy (with action masking to prevent redundant selections). We also present this data in log-scale in the SI, see Figure S5.
The candidate pool is heavily skewed to very low pressures (as seen in Figure 3), but the PPO-GPR agent still spreads selections across the pressure range while keeping a large share at the lowest bin, exactly where the heatmaps say errors are higher. Temperature is offered nearly uniformly, and the agent’s picks mirror that, which gives good coverage of the low-T regime where errors peak. Ethane/propane are on discrete grids; selections track availability with a mild emphasis on common propane levels (≈0.2–0.3), again overlapping the moderate-error regions. Overall, the agent does place many points in the challenging corner (low T, low–mid P, propane-rich), which is desirable for improving the surrogate; additional targeted sampling specifically around ∼220–260 K and ∼15–35 bar at higher propane could further reduce the remaining hot spots. We also present this data in the SI, see Figure S6.

Figure 3

Figure 3. Overlaid distributions of the full design pool (“All”, blue) versus PPO–GPR selections (“Selected”, orange) for IRMOF-1 under the ternary CH4/C2H6/C3H8 case.

The RMSE heatmap of the unlabeled data set in Figure 4 (pressure–temperature panels conditioned on ethane × propane ranges) is best over most mid–high pressures (≥35–40 bar) and moderate-to-high temperatures (≥280–300 K), largely independent of composition. By contrast, several panels develop a clear error “ridge” at low temperatures (≈200–240 K) and low–mid pressures (≈15–35 bar). The hotspot is most pronounced when propane is high (≈0.40–0.70) and ethane is low–medium (≈0.10–0.40): e.g., a bright patch near ∼225 K and ∼20 bar. In short, IRMOF-1 selectivity is hardest for the model in low temperature, moderately compressed conditions, especially at propane-rich compositions. In Figure S10, we present the R2 heat map showing a similar behavior as the RMSE.

Figure 4

Figure 4. RMSE heatmaps for IRMOF-1 (ternary). Each panel shows the prediction error over pressure (bar) × temperature (K) for a fixed composition bin.

We compare the selectivity isotherms for different selected temperatures (200 K, 300 and 400 K) across several molar compositions of the ternary mixtures. Figure 5 compares ternary mixture selectivity isotherms in CuBTC MOF obtained from ground-truth GCMC simulations (solid blue) against the PPO-GPR predictions (dashed orange) across nine representative operating conditions spanning three temperatures (200, 300, and 400 K) and three distinct feed compositions per temperature. Overall, the PPO-GPR model reproduces the pressure dependence of selectivity with strong fidelity across the full range of conditions, capturing both the magnitude and the trend of the GCMC curves. The closest agreement is observed at intermediate-to-high pressures, where the predicted isotherms frequently overlap the GCMC results, particularly at 300–400 K. The primary deviations occur in the most notably at 200 K at the mid pressure region.

Figure 5

Figure 5. Adsorption selectivity isotherms in CuBTC comparing the PPO-GPR predicted selectivity compared to the ground-truth, across several three temperatures (T) and molar composition (x). “x” is arranged in molar composition of CH4, C2H6 and C3H8.

For IRMOF-1, Figure 6 compares selectivity isotherms from GCMC simulations (solid blue) with PPO-GPR predictions (dashed orange) for nine representative quaternary mixture conditions spanning 200–400 K and multiple feed compositions. Across all panels, PPO-GPR reproduces the overall pressure dependence of selectivity and closely matches the GCMC values over much of the measured pressure range, including the gradual, monotonic increases observed at 300–400 K. The largest discrepancies occur at low pressures, particularly at 200 K, where GCMC selectivity changes abruptly at the onset of adsorption and exhibits small fluctuations; in these cases, PPO-GPR typically transitions more smoothly and may slightly overshoot or lag the GCMC curve before converging at higher pressures. At intermediate-to-high pressures, the agreement is strongest, with the PPO-GPR curves frequently overlapping the GCMC results and correctly capturing composition- and temperature-dependent trends. Overall, these results indicate that PPO-GPR provides an accurate surrogate for IRMOF-1 selectivity isotherms across diverse operating conditions, with residual error concentrated in the most sensitive low-pressure regime.

Figure 6

Figure 6. Adsorption selectivity isotherms in IRMOF-1 comparing the PPO-GPR predicted selectivity compared to the ground-truth, across several three temperatures (T) and molar composition (x). “x” is arranged in molar composition of CH4, C2H6 and C3H8.

5.2. Performance on Quaternary Mixtures

Extending the mixture to include C4H10, the PPO agent continued to demonstrate robust performance in data selection, further enhancing the GPModel’s accuracy. Table 3 presents the performance metrics for quaternary mixtures.
Table 3. Performance Metrics of the RL model for CuBTC and IRMOF-1 on Various Datasets for the Quaternary Mixturea
data setR2MAERMSE
CuBTC testing data set (quaternary mixture)0.9810.0020.009
CuBTC unlabeled data set (quaternary mixture)0.9560.0040.013
IRMOF-1 testing data set (quaternary mixture)0.9800.0020.010
IRMOF-1 unlabeled data set (quaternary mixture)0.9120.0070.021
a

These results are for selectivity of CH4 over C2H6, C3H8 and C4H10 in the MOFs.

The inclusion of C4H10 in the gas mixtures introduces additional complexity to the selectivity prediction task. Despite this increased complexity, the PPO-integrated AL framework maintained high predictive performance. In the case of CuBTC, the R2 score on the testing data set improved slightly to 0.981, while IRMOF-1 achieved an R2 score of 0.980. These results indicate that the framework remains effective even as the number of components in the gas mixture increases.
The unlabeled data sets for quaternary mixtures exhibited R2 scores of 0.956 for CuBTC and 0.912 for IRMOF-1. These scores reflect the model’s ability to generalize to more complex scenarios, where additional variables influence selectivity. The high R2 scores affirm the framework’s robustness and its capacity to handle multicomponent gas mixtures effectively.
Furthermore, the MAE and RMSE values across quaternary mixtures remained consistently low, reinforcing the model’s precision in predicting selectivity. The ability to maintain low error metrics despite the increased complexity of the gas mixtures underscores the strength of the AL approach in selecting data points that provide maximal information gain for the GPModel.
The trained GPModels were employed to predict the selectivity of CH4 over higher alkanes in unseen (unlabeled) data sets. The high R2 scores on these data sets indicate the model’s strong generalization capabilities, making it a reliable tool for predicting MOF selectivity in practical applications.
For this quaternary case, we present in figures the sampled regions by the RL agent and show heatmaps of the predictive model performance. To reiterate, the input features for the quaternary study were pressure, temperature, molar compositions of C2H6, C3H8 and C4H10, as the molar composition of CH4 not included as a feature in the GPModel. We also use the same grid over ethane (rows) and propane (columns) and render the P–T metric heat map inside each cell. Because introducing a third composition axis would overfragment the data, butane is not independently binned; instead, we annotate each panel with the butane span present (min–max) to convey the data span. This keeps panel occupancy balanced while still exposing composition-dependent performance.
Figure 7 shows how the PPO–GPR policy allocates samples in CuBTC when a fourth component (n-butane) is introduced. The extra compositional degree of freedom fragments the pool, so each discrete composition level is sparser than in the ternary grids. The policy reacts by keeping strong coverage at very low pressure, while maintaining a broader tail into intermediate pressures (≈10–60 bar) than in the ternary plots, ensuring at least some information from less populated pressure regions. Temperature again lies on a near-uniform grid (200–400 K), but with the sampling budget now spread over an expanded composition space, selections appear almost purely proportional to availability.

Figure 7

Figure 7. CuBTC, quaternary CH4/C2H6/C3H8/C4H10. Overlaid histograms of the design pool (“All”, blue) and PPO–GPR selections (“Selected”, orange).

Compositionally, the policy shows mild, systematic emphasis of midrange heavy-hydrocarbon levels. For propane, selections peak around propane composition ≈0.20–0.30, and for butane around ≈0.20–0.40, with de-emphasis of the butane extreme (≈0.60) and of very low ethane fractions outside the most common settings. This pattern differs from the ternary case, where composition closely mirrored availability: in the quaternary system the agent leans toward moderate propane/butane loadings, conditions that typically sharpen contrast with methane without saturating pores, while it preserves broad coverage elsewhere. Overall, pressure remains the main lever, but the fourth component induces a noticeable midcomposition preference and a wider pressure tail to hedge against the increased sparsity of the design space. In Figure S7, we present the same data in log-scale.
The Figure 8 RMSE heatmap over Pressure–Temperature slices, stratified by molar compositions, is mostly dark (low error), with isolated “warm” patches. Those higher-error pockets concentrate at low temperature (≈210–250 K) and low–mid pressure (≈15–35 bar), and they become most pronounced when the C2H6 fraction is higher (≈0.28–0.40) together with moderate–high C3H8 (≈0.30–0.35). Depending on the panel, the error either persists or amplifies when butane is low–moderate (≈0.20–0.40); when butane is higher (≈0.35–0.60) but ethane/propane are low, the model is steadier. In contrast, the model is consistently accurate at higher temperatures (≥280–300 K) and higher pressures (≥50 bar) across nearly all composition ranges. We also show in the SI the R2 heatmap (Figure S11).

Figure 8

Figure 8. RMSE heatmaps for CuBTC (quaternary). Each panel shows the prediction error over pressure (bar) × temperature (K) for a fixed composition bin of C2H6/C3H8/C4H10.

Figure 9 summarizes how the PPO–GPR policy distributes samples for IRMOF-1 across pressure, temperature, and the C2H6/C3H8/C4H10 mole fractions. The design pool is again dominated by very low pressures, and the agent allocates the largest share of selections there; however, compared to CuBTC, IRMOF-1 shows a flatter tail into intermediate pressures, indicating the policy maintains steadier coverage up to tens of bar rather than concentrating only at the extreme low end. Temperature sampling is essentially proportional to the uniform grid (200–400 K). This data is also presented in log-scale in the SI, see Figure S8.

Figure 9

Figure 9. IRMOF-1, quaternary CH4/C2H6/C3H8/C4H10. Overlaid histograms of the design pool (“All”, blue) and PPO–GPR selections (“Selected”, orange)

For the quaternary composition, IRMOF-1 exhibits a clear midcomposition emphasis among the heavy components: selections are most frequent at moderate propane (≈0.20–0.30) and butane (≈0.20–0.40) levels, with fewer choices at the butane extreme (≈0.60). Ethane is sampled broadly but with a slight preference for commonly occurring mid values. This pattern suggests that, in IRMOF-1, the policy balances the pressure lever with targeted exploration of mid heavy-hydrocarbon loadings, likely where contrasts with methane are informative without entering highly saturated regimes.
The P–T RMSE error maps sliced by (C2H6/C3H8/C4H10) in Figure 10 show a consistent pattern: errors are generally small across most of the grid, with hotspots roughly P ≈ 15–35 bar and T ≈ 230–300 K─where loadings change steeply. These hotspots intensify when the mixture is richer in the heavier alkanes (higher C3H8 and especially moderate C4H10 bins), and they diminish at higher temperatures (>300 K) or at higher pressures (>50–60 bar) where behavior is smoother. In other words, the model performs best in the high-P/high-T regions and in light-alkane-lean mixtures, and it is most challenged in the mid-P, mid-T regime where selectivity is most sensitive to composition. We also present the R2 heat maps for this system in Figure S12.

Figure 10

Figure 10. RMSE heatmaps for IRMOF-1 (quaternary). Each panel shows the prediction error over pressure (bar) × temperature (K) for a fixed composition bin of C2H6/C3H8/C4H10.

For CuBTC under quaternary mixtures, Figure 11 presents selectivity isotherms predicted by PPO-GPR (orange) against GCMC reference values (blue) across nine representative conditions spanning 200–400 K and distinct feed compositions. Across all panels, PPO-GPR closely reproduces both the magnitude and pressure dependence of the GCMC selectivity, including the rapid rise at low pressure followed by a near-plateau at 200 K and the more gradual, monotonic increases observed at 300–400 K. The largest discrepancies are confined to the low-pressure onset region, most visible at 200 K where selectivity is highly sensitive to small differences in competitive uptake and the GCMC curves show sharper transitions and local fluctuations; in these cases, PPO-GPR tends to smooth the transition and may exhibit mild overshoot or lag before converging. At intermediate and high pressures, the agreement is strong, with the PPO-GPR curves frequently overlapping the GCMC data and capturing composition-dependent differences in both plateau levels and curvature. Overall, the figure demonstrates that PPO-GPR provides an accurate surrogate for CuBTC quaternary-mixture selectivity isotherms across a broad range of thermodynamic conditions, while maintaining the correct trends with temperature, composition, and pressure.

Figure 11

Figure 11. Adsorption selectivity isotherms in CuBTC comparing the PPO-GPR predicted selectivity compared to the ground-truth, across several three temperatures (T) and molar composition (x). “x” is arranged in molar composition of CH4, C2H6, C3H8 and C4H10.

For IRMOF-1 under quaternary mixtures, Figure 12 compares selectivity isotherms obtained from GCMC simulations (blue) with PPO-GPR predictions (orange) across nine representative conditions spanning 200–400 K and three distinct feed compositions per temperature. Overall, PPO-GPR closely reproduces the GCMC selectivity profiles across the full pressure range, capturing both the absolute selectivity levels and their pressure dependence. At 200 K, the model accurately recovers the rapid low-pressure rise and subsequent near-plateau behavior, with only minor differences in the onset region where competitive adsorption leads to sharp transitions and small fluctuations in the GCMC data. At 300 K, PPO-GPR matches the step-like increase in selectivity observed near intermediate pressures and remains consistent with GCMC at higher pressures. At 400 K, where selectivity varies more smoothly and often exhibits curvature (including an initial drop followed by a gradual increase with pressure), the PPO-GPR curves nearly overlap the GCMC results, indicating strong agreement in both trend and magnitude. Taken together, these results demonstrate that PPO-GPR provides a reliable surrogate for IRMOF-1 quaternary selectivity isotherms across diverse temperatures and compositions, with residual discrepancies primarily confined to the most sensitive low-pressure onset regime.

Figure 12

Figure 12. Adsorption selectivity isotherms in IRMOF-1 comparing the PPO-GPR predicted selectivity compared to the ground-truth, across several three temperatures (T) and molar composition (x). “x” is arranged in molar composition of CH4, C2H6, C3H8 and C4H10.

One of the features of this PPO-integrated AL framework is its ability to achieve high predictive performance with a reduced number of training data points with respect to the full GCMC data points, as seen in Table 4 for both the cases of the ternary and quaternary mixtures. PPO-GPR achieves 77–86% data savings across all studies (CuBTC/IRMOF-1; ternary/quaternary). In practical terms, the policy typically selects only ∼14–23% of the candidate pool (only the prior + test data) to achieve comparable model fidelity. Importantly, for benchmarking fairness, all labels used during the RL loop were precomputed: the prior set initializes training, and only the test set is queried on demand, but both were simulated in advance. Under this setup, the PPO-GPR optimization itself adds only ∼5 h on 16 CPU cores, whereas generating the complete (prior + test + unlabeled) ternary and quaternary data sets required roughly 48 and 53 days, respectively, when run as one RASPA (67) job per data point, on a single CPU node.
Table 4. Performance in Terms of Data Savings of the PPO-GPModel Compared to the Total GCMC Data Simulations
cases% data saving with respect to full GCMC
CuBTC (ternary mixture)86
IRMOF-1(ternary mixture)81
CuBTC (quaternary mixture)82
IRMOF-1 (quaternary mixture)77
CuBTC (ternary mixture)86
Crucially, the efficiency gain is not only about fewer points, but also better points. Traditional exhaustive campaigns collect large, indiscriminate data sets to guarantee coverage, which is costly. In contrast, our AL policy targets samples with high expected information gain, focusing queries on the P–T–composition regimes where selectivity varies rapidly and avoiding redundant regions. This targeted acquisition improves the quality of the training set: each newly “revealed” label contributes meaningfully, yielding models that are robust and generalizable across operating conditions, as shown in the previous sections.

6. Conclusion

Click to copy section linkSection link copied!

This study presents a novel AL framework that integrates PPO with GPR to predict the selectivity of methane (CH4) over higher alkanes in MOFs. By leveraging RL to strategically select the most informative data points, the framework significantly enhances the predictive accuracy of the GPModel while reducing the volume of required training data. The successful application of this methodology to CuBTC and IRMOF-1 in both ternary and quaternary gas mixtures underscores its potential for accelerating material discovery and optimization in gas separation technologies.
The comprehensive results demonstrate that the PPO-integrated AL framework consistently achieves high R2 scores and low error metrics across both testing and unseen data sets, validating the model’s reliability and generalization capabilities. The framework’s ability to maintain performance even as the complexity of the gas mixtures increases highlights its robustness and adaptability.
Furthermore, the framework’s data efficiency offers substantial practical advantages. By minimizing the number of required data points, it reduces experimental and computational costs, making it an attractive option for industrial applications where resources may be constrained. This efficiency does not come at the expense of performance; instead, the strategic selection of data points ensures that the model remains highly accurate and reliable.
The integration of PPO with GPR exemplifies the powerful synergy between RL and regression models in addressing complex predictive tasks in material science. This approach not only stream- lines the data collection process but also enhances the depth and breadth of the model’s understanding, paving the way for more sophisticated and efficient predictive frameworks.
In conclusion, the PPO-integrated AL framework represents a significant advancement in the field of MOF selectivity prediction. Its combination of high predictive accuracy, data efficiency, and scalability makes it an asset for advancing gas separation technologies, contributing to more sustainable and efficient energy solutions.

Data Availability

Click to copy section linkSection link copied!

All codes and data can be found via GitHub here: https://github.com/theOsaroJ/PPO_GPR

Supporting Information

Click to copy section linkSection link copied!

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsengineeringau.5c00122.

  • Mixture adsorption isotherms for CuBTC and IRMOF-1 under ternary and quaternary conditions (Figures S1–S4); Gaussian process kernel testing and comparison (RQ, Matérn, and composite RQ+Matérn) with performance metrics (Table S1); log-scale sampling/distribution plots comparing the full design pool versus PPO–GPR-selected prior for CuBTC and IRMOF-1, ternary and quaternary cases (Figures S5–S8); R2 heatmaps over pressure–temperature grids for CuBTC and IRMOF-1, ternary and quaternary cases (Figures S9–S12) (PDF)

Terms & Conditions

Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.

Author Information

Click to copy section linkSection link copied!

  • Corresponding Author
  • Author
  • Notes
    The authors declare no competing financial interest.

Acknowledgments

Click to copy section linkSection link copied!

E.O. would like to thank the Lucy Family Institute for Data and Society, and The Patrick and Jana Eilers Graduate Student Fellowship for Energy Related Research at the University of Notre Dame. Y.J.C. gratefully acknowledge NSF CAREER Award No. CBET-2143346 and NSF Award No. CBET-2347040. The authors also thank the Center for Research Computing at the University of Notre Dame for computational resources.

References

Click to copy section linkSection link copied!

This article references 72 other publications.

  1. 1
    Ji, Z.; Wang, H.; Canossa, S.; Wuttke, S.; Yaghi, O. M. Pore Chemistry of Metal–Organic Frameworks. Adv. Funct. Mater. 2020, 30, 2000238  DOI: 10.1002/adfm.202000238
  2. 2
    Furukawa, H.; Cordova, K. E.; O’Keeffe, M.; Yaghi, O. M. The Chemistry and Applications of Metal-Organic Frameworks. Science 2013, 341, 1230444  DOI: 10.1126/science.1230444
  3. 3
    Kaskel, S. Progress in Advanced Characterization of MOFs. In The Chemistry of Metal-Organic Frameworks: Synthesis, Characterization, and Applications; Wiley, 2016; pp 575822.
  4. 4
    James, S. L. Metal-organic frameworks. Chem. Soc. Rev. 2003, 32, 276,  DOI: 10.1039/b200393g
  5. 5
    Jones, C. W. Metal-Organic Frameworks and Covalent Organic Frameworks: Emerging Advances and Applications. JACS Au 2022, 2, 15041505,  DOI: 10.1021/jacsau.2c00376
  6. 6
    Langmi, H. W.; Ren, J.; North, B.; Mathe, M.; Bessarabov, D. Hydrogen storage in metal-organic frameworks: A review. Electrochim. Acta 2014, 128, 368392,  DOI: 10.1016/j.electacta.2013.10.190
  7. 7
    Baumann, A. E.; Burns, D. A.; Liu, B.; Thoi, V. S. Metal-organic framework functionalization and design strategies for advanced electrochemical energy storage devices. Commun. Chem. 2019, 2, 114,  DOI: 10.1038/s42004-019-0184-6
  8. 8
    Mao, H.; Tang, J.; Day, G. S. A scalable solid-state nanoporous network with atomic-level interaction design for carbon dioxide capture. Sci. Adv. 2022, 8, abo6849  DOI: 10.1126/sciadv.abo6849
  9. 9
    Wang, L.; Huang, H.; Zhang, X. Designed metal-organic frameworks with potential for multi-component hydrocarbon separation. Coord. Chem. Rev. 2023, 484, 215111  DOI: 10.1016/j.ccr.2023.215111
  10. 10
    Zhao, G.; Brabson, L. M.; Chheda, S. CoRE MOF DB: A curated experimental metal-organic framework database with machine-learned properties for integrated material-process screening. Matter 2025, 8, 102140  DOI: 10.1016/j.matt.2025.102140
  11. 11
    Chung, Y. G.; Haldoupis, E.; Bucior, B. J. Advances, Updates, and Analytics for the Computation-Ready, Experimental Metal–Organic Framework Database: CoRE MOF 2019. J. Chem. Eng. Data 2019, 64, 59855998,  DOI: 10.1021/acs.jced.9b00835
  12. 12
    Colón, Y. J.; Snurr, R. Q. High-throughput computational screening of metal-organic frameworks. Chem. Soc. Rev. 2014, 43, 57355749,  DOI: 10.1039/C4CS00070F
  13. 13
    Osaro, E.; Colón, Y. J. Intelligent screening of porous materials: A review of active-learning approaches in MOF research. Chem. Phys. Rev. 2025, 6, 041307  DOI: 10.1063/5.0295283
  14. 14
    Wang, Z.; Zhou, T.; Sundmacher, K. Interpretable machine learning for accelerating the discovery of metal-organic frameworks for ethane/ethylene separation. Chem. Eng. J. 2022, 444, 136651  DOI: 10.1016/j.cej.2022.136651
  15. 15
    Lookman, T.; Balachandran, P. V.; Xue, D.; Yuan, R. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design. NPJ. Comput. Mater. 2019, 5, 21  DOI: 10.1038/s41524-019-0153-8
  16. 16
    Cohn, D. A.; Ghahramani, Z.; Jordan, M. I. Active Learning with Statistical Models. J. Artif. Intell. Res. 1996, 4, 129145,  DOI: 10.1613/jair.295
  17. 17
    Gubaev, K.; Podryabinkin, E. V.; Shapeev, A. V. Machine learning of molecular properties: Locality and active learning. J. Chem. Phys. 2018, 148, 241727  DOI: 10.1063/1.5005095
  18. 18
    Ren, P.; Xiao, Y.; Chang, X. A Survey of Deep Active Learning. ACM Comput. Surv. 2022, 54, 140,  DOI: 10.1145/3472291
  19. 19
    Osaro, E.; Mukherjee, K.; Colón, Y. J. Active Learning for Adsorption Simulations: Evaluation, Criteria Analysis, and Recommendations for Metal–Organic Frameworks. Ind. Eng. Chem. Res. 2023, 62, 1300913024,  DOI: 10.1021/acs.iecr.3c01589
  20. 20
    Mukherjee, K.; Osaro, E.; Colón, Y. J. Active learning for efficient navigation of multi-component gas adsorption landscapes in a MOF. Digital Discovery 2023, 2, 15061521,  DOI: 10.1039/D3DD00106G
  21. 21
    Osaro, E.; Fajardo-Rojas, F.; Cooper, G. M.; Gómez-Gualdrón, D.; Colón, Y. J. Active learning of alchemical adsorption simulations; towards a universal adsorption model. Chem. Sci. 2024, 15, 1767117684,  DOI: 10.1039/D4SC02156H
  22. 22
    Osaro, E.; LaCapra, M.; Colón, Y. J. Harmonizing Adsorption and Diffusion in Active Learning Campaigns of Gas Separations in a MOF. J. Phys. Chem. C 2025, 129, 98779891,  DOI: 10.1021/acs.jpcc.5c00922
  23. 23
    Osaro, E.; Bakare, A.; Colón, Y. J. Multi-method material selection for adsorption using Bayesian approaches. Commun. Mater. 2025, 6, 215,  DOI: 10.1038/s43246-025-00933-w
  24. 24
    Gantzler, N.; Deshwal, A.; Doppa, J. R.; Simon, C. M. Multi-fidelity Bayesian optimization of covalent organic frameworks for xenon/krypton separations. Digital Discovery 2023, 2, 19371956,  DOI: 10.1039/D3DD00117B
  25. 25
    He, G.-F.; Zhang, P.; Yin, Z.-Y. Active learning inspired multi-fidelity probabilistic modelling of geomaterial property. Comput. Methods Appl. Mech. Eng. 2024, 432, 117373  DOI: 10.1016/j.cma.2024.117373
  26. 26
    Hernandez-Garcia, A.; Saxena, N.; Jain, M.; Liu, C.-H.; Bengio, Y. Multi-Fidelity Active Learning with GFlowNets. 2024.
  27. 27
    Wang, A.; Liang, H.; McDannald, A.; Takeuchi, I.; Kusne, A. G. Benchmarking active learning strategies for materials optimization and discovery. Oxford Open Mater. Sci. 2022, 2, itac006  DOI: 10.1093/oxfmat/itac006
  28. 28
    Kusne, A. G.; Yu, H.; Wu, C. On-the-fly closed-loop materials discovery via Bayesian active learning. Nat. Commun. 2020, 11, 5966  DOI: 10.1038/s41467-020-19597-w
  29. 29
    Deringer, V. L.; Bartók, A. P.; Bernstein, N. Gaussian Process Regression for Materials and Molecules. Chem. Rev. 2021, 121, 1007310141,  DOI: 10.1021/acs.chemrev.1c00022
  30. 30
    Rasmussen, C. E. Gaussian Processes in machine learning. In Lecture Notes in Computer Science, Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; Springer, 2004; Vol. 3176, pp 6371  DOI: 10.1007/978-3-540-28650-9_4 .
  31. 31
    Rasmussen, C. E.; Williams, C. K. I. Gaussian Processes for Machine Learning; The MIT Press, 2005  DOI: 10.7551/mitpress/3206.001.0001 .
  32. 32
    Hensman, J.; Fusi, N.; Lawrence, N. D. Gaussian Processes for Big Data 2013 https://arxiv.org/pdf/1309.6835.
  33. 33
    Li, Y. Deep Reinforcement Learning: An Overview 2017 https://arxiv.org/abs/1701.07274.
  34. 34
    Sui, F.; Guo, R.; Zhang, Z.; Gu, G. X.; Lin, L. Deep Reinforcement Learning for Digital Materials Design. ACS Mater. Lett. 2021, 3, 14331439,  DOI: 10.1021/acsmaterialslett.1c00390
  35. 35
    Sutton, R. S. Introduction: The Challenge of Reinforcement Learning. In Reinforcement Learning; Springer US: Boston, MA, 1992; pp 13  DOI: 10.1007/978-1-4615-3618-5_1 .
  36. 36
    Kaelbling, L. P.; Littman, M. L.; Moore, A. W. Reinforcement Learning: A Survey. J. Artif. Intell. Res. 1996, 4, 237285,  DOI: 10.1613/jair.301
  37. 37
    Peters, M.; Ketter, W.; Saar-Tsechansky, M.; Collins, J. A reinforcement learning approach to autonomous decision-making in smart electricity markets. Mach. Learn. 2013, 92, 539,  DOI: 10.1007/s10994-013-5340-0
  38. 38
    Osaro, E.; Colón, Y. J. Optimizing the prediction of adsorption in metal–organic frameworks leveraging Q-learning. AIChE J. 2024, 70, 18611  DOI: 10.1002/aic.18611
  39. 39
    Park, H.; Majumdar, S.; Zhang, X.; Kim, J.; Smit, B. Inverse design of metal–organic frameworks for direct air capture of CO 2 via deep reinforcement learning. Digital Discovery 2024, 3, 728,  DOI: 10.1039/D4DD00010B
  40. 40
    Zhuang, Z.; Lei, K.; Liu, J.; Wang, D.; Guo, Y. Behavior Proximal Policy Optimization. 2023.
  41. 41
    Gu, Y.; Cheng, Y.; Chen, C. L. P.; Wang, X. Proximal Policy Optimization With Policy Feedback. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 46004610,  DOI: 10.1109/TSMC.2021.3098451
  42. 42
    Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. 2017.
  43. 43
    Kostrikov, I.; Nair, A.; Levine, S. Offline Reinforcement Learning with Implicit Q-Learning 2021 https://arxiv.org/abs/2110.06169.
  44. 44
    Tan, F.; Yan, P.; Guan, X. Deep Reinforcement Learning: From Q-Learning to Deep Q-Learning. In Neural Information Processing; Springer, 2017; pp 475483  DOI: 10.1007/978-3-319-70093-9_50 .
  45. 45
    Jang, B.; Kim, M.; Harerimana, G.; Kim, J. W. Q-Learning Algorithms: A Comprehensive Classification and Applications. IEEE Access 2019, 7, 133653133667,  DOI: 10.1109/ACCESS.2019.2941229
  46. 46
    Clifton, J.; Laber, E. Q-Learning: Theory and Applications. Annu. Rev. Stat. Appl. 2020, 7, 279301,  DOI: 10.1146/annurev-statistics-031219-041220
  47. 47
    Watkins, C. J. C. H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279292,  DOI: 10.1007/BF00992698
  48. 48
    Li, S. E. Deep Reinforcement Learning. In Reinforcement Learning for Sequential Decision and Optimal Control; Springer Nature Singapore: Singapore, 2023; pp 365402  DOI: 10.1007/978-981-19-7784-8_10 .
  49. 49
    Sumiea, E. H.; Abdulkadir, S. J.; Alhussian, H. S. Deep deterministic policy gradient algorithm: A systematic review. Heliyon 2024, 10, e30697  DOI: 10.1016/j.heliyon.2024.e30697
  50. 50
    Li, S.; Wu, Y.; Cui, X. Robust Multi-Agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient. Proc. AAAI Conf. Artif. Intell. 2019, 33, 42134220,  DOI: 10.1609/aaai.v33i01.33014213
  51. 51
    Tan, H. Reinforcement Learning with Deep Deterministic Policy Gradient. In 2021 International Conference on Artificial Intelligence, Big Data and Algorithms (CAIBDA); IEEE, 2021; pp 8285  DOI: 10.1109/CAIBDA53561.2021.00025 .
  52. 52
    Zhang, J.; Zhang, Z.; Han, S.; Lü, S. Proximal policy optimization via enhanced exploration efficiency. Inf. Sci. 2022, 609, 750765,  DOI: 10.1016/j.ins.2022.07.111
  53. 53
    Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms 2017 https://arxiv.org/abs/1707.06347.
  54. 54
    Zhong, C.; Lu, Z.; Gursoy, M. C.; Velipasalar, S. A Deep Actor-Critic Reinforcement Learning Framework for Dynamic Multichannel Access. IEEE Trans. Cogn. Commun. Netw. 2019, 5, 11251139,  DOI: 10.1109/TCCN.2019.2952909
  55. 55
    Gruslys, A.; Dabney, W.; Azar, M. G. The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning 2017 https://arxiv.org/abs/1704.04651.
  56. 56
    Chen, R.; Goldberg, J. H. Actor-critic reinforcement learning in the songbird. Curr. Opin. Neurobiol. 2020, 65, 19,  DOI: 10.1016/j.conb.2020.08.005
  57. 57
    Han, M.; Zhang, L.; Wang, J.; Pan, W. Actor-Critic Reinforcement Learning for Control With Stability Guarantee. IEEE Robot. Autom. Lett. 2020, 5, 62176224,  DOI: 10.1109/LRA.2020.3011351
  58. 58
    Grondman, I.; Busoniu, L.; Lopes, G. A. D.; Babuska, R. A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients. IEEE Trans. Syst, Man, Cybern. 2012, 42, 12911307,  DOI: 10.1109/TSMCC.2012.2218595
  59. 59
    WILLIAMS, R. J.; PENG, J. Function Optimization using Connectionist Reinforcement Learning Algorithms. Connect. Sci. 1991, 3, 241268,  DOI: 10.1080/09540099108946587
  60. 60
    Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229256
  61. 61
    Zhang, J.; Kim, J.; O’Donoghue, B.; Boyd, S. Sample Efficient Reinforcement Learning with REINFORCE. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1088710895,  DOI: 10.1609/aaai.v35i12.17300
  62. 62
    Brockman, G.; Cheung, V.; Pettersson, L. OpenAI Gym 2016 https://arxiv.org/abs/1606.01540.
  63. 63
    Towers, M.; Kwiatkowski, A.; Terry, J. Gymnasium: A Standard Interface for Reinforcement Learning Environments 2024 https://arxiv.org/abs/2407.17032.
  64. 64
    Raffin, A.; Hill, A.; Gleave, A. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res. 2021, 22, 18
  65. 65
    Rappé, A. K.; Casewit, C. J.; Colwell, K. S.; Goddard, W. A.; Skiff, W. M. UFF, a Full Periodic Table Force Field for Molecular Mechanics and Molecular Dynamics Simulations. J. Am. Chem. Soc. 1992, 114, 1002410035,  DOI: 10.1021/ja00051a040
  66. 66
    Martin, M. G.; Siepmann, J. I. Transferable Potentials for Phase Equilibria. 1. United-Atom Description of n-Alkanes. J. Phys. Chem. B 1998, 102, 25692577,  DOI: 10.1021/jp972543+
  67. 67
    Dubbeldam, D.; Calero, S.; Ellis, D. E.; Snurr, R. Q. RASPA: Molecular simulation software for adsorption and diffusion in flexible nanoporous materials. Mol. Simul. 2016, 42, 81101,  DOI: 10.1080/08927022.2015.1010082
  68. 68
    Gheytanzadeh, M.; Baghban, A.; Habibzadeh, S. Towards estimation of CO2 adsorption on highly porous MOF-based adsorbents using gaussian process regression approach. Sci. Rep. 2021, 11, 15710  DOI: 10.1038/s41598-021-95246-6
  69. 69
    Dudek, A.; Baranowski, J. Gaussian Processes for Signal Processing and Representation in Control Engineering. Appl. Sci. 2022, 12, 4946  DOI: 10.3390/app12104946
  70. 70
    Wilson, A.; Adams, R. Gaussian process kernels for pattern discovery and extrapolation. In 30th International Conference on Machine Learning, ICML 2013; ICML, 2013; Vol. 28, pp 21042112.
  71. 71
    Melkumyan, A.; Ramos, F. Multi-kernel Gaussian Processes , IJCAI International Joint Conference on Artificial Intelligence, 2011; pp 14081413  DOI: 10.5591/978-1-57735-516-8/IJCAI11-238 .
  72. 72
    Deshwal, A.; Doppa, J. R. Combining Latent Space and Structured Kernels for Bayesian Optimization over Combinatorial Spaces. In Advances in Neural Information Processing Systems; Curran Associates, Inc., 2021; Vol. 10, pp 81858200.

Cited By

Click to copy section linkSection link copied!

This article has not yet been cited by other publications.

ACS Engineering Au

Cite this: ACS Eng. Au 2026, XXXX, XXX, XXX-XXX
Click to copy citationCitation copied!
https://doi.org/10.1021/acsengineeringau.5c00122
Published April 7, 2026

© 2026 The Authors. Published by American Chemical Society. This publication is licensed under

CC-BY-NC-ND 4.0 .

Article Views

91

Altmetric

-

Citations

-
Learn about these metrics

Article Views are the COUNTER-compliant sum of full text article downloads since November 2008 (both PDF and HTML) across all institutions and individuals. These metrics are regularly updated to reflect usage leading up to the last few days.

Citations are the number of other articles citing this article, calculated by Crossref and updated daily. Find more information about Crossref citation counts.

The Altmetric Attention Score is a quantitative measure of the attention that a research article has received online. Clicking on the donut icon will load a page at altmetric.com with additional details about the score and the social media presence for the given article. Find more information on the Altmetric Attention Score and how the score is calculated.

  • Abstract

    Figure 1

    Figure 1. Overlaid distributions of the full design pool (“All”, blue) versus PPO–GPR selections (“Selected”, orange) for CuBTC under the ternary CH4/C2H6/C3H8 case.

    Figure 2

    Figure 2. RMSE heatmaps for CuBTC (ternary). Each panel shows the prediction error over pressure (bar) × temperature (K) for a fixed composition bin (labels atop each panel). Color encodes RMSE between model predictions and true selectivity.

    Figure 3

    Figure 3. Overlaid distributions of the full design pool (“All”, blue) versus PPO–GPR selections (“Selected”, orange) for IRMOF-1 under the ternary CH4/C2H6/C3H8 case.

    Figure 4

    Figure 4. RMSE heatmaps for IRMOF-1 (ternary). Each panel shows the prediction error over pressure (bar) × temperature (K) for a fixed composition bin.

    Figure 5

    Figure 5. Adsorption selectivity isotherms in CuBTC comparing the PPO-GPR predicted selectivity compared to the ground-truth, across several three temperatures (T) and molar composition (x). “x” is arranged in molar composition of CH4, C2H6 and C3H8.

    Figure 6

    Figure 6. Adsorption selectivity isotherms in IRMOF-1 comparing the PPO-GPR predicted selectivity compared to the ground-truth, across several three temperatures (T) and molar composition (x). “x” is arranged in molar composition of CH4, C2H6 and C3H8.

    Figure 7

    Figure 7. CuBTC, quaternary CH4/C2H6/C3H8/C4H10. Overlaid histograms of the design pool (“All”, blue) and PPO–GPR selections (“Selected”, orange).

    Figure 8

    Figure 8. RMSE heatmaps for CuBTC (quaternary). Each panel shows the prediction error over pressure (bar) × temperature (K) for a fixed composition bin of C2H6/C3H8/C4H10.

    Figure 9

    Figure 9. IRMOF-1, quaternary CH4/C2H6/C3H8/C4H10. Overlaid histograms of the design pool (“All”, blue) and PPO–GPR selections (“Selected”, orange)

    Figure 10

    Figure 10. RMSE heatmaps for IRMOF-1 (quaternary). Each panel shows the prediction error over pressure (bar) × temperature (K) for a fixed composition bin of C2H6/C3H8/C4H10.

    Figure 11

    Figure 11. Adsorption selectivity isotherms in CuBTC comparing the PPO-GPR predicted selectivity compared to the ground-truth, across several three temperatures (T) and molar composition (x). “x” is arranged in molar composition of CH4, C2H6, C3H8 and C4H10.

    Figure 12

    Figure 12. Adsorption selectivity isotherms in IRMOF-1 comparing the PPO-GPR predicted selectivity compared to the ground-truth, across several three temperatures (T) and molar composition (x). “x” is arranged in molar composition of CH4, C2H6, C3H8 and C4H10.

  • References


    This article references 72 other publications.

    1. 1
      Ji, Z.; Wang, H.; Canossa, S.; Wuttke, S.; Yaghi, O. M. Pore Chemistry of Metal–Organic Frameworks. Adv. Funct. Mater. 2020, 30, 2000238  DOI: 10.1002/adfm.202000238
    2. 2
      Furukawa, H.; Cordova, K. E.; O’Keeffe, M.; Yaghi, O. M. The Chemistry and Applications of Metal-Organic Frameworks. Science 2013, 341, 1230444  DOI: 10.1126/science.1230444
    3. 3
      Kaskel, S. Progress in Advanced Characterization of MOFs. In The Chemistry of Metal-Organic Frameworks: Synthesis, Characterization, and Applications; Wiley, 2016; pp 575822.
    4. 4
      James, S. L. Metal-organic frameworks. Chem. Soc. Rev. 2003, 32, 276,  DOI: 10.1039/b200393g
    5. 5
      Jones, C. W. Metal-Organic Frameworks and Covalent Organic Frameworks: Emerging Advances and Applications. JACS Au 2022, 2, 15041505,  DOI: 10.1021/jacsau.2c00376
    6. 6
      Langmi, H. W.; Ren, J.; North, B.; Mathe, M.; Bessarabov, D. Hydrogen storage in metal-organic frameworks: A review. Electrochim. Acta 2014, 128, 368392,  DOI: 10.1016/j.electacta.2013.10.190
    7. 7
      Baumann, A. E.; Burns, D. A.; Liu, B.; Thoi, V. S. Metal-organic framework functionalization and design strategies for advanced electrochemical energy storage devices. Commun. Chem. 2019, 2, 114,  DOI: 10.1038/s42004-019-0184-6
    8. 8
      Mao, H.; Tang, J.; Day, G. S. A scalable solid-state nanoporous network with atomic-level interaction design for carbon dioxide capture. Sci. Adv. 2022, 8, abo6849  DOI: 10.1126/sciadv.abo6849
    9. 9
      Wang, L.; Huang, H.; Zhang, X. Designed metal-organic frameworks with potential for multi-component hydrocarbon separation. Coord. Chem. Rev. 2023, 484, 215111  DOI: 10.1016/j.ccr.2023.215111
    10. 10
      Zhao, G.; Brabson, L. M.; Chheda, S. CoRE MOF DB: A curated experimental metal-organic framework database with machine-learned properties for integrated material-process screening. Matter 2025, 8, 102140  DOI: 10.1016/j.matt.2025.102140
    11. 11
      Chung, Y. G.; Haldoupis, E.; Bucior, B. J. Advances, Updates, and Analytics for the Computation-Ready, Experimental Metal–Organic Framework Database: CoRE MOF 2019. J. Chem. Eng. Data 2019, 64, 59855998,  DOI: 10.1021/acs.jced.9b00835
    12. 12
      Colón, Y. J.; Snurr, R. Q. High-throughput computational screening of metal-organic frameworks. Chem. Soc. Rev. 2014, 43, 57355749,  DOI: 10.1039/C4CS00070F
    13. 13
      Osaro, E.; Colón, Y. J. Intelligent screening of porous materials: A review of active-learning approaches in MOF research. Chem. Phys. Rev. 2025, 6, 041307  DOI: 10.1063/5.0295283
    14. 14
      Wang, Z.; Zhou, T.; Sundmacher, K. Interpretable machine learning for accelerating the discovery of metal-organic frameworks for ethane/ethylene separation. Chem. Eng. J. 2022, 444, 136651  DOI: 10.1016/j.cej.2022.136651
    15. 15
      Lookman, T.; Balachandran, P. V.; Xue, D.; Yuan, R. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design. NPJ. Comput. Mater. 2019, 5, 21  DOI: 10.1038/s41524-019-0153-8
    16. 16
      Cohn, D. A.; Ghahramani, Z.; Jordan, M. I. Active Learning with Statistical Models. J. Artif. Intell. Res. 1996, 4, 129145,  DOI: 10.1613/jair.295
    17. 17
      Gubaev, K.; Podryabinkin, E. V.; Shapeev, A. V. Machine learning of molecular properties: Locality and active learning. J. Chem. Phys. 2018, 148, 241727  DOI: 10.1063/1.5005095
    18. 18
      Ren, P.; Xiao, Y.; Chang, X. A Survey of Deep Active Learning. ACM Comput. Surv. 2022, 54, 140,  DOI: 10.1145/3472291
    19. 19
      Osaro, E.; Mukherjee, K.; Colón, Y. J. Active Learning for Adsorption Simulations: Evaluation, Criteria Analysis, and Recommendations for Metal–Organic Frameworks. Ind. Eng. Chem. Res. 2023, 62, 1300913024,  DOI: 10.1021/acs.iecr.3c01589
    20. 20
      Mukherjee, K.; Osaro, E.; Colón, Y. J. Active learning for efficient navigation of multi-component gas adsorption landscapes in a MOF. Digital Discovery 2023, 2, 15061521,  DOI: 10.1039/D3DD00106G
    21. 21
      Osaro, E.; Fajardo-Rojas, F.; Cooper, G. M.; Gómez-Gualdrón, D.; Colón, Y. J. Active learning of alchemical adsorption simulations; towards a universal adsorption model. Chem. Sci. 2024, 15, 1767117684,  DOI: 10.1039/D4SC02156H
    22. 22
      Osaro, E.; LaCapra, M.; Colón, Y. J. Harmonizing Adsorption and Diffusion in Active Learning Campaigns of Gas Separations in a MOF. J. Phys. Chem. C 2025, 129, 98779891,  DOI: 10.1021/acs.jpcc.5c00922
    23. 23
      Osaro, E.; Bakare, A.; Colón, Y. J. Multi-method material selection for adsorption using Bayesian approaches. Commun. Mater. 2025, 6, 215,  DOI: 10.1038/s43246-025-00933-w
    24. 24
      Gantzler, N.; Deshwal, A.; Doppa, J. R.; Simon, C. M. Multi-fidelity Bayesian optimization of covalent organic frameworks for xenon/krypton separations. Digital Discovery 2023, 2, 19371956,  DOI: 10.1039/D3DD00117B
    25. 25
      He, G.-F.; Zhang, P.; Yin, Z.-Y. Active learning inspired multi-fidelity probabilistic modelling of geomaterial property. Comput. Methods Appl. Mech. Eng. 2024, 432, 117373  DOI: 10.1016/j.cma.2024.117373
    26. 26
      Hernandez-Garcia, A.; Saxena, N.; Jain, M.; Liu, C.-H.; Bengio, Y. Multi-Fidelity Active Learning with GFlowNets. 2024.
    27. 27
      Wang, A.; Liang, H.; McDannald, A.; Takeuchi, I.; Kusne, A. G. Benchmarking active learning strategies for materials optimization and discovery. Oxford Open Mater. Sci. 2022, 2, itac006  DOI: 10.1093/oxfmat/itac006
    28. 28
      Kusne, A. G.; Yu, H.; Wu, C. On-the-fly closed-loop materials discovery via Bayesian active learning. Nat. Commun. 2020, 11, 5966  DOI: 10.1038/s41467-020-19597-w
    29. 29
      Deringer, V. L.; Bartók, A. P.; Bernstein, N. Gaussian Process Regression for Materials and Molecules. Chem. Rev. 2021, 121, 1007310141,  DOI: 10.1021/acs.chemrev.1c00022
    30. 30
      Rasmussen, C. E. Gaussian Processes in machine learning. In Lecture Notes in Computer Science, Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; Springer, 2004; Vol. 3176, pp 6371  DOI: 10.1007/978-3-540-28650-9_4 .
    31. 31
      Rasmussen, C. E.; Williams, C. K. I. Gaussian Processes for Machine Learning; The MIT Press, 2005  DOI: 10.7551/mitpress/3206.001.0001 .
    32. 32
      Hensman, J.; Fusi, N.; Lawrence, N. D. Gaussian Processes for Big Data 2013 https://arxiv.org/pdf/1309.6835.
    33. 33
      Li, Y. Deep Reinforcement Learning: An Overview 2017 https://arxiv.org/abs/1701.07274.
    34. 34
      Sui, F.; Guo, R.; Zhang, Z.; Gu, G. X.; Lin, L. Deep Reinforcement Learning for Digital Materials Design. ACS Mater. Lett. 2021, 3, 14331439,  DOI: 10.1021/acsmaterialslett.1c00390
    35. 35
      Sutton, R. S. Introduction: The Challenge of Reinforcement Learning. In Reinforcement Learning; Springer US: Boston, MA, 1992; pp 13  DOI: 10.1007/978-1-4615-3618-5_1 .
    36. 36
      Kaelbling, L. P.; Littman, M. L.; Moore, A. W. Reinforcement Learning: A Survey. J. Artif. Intell. Res. 1996, 4, 237285,  DOI: 10.1613/jair.301
    37. 37
      Peters, M.; Ketter, W.; Saar-Tsechansky, M.; Collins, J. A reinforcement learning approach to autonomous decision-making in smart electricity markets. Mach. Learn. 2013, 92, 539,  DOI: 10.1007/s10994-013-5340-0
    38. 38
      Osaro, E.; Colón, Y. J. Optimizing the prediction of adsorption in metal–organic frameworks leveraging Q-learning. AIChE J. 2024, 70, 18611  DOI: 10.1002/aic.18611
    39. 39
      Park, H.; Majumdar, S.; Zhang, X.; Kim, J.; Smit, B. Inverse design of metal–organic frameworks for direct air capture of CO 2 via deep reinforcement learning. Digital Discovery 2024, 3, 728,  DOI: 10.1039/D4DD00010B
    40. 40
      Zhuang, Z.; Lei, K.; Liu, J.; Wang, D.; Guo, Y. Behavior Proximal Policy Optimization. 2023.
    41. 41
      Gu, Y.; Cheng, Y.; Chen, C. L. P.; Wang, X. Proximal Policy Optimization With Policy Feedback. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 46004610,  DOI: 10.1109/TSMC.2021.3098451
    42. 42
      Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. 2017.
    43. 43
      Kostrikov, I.; Nair, A.; Levine, S. Offline Reinforcement Learning with Implicit Q-Learning 2021 https://arxiv.org/abs/2110.06169.
    44. 44
      Tan, F.; Yan, P.; Guan, X. Deep Reinforcement Learning: From Q-Learning to Deep Q-Learning. In Neural Information Processing; Springer, 2017; pp 475483  DOI: 10.1007/978-3-319-70093-9_50 .
    45. 45
      Jang, B.; Kim, M.; Harerimana, G.; Kim, J. W. Q-Learning Algorithms: A Comprehensive Classification and Applications. IEEE Access 2019, 7, 133653133667,  DOI: 10.1109/ACCESS.2019.2941229
    46. 46
      Clifton, J.; Laber, E. Q-Learning: Theory and Applications. Annu. Rev. Stat. Appl. 2020, 7, 279301,  DOI: 10.1146/annurev-statistics-031219-041220
    47. 47
      Watkins, C. J. C. H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279292,  DOI: 10.1007/BF00992698
    48. 48
      Li, S. E. Deep Reinforcement Learning. In Reinforcement Learning for Sequential Decision and Optimal Control; Springer Nature Singapore: Singapore, 2023; pp 365402  DOI: 10.1007/978-981-19-7784-8_10 .
    49. 49
      Sumiea, E. H.; Abdulkadir, S. J.; Alhussian, H. S. Deep deterministic policy gradient algorithm: A systematic review. Heliyon 2024, 10, e30697  DOI: 10.1016/j.heliyon.2024.e30697
    50. 50
      Li, S.; Wu, Y.; Cui, X. Robust Multi-Agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient. Proc. AAAI Conf. Artif. Intell. 2019, 33, 42134220,  DOI: 10.1609/aaai.v33i01.33014213
    51. 51
      Tan, H. Reinforcement Learning with Deep Deterministic Policy Gradient. In 2021 International Conference on Artificial Intelligence, Big Data and Algorithms (CAIBDA); IEEE, 2021; pp 8285  DOI: 10.1109/CAIBDA53561.2021.00025 .
    52. 52
      Zhang, J.; Zhang, Z.; Han, S.; Lü, S. Proximal policy optimization via enhanced exploration efficiency. Inf. Sci. 2022, 609, 750765,  DOI: 10.1016/j.ins.2022.07.111
    53. 53
      Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms 2017 https://arxiv.org/abs/1707.06347.
    54. 54
      Zhong, C.; Lu, Z.; Gursoy, M. C.; Velipasalar, S. A Deep Actor-Critic Reinforcement Learning Framework for Dynamic Multichannel Access. IEEE Trans. Cogn. Commun. Netw. 2019, 5, 11251139,  DOI: 10.1109/TCCN.2019.2952909
    55. 55
      Gruslys, A.; Dabney, W.; Azar, M. G. The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning 2017 https://arxiv.org/abs/1704.04651.
    56. 56
      Chen, R.; Goldberg, J. H. Actor-critic reinforcement learning in the songbird. Curr. Opin. Neurobiol. 2020, 65, 19,  DOI: 10.1016/j.conb.2020.08.005
    57. 57
      Han, M.; Zhang, L.; Wang, J.; Pan, W. Actor-Critic Reinforcement Learning for Control With Stability Guarantee. IEEE Robot. Autom. Lett. 2020, 5, 62176224,  DOI: 10.1109/LRA.2020.3011351
    58. 58
      Grondman, I.; Busoniu, L.; Lopes, G. A. D.; Babuska, R. A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients. IEEE Trans. Syst, Man, Cybern. 2012, 42, 12911307,  DOI: 10.1109/TSMCC.2012.2218595
    59. 59
      WILLIAMS, R. J.; PENG, J. Function Optimization using Connectionist Reinforcement Learning Algorithms. Connect. Sci. 1991, 3, 241268,  DOI: 10.1080/09540099108946587
    60. 60
      Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229256
    61. 61
      Zhang, J.; Kim, J.; O’Donoghue, B.; Boyd, S. Sample Efficient Reinforcement Learning with REINFORCE. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1088710895,  DOI: 10.1609/aaai.v35i12.17300
    62. 62
      Brockman, G.; Cheung, V.; Pettersson, L. OpenAI Gym 2016 https://arxiv.org/abs/1606.01540.
    63. 63
      Towers, M.; Kwiatkowski, A.; Terry, J. Gymnasium: A Standard Interface for Reinforcement Learning Environments 2024 https://arxiv.org/abs/2407.17032.
    64. 64
      Raffin, A.; Hill, A.; Gleave, A. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res. 2021, 22, 18
    65. 65
      Rappé, A. K.; Casewit, C. J.; Colwell, K. S.; Goddard, W. A.; Skiff, W. M. UFF, a Full Periodic Table Force Field for Molecular Mechanics and Molecular Dynamics Simulations. J. Am. Chem. Soc. 1992, 114, 1002410035,  DOI: 10.1021/ja00051a040
    66. 66
      Martin, M. G.; Siepmann, J. I. Transferable Potentials for Phase Equilibria. 1. United-Atom Description of n-Alkanes. J. Phys. Chem. B 1998, 102, 25692577,  DOI: 10.1021/jp972543+
    67. 67
      Dubbeldam, D.; Calero, S.; Ellis, D. E.; Snurr, R. Q. RASPA: Molecular simulation software for adsorption and diffusion in flexible nanoporous materials. Mol. Simul. 2016, 42, 81101,  DOI: 10.1080/08927022.2015.1010082
    68. 68
      Gheytanzadeh, M.; Baghban, A.; Habibzadeh, S. Towards estimation of CO2 adsorption on highly porous MOF-based adsorbents using gaussian process regression approach. Sci. Rep. 2021, 11, 15710  DOI: 10.1038/s41598-021-95246-6
    69. 69
      Dudek, A.; Baranowski, J. Gaussian Processes for Signal Processing and Representation in Control Engineering. Appl. Sci. 2022, 12, 4946  DOI: 10.3390/app12104946
    70. 70
      Wilson, A.; Adams, R. Gaussian process kernels for pattern discovery and extrapolation. In 30th International Conference on Machine Learning, ICML 2013; ICML, 2013; Vol. 28, pp 21042112.
    71. 71
      Melkumyan, A.; Ramos, F. Multi-kernel Gaussian Processes , IJCAI International Joint Conference on Artificial Intelligence, 2011; pp 14081413  DOI: 10.5591/978-1-57735-516-8/IJCAI11-238 .
    72. 72
      Deshwal, A.; Doppa, J. R. Combining Latent Space and Structured Kernels for Bayesian Optimization over Combinatorial Spaces. In Advances in Neural Information Processing Systems; Curran Associates, Inc., 2021; Vol. 10, pp 81858200.
  • Supporting Information

    Supporting Information


    The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsengineeringau.5c00122.

    • Mixture adsorption isotherms for CuBTC and IRMOF-1 under ternary and quaternary conditions (Figures S1–S4); Gaussian process kernel testing and comparison (RQ, Matérn, and composite RQ+Matérn) with performance metrics (Table S1); log-scale sampling/distribution plots comparing the full design pool versus PPO–GPR-selected prior for CuBTC and IRMOF-1, ternary and quaternary cases (Figures S5–S8); R2 heatmaps over pressure–temperature grids for CuBTC and IRMOF-1, ternary and quaternary cases (Figures S9–S12) (PDF)


    Terms & Conditions

    Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.