wl-hydrophilic-polymer/task2/task2-chunks/═и═∙╧┬╥╗┤·╢р╧р┤▀╗п╝┴г║╗·╞ў...╓·┴ж▒э├ц╖┤╙ж╨╘╘д▓тги╙в╬─гй_┴ї╨╛╤╘.json

[
    {
        "id": 1,
        "chunk": "# Research AI in Chemical EngineeringÐReview",
        "category": " Introduction"
    },
    {
        "id": 2,
        "chunk": "# Toward Next-Generation Heterogeneous Catalysts: Empowering Surface Reactivity Prediction with Machine Learning  \n\nXinyan Liu \\*, Hong-Jie Peng  \n\nInstitute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China",
        "category": " Abstract"
    },
    {
        "id": 3,
        "chunk": "# A R t i c L E i N F o",
        "category": " Introduction"
    },
    {
        "id": 4,
        "chunk": "# A b s t R A c t  \n\nArticle history:   \nReceived 5 January 2023   \nRevised 27 May 2023   \nAccepted 17 July 2023   \nAvailable online 5 January 2024  \n\nKeywords:   \nMachine learning   \nHeterogeneous catalysis   \nChemisorption   \nTheoretical simulation   \nMaterials design   \nHigh-throughput screening  \n\nHeterogeneous catalysis remains at the core of various bulk chemical manufacturing and energy conversion processes, and its revolution necessitates the hunt for new materials with ideal catalytic activities and economic feasibility. Computational high-throughput screening presents a viable solution to this challenge, as machine learning (ML) has demonstrated its great potential in accelerating such processes by providing satisfactory estimations of surface reactivity with relatively low-cost information. This review focuses on recent progress in applying ML in adsorption energy prediction, which predominantly quantifies the catalytic potential of a solid catalyst. ML models that leverage inputs from different categories and exhibit various levels of complexity are classified and discussed. At the end of the review, an outlook on the current challenges and future opportunities of ML-assisted catalyst screening is supplied. We believe that this review summarizes major achievements in accelerating catalyst discovery through ML and can inspire researchers to further devise novel strategies to accelerate materials design and, ultimately, reshape the chemical industry and energy landscape.  \n\n$\\circledcirc$ 2024 THE AUTHORS. Published by Elsevier LTD on behalf of Chinese Academy of Engineering and Higher Education Press Limited Company. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).",
        "category": " Abstract"
    },
    {
        "id": 5,
        "chunk": "# 1. Introduction  \n\nPositioned at the heart of the chemical industry, catalytic reactions are involved in the processes of over $80\\%$ of all manufactured products [1]. Among various catalysis scenarios, heterogeneous catalysis using solid catalysts receives exceptional attention due to its high scalability for bulk manufacturing and its outstanding advantages in product separation and catalyst recycling [2,3]. Contemporary industrialized heterogeneous catalytic processes, such as methane reforming [4], ammonia synthesis [5], hydrocarbon cracking [6], and a variety of selective hydrogenation/dehydrogenation reactions [7–10], are mostly thermochemical and usually require high-temperature and/or high-pressure conditions to shift the chemical equilibria and modulate the reaction rates. Moreover, these conventional processes rely heavily on the use of fossil resources as reactants and for energy inputs, as well as on precious metals (e.g., Pt, Pd, Ru, and Rh) as catalysts, thereby deviating from the goal of global sustainability [11,12]. Therefore, it is imperative to design new catalytic reactions and processes that are more energetically efficient, environmentally friendly, and economically favorable. Along with the continuous advancement of human civilization, the journey to hunt for such processes and corresponding key materials never ceases.  \n\nThe rapid development of renewable energy technology, such as photovoltaics, has spurred this journey by enabling large-scale and low-cost ‘‘greenº electricity generation [11,13]. To better utilize surplus electricity, one of the most well-known initiatives is to replace fossil–fuel-derived ‘‘greyº hydrogen with ‘‘greenº hydrogen, the production of which relies on key technology such as electrochemical water splitting [14,15]. Similar concepts of green electricity-to-chemical energy conversion have also been implemented in carbon dioxide reduction reactions $(\\mathsf{C O}_{2}\\mathsf{R R})$ [16–22] and ammonia electrosynthesis [23–26]. In turn, renewably synthesized hydrogen, carbon-containing fuels, and ammonia are attractive feeds for fuel cells/engines or raw materials for the chemical industry, aiding to close the fossil-resource-free loops of carbon and nitrogen. The rational design of highly efficient and earthabundant catalytic materials plays a central role in achieving this goal, prior to subsequent reaction engineering and scaling up. Unfortunately, current catalytic materials are still far from satisfactory in terms of efficiency and/or scalability [27–30]. Innovations in next-generation catalyst design are therefore in high demand.  \n\nThe design, optimization, and further development of novel catalytic materials traditionally rely on Edisonian trial-and-error processes in Fig. 1 (Scheme 1). However, the efficiency of such processes is limited, as it usually takes decades to discover and commercialize a new catalyst. Furthermore, as it is impossible to exhaust allÐor even the majorityÐof the abundant candidate space of both compositions and structures, a more efficient methodology to navigate through this space remains indispensable. In fact, the flourishing of computational methods and theoretical modeling (e.g., density functional theory (DFT) calculations) has enabled another path that can replace tedious experimental exploration in Fig. 1 (Scheme 2) [31–34]. It has been revealed that the surface reaction rates on a solid catalyst can be correlated to the surface bond energies of adsorbed species presented in the reaction network (including transition states (TSs)), which are accessible through state-of-the-art computations [34–36]. Thus, it is possible to conduct ‘‘virtualº experiments on computers to assess the catalytic activity of a material by calculating the relevant energies. When reaction rates are reformulated as functions of only one or two descriptor(s), the high-dimensional problem of searching for candidates with desirable catalytic performance can be further collapsed to the hunt for catalysts exhibiting optimal descriptor values, where the descriptor is often a physical or chemical property that can be calculated or measured [36]. This so-called descriptor-based approach opens up new possibilities for the high-throughput computational screening of undiscovered catalysts. Among various electronic and geometric descriptors, the adsorption energies of surface species are frequently adopted, as $\\textcircled{1}$ they can be obtained via computations and $\\textcircled{2}$ the calculation results can be verified through accurate calorimetric experiments [37]. More importantly, the adsorption-energy-based activity map can be viewed as a quantitative implementation of the classical Sabatier principle, providing a rational understanding of trends in heterogeneous catalysis [38]. Although establishing activity maps helps to expedite the discovery of novel catalysts, acquiring energetic descriptors through modeling is still computationally demanding on a large scale, especially considering the enormous compositional and/or structural heterogeneities when searching for multicomponent and/or multisite catalytic materials. To explore the vast material space for heterogeneous catalyst screening, it is therefore vital to develop ways to obtain surface adsorption strengths more efficiently and effectively.  \n\n![](images/006d64419e29d97f7818d9d68cfb233aa868a769c1dd455a0bf8e198dcca4d3c.jpg)  \nFig. 1. Schematic illustration of three common schemes for catalyst screening. Scheme 1 refers to a conventional Edisonian trial-and-error process, where potential candidates (numbered $N_{\\mathrm{E}}{\\mathrm{.}}$ are selected for synthesis, characterization, and performance evaluation. Based on the results, new candidates may need to be reselected from the material space. Scheme 2 represents a conventional computational descriptor-based approach, where surface reactivities of more materials (numbered $N_{\\mathrm{T}},$ where $N_{\\mathrm{T}}$ may be orders of magnitude larger than $N_{\\mathrm{E}},$ are evaluated through simulation. Potential candidates are screened based on a further combination with an activity map established from theoretical trends such as scaling relations. Compared with Scheme 1, far fewer potential candidates are subjected to experimental validation. Scheme 3 refers to a machine learning (ML)-aided approach, where the large-body simulations in Scheme 2 are replaced with predictions from ML models. The outcome understandings can be utilized to re-improve the model and theoretical understanding. Dashed arrows in the figure represent processes that are time-consuming or resource-intensive, while solid arrows refer to those that are relatively fast and cheap.  \n\nOver the past decades, the rapid development of computer science and artificial intelligence (AI), along with the establishment of comprehensive databases, has enabled numerous possibilities for applying AI in chemistry and materials sciences for experiments, characterizations, and modeling [39–54]. Incorporating advanced machine learning (ML) models in catalyst design and screening makes it possible to directly predict the surface reactivity from fewer or less computationally expensive properties, with huge potential for improvements in cost and accuracy in Fig. 1 (Scheme 3). Consequently, the acceleration of the entire screening process can be envisioned. In addition, unveiling hidden patterns and correlations through ML offers alternative opportunities to further our physical understanding of catalytic systems and obtain fresh perspectives on catalyst design [55]. In this case, the application of ML for adsorption energy prediction and high-throughput catalyst screening, while still in its infancy, has already demonstrated its huge potential in enabling a paradigm shift in the discovery of new materials for emerging catalytic processes. Thus, summarizing the latest advances in ML-empowered highthroughput catalyst screening and proposing promising directions remains beneficial and necessary for future research.  \n\nUnlike the existing reviews covering the many aspects of ML application in catalysis research [56–66], this review has a particular focus on the data-driven prediction of adsorption energies, as the surface reactivity dominantly quantifies the catalytic potential of a solid catalyst. In addition, we highlight efforts to combine ML models with experimental exploration. In this review, we first categorize ML models according to the inputs adoptedÐnamely, ab initio or non-ab initio featuresÐand discuss related research progress in two consecutive sections. In each section, works targeting systems with different levels of complexity are summarized, along with the physical understandings these works might supply. Next, ML-guided experimental catalyst discovery is showcased, based on either the ML model’s predictive power or interpretable insights. Finally, we provide an outlook on current challenges and future opportunities in ML-assisted catalyst screening. As this is a focus review, we do not discuss the general principles and common models of ML, or the application of ML in other aspects of catalysis research such as high-throughput experimentation and ML-accelerated theoretical modeling; detailed information on these topics is already available in other reviews [57–61].",
        "category": " Introduction"
    },
    {
        "id": 6,
        "chunk": "# 2. ML with ab initio features",
        "category": " Introduction"
    },
    {
        "id": 7,
        "chunk": "# 2.1. Features based on calculated adsorption energies",
        "category": " Materials and methods"
    },
    {
        "id": 8,
        "chunk": "# 2.1.1. Adsorption scaling relations  \n\nAs discussed in the previous section, the idea of highthroughput screening can be realized, with the availability of material descriptors (e.g., adsorption energy and electronic structure) through ab initio calculations and the proposal of the descriptor-based approach. The main idea of this approach is the dimension reduction brought by so-called scaling relations, which projects reaction energetics onto a few properties [35,36,38]. In brief, it has been found that the adsorption energies of different adsorbates that bind to the surface through the same atom(s) tend to scale with each other, usually in a linear fashion. The foundation of this scaling relation lies in the d-band model initially proposed by Hammer and Nørskov [67] to explain the noblest properties of gold among the transition metals, which is now a wellestablished and widely recognized quantitative theory for catalysis after years of continuous research and development [68]. In the d-band theory, the chemisorption abilities of transition metal surfaces can be well described by the energy distribution of the d-band of the corresponding metal surfaces, which is mostly quantified using the average energy of the bandÐnamely, the d-band center. The adsorption of similar species on these surfaces therefore tends to correlate when the species’ adsorption energies rely only on the adsorbate valence and metallic d-band properties. Given the ubiquity of chemical bond formation between transition metal sites and adsorbates in heterogeneous catalysis, such a relation has been found to hold across a broad range of materials.  \n\nIn a foundational work, Abild-Pedersen et al. [69] found that the adsorption energies of hydrogen-containing molecules, $\\mathsf{A H}_{x},$ correlated linearly with the adsorption energy of atom A $\\overset{\\cdot}{\\boldsymbol{A}}=\\overset{\\cdot}{\\boldsymbol{C}}$ , N, O, and S). The mean absolute error (MAE) was reported to be only $0.13\\ \\mathrm{eV}$ when such linear relations were applied to describe the adsorption strengths of hydrogenated species over a range of pure metals. The successful prediction of the adsorption energies of hydrogenated species based on their atomic counterparts simplifies the estimation of the reaction energies of dehydrogenation and hydrogenation reactions, and can also be established in other more sophisticated reactions. For example, Chowdhury et al. [70] investigated the adsorption energies of surface species involved in the decarboxylation and decarbonylation of propionic acid over eight flat monometallic transition-metal surfaces (the (111) surfaces of Ni, Pt, Pd, Ru, Rh, Re, Cu, and $\\mathsf{A g}$ ). They found that multivariate linear scaling relations with a combination of descriptors (i.e., the adsorption energies of ${\\mathrm{CHCHCO}}^{*}$ , $\\mathsf{O H}^{*}$ , and $C^{*}$ , where \\* refers to the adsorbed species) yielded exceptionally accurate results, with a MAE of $0.12\\ \\mathrm{eV}$ , which could not be outperformed by any other nonlinear models. It is only when the training dataset is incomplete (i.e., contains a random subset of adsorption energies) that kernel-based nonlinear ML models start to become superior. Although this comparison accentuates the effectiveness of linear scaling relations in rationalizing a complete and large dataset, it also points out the inadequacy of linear models in predicting adsorption energies from a limited dataset.  \n\nWhile scaling relations are generally an effective and efficient way to largely reduce the reaction intermediate space to a few descriptors, several challenges remain when applying scaling relations in high-throughput catalyst screening. First, scaling relations usually only apply to similar adsorbates that bind through the same atom, with accuracies limited to around $0.1\\mathrm{-}0.2\\ \\mathrm{eV}$ . Second, stemming from the d-band theory, scaling relations work quite well for pure and alloyed transition metals; however, although a variety of scaling relations have been successfully established for inorganic compounds such as oxides, some of these apply only to limited systems with specificities in either composition or crystal structure [71,72]. Third, for complex reactions involving large organic molecules (e.g., alkanes containing more than three carbon atoms), the possibilities of single descriptors or descriptor pairs start to explode, interfering with the determination of a good catalyst, as the as-constructed activity maps are highly dependent on the chosen descriptor(s). For example, Wang et al. [73] showcased the importance of descriptor engineering with a selective propane dehydrogenation reaction to propylene. They found that the adoption of both ${\\mathrm{CH}}_{3}{\\mathrm{CHCH}}_{2}^{*}$ and $\\mathrm{CH}_{3}\\mathrm{CH}_{2}\\mathrm{CH}^{*}$ bindings as descriptors not only resulted in an overall MAE lower than $0.09\\mathrm{eV}$ for all scaling relations but also enabled the greatest differentiation of elemental metals. Nevertheless, the use of such an approach to determine descriptors often requires the input of external knowledge (e.g., $\\mathrm{CH}_{3}\\mathrm{CHCH}_{2}^{*}$ as the selectivity-determining species in the above showcased reaction). Developing strategies that do not rely on significant domain input or human intuition is therefore highly desirable.",
        "category": " Results and discussion"
    },
    {
        "id": 9,
        "chunk": "# 2.1.2. Improving scaling relations through ML  \n\nGiven the challenges outlined above, numerous efforts have been devoted to improving scaling relations. In this regard, Mamun et al. [74] proposed a Bayesian framework to extend the single descriptor linear scaling relation to a multi-descriptor linear regression model. Bayesian information criteria (BIC) were adopted as the model evidence to select the best model, providing a statistical rationalization of the descriptor selection regarding how many and which descriptors should be employed to yield the best bias-variance trade-off (Fig. 2(a)). In an attempt to further improve the prediction accuracy, the researchers also leveraged Gaussian process regression (GPR) to predict the residual of the selected model (i.e., residual learning; Fig. 2(b)). When applied to the (111) or (100) facet of 2035 binary alloy materials in their $\\mathsf{A}_{1}$ , $\\mathtt{L1}_{0}$ , and $\\mathbf{L}1_{2}$ Strukturbericht designation and six typical hydrogen-containing adsorbates $\\mathrm{^{CH^{*}}}$ , $\\mathrm{CH}_{2}^{*}$ , ${\\mathrm{CH}}_{3}{}^{*}$ , $\\boldsymbol{\\mathrm{OH}^{*}}$ , $\\mathsf{N H}^{*}$ , and $\\mathrm{SH^{*}}.$ , the as-devised framework demonstrated an impressive performance, with a test MAE of $0.1\\mathrm{eV}$ , which is very comparable with standard DFT error. This is a promising example of how ML can improve model fidelity and yield more accurate adsorption energy predictions than conventional linear scaling relations.  \n\nSimilarly, García-Muelas and López [75] reported the application of a statistical principle component analysis (PCA) and principle component regression (PCR) model to the DFT-computed adsorption strengths of 71 $\\mathsf{C}_{1}\\mathsf{-C}_{2}$ species on 12 close-packed metal surfaces (Cu, Ag, Au, Ni, Pd, Pt, Rh, Ir, Ru, Os, Zn, and Cd). As a common method for dimension reduction in unsupervised learning, PCA revealed that the majority of the thermochemistry of a given metal can be sufficiently estimated with two principal components (PCs) constructed from the formation energies of three predictors $(0^{*},0\\mathrm{H}^{*}$ , and ${\\mathrm{CCHOH}}^{*}$ ). One component presents the affinity of a metal to form covalent bonds with an intermediate, while the other describes the ionicity of the metal–adsorbate bond (Figs. 2(c) and (d)). The inclusion of the second component was found to be the key in extending the adsorbate thermochemistry predictions on transition metals to beyond conventional d-band theory, especially for adsorbates or metals with almost-filled valence shells or d-bands. A later PCR further confirmed this finding, exhibiting an MAE of $0.12\\mathrm{eV}$ on the validation set. This model was also applied to single-atom and near-surface alloy systems. With a minimum of DFT energy evaluations (around 1800), a full set of 31 000 formation energies were predicted with high accuracy $\\mathrm{(MAE=0.19~eV}.$ ). The high predictive power of statistical learning based on PCA/PCR was thereby demonstrated.  \n\n![](images/1186e024790259ee878a732474f4444e7d2478c86e07e0b30dca8c85e9bc3156.jpg)  \nFig. 2. ML models applied to improve upon scaling relations. (a) BIC plotted against the number of parameters (# of descriptors) used for $\\mathsf{N H}^{*}$ for a synthetic dataset containing 100 data points. The red line connects the minimum of each descriptor (the BIC envelope), while the blue star indicates the best model (with the lowest BIC value). (b) Parity plot showing residual learning using scaling relation model-predicted chemisorption energies of $\\mathsf{N H}^{*}$ $\\cdot\\Delta E_{\\mathrm{GP}})$ plotted against the DFT-computed chemisorption energies $(\\Delta E_{\\mathrm{DFT}})$ for the testing set. The uncertainty in the prediction is shown with the color bar on the right. RMSE: root-mean-square error. (c) Descriptors $(t_{i1},t_{i2})$ from PCA for metals. (d) Descriptors $(w_{1j},w_{2j})$ from PCA for adsorbates. The color scale in part (d) measures the robustness of each species being a predictor; those marked in brown are more suitable predictors and those marked in yellow are the least suitable. $i_{j}.$ : a dimensionless value quantifying the relative contribution of specie $-j$ -related descriptors to prediction error. (a, b) Reproduced from Ref. [74] with permission; (c, d) reproduced from Ref. [75] with permission.",
        "category": " Results and discussion"
    },
    {
        "id": 10,
        "chunk": "# 2.1.3. Estimating activation energies through ML  \n\nThe theoretical justification of estimating the activity of a solid catalyst through its adsorption energies relies on the existence of the Brønsted–Evans–Polanyi (BEP) relation, which states that the activation energy of an elementary step is positively correlated with its reaction energy [76]. There are cases, however, in which the linear BEP relations fail to capture the catalytic trend [77– 79]. Thus, it remains more desirable yet challenging to directly predict activation energies and assess the influence brought by other parameters besides reaction energy. Based on an open-access database, CatApp [80], which contains a set of DFT-calculated reaction energies and activation energies for a large number of elementary steps on single-crystal metal surfaces, including those with low symmetry such as stepped (211) surfaces, Takahashi and Miyazato [81] attempted to implement ML algorithms in conventional BEP relations in order to improve the accuracy in predicting activation energies. In addition to reaction energies, other features describing the catalyst, surface plane, reactants, and product were considered in nonlinear models such as random forest and support vector regression, resulting in better accuracy than linear models.  \n\nSimilarly, Artrith et al. [82] demonstrated an MAE of $0.20\\ \\mathrm{eV}$ (lower than an MAE of $0.35\\mathrm{eV}$ through BEP approximation) when predicting the TS energies of various C–C and C–O scission steps involved in ethanol reforming, using a set of ab initio (e.g., reaction energies) and non-ab initio (e.g., electronegativity and nearest– neighbor distance of chemical species) features in the ML model. The TS energies predicted in this model were further adopted as features in a second model based on a smaller experimental database, enabling the direct prediction of ethanol reforming activity/ selectivity without the need to know detailed reaction mechanisms or establish theoretical activity/selectivity volcano maps. These works provide methods for the rapid estimation of activation energy, although their transferability to catalysts beyond transition metals/alloys and to reactions beyond thermochemical reactions requires further demonstration.  \n\nIn sum, this subsection focused on improvements upon the traditional linear scaling relations that have been extensively relied on in conventional catalyst screening. One obvious advantage of the approaches discussed above lies in their physical rationality, as the theoretical foundation of linear scaling relations is fairly solid. However, these approaches all utilize features related to adsorption energies, which require DFT relaxations and are expensive to obtain. Furthermore, the adsorption energy is already an overall reflection of many geometric and electronic structural factors, whose contributions are challenging to understand and disentangle from a fundamental perspective. Therefore, it is still desirable to incorporate features that are formulated directly from the material electronic structure for adsorption energy prediction, which is discussed in the next subsection.",
        "category": " Results and discussion"
    },
    {
        "id": 11,
        "chunk": "# 2.2. Features based on calculated electronic structure properties  \n\nAside from the adsorption energies of some basic species, the ab initio electronic structure properties can also be calculated and employed as informative features for the ML-enabled estimation of adsorption energies. In this section, we discuss works that leverage electronic structure features, which not only present stronger potential for generalization but could also lead to physical understandings of specific heterogeneous catalytic processes.",
        "category": " Results and discussion"
    },
    {
        "id": 12,
        "chunk": "# 2.2.1. Formulated electronic structure properties  \n\nThe incorporation of domain knowledge, such as the d-band theory, can help researchers identify and formulate suitable electronic structure properties as feature inputs. Along this line, the Ma et al. [83] and Li et al. [84] evaluated several characteristics of the d-band distribution and the local Pauling electronegativity, which reflects the delocalized sp-states, as features in neural network (NN) models to predict ${\\mathsf{C O}}^{*}$ binding energies on (100)- and (111)-terminated multi-metallic alloys for $C O_{2}\\mathrm{RR}$ catalyst screening (Fig. 3(a)). The root-mean-square errors (RMSEs) for the predictions were approximately $0.1{-}0.2\\ \\mathrm{eV}$ , depending on the surface models. Similarly, with a target space holding various C–, N–, and O–containing adsorbates over different facets ((100), (111), and (211) of 11 transition metals with an face-centered cubic (fcc) bulk structure, including Co, Rh, Ir, Ni, Pd, Pt, Ru, Os, Cu, Ag, and Au), Praveen and Comas-Vives [85] devised a single ML model capable of predicting the adsorption strengths of multiple adsorbates simultaneously. With features related to the properties of the active sites, the elements involved in direct bonding, and electronic structure properties obtained from DFT calculations of free adsorbates and clean metal surfaces, the researchers trained an extreme gradient boosting (XGBoost) regressor that remained effective for adsorption energy prediction, with MAEs for the training and testing sets of 0.074 and $0.174\\ \\mathrm{eV}$ respectively.  \n\nA more important aspect of leveraging electronic features in ML-based adsorption energy prediction is to assist in the identification of the most influential features. Understanding why these features are important can prevent researchers from taking MLbased analyses at face value and allow for the identification of the principle factors determining surface catalytic chemistry, as well as potential ways to tailor better catalysts. The study by Praveen and Comas-Vives [85] mentioned above suggests that the most important features are electronic properties, primarily from the adsorbate and then from the metal, according to their feature importance analysis. Aside from feature importance ranking, a Bayesian learning approach (called Bayeschem) has been proposed to bridge the complexity of electronic descriptors [86]. Built upon the well-established d-band theory and a Newns– Anderson-type Hamiltonian for capturing the essential physics of chemisorption processes, a model optimized with pristine transition-metal data demonstrated impressive prediction accuracies $(\\sim0.1\\mathrm{-}0.2\\ \\mathrm{eV})$ and uncertainty quantifications for adsorbates such as $0^{*}$ and ${\\mathsf{O H}}^{*}$ at a diverse range of atomically tailored metal sites. More importantly, insights into the orbital-wise nature of chemical bonding at adsorption sites with ${\\mathsf{d}}{\\mathsf{\\Omega}}$ -state characteristics ranging from bulk-like semi-elliptic bands to free-atom-like discrete energy levels can be naturally drawn from the model.  \n\nBeyond pure metallic systems, ML methods have also been found to be efficient in describing the reactivity of metal compound catalysts. For example, Göltl et al. [87] adopted an ML genetic algorithm (GA) to analyze the correlation between various DFT-calculated electronic structure properties and ${\\mathsf{C O}}^{*}/{\\mathsf{N O}}^{*}$ adsorption strengths on transition metal sites (Cu, Ni, Co, and Fe) in zeolites (SSZ-13 and mordenite). Through this analysis, the position of the s orbital, the number of valence electrons of the active site, and the highest occupied molecular orbital (HOMO)–lowest unoccupied molecular orbital (LUMO) gap of the adsorbate were found to be the most important electronic descriptors. Moreover, this work pointed out the importance of capturing site reconstruction in adsorption prediction. Similarly, molecular-orbital-based analysis was performed to quantify the interactions between a variety of small molecules and the surfaces of group 13 metal oxides [88]. The HOMO energies of the adsorbates and the surface energies of the oxide surfaces were identified as two major factors governing the solid–adsorbate interactions in such systems.  \n\nThe application of ML-based predictive models has also been extended to the screening of single-atom catalysts (SACs)  \n\n![](images/5b766a2c1424c28d6b095dff5e3df60dc2d4dec4c379625282ff1a5109c1dd37.jpg)  \nFig. 3. ML models with features manually crafted from electronic structures. (a) Normalized sensitivity coefficient obtained by analyzing the network response to perturbations of input features. (b) Example of sure independence screening and sparsifying operator adsorption energy prediction for C at a hexagonal close packed (hcp)-s site of an IrRu alloy using a data-driven descriptor. The tabulated primary features are calculated as averages over the three metal atoms (two Ir atoms and one Ru atom) making up the IrRu hcp-s site. The shown fitting coefficients are specific for C. The definition of variables can be found in Ref. [93]. Part (a) reproduced from Ref. [83] with permission; Part (b) reproduced from Ref. [93] with permission.  \n\n[89–91]. In this regard, Chen et al. [92] constructed a comprehensive dataset comprising 1060 atomically dispersed metal/nonmetal co-doped graphene systems as model carbon-supported SACs for ${\\mathsf{C O}}_{2}{\\mathsf{R R}},$ as well as an ML model based on XGBoost and simple features, revealing that the Pauling electronegativity and covalent radius of central metal atoms are more important features than the metal d-electron number. These understandings obtained for zeolites, oxides, or SACs are generally quite different from those gained from transition metals, highlighting the great opportunities to leverage ML to disclose unique catalytic chemistry beyond transition metals.  \n\nIn addition to the identification of the main factors affecting the interactions between adsorbates and surfaces, ML models exhibit the capability to construct new descriptors from explicit expressions of these influential factors. For example, Andersen et al. [93] proposed so-called ‘‘data-drivenº descriptors, whose predictive power was shown to extend over a wide range of adsorbates, multi-metallic transition metal surfaces, and facets. Identified using the recently developed compressed sensing method sure independence screening and sparsifying operator (SISSO), the descriptors are expressed as nonlinear functions of the intrinsic properties of the clean catalyst surface, including the coordination numbers and d-band moments (Fig. 3(b)). The good agreement between DFT-calculated and SISSO-predicted adsorption strengths demonstrates the effectiveness of new descriptors over scaling relations, as well as the possibility of extending them to broader material spaces.",
        "category": " Results and discussion"
    },
    {
        "id": 13,
        "chunk": "# 2.2.2. Raw electronic structure properties  \n\nWhile the aforementioned works adopt statistical features computed from electronic structure properties such as the d-band center or width, it is also possible to construct frameworks that directly digest raw electronic structural data such as the density of states (DOS). For example, Fung et al. [94] leveraged the DOS of catalytic surfaces for adsorption prediction, using the same dataset reported by Mamun et al. [74]. Unlike the previous work by Mamun et al. [74], Fung et al. [94] additionally computed the DOS of the surfaces. A convolutional neural network (CNN) model, which has been widely utilized in image processing and characterization, was adopted to automatically extract information from the raw DOS data without the need for external knowledge (Fig. 4(a)), yielding a low test MAE on the order of $0.1\\ \\mathrm{eV}.$ In addition, with the incorporation of domain knowledge, the as-devised model (referred to as ‘‘DOSnetº) supplied physically meaningful guidance through occlusion sensitivity analyses, by which the energetic responses to perturbations on electronic structures could be well estimated. This CNN-aided framework can thus potentially accelerate the discovery of new catalysts by enabling the exploration of an electronic structure space without adsorption energy calculations. As only a single calculation is required for each catalytic surface, DOSnet will exhibit even greater potential in computational savings and high-throughput screening when investigating surfaces containing a large quantity of unique adsorption sites (e.g., highentropy alloy (HEA) surfaces).  \n\nIn an attempt to obtain more interpretable features and descriptors, further engineering of DOS can be performed. For example, an automated framework was proposed to obtain accurate and interpretable descriptors of chemical activity for metal alloys and oxides using unsupervised ML (Fig. 4(b)) [95]. PCA was first adopted to identify a lower dimension basis of the DOS matrix, which consisted of PC descriptors. Models leveraging different featuresÐnamely, the traditional electronic descriptors, the full DOS, and $10~\\mathsf{P C}$ descriptors with top scoresÐwere compared for $C^{*}$ , $0^{*}$ , $\\mathsf{N}^{*}$ , and $\\mathsf{H}^{*}$ adsorption energy predictions on layered alloys; the PC-based models exhibited the most accurate results, with RMSEs smaller than those of the other two models by a factor of about two. In addition to prediction accuracies, this model is endowed with physical interpretability via the signal reconstruction of electronic-structure patterns captured by PC descriptors; thus, it provides suggestions on potential design motifs for future catalysts and establishes a link between the material’s geometric and catalytic properties.  \n\nThe importance and indispensable role of electronic structurerelated features in adsorption prediction is clearly demonstrated by the works discussed above. In addition to providing great predictive power, these features make nontrivial contributions to the model interpretability, through which fundamental understandings of the most influential electronic structural factors can be acquired and consequent objective catalyst design can be further enabled. However, the computational burden is a major concern in these approaches, as obtaining ab initio features can be expensive, especially in large systems. Realizing accurate adsorption prediction with only non-ab initio features is more appealing, in this sense. Such approaches are discussed in the following section.",
        "category": " Results and discussion"
    },
    {
        "id": 14,
        "chunk": "# 3. ML with non-ab initio features  \n\nThe central role of electronic structures in determining adsorbate–surface interactions makes it natural to include related features for adsorption energy predictions. However, acquiring these features often requires ab initio calculations, especially for unexplored new materials that cannot be found in existing databases. The resulting increase of the computational cost is obviously undesirable, especially given the aim for high-throughput screening in a material space with unlimited possibilities of crystallographic orientations, surface compositions, and binding sites (e.g., HEAs and high-entropy metal compounds). Therefore, there has been a strong tendency to realize adsorption predictions using only lowcost features that do not require new ab initio calculations. For example, Toyao et al. [96] were the first to adopt 12 readily available elemental properties (EPs; e.g., surface energy, melting point, and group in the periodic table) as features in ML models for predicting the adsorption energies of $\\mathrm{CH}_{4}$ -related species $\\mathrm{^CH_{3}^{*}}$ , ${\\mathsf{C H}}_{2}^{*}$ , $\\mathrm{CH^{*}}$ , $C^{*}$ , and $\\mathsf{H}^{*}$ ) on copper $(\\mathsf{C u})$ -based alloys, realizing decent accuracy with MAEs $<0.3\\ \\mathrm{eV}$ . Once non-ab initio features are further rationally engineered to yield better model performance, we can anticipate a boost in new catalyst discovery, as time-consuming DFT calculations will no longer be heavily relied on.",
        "category": " Results and discussion"
    },
    {
        "id": 15,
        "chunk": "# 3.1. Physically inspired non-ab initio features  \n\nThe implication of well-established theory in a predictive model is a general strategy when engineering simple, non-ab initio features with physical rationality. Aiming to predict $C0^{*}$ binding energies on alloys, Noh et al. [97] proposed a framework leveraging active learning (AL) and kernel ridge regression. More specifically, they adopted the d-band width calculated from linear muffin-tin orbital (LMTO) theory to account for the local coordination environment and the geometric mean of electronegativity to describe adsorbate renormalization. Demonstrated mostly on the (100) facets of subsurface alloy systems in an fcc bulk structure (Fig. 5(a)), the automated framework yields an impressive prediction MAE of only $0.05\\ \\mathrm{eV}$ when only adopting LMTO-derived features, which instills confidence in applying this model to screen for ideal subsurface alloys to catalyze $C0_{2}\\tt R R$ (Fig. 5(b)).  \n\nLeveraging tree-based models, Esterhuizen et al. [98] proposed a generalized additive model (iGAM) to investigate perturbations brought by strain or the ligand effect (Figs. 5(c)–(e)). The chemisorption of species representative of both electron-rich ( $\\mathrm{\\Phi_{oH}*}$ and ${\\mathsf{C l}}^{*}$ ) and electron-poorer $\\cdot\\mathrm{~o}^{*}$ and $S^{*}$ ) adsorbates on the (111) facets of subsurface metal alloys were focused on. Aside from its superior predictive capabilities (in general, with training RMSEs $<0.032\\mathrm{eV}$ and testing $\\mathrm{RMSES}<0.065\\mathrm{eV}.$ ), the iGAM model can provide further information, as it forces the model fit through construction to be a linear combination of different functions, where each function is only dependent on one feature of interest. In this case, the chemisorption strength was found to be impacted by three crucial site-related features: the strain in the surface layer, the number of d-electrons in the ligand metal, and the size of the ligand atom.  \n\n![](images/a9c66733a3b044f5d900c1edeaae4c5e36acbd544c83470cff50420e25d2238a.jpg)  \nFig. 4. ML models with features automatically formulated from DOS. (a) General schematic of the DOSnet model. The site-projected DOS of a surface atom serves as the input (light blue), which goes through a series of convolutional layers (green), followed by fully connected layers (red), and a final output layer. For additional atoms, the same convolutional layers are used with shared weights before being merged with the fully connected layers. Conv: convolutional and Fc: fully connected. (b) Workflow for automating electronic-structure descriptor identification using PCA. PCA identifies a lower dimensional basis (i.e., the PCs) of a DOS matrix to yield PC score descriptors. The electronic-structure effects captured in each descriptor can be analyzed and interpreted by reconstructing the DOS from the descriptors. Part (a) reproduced from Ref. [94] with permission; part (b) reproduced from Ref. [95] with permission.  \n\nOther than the manually selected features, new features can be constructed through ML. For example, the SISSO method was found to be effective in assembling initial features whose values are readily available in existing databases into new combinations, thereby either enlarging the feature space for chemisorption prediction on different metal alloys [99] or deriving more accurate descriptors for Pt-based oxygen reduction reaction (ORR) catalysts [100]. Insights on the critical physical concepts that control the chemisorption process on metal surfaces can also be further extracted.  \n\n![](images/036164d3bbef1b5f1e8bab772d388b059e8f6d1c7dc15a405e2c09b3702ff70d.jpg)  \nFig. 5. ML models built with non-ab initio features targeted at simple facets. (a) Three subsurface alloy models (i) $\\mathbf{\\boldsymbol{X}}@\\mathbf{\\boldsymbol{M}}$ (ii) $\\mathbf{M}{-}\\mathbf{X}@\\mathbf{M}$ , and (iii) $\\mathrm{\\bfM}_{3}\\mathrm{\\bfX}@\\mathrm{\\bfM}$ , where the blue and black balls denote M and X metals, respectively. (b) The performance of various ML models with different descriptors: without an ab initio d-band center and with a $\\mathsf{d}$ -band center. (c) The functions that make-up iGAM models to predict the target property, $y$ are ensembles of decision trees. (d) An elbow plot for a $k$ -medoids clustering analysis combined with silhouette coefficient analysis is used to select the optimal number of clusters. (e) The nine features considered are positively and negatively correlated to various degrees, based on the Pearson correlation coefficient. The definition of variables can be found in Refs. [97,98]. Parts (a, b) reproduced from Ref. [97] with permission; Parts (c–e) reproduced from Ref. [98] with permission.  \n\nIn the above work, the features were mostly formulated using known theories or domain knowledge. However, an inverse approach can be used based on previous theoretical models. For example, based on a unified empirical model [101] that correlates adsorption strength with a few electronic structure parameters including the d-band center, the number of p electrons, and the matrix coupling element between the adsorbate and the metal states, Montemore et al. [102] first predicted these parameters using ML and then derived the adsorption energies of a broad range of species (C, N, O, OH, H, S, K, and F) on flat metal and alloy surfaces with the predicted parameters as inputs to the empirical model, achieving an MAE of $0.29\\ \\mathrm{eV}.$ . Given the large ranges of the adsorbates and surfaces in this study, this model can be deemed to be general and reusable. Nevertheless, through a comparison between the two approaches, we note that these physically inspired models may present the dilemma of lower model accuracy or less generalizability, and such a balance often depends on how well the established theory works with the target chemical space.",
        "category": " Results and discussion"
    },
    {
        "id": 16,
        "chunk": "# 3.2. Enhanced representation of surfaces and molecules  \n\nThe works described above mostly focus on a single or a few adsorption sites, along with simple adsorbates. This might be sufficient for describing the activities of simple flat facets such as (111) and (100), which exhibit relatively high symmetry. However, as has been well established in many catalytic reactions, steppedlike surfaces are much more reactive and make major contributions to the overall activities [103,104]. Modeling catalytic reactions on these surfaces presents greater challenges, due to the broken surface symmetry and the resulting increase in surface heterogeneity. To accommodate various possible binding sites, traditional screening typically relies on the introduction of geometric descriptors [105,106] or the establishment of multiple site-specific activity maps [107,108]. On the other hand, emerging catalytic applications such as biomass [109,110] and plastic valorization [111,112] often require the description of interactions between large molecules and catalytic surfaces. Explicitly obtaining either the site-specific structure-activity relationships or the surface adsorption/reaction energetics involving large molecules adds up to a heavy computational burden. In this regard, ML is extremely suitable for overcoming this hurdle, once the enhanced representation of complex surfaces, molecules, or catalytic systems under more realistic conditions is implemented.",
        "category": " Results and discussion"
    },
    {
        "id": 17,
        "chunk": "# 3.2.1. Enhanced representation of complex surfaces  \n\nAs mentioned above, a prediction on stepped alloy surfaces serves as an example of a scenario in which the increased structural diversity of the catalytic surfaces must be considered. This scenario can be rather simple if the host metal remains unchanged, such as when predicting $\\mathsf{H}^{*}$ adsorption on stepped silver $(\\mathsf{A g})$ alloys. An ML model yielded an MAE as low as $0.014\\mathrm{eV}$ while only using non-ab initio features relative to the dopant atoms, without deliberate consideration of local geometric variations [113]. However, ML cannot work well with appropriate surface representation if the alloy composition is more variable. Saxena et al. [114] compared several ML models in predicting $C^{*}$ and $0^{*}$ binding energies on the (211) surfaces of $\\mathsf{A}_{3}\\mathsf{B}$ alloys with some common non-ab initio feature inputs, obtaining RMSEs of $0.31{-}0.38\\ \\mathrm{eV}$ depending on the surface termination and the adsorbate. However, the vast number of site possibilities on a (211) surface were not considered, leading to a prediction accuracy that was incomparable with those of the aforementioned models on simpler surfaces. Taking a step further, our group focused on the (211) surfaces of binary $\\mathbf{L}1_{2}$ -type alloys across 37 common metal and metalloid elements with sitespecific binding configurations, generating a rich library of site motifs and yielding a comprehensive dataset containing about 2000 adsorption energies [115]. With the inclusion of only low cost, non-ab initio features encoding both the electronic structure properties and the coordinate-based geometric information of the surface sites, our models demonstrated satisfactory prediction accuracies, with test MAEs of 0.14 and $\\phantom{-}0.18\\mathrm{eV}$ for $C^{*}$ and $0^{*}$ binding, respectively. Furthermore, interpretable physical insights could be extracted from the feature importance distributions and Kullback–Leibler divergence analysis, showing the most probable structural and compositional characteristics of an ideal alloy catalyst for a specific reaction. The proposed models were further validated through DFT calculations and microkinetics modeling, with low-temperature methanol synthesis as a test reaction and a $\\mathrm{Cu}_{3}\\mathrm{Pd}$ alloy as a promising candidate identified by ML. In principle, due to its simplicity, the use of this model as a rapid screening tool prior to any detailed theoretical or experimental investigations is readily applicable to other reactions that are well described by $C^{*}$ and $0^{*}$ binding strengths. Other coordinate-based geometric representations, such as the generalized coordination number, have also been found to be effective in improving the prediction accuracy of ML models based on non-ab initio electronic structure features [116–118].  \n\nThe above examples tend to focus on a system consisting of only one or two elements; however, it is also beneficial to realize effective adsorption evaluation across a broader spectrum of elements. Thus, prediction on HEA surfaces serves as another example of a scenario in which compositional heterogeneity plays an interesting role. For example, Batchelor et al. [119] explored HEAs composed of five elements (Ir, Pd, Pt, Rh, and Ru) as candidate catalysts for ORR, in which the adsorption strengths of $0^{*}$ and $\\boldsymbol{\\mathrm{OH^{*}}}$ were targeted. The researchers constructed a very simple linear model that leveraged parameterizations based solely on the nearest–neighbor compositions to the binding sites. Three and five types of atomic zones in (111)-type HEAs were classified for ${\\mathsf{O H}}^{*}$ and $0^{*}$ adsorption, respectively (Fig. 6(a)). By adopting the adsorption energies on a random subset of available binding sites as the training set, the model exhibited impressive prediction accuracy, with RMSEs of 0.063 and $0.076{\\mathrm{~eV}}$ for $\\boldsymbol{\\mathrm{OH^{*}}}$ and $0^{*}$ adsorption, respectively, on other possible sites. More importantly, the as-developed model was then applied to optimize the HEA composition, offering a design platform for the discovery of novel alloys by promoting sites with exceptional catalytic activities (Fig. 6(b)).  \n\nA similar concept of site representation was adopted for screening bimetallic or HEA catalysts for either ${\\mathsf{C O}}_{2}$ hydrogenation to methanol [120] or the hydrogen evolution reaction (HER) [121]. The use of distance-based descriptors as an alternative to the nearest–neighbor information was found to contribute to the accurate prediction of $\\mathsf{H}^{*}$ adsorption on multi-metallic surfaces [122]. Nevertheless, the prediction of multi-metallic or HEA catalysts is mainly limited to (111) or (100) model surfaces at present. Accurate predictive models capable of encompassing both structural and compositional variations (e.g., HEA catalysts with non-ideal flat surfaces) are still lacking and require future development.  \n\nThe coordinate-based representation method further enables the AL-based fully automated theoretical framework to guide the DFT calculations of desirable energetic descriptors, as demonstrated by Tran and Ulissi [123]. More specifically, these researchers proposed a fingerprinting method to represent the adsorption site numerically (Fig. 6(c)). This method describes each element type coordinated with the adsorbates using a vector of four numbers: the atomic number; the Pauling electronegativity; the number of atoms of the element coordinated with the adsorbate, as determined by the Voronoi tessellation; and the median adsorption energy between the adsorbate and the pure element $(\\Delta E)$ . Having enumerated all possible binding sites over 1499 different intermetallic combinations across 31 elements, the researchers were able to identify 54 candidates with surfaces having nearoptimal $C0^{*}$ binding for electrochemical ${\\mathsf{C O}}_{2}{\\mathsf{R R}}$ and 102 candidates with ideal $\\mathsf{H}^{*}$ binding for the HER (Figs. 6(d) and (e)). The prediction MAEs were reported to be 0.29 and $0.24~\\mathrm{eV}$ for ${\\mathsf{C O}}^{*}$ and $\\mathsf{H}^{*}$ , respectively. This proposed framework is a successful example of combining flexibility, automation, and ML guidance to enable holistic analyses across numerous adsorption sites, surfaces, and material spaces and the consequent acceleration of theoretical discovery. It should be noted that, although the AL framework basically adopted non-ab initio features (except for $\\Delta E$ , additional DFT calculations were iteratively performed to verify the prediction and generate new DFT data for model retraining.  \n\nCompared with coordinate-based methods, graph-based deep learning (DL) methods have advantages in high-level feature representations [124]. With the same dataset as that used in Ref. [123], Back et al. [125] demonstrated lower MAEs of $0.15\\ \\mathrm{eV}$ with CNNs that were built on top of the graph representation and used only initial structures as inputs. Even more impressive prediction accuracies (i.e., test MAEs of 0.116 and $0.085\\mathrm{eV}$ for ${\\mathsf{C O}}^{*}$ and $\\mathsf{H}^{*}$ binding, respectively) were achieved with an ensemble of crystal graph CNNs (CGCNNs) and a labeling method representing the binding site atoms of the unrelaxed bare surface geometry [126]. The site labeling method (Fig. 7(a)) enables the complete removal of DFTbased surface relaxation by generating unrelaxed surface structures from relaxed bulk structures that are computationally cheaper or even readily available in open-sourced databases such as Ref. [127]. In principle, such a universal method can be applied to any DL-based adsorption prediction model without modification. These works demonstrate that the combination of a novel site description method and advanced ML algorithms provides a viable solution for the high-throughput prediction of complex catalytic surfaces, significantly extending the searching space from singlecrystal model catalysts to more practical ones.  \n\n![](images/5fd739c26b178c28a59db7df66967edfe82e8d921b079b11afe830ba5b1ab2d5.jpg)  \nFig. 6. ML models with a coordination-based method for representing complex surfaces. (a) Parameterization of the surface configurations using nearest–neighbors for $^*\\mathrm{OH}$ on-top and $^*0$ fcc hollow-site binding. (b) Activities (As) of reengineered compositions of the HEA IrPdPtRhRu; distribution of adsorption energies for $\\mathrm{Ir}_{20}\\mathrm{Pd}_{20}\\mathrm{Pt}_{20}\\mathrm{Rh}_{20}\\mathrm{Ru}_{20}$ $\\mathrm{Ir_{10.2}P d_{32.0}P t_{9.3}R h_{19.6}R u_{28.9}}$ , $\\mathrm{Pd}_{81.7}\\mathrm{Ru}_{18.3}$ and $\\mathrm{Ir}_{17.5}\\mathrm{Pt}_{82.5}$ (global maximum activity). $\\Delta E_{\\mathrm{pred}}$ : predicted adsorption energy. (c) Fingerprint of the coordination site, where the adsorption sites are reduced to numerical representationsÐnamely, fingerprintsÐand these fingerprints are used as model features. Z: the atomic number of the element; $\\chi$ : the Pauling electronegativity of the element, CN: the number of atoms of the element coordinated with the adsorbate, $\\Delta E$ : the median adsorption energy between the adsorbate and the pure element. (d) A t-distributed stochastic neighbor embedding visualization of all the adsorption sites simulated with DFT, where the adsorption energy values are in units of eV. $\\Delta E_{\\mathrm{H}}$ : H adsorption energy. (e) Normalized distribution of low-coverage $\\mathsf{H}^{*}$ adsorption values calculated by the DFT workflow; dashed lines indicate the $0.1\\ \\mathrm{eV}$ range around the optimal $\\boldsymbol{\\mathrm{H^{*}}}$ adsorption value of $-0.27\\ \\mathrm{eV}$ . Parts (a, b) reproduced from Ref. [119] with permission; Parts (d, e) reproduced from Ref. [123] with permission.  \n\nWhen combined with different ML methods or modules, graph-based representations also provide a promising strategy for increasing the interpretability of features extracted from electronic structure properties such as DOS. Wang et al. [128] directly infused the famous d-band theory into DL, obtaining a framework capable of suppling physical insights from learned data by design. This so-called theory-infused NN (TinNet) approach contains two sequential components: a convolutional-NN-based regression module that encodes the atomic and electronic structural information from the raw data; and a theory module that takes outputs from the regression module and predicts the adsorption properties of a metal site (Fig. 7(b)). The effectiveness of TinNet was demonstrated with representative simple adsorbates such as $\\mathrm{OH^{*}}$ and $0^{*}$ . With an MAE of $0.118\\mathrm{eV}$ , the prediction performance was among the best in comparison with existing models or algorithms such as GPR [74], Bayeschem [86], DOSnet [94], and CGCNN. In addition to having a prediction performance on par with purely datadriven ML methods, TinNet allows for the decomposition of d-contributed adsorption energy into Pauli repulsion and orbital hybridization, a detailed analysis of which sheds light on potential paths to tailor novel motifs with desired catalytic properties.  \n\n![](images/9864dfbe3ebfeab62fa11adcb18bbaec9e202dedd0eb2d7ec8baf33f9239841e.jpg)  \nFig. 7. ML models with a graph-based method to represent complex surfaces. (a) Creation of the labeled-site representation for training and its application to real systems. For training, a covalent radius is used to identify the interaction between the surface and the adsorbate in the relaxed geometry; then, the binding-site atoms in the unrelaxed surface geometry are substituted with their pseudoelement counterpart. For applications, surface atoms of the unrelaxed surface geometry are identified by alpha shape, and top, bridge, and hollow sites are identified using graph theory. Specifically, $d$ refers to the distance between atom i and $j$ whereas $d_{\\mathrm{cov1}}$ and $d_{\\mathrm{cov}2}$ refer to the covalent radii of atom i and $j$ , respectively. (b) Schematic illustration of the TinNet. Information flows from the graphical representation of a given adsorbate–substrate system to the adsorption energy, the projected DOS onto the adsorbate frontier orbital(s), and the d-band momen s of the adsorption site. Circles and squares in the regression module represent neurons and feature maps, respectively. The definition of variables can be found in Ref. [126,128]. Part (a) reproduced from Ref. [126] with permission; Part (b) reproduced from Ref. [128] with permission.",
        "category": " Results and discussion"
    },
    {
        "id": 18,
        "chunk": "# 3.2.2. Enhanced representation of complex molecules  \n\nSince the interaction between surfaces and molecules plays a central role in heterogeneous catalysis, the numbers of both possible adsorption configurations and possible reaction pathways increase drastically when the target reactions involve larger molecules. Thus, the explicit calculation of all adsorption energies can be very resource- and time-consuming. As has been wellestablished and demonstrated in general molecular ML for organic synthesis or drug discovery, many molecular representation methods have been directly implemented in predictive ML models for catalysis [129–133]. For example, Li et al. [134] compared different combinations of methods, including EP [96] and Coulomb matrix [129] representations for surfaces, as well as extended connectivity fingerprint (ECFP) [130], spectral London Axilrod–Teller–Muto (SLATM) [131], and bags-of-bonds (BOB) [132] representations for adsorbates, and found that the EP $^+$ SLATM combination yielded the lowest MAE of approximately $0.18\\mathrm{eV}$ for 68 adsorbates on four low-index metal facets $\\mathsf{\\Gamma}(\\mathbf{u}(111)$ , Pt(111), $\\mathsf{P d}(111)$ , and ${\\mathrm{Ru}}(0001))$ . The researchers further extended the simple surfaces to broader transition metal/alloy surfaces and made a change in various representation methods [123,126,133] by replacing the atomic number with the elemental group and periods, thereby achieving an MAE of about $0.05\\mathrm{eV}$ for $\\mathsf{H}^{*}$ binding prediction and MAEs of about $0.1\\ \\mathrm{eV}$ for other strong binding adsorbates $(\\mathsf C^{*},\\mathsf N^{*},0^{*}$ , and $S^{*}$ ) [135]. Using molecular fingerprints based on simplified molecular input line entry system (SMILES) notation (Fig. 8(a)) [136,137], Chowdhury et al. [137] constructed multiple filter-based NN models to extrapolate from a $\\mathsf{C}_{4}$ dataset to a $\\mathsf C_{2}/\\mathsf C_{3}$ dataset on Pt(111), where $C_{2^{-}}C_{4}$ refer to species made up of two to four carbon atoms. The SMILES-based representation was demonstrated to lower the extrapolation MAE by approximately $20\\%$ compared with coordinate-based ones. Similar feature engineering has also helped to predict and compare the adsorption energies of ring and chain species on metal surfaces [138]. Both works demonstrate the effectiveness of SMILES notation in encoding complex molecular structures in predictive ML models.  \n\nSimilar to surface representation, graph-based methods enable enhanced and efficient molecular representation due to their conveniently readable and extendable data structure. For example, various graph-based methods such as graph NN (GNN) have been employed to represent up to $315\\ C_{1}/\\mathsf C_{2}$ surface intermediates and TSs on Rh(111) for syngas-to-ethanol conversion [139]. The best RMSE and MAE for adsorption energy prediction were found to be 0.19 and $0.15\\ \\mathrm{~eV}$ , respectively, and the error for activation energy prediction was lower than those of conventional BEP relations. Very recently, the superiority of GNN in representing complex molecules was substantiated by Pablo-García et al. [140], who demonstrated the construction of a well-balanced chemically diverse dataset and a new GNN architecture called graph-based adsorption on a metal energy (GAME)–neural network (Net) (Fig. 8(b)). Their dataset is very comprehensive, containing closed-shell $\\mathsf{C}_{1-4}$ molecules with functional groups including N, O, S, and $C_{6-10}$ aromatic rings (3315 entries). The optimal adsorption configuration and position of all the molecules were explored through DFT calculations after extensive sampling. Only the lowest energy configurations were included in the dataset. A molecule adsorbed on a closed-pack metal surface was further represented as an integral graph to train GAME–Net, consisting of fully connected layers, convolutional layers, and a pooling layer. The strong predictive power of GAME–Net was demonstrated by a low MAE of $0.18\\ \\mathrm{eV}$ on the test set and six orders of magnitude less time consumed compared with DFT. The model could even be directly adopted to predict larger plastic and biomass molecules with up to 30 heteroatoms, which were not presented in the initial dataset for training, yielding an MAE of $0.016\\mathrm{eV}$ per atom that showed the model’s promising accuracy. Although this model still has a few limitations, such as the requirement of highly symmetric surfaces (i.e., only close-packed pure metal is considered) and neglect of lateral effects, the simplicity and generality of this model make it a useful tool for the fast screening of catalytic materials for unique applications that cannot be easily simulated by traditional methods such as DFT.",
        "category": " Results and discussion"
    },
    {
        "id": 19,
        "chunk": "# 3.2.3. Enhanced representation of catalytic systems under more realistic conditions  \n\nWhile the above works focus on model catalytic systems such as single-crystal surfaces with low coverage of adsorbates, efforts to leverage ML in order to better describe and predict more practical catalytic systems also benefit from enhanced representation. For example, the importance of accurate surface representation is further demonstrated by the prediction of practical catalytic materials beyond single-crystal model surfaces, such as nanoparticles (NPs) and small clusters. With a focus on describing the catalytic NO decomposition performance of RhAu alloy NPs (Fig. 9(a)), Jinnouchi and Asahi [141] proposed a universal ML scheme to investigate reaction activities based on local atomic configurations. To evaluate the structural similarities, the researchers adopted a socalled smooth overlap atomic position (SOAP) similarity kernel, which consists of overlap integrals between three-dimensional (3D) atomic distributions within a cutoff radius from different surface sites. The success of this model demonstrates the fact that the adsorbate binding is rather local and the prediction accuracy can be systematically improved by increasing the number of DFT data to cover all possible local structures. Similar conclusions were drawn when a research group combined SOAP descriptors with ML models to predict $\\mathsf{H}^{*}$ adsorption on a variety of $\\mathsf{M o S}_{2}$ and Cu–Au nanoclusters [142].  \n\nAdvanced local structure representation can then be assembled using various global structure generation methods into ML pipelines for predicting structurally diverse practical catalytic systems. Chen et al. [143] devised an NN model to identify the active sites on gold (Au) NPs and dealloyed $\\mathsf{A u}_{3}\\mathsf{F e}$ NPs for $C O_{2}\\mathrm{RR}$ to CO. The researchers focused on a performance indicator called the $a$ -value, which can be expressed as $a\\ =\\ \\Delta E_{\\mathrm{CO}}\\ -\\ 1.4423\\Delta E_{\\mathrm{HOCO}},$ where $\\Delta E_{\\mathrm C0}$ and $\\Delta E_{\\mathrm{HOCO}}$ represent the adsorption energy of CO and the surface carboxyl $(\\mathrm{HOCO^{*}})$ , respectively. Both energies can be obtained by means of quantum mechanics (QM). Using a developed force field for reactive systems called ReaxFF [144], the researchers first constructed a $10\\mathrm{nm}$ Au NP, which contained more than 10 000 surfaces sites. Then, features based on the interatomic distances between the Au atoms were leveraged to describe the extremely irregular and disordered Au surfaces, with RMSEs of approximately 0.05 and $0.06\\ \\mathrm{eV}$ for the $\\Delta E_{\\mathsf{C O}}$ and $\\Delta E_{\\mathrm{HOCO}}$ predictions, respectively. The catalytic activity of the whole surface was further mapped to illustrate the desirable site geometries of the NPs (Fig. 9(b)) and guide the design of high-performance electrocatalysts for ${\\mathrm{CO}}_{2}{\\mathrm{RR}}.$ A similar ML-QM-ReaxFF framework was applied to study ${\\mathsf{C O}}_{2}{\\mathsf{R R}}$ on Au NPs while considering solvation effects and roughened Cu surfaces, demonstrating the good versatility of this strategy [145,146].  \n\nDifferent site representation and initial structure generation methods can be considered to further modify the workflow. By leveraging the fingerprint labeling method [126], Gu et al. [147] integrated the force field, DFT, ML, and kinetic Monte Carlo in an end-to-end multiscale simulation framework to elucidate the alkaline HER kinetics of jagged platinum $\\left(\\mathrm{Pt}\\right)$ nanowires. This framework not only achieved a high prediction accuracy for $\\mathsf{H}^{*}$ adsorption energies, with an MA $\\mathrm{~E~<~}0.05\\mathrm{~\\eV}$ , but also offered insights into the autobifunctional alkaline HER mechanism. It also suggested structure motifs of highly active Pt catalysts for alkaline HER. Similarly focusing on HER catalysts but with an amorphous system, Zhang et al. [148] adopted a GA optimization method implemented in the universal structure predictor evolutionary Xtallography code to obtain over 600 amorphous surface structures of ${\\mathrm{Ni}}_{2}{\\mathrm{P}}.$ . Non-ab initio features relying only on the local chemical environment were utilized to predict the frozen adsorption energies of $\\boldsymbol{\\mathrm{H}}^{*}$ , with an $\\mathrm{RMSE}<0.1$ eV. However, we note that the $\\mathsf{H}^{*}$ adsorption energy consists of a frozen term and a relaxation term. The prediction of the latter, which accounts for the energy change upon site and surface deformation, still requires ab initio features, in accordance with prior discussions on the zeolite system [87].  \n\n![](images/3d3b4da66008320927e8e121ea62d796dcdce78aacf52aa89c3ff5c0d70a69c3.jpg)  \nFig. 8. ML models with enhanced representation of complex molecules. (a) SMILES-based molecular fingerprint for the surface species $\\mathrm{CH}_{3}\\mathrm{CHCOO}$ . Here, ${\\sf C}_{0}$ denotes a saturated carbon. $\\mathsf C_{1},\\mathsf C_{2}$ , and ${{C}_{3}}$ denote carbon atoms with one, two, and three free valences, respectively. Similarly, $0_{0}$ is a saturated oxygen, whereas $0_{1}$ is an oxygen atom with one free valence. (b) Schematic illustration of the workflow for GAME–Net. (b–i)–(b–iv) Starting from the DFT functional group (FG) dataset containing small adsorbates, the sample adsorption systems are transformed to their corresponding graph representation to train the proposed GNN architecture. BM: big molecules. The final purpose is to use GAME–Net to estimate the adsorption energy of big molecules $C_{<23}$ on metal surfaces present in the big molecule dataset, thus avoiding the use of computationally expensive DFT calculations. Here $E_{\\{i,$ GNN} refers to the GNN-predicted proxy energy of a molecule i adsorbed on a surface. Part (a) reproduced from Ref. [137] with permission; Part (b) reproduced from Ref. [140] with permission.  \n\nAnother aspect of practical catalytic complexity stems from lateral effects such as adsorbate–adsorbate interactions and solvation. Explicitly accounting for these effects in ab initio simulations, however, is often extremely computationally demanding. For example, to identify the most optimal binding configuration on a surface at high coverages normally requires the enumeration of all possible binding configurations and then acquiring the energy of each configuration using DFT calculations. The exploration of such a large space of atomistic configurations could take orders of magnitude more time than a single calculation at a low coverage. To address this challenge, the Greeley group developed an ML-based surrogate model, named the adsorbate chemical environment-based–graph convolution neural network (ACE–GCN), to replace expensive DFT calculations in determining the atomistic configurations of high-coverage catalytic surfaces (Fig. 9(c)) [149]. This model was based on the SurfGraph algorithm, which allows for the conversion of atomistic configurations to undirected graph representations [150]. The graph representations were further split into subgraphs for featurization and model training. This splitting into subgraphs is the key in explicitly accounting for the local environment of the adsorbate so that subtle atomistic interplay such as adsorbate–adsorbate interaction can be accurately captured. Illustrated by $\\boldsymbol{\\mathrm{OH^{*}}}$ adsorption on a stepped Pt(221) surface, the ACE–GCN not only enabled the use of a mixed training dataset (high-coverage data obtained on both Pt(221) and Pt(100) surfaces) to improve the model’s reliability in ranking the most likely adsorption configurations but also successfully identified energetically favorable and unfavorable high-coverage (corresponding to $1/2$ monolayer) ${\\mathsf{O H}}^{*}$ adsorption configurations on $\\mathsf{P t}(221)$ with $96\\%$ fewer DFT relaxations (Fig. 9(d)).  \n\n![](images/c88d1da07961a2bd1b4e70ab1b9169452ddeb33cdec26ac8103a134e69073400.jpg)  \nFig. 9. ML models with an enhanced representation of catalytic systems under more realistic conditions. (a) Atomic distributions and binding energies $(E_{\\mathrm{{b}}})$ of N, O, and NO with the surface sites on a $\\begin{array}{r}{\\mathbb{R}\\mathrm{h}_{1-x}\\mathrm{Au}_{x}\\mathrm{NP}}\\end{array}$ with $x=0.19$ and $d=5{\\mathrm{nm}}$ (x refers to the atomic fraction of Au in the NP and d refers to the diameter of the NP). (b) $a$ -value mapping and catalytic activity visualization for a dealloyed Au surface. Each single site is given an $a$ -value based on NN prediction. These $a$ -values are then mapped back on the particle to visualize the catalytic activity of the whole surface. As indicated in the color bar, the red sites are inactive, while the blue sites are active. ‘‘Surface defectº and ‘‘Step under 111º sites are highlighted as two representative highly active sites, corresponding to surface $\\mathsf{A u}(111)$ atoms with one or two missing atoms around the center site and undercoordinated $\\mathsf{A u}(111)$ atoms near to steps, respectively. (c) An ACE–GCN algorithm used to encode and train high-coverage adsorbate configurations. (d) Screening highcoverage $\\mathsf{O H}^{*}$ configurations on $\\mathsf{P t}(221)$ : (left) scatter plots for the average ${\\mathsf{O H}}^{*}$ binding energies of unrelaxed configurations, as predicted by ACE–GCN, with respect to DFTrelaxed energies of the corresponding structures. Among all configurations (N refers to the number of configurations), 213 out of 5855 configurations remain undissociated after DFT relaxation. A representative area of the chemical space relevant for unstable and stable configurations is depicted on the scatter plots, marked as ‘‘(i)º and ‘‘(ii)º; (right) representative stable and unstable atomic configurations from the (i) and (ii) regions depicted in the scatter plots. Part (a) reproduced from Ref. [141] with permission; Part (b) reproduced from Ref. [143] with permission; Parts (c, d) reproduced from Ref. [149] with permission.  \n\nThe rigorous description of catalytic systems embracing both the complexities originating from nanostructured catalysts and realistic reaction conditions is rarely reported, except for a very recent study by Cao and Mueller [151], who adopted a machinelearned cluster expansion method to map ORR activity on Pt–Ni alloy nanoparticles. Nevertheless, it is definitely a promising direction to accelerate the in situ theoretical description of practical catalytic systems using ML and advanced representation methods.",
        "category": " Results and discussion"
    },
    {
        "id": 20,
        "chunk": "# 4. ML-guided experimental catalyst discovery  \n\nAn accurate estimation of the adsorbate binding strength helps lay the foundation for efficient high-throughput catalyst screening and catalyst design, the effectiveness of whichÐof courseÐstill requires experimental validation. In this section, we present a few examples of the successful development of highly active catalysts under ML guidance to further demonstrate the significance of ML methods in accelerating experimental catalyst discovery.  \n\nFor example, Zhong et al. [152] adopted the AL framework discussed above [123] to investigate CO adsorption strengths on alloy surfaces. Based on insights obtained from a scaling-derived volcano map, which indicated that the optimal CO binding for $C0_{2}\\tt R R$ should be around 0.67 eV [107], the researchers examined a wide range of alloys to identify the ideal catalysts that exhibit adsorption strengths around that value. As illustrated by its tdistributed stochastic neighbor embedding (t-SNE) diagram [153], the Cu–Al alloy presents multiple sites and surface orientations with near-optimal CO binding, demonstrating its great potential for efficient and selective $C O_{2}\\mathrm{RR}$ catalysis. This was later confirmed with a synthesized $\\mathsf{C u\\mathrm{-}A l}$ catalyst, which efficiently reduces ${\\mathsf{C O}}_{2}$ to ethylene with the highest reported Faradaic efficiency of over $80\\%$ . Similarly, ML has been verified to be effective in designing alloy catalysts for nitrogen-related chemistries such as ammonia oxidation. For example, adopting the aforementioned TinNet framework [128], Pillai et al. [154] explored the immense design space of ternary Pt alloy nanostructures (Figs. 10(a) and (b)). With a training dataset of ab initio data, concurrent predictions of site reactivity, surface stability, and catalyst synthesizability descriptors can be realized. An AL workflow showed $\\mathrm{Pt}_{3}\\mathrm{Ru}-\\mathrm{M}$ $\\mathrm{T}\\mathrm{M}=\\mathrm{Fe}$ , Co, or Ni) alloys to be promising iridium (Ir)-free candidates, and their catalytic potential was confirmed by the corresponding experimentally synthesized nanocubes, which exhibited higher activities than state-of-the-art Pt catalysts and its bimetallic alloy counterparts (Figs. 10(c) and (d)). The great potential of ML in guiding and accelerating the experimental exploration of catalysts in a vast chemical space such as that of a multi-metallic system was thereby established.  \n\nIn addition to its use in high-throughput screening, ML’s attractive capability to supply valuable physical insights for experimental catalyst design has been established. Along this line, Zhai et al. [155] devised an NN model correlating the ORR activity of perovskite oxides to nine ionic descriptors including the ionic Lewis acid strength (ISA) on A- and B-sites, which was later confirmed to be the most influential feature according to the feature importance ranking. Tuning the ISAs of perovskites is therefore suggested as a viable approach for optimizing perovskites’ ORR activity. Experimental characterization has revealed that decreased A-site and increased B-site ISAs can considerably improve the surface exchange kinetics of perovskite oxides. Based on this premise, four perovskite oxides were synthesized, whose superior catalytic performance substantiated the effectiveness of ML-derived catalyst design principles. Similarly, machine-learned insights through Bayeschem [86] were found to be effective in discovering novel catalysts for the electrochemical nitrate reduction reaction $(\\mathsf{N O}_{3}\\mathsf{R R})$ that break the adsorption-energy scaling limitations posed by conventional catalysts [156]. More specifically, Bayeschem was used to determine that the non-scaling behavior originated from site-specific Pauli repulsion interactions of the metal ${\\mathsf{d}}$ -states with the adsorbate frontier orbitals and could be realized on (100)-type sites, where $^*\\mathrm{N}$ and ${^*{\\mathsf{N O}}_{3}}$ exhibited different orbital overlap degrees with subsurface metal atoms. As a result, tuning the subsurface elements in ordered B2 intermetallics became a rational strategy to optimize the ${\\tt N O}_{3}{\\tt R R}$ performance. This strategy was further verified by synthesizing and testing monodisperse ordered B2 CuPd nanocubes with (100)-like surface orientations, which displayed a high Faradaic efficiency of $92.5\\%$ for ${\\tt N O}_{3}{\\tt R R}$ to ammonia and improved ammonia yield rates more than Cu or Pd. This success in translating machine-learned insights into rational experimental catalyst design principles sheds light on ML-guided new catalyst discovery aside from direct computational high-throughput screening.  \n\n![](images/83b1bde2d0a9a3ec579a107c2974f1c05ebcd6caf34839185837d5a0121a3b37.jpg)  \nFig. 10. ML-guided experimental catalyst discovery. (a) An AL workflow for accelerating catalytic materials discovery. (b) The ammonia oxidation reaction activity map at 0.3 V vs a reversible hydrogen electrode (RHE) with solid markers showing promising ternary Pt alloy electrocatalysts predicted from the workflow. $\\Delta E_{*\\ensuremath{\\mathrm{N}_{\\mathrm{b}}}}$ : nitrogen adsorption at bridge site, $\\Delta E_{*\\ensuremath{\\mathrm{N}_{\\mathrm{h}}}}$ : nitrogen adsorption at hollow site. The activity is quantified using turnover frequency of $\\mathsf{N}_{2}$ $\\left\\langle\\mathrm{TOF}_{\\mathrm{N}_{2}}\\right\\rangle$ . (c) High-angle annular dark-field scanning transmission election microscope image and the corresponding energy dispersive spectroscopic elemental mapping of Pt, Ru, and Co. (d) Electrocatalytic performance testing of Pt, $\\mathrm{Pt}_{3}\\mathrm{Ir}$ , $\\mathrm{Pt}_{3}\\mathrm{R}\\mathbf{u}$ , and $\\mathrm{Pt}_{3}\\mathrm{Ru}_{1/2}\\mathrm{Co}_{1/2}$ nanocubes via cyclic voltammetry with a rotating speed of $900\\mathrm{r}{\\cdot}\\mathrm{min}^{-1}$ in Ar-saturated $\\mathrm{1.0\\mol{\\cdot}L^{-1}\\ K O H+0.1\\ m o l{\\cdot}L^{-1}\\ N H_{3}}$ under ambient conditions. The measured current density was normalized to the mass of Pt (i.e. A $\\mathrm{g}_{\\mathrm{Pt}}^{-1}$ ) in Pt-based electrocatalysts. Reproduced from Ref. [154] with permission.",
        "category": " Results and discussion"
    },
    {
        "id": 21,
        "chunk": "# 5. Summary and outlook  \n\nThe search for efficient catalysts for the next-generation chemical industry will continue to be a research hotspot for decades to come. As a rising field that is still in its infancy, ML-aided surface reactivity evaluation has already demonstrated its huge potential to enable a paradigm shift in high-throughput catalyst screening. Considering the progress that has already been achieved, we point out two major propellants (Fig. 11) in the development of ML models for adsorption energy prediction:  \n\n(1) The construction and curation of datasets. Rather than generating a completely new set of training data points from scratch, many works leverage datasets from previous papers or public data repositories to devise novel models for binding strength prediction. For example, the datasets reported in Refs. [74,84,93,123] have been widely adopted in other works, which present fresh perspectives by tackling these published data from a different angle. Public data repositories such as CatApp [80] and Catalysis-Hub.org [157] maintained by the SUNCAT center at the Stanford Linear Accelerator Center (SLAC) have also been frequently used. The reuse of the same dataset for the demonstration of different ML models enables objective performance comparison, where the establishment of appropriate benchmarks encourages the development of more accurate and robust models. With the aim of constructing extensive datasets for heterogeneous catalysis, Fundamental AI Research at Meta AI (originally Facebook AI) and Carnegie Mellon University’s Department of Chemical Engineering launched the Open Catalyst (OC) project in 2020. Its original dataset, OC2020, consists of 1.28 million DFT relaxations ${\\sim}260$ million single-point evaluations), spanning across 55 elements, 82 adsorbates, and unary/binary/ternary inorganic materials [158]. The release of such a large-scale dataset is undoubtedly beneficial in attracting broader interests and gathering the research community together to address open challenges in developing generalizable ML models for catalysis discovery [159].  \n\n(2) The implementation and improvement of matter representation. As demonstrated in Section 3.2, ML model accuracy is largely dependent on an appropriate representation of surfaces and molecules, whose role becomes even more predominant when modeling the catalytic activities of structurally or compositionally complex systems such as nanoparticles and HEAs. Given the ubiquity of site diversity that results from likely catalyst reconstruction under realistic conditions, it is therefore crucial to rationalize and optimize matter representation. DL-based approaches have recently exhibited great potential in sophisticated matter representation [124–126,140,150,160]. Their representations are more expressive than hand-crafted ones and are expected to be compatible with large-scale datasets, as revealed by a comparative study on the OC2020 dataset [159].  \n\nDespite the impressive achievements that have been made so far, accessing adsorption strengths directly through ML still presents the following nontrivial challenges (Fig. 11):  \n\n(1) Generalizability. As many previous works have mostly focused on systems based on specific chemistries and material compositions (e.g., predominantly metal alloys) with limited demonstration of their generalizability, it remains a ‘‘holy grailº task in this field to develop a universal model that can operate across the abundant space of materials and molecular adsorbates. Similar to AI/ML model optimization in other fields, a model’s predictive capability generally improves as the amount of data increases. Unfortunately, this improvement is not as simple and scalable. As revealed by the OC team [158] using current baseline models, the scaling between the dataset size and model performance is more difficult for catalysis datasets than for datasets of organic small molecules and inorganic materials. Innovations in ML models are therefore greatly needed to overcome this hurdle.  \n\n![](images/f5c3fabaa8e81930576896d9994a8a8a6de4300ebcaa640d0ce4a20c63b57778.jpg)  \nFig. 11. Two major propellants and five future challenges in developing ML-assisted approaches for adsorption energy prediction.  \n\n(2) Efficiency. Given access to large-scale datasets, the next task is to enhance model efficiency. This usually relies on the utilization of low-cost features (e.g., using only the graphic information of initial atomistic structures, as in OC2020 tasks [158]) and the improvement of prediction accuracy. As the ultimate goal is to identify materials with desirable properties within an almost unlimited candidate space, the adoption of computationally costly information is not preferable. On the other hand, the prediction accuracy of ML models remains essential, since inadequate results eventually lead to a waste of time and resources, which diminishes the goal of accelerated material screening. Unfortunately, reducing the cost and improving the accuracy often result in a dilemma, as demonstrated by the comparison between models using ab initio and non-ab initio features. It is therefore vital to carefully and delicately balance these two demands.  \n\n(3) Complexity. Despite the desirable efforts that have been made to predict adsorption energies for species involved in complicated reaction networks or on complex catalytic surfaces, training datasets are mostly obtained on idealized surfaces with simple assumptions such as a high vacuum, low adsorbate coverage, and single surface species. These approximations, however, can be too crude and may deviate substantially from the actual reaction conditions, especially for the electrocatalytic reactions used in a wide swath of future clean-energy-related applications. In addition to some common complexities introduced by, for example, species co-adsorption or adsorbate–adsorbate interaction [107,161], these electrocatalytic reactions embrace additional complications stemming from the inherent electrochemical interfaces, which can lead to profound solvation and charge separation effects [162–164]. The prediction results of ML models will not be as useful and impactful if these complexities cannot be well captured, despite the potentially satisfactory prediction accuracies such models might be able to achieve [149].  \n\n(4) Reliability. The energetic data in most current databases are obtained through generalized gradient approximation (GGA)-level DFT computation. Consequently, the accuracies of ML models built upon these data are also restricted to such a level. More sophisticated methods such as meta-GGA or hybrid functionals are capable of supplying more reliable results, but they usually induce an enormous computation burden at the same time, making it impractical to construct datasets with these methods. In addition, some systemsÐsuch as those with spin polarization or strong electron correlation (e.g., magnetic 3D metal oxides)Ðrequire the delicate tuning of DFT parameters to yield physically sensible results, presenting another hurdle in the formulation of large-scale datasets. For example, the OC2020 dataset simply considers no spin polarization for all systems [158]. This inconsistency in computational methods introduces additional uncertainties when adopting databases from different sources. The uncertainty quantification, in this case, remains necessary. Developing reliable methods to accelerate high-precision DFT simulations or to provide accurate DFT surrogates is another valuable direction, in which ML has already demonstrated its great potential [165–169]. A discussion on this aspect, however, lies beyond the scope of this review.  \n\n(5) Interpretability. Improving a model’s interpretability helps to better exploit its predictive power. Other than merely obtaining a few promising candidates, it is also of paramount significance to acquire fresh understandings and new principles to aid in the design of better catalysts through objective optimization. Most previous works have adopted pure data-driven approaches, which yield impressively low prediction errors but provide limited interpretability. Post-training analysis is therefore a common yet effective way to extract more physical insights from such models. Alternatively, it is even more ideal to intentionally weave mechanistic understandings into the ML framework, in which case the physical rationality of the model can be automatically ensured and the model’s interpretability will come naturally. More importantly, merging interpretability into ML models can help to partially address the reliability concern, as experts can try to rationalize the derived interpretations and compare them with known physics [55].  \n\nWe note that the above challenges can be highly entangled, and that there might not be a single ideal ML model capable of overcoming all obstacles simultaneously. Alternatively, we envision a hierarchical workflow to leverage multiple ML models with unique superiorities in different aspects, while the overall mission of highthroughput screening could be decomposed into a sequential task consisting of steps with different requirements for accuracy, complexity, and scalability. For example, pure data-driven ML models can first be employed to rapidly navigate through the vast material space with simple assumptions and compromised prediction accuracies. Given appropriate uncertainty quantification, it would still be possible to locate the subspace enclosing possible promising candidates. Next, highly reliable prediction and knowledge extraction could be enabled by focusing on this specific subspace while utilizing ML models that accommodate smaller datasets, leverage more accurate computational methods, compile more realistic approximations, and exhibit greater interpretability. Finally, the obtained physical insights could be further applied to reexamine the entire material space in an attempt to search for potential missing candidates that align well with the extracted patterns. In sum, despite the many challenges presented by the application of ML for surface reactivity prediction and high-throughput catalyst screening, we believe that this remains an extremely promising field with great potential to improve computational science, accelerate materials design, and ultimately reshape the future chemical industry and energy landscape.",
        "category": " Conclusions"
    },
    {
        "id": 22,
        "chunk": "# Acknowledgment  \n\nThis work was supported by the National Natural Science Foundation of China (22109020 and 22109082).",
        "category": " References"
    },
    {
        "id": 23,
        "chunk": "# Compliance with ethics guidelines  \n\nXinyan Liu and Hong-Jie Peng declare that they have no conflict of interest or financial conflicts to disclose.",
        "category": " Results and discussion"
    },
    {
        "id": 24,
        "chunk": "# References  \n\n[1] Catlow CR, Davidson M, Hardacre C, Hutchings GJ. Catalysis making the world a better place. Philos Trans R Soc A Eng Sci 2016;374(2061):20150089.   \n[2] Schlögl R. Heterogeneous catalysis. Angew Chem Int Ed Engl 2015;54 (11):3465–520.   \n[3] Wang AQ, Li J, Zhang T. Heterogeneous single-atom catalysis. Nat Rev Chem 2018;2(6):65–81.   \n[4] Rostrup-Nielsen JR, Sehested J, Nørskov JK. Hydrogen and synthesis gas by steam- and ${\\mathsf{C O}}_{2}$ reforming. Adv Catal 2002;47:65–139.   \n[5] Wang QR, Guo JP, Chen P. Recent progress towards mild-condition ammonia synthesis. J Energy Chem 2019;36:25–36. [6] Vogt ETC, Weckhuysen BM. Fluid catalytic cracking: recent developments on the grand old lady of zeolite catalysis. Chem Soc Rev 2015;44(20):7342–70.   \n[7] Jiang X, Nie X, Guo X, Song C, Chen JGG. Recent advances in carbon dioxide hydrogenation to methanol via heterogeneous catalysis. Chem Rev 2020;120 (15):7984–8034.   \n[8] Tomishige K, Nakagawa Y, Tamura M. Taming heterogeneous rhenium catalysis for the production of biomass-derived chemicals. Chin Chem Lett 2020;31(5):1071–7.   \n[9] Schwach P, Pan X, Bao X. Direct conversion of methane to value-added chemicals over heterogeneous catalysts: challenges and prospects. Chem Rev 2017;117(13):8497–520.   \n[10] Dai Y, Gao X, Wang Q, Wan X, Zhou C, Yang Y. Recent progress in heterogeneous metal and metal oxide catalysts for direct dehydrogenation of ethane and propane. Chem Soc Rev 2021;50(9):5590–630.   \n[11] Seh ZW, Kibsgaard J, Dickens CF, Chorkendorff I, Nørskov JK, Jaramillo TF. Combining theory and experiment in electrocatalysis: insights into materials design. Science 2017;355(6321):eaad4998.   \n[12] Bullock RM, Chen JGG, Gagliardi L, Chirik PJ, Farha OK, Hendon CH, et al. Using nature’s blueprint to expand catalysis with Earth-abundant metals. Science 2020;369(6505):eabc3183.   \n[13] Chu S, Cui Y, Liu N. The path towards sustainable energy. Nat Mater 2016;16 (1):16–22.   \n[14] Nikolaidis P, Poullikkas A. A comparative overview of hydrogen production processes. Renew Sustain Energy Rev 2017;67:597–611.   \n[15] Lagadec MF, Grimaud A. Water electrolysers with closed and open electrochemical systems. Nat Mater 2020;19(11):1140–50.   \n[16] Zhang L, Zhao ZJ, Gong J. Nanostructured materials for heterogeneous electrocatalytic ${\\mathsf{C O}}_{2}$ reduction and their related reaction mechanisms. Angew Chem Int Ed Engl 2017;56(38):11326–53.   \n[17] Gao DF, Aran-Ais RM, Jeon HS, Roldan CB. Rational catalyst and electrolyte design for $\\mathrm{CO}_{2}$ electroreduction towards multicarbon products. Nat Catal 2019;2(3):198–210.   \n[18] Nitopi S, Bertheussen E, Scott SB, Liu X, Engstfeld AK, Horch S, et al. Progress and perspectives of electrochemical $\\mathsf{C O}_{2}$ reduction on copper in aqueous electrolyte. Chem Rev 2019;119(12):7610–72.   \n[19] Ross MB, De Luna P, Li YF, Dinh CT, Kim D, Yang P, et al. Designing materials for electrochemical carbon dioxide recycling. Nat Catal 2019;2(8):648–58.   \n[20] Liu XY, Li BQ, Ni B, Wang L, Peng HJ. A perspective on the electrocatalytic conversion of carbon dioxide to methanol with metallomacrocyclic catalysts. J Energy Chem 2022;64:263–75.   \n[21] Zhu Z, Li Z, Wang J, Li R, Chen H, Li Y, et al. Improving $\\mathrm{NiN}_{x}$ and pyridinic N active sites with space-confined pyrolysis for effective ${\\mathsf{C O}}_{2}$ electroreduction. eScience 2022;2(4):445–52.   \n[22] Gao $z_{\\mathrm{Q}},$ Li JJ, Zhang ZC, Hu WP. Recent advances in carbon-based materials for electrochemical $\\mathrm{CO}_{2}$ reduction reaction. Chin Chem Lett 2022;33(5):2270–80.   \n[23] Chen JG, Crooks RM, Seefeldt LC, Bren KL, Bullock RM, Darensbourg MY, et al. Beyond fossil fuel-driven nitrogen transformations. Science 2018;360(6391): eaar6611.   \n[24] Suryanto BHR, Du HL, Wang DB, Chen J, Simonov AN, MacFarlane DR. Challenges and prospects in the catalysis of electroreduction of nitrogen to ammonia. Nat Catal 2019;2(4):290–6.   \n[25] Andersen SZ, ÏColicÂ V, Yang S, Schwalbe JA, Nielander AC, McEnaney JM, et al. A rigorous electrochemical ammonia synthesis protocol with quantitative isotope measurements. Nature 2019;570(7762):504–8.   \n[26] Cui XY, Tang C, Zhang Q. A review of electrocatalytic reduction of dinitrogen to ammonia under ambient conditions. Adv Energy Mater 2018;8(22):1800369.   \n[27] Jiao Y, Zheng Y, Jaroniec M, Qiao SZ. Design of electrocatalysts for oxygen- and hydrogen-involving energy conversion reactions. Chem Soc Rev 2015;44 (8):2060–86.   \n[28] McCrory CCL, Jung S, Ferrer IM, Chatman SM, Peters JC, Jaramillo TF. Benchmarking hydrogen evolving reaction and oxygen evolving reaction electrocatalysts for solar water splitting devices. J Am Chem Soc 2015;137 (13):4347–57.   \n[29] Shao M, Chang Q, Dodelet JP, Chenitz R. Recent advances in electrocatalysts for oxygen reduction reaction. Chem Rev 2016;116(6):3594–657.   \n[30] Kibsgaard J, Chorkendorff I. Considerations for the scaling-up of water splitting catalysts. Nat Energy 2019;4(6):430–3.   \n[31] Nørskov JK, Bligaard T, Rossmeisl J , Christensen CH. Towards the computational design of solid catalysts. Nat Chem 2009;1(1):37–46.   \n[32] Bruix A, Margraf JT, Andersen M, Reuter K. First-principles-based multiscale modelling of heterogeneous catalysis. Nat Catal 2019;2(8):659–70.   \n[33] Chen BWJ, Xu L, Mavrikakis M. Computational methods in heterogeneous catalysis. Chem Rev 2021;121(2):1007–48.   \n[34] Motagamwala AH, Dumesic JA. Microkinetic modeling: a tool for rational catalyst design. Chem Rev 2021;121(2):1049–76.   \n[35] Greeley J. Theoretical heterogeneous catalysis: scaling relationships and computational catalyst design. Annu Rev Chem Biomol Eng 2016;7 (1):605–35.   \n[36] Zhao ZJ, Liu SH, Zha SJ, Cheng DF, Studt F, Henkelman G, et al. Theory-guided design of catalytic materials using scaling relationships and reactivity descriptors. Nat Rev Mater 2019;4(12):792–804.   \n[37] Campbell CT. Energies of adsorbed catalytic intermediates on transition metal surfaces: calorimetric measurements and benchmarks for theory. Acc Chem Res 2019;52(4):984–93.   \n[38] Medford AJ, Vojvodic A, Hummelshoj JS, Voss J, Abild-Pedersen F, Studt F, et al. From the Sabatier principle to a predictive theory of transition-metal heterogeneous catalysis. J Catal 2015;328:36–42.   \n[39] Butler KT, Davies DW, Cartwright H, Isayev O, Walsh A. Machine learning for molecular and materials science. Nature 2018;559(7715):547–55.   \n[40] Tshitoyan V, Dagdelen J, Weston L, Dunn A, Rong Z, Kononova O, et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 2019;571(7763):95–8.   \n[41] Zhou T, Song Z, Sundmacher K. Big data creates new opportunities for materials research: a review on methods and applications of machine learning for materials design. Engineering 2019;5(6):1017–26.   \n[42] Chen A, Zhang X, Zhou Z. Machine learning: accelerating materials development for energy storage and conversion. InfoMat 2020;2(3):553–76.   \n[43] Liu Y, Guo BR, Zou XX, Li YJ, Shi SQ. Machine learning assisted materials design and discovery for rechargeable batteries. Energy Storage Mater 2020;31:434–50.   \n[44] Chen X, Liu X, Shen X, Zhang Q. Applying machine learning to rechargeable batteries: from the microscale to the macroscale. Angew Chem Int Ed   \n[45] Li JZ, Huang XB, Pianetta P, Liu YJ. Machine-and-data intelligence for synchrotron science. Nat Rev Phys 2021;3(12):766–8.   \n[46] Xu S, Li J, Cai P, Liu X, Liu B, Wang X. Self-improving photosensitizer discovery system via Bayesian search with first-principle simulations. J Am Chem Soc 2021;143(47):19769–77.   \n[47] Li SN, Liu YJ, Chen D, Jiang Y, Nie ZW, Pan F. Encoding the atomic structure for machine learning in materials science. Wiley Interdiscip Rev Comput Mol Sci 2022;12(1):e1558.   \n[48] Lombardo T, Duquesnoy M, El-Bouysidy H, Årén F, Gallo-Bueno A, Jørgensen PB, et al. Artificial intelligence applied to battery research: hype or reality? Chem Rev 2022;122(12):10899–969.   \n[49] Liu XY, Zhang XQ, Chen X, Zhu GL, Yan C, Huang JQ, et al. A generalizable, data-driven online approach to forecast capacity degradation trajectory of lithium batteries. J Energy Chem 2022;68:548–55.   \n[50] Lin M, Xiong J, Su M, Wang F, Liu X, Hou Y, et al. A machine learning protocol for revealing ion transport mechanisms from dynamic NMR shifts in paramagnetic battery materials. Chem Sci 2022;13(26):7863–72.   \n[51] Wang X, Jiang S, Hu W, Ye S, Wang T, Wu F, et al. Quantitatively determining surface-adsorbate properties from vibrational spectroscopy with interpretable machine learning. J Am Chem Soc 2022;144(35):16069–76.   \n[52] Oliveira JCA, Frey J, Zhang SQ, Xu LC, Li X, Li SW, et al. When machine learning meets molecular synthesis. Trends Chem 2022;4(10):863–85.   \n[53] Liu X, Peng HJ, Li BQ, Chen X, Li Z, Huang JQ, et al. Untangling degradation chemistries of lithium–sulfur batteries through interpretable hybrid machine learning. Angew Chem Int Ed Engl 2022;61(48):e202214037.   \n[54] Yao ZP, Lum YW, Johnston A, Mejia-Mendoza LM, Zhou X, Wen YG, et al. Machine learning for a sustainable energy future. Nat Rev Mater 2022;8 (3):202–15.   \n[55] Esterhuizen JA, Goldsmith BR, Linic S. Interpretable machine learning for knowledge generation in heterogeneous catalysis. Nat Catal 2022;5(3):175–84.   \n[56] Medford AJ, Kunz MR, Ewing SM, Borders T, Fushimi R. Extracting knowledge from data through catalysis informatics. ACS Catal 2018;8(8):7403–29.   \n[57] Lamoureux PS, Winther KT, Torres JAG, Streibel V, Zhao M, Bajdich M, et al. Machine learning for computational heterogeneous catalysis. ChemCatChem 2019;11(16):3581–601.   \n[58] Toyao T, Maeno Z, Takakusagi S, Kamachi T, Takigawa I, Shimizu K. Machine learning for catalysis informatics: recent applications and prospects. ACS Catal 2020;10(3):2260–97.   \n[59] Gu GH, Choi C, Lee Y, Situmorang AB, Noh J, Kim YH, et al. Progress in computational and machine-learning methods for heterogeneous smallmolecule activation. Adv Mater 2020;32(35):1907865.   \n[60] Ma SC, Liu ZP. Machine learning for atomic simulation and activity prediction in heterogeneous catalysis: current status and future. ACS Catal 2020;10 (22):13213–26.   \n[61] Xu J, Cao XM, Hu P. Perspective on computational reaction prediction using machine learning methods in heterogeneous catalysis. Phys Chem Chem Phys 2021;23(19):11155–79.   \n[62] Chen LT, Zhang X, Chen A, Yao S, Hu X, Zhou Z. Targeted design of advanced electrocatalysts by machine learning. Chin J Catal 2022;43(1):11–32.   \n[63] Cao L. Recent advances in the application of machine-learning algorithms to predict adsorption energies. Trends Chem 2022;4(4):347–60.   \n[64] Li H, Jiao Y, Davey K, Qiao SZ. Data-driven machine learning for understanding surface structures of heterogeneous catalysts. Angew Chem Int Ed 2023;62(9):e202216383.   \n[65] Mou TY, Pillai HS, Wang SW, Wan MY, Han X, Schweitzer NM, et al. Bridging the complexity gap in computational heterogeneous catalysis with machine learning. Nat Catal 2023;6(2):122–36.   \n[66] Yang H, He ZQ, Zhang MD, Tan XJ, Sun K, Liu HY, et al. Reshaping the material research paradigm of electrochemical energy storage and conversion by machine learning. EcoMat 2023;5(5):e12330.   \n[67] Hammer B, Nørskov JK. Why gold is the noblest of all the metals. Nature 1995;376(6537):238–40.   \n[68] Nørskov JK, Studt F, Abild-Pedersen F, Bligaard T. Fundamental concepts in heterogeneous catalysis. Hoboken: John Wiley & Sons, Inc.; 2014.   \n[69] Abild-Pedersen F, Greeley J, Studt F, Rossmeisl J, Munter TR, Moses PG, et al. Scaling properties of adsorption energies for hydrogen-containing molecules on transition-metal surfaces. Phys Rev Lett 2007;99(1):016105.   \n[70] Chowdhury AJ, Yang WQ, Walker E, Mamun O, Heyden A, Terejanu GA. Prediction of adsorption energies for chemical species on metal catalyst surfaces using machine learning. J Phys Chem C 2018;122(49): 28142–50.   \n[71] Man IC, Su HY, Calle-Vallejo F, Hansen HA, Martinez JI, Inoglu NG, et al. Universality in oxygen evolution electrocatalysis on oxide surfaces. ChemCatChem 2011;3(7):1159–65.   \n[72] Latimer AA, Kulkarni AR, Aljama H, Montoya JH, Yoo JS, Tsai C, et al. Understanding trends in C–H bond activation in heterogeneous catalysis. Nat Mater 2017;16(2):225–9.   \n[73] Wang T, Cui XJ, Winther KT, Abild-Pedersen F, Bligaard T, Nørskov JK. Theoryaided discovery of metallic catalysts for selective propane dehydrogenation to propylene. ACS Catal 2021;11(10):6290–7.   \n[74] Mamun O, Winther KT, Boes JR, Bligaard T. A Bayesian framework for adsorption energy prediction on bimetallic alloy catalysts. npj Comput Mater 2020;6(1):177.   \n[75] García-Muelas R, López N. Statistical learning goes beyond the d-band model providing the thermochemistry of adsorbates on transition metals. Nat Commun 2019;10(1):4687.   \n[76] Bligaard T, Nørskov JK, Dahl S, Matthiesen J, Christensen CH, Sehested J. The Bronsted–Evans–Polanyi relation and the volcano curve in heterogeneous catalysis. J Catal 2004;224(1):206–17.   \n[77] Yu L, Abild-Pedersen F. Bond order conservation strategies in catalysis applied to the ${\\mathrm{NH}}_{3}$ decomposition reaction. ACS Catal 2017;7(1):864–71.   \n[78] Peng HJ, Tang MT, Liu XY, Schlexer Lamoureux P, Bajdich M, Abild-Pedersen F. The role of atomic carbon in directing electrochemical ${\\mathsf{C O}}_{2}$ reduction to multicarbon products. Energy Environ Sci 2021;14(1):473–82.   \n[79] Cheng YL, Hsieh CT, Ho YS, Shen MH, Chao TH, Cheng MJ. Examination of the Brønsted–Evans–Polanyi relationship for the hydrogen evolution reaction on transition metals based on constant electrode potential density functional theory. Phys Chem Chem Phys 2022;24(4):2476–81.   \n[80] Hummelshøj JS, Abild-Pedersen F, Studt F, Bligaard T, Nørskov JK. CatApp: a web application for surface chemistry and heterogeneous catalysis. Angew Chem Int Ed Engl 2012;51(1):272–4.   \n[81] Takahashi K, Miyazato I. Rapid estimation of activation energy in heterogeneous catalytic reactions via machine learning. J Comput Chem 2018;39(28):2405–8.   \n[82] Artrith N, Lin ZX, Chen JG. Predicting the activity and selectivity of bimetallic metal catalysts for ethanol reforming using machine learning. ACS Catal 2020;10(16):9438–44.   \n[83] Ma X, Li Z, Achenie LEK, Xin H. Machine-learning-augmented chemisorption model for $\\mathrm{CO}_{2}$ electroreduction catalyst screening. J Phys Chem Lett 2015;6 (18):3528–33.   \n[84] Li Z, Wang SW, Chin WS, Achenie LE, Xin HL. High-throughput screening of bimetallic catalysts enabled by machine learning. J Mater Chem A 2017;5 (46):24131–8.   \n[85] Praveen CS, Comas-Vives A. Design of an accurate machine learning algorithm to predict the binding energies of several adsorbates on multiple sites of metal surfaces. ChemCatChem 2020;12(18):4611–7.   \n[86] Wang S, Pillai HS, Xin H. Bayesian learning of chemisorption for bridging the complexity of electronic descriptors. Nat Commun 2020;11(1):6132.   \n[87] Göltl F, Muller P, Uchupalanun P, Sautet P, Hermans I. Developing a descriptor-based approach for CO and NO adsorption strength to transition metal sites in zeolites. Chem Mater 2017;29(15):6434–44.   \n[88] Liu C, Li YX, Takao M, Toyao T, Maeno Z, Kamachi T, et al. Frontier molecular orbital based analysis of solid–adsorbate interactions over group 13 metal oxide surfaces. J Phys Chem C 2020;124(28):15355–65.   \n[89] Jyothirmai MV, Roshini D, Abraham BM, Singh JK. Accelerating the discovery of $\\mathrm{g-C_{3}N_{4}}$ -supported single atom catalysts for hydrogen evolution reaction: a combined DFT and machine learning strategy. ACS Appl Energy Mater 2023;6 (10):5598–606.   \n[90] Liu TY, Zhao X, Liu XF, Xiao WJ, Luo ZJ, Wang WT, et al. Understanding the hydrogen evolution reaction activity of doped single-atom catalysts on twodimensional GaPS4 by DFT and machine learning. J Energy Chem 2023;81:93–100.   \n[91] Sun H, Li YZ, Gao LY, Chang MY, Jin XR, Li BY, et al. High throughput screening of single atomic catalysts with optimized local structures for the electrochemical oxygen reduction by machine learning. J Energy Chem 2023;81:349–57.   \n[92] Chen A, Zhang X, Chen LT, Yao S, Zhou Z. A machine learning model on simple features for $\\mathrm{CO}_{2}$ reduction electrocatalysts. J Phys Chem C 2020;124 (41):22471–8.   \n[93] Andersen M, Levchenko SV, Scheffler M, Reuter K. Beyond scaling relations for the description of catalytic materials. ACS Catal 2019;9(4):2752–9.   \n[94] Fung V, Hu G, Ganesh P, Sumpter BG. Machine learned features from density of states for accurate adsorption energy prediction. Nat Commun 2021;12(1):88.   \n[95] Esterhuizen JA, Goldsmith BR, Linic S. Uncovering electronic and geometric descriptors of chemical activity for metal alloys and oxides using unsupervised machine learning. Chem Catal 2021;1(4):923–40.   \n[96] Toyao T, Suzuki K, Kikuchi S, Takakusagi S, Shimizu K, Takigawa I. Toward effective utilization of methane: machine learning prediction of adsorption energies on metal alloys. J Phys Chem C 2018;122(15):8315–26.   \n[97] Noh J, Back S, Kim J, Jung Y. Active learning with non-ab initio input features toward efficient ${\\mathsf{C O}}_{2}$ reduction catalysts. Chem Sci 2018;9(23):5152–9.   \n[98] Esterhuizen JA, Goldsmith BR, Linic S. Theory-guided machine learning finds geometric structure–property relationships for chemisorption on subsurface alloys. Chem 2020;6(11):3100–17.   \n[99] Wang TR, Li JC, Shu W, Hu SL, Ouyang RH, Li WX. Machine-learning adsorption on binary alloy surfaces for catalyst screening. Chin J Chem Phys 2020;33(6):703–11.   \n[100] Zhang X, Wang Z, Lawan AM, Wang JH, Hsieh CY, Duan CR, et al. Data-driven structural descriptor for predicting platinum-based alloys as oxygen reduction electrocatalysts. InfoMat 2023;5(6):e12406.   \n[101] Montemore MM, Medlin JW. A unified picture of adsorption on transition metals through different atoms. J Am Chem Soc 2014;136(26):9272–5.   \n[102] Montemore MM, Nwaokorie CF, Kayode GO. General screening of surface alloys for catalysis. Catal Sci Technol 2020;10(13):4467–76.   \n[103] Somorjai GA, Park JY. Molecular surface chemistry by metal single crystals and nanoparticles from vacuum to high pressure. Chem Soc Rev 2008;37 (10):2155–62.   \n[104] Nørskov JK, Bligaard T, Hvolbaek B, Abild-Pedersen F, Chorkendorff I, Christensen CH. The nature of the active site in heterogeneous metal catalysis. Chem Soc Rev 2008;37(10):2163–71.   \n[105] Calle-Vallejo F, Loffreda D, Koper MTM, Sautet P. Introducing structural sensitivity into adsorption–energy scaling relations by means of coordination 2015;7(5):403–10.   \n[106] Calle-Vallejo F, Tymoczko J, Colic V, Vu QH, Pohl MD, Morgenstern K, et al. Finding optimal surface sites on heterogeneous catalysts by counting nearest neighbors. Science 2015;350(6257):185–9.   \n[107] Liu X, Xiao J, Peng H, Hong X, Chan K, Nørskov JK. Understanding trends in electrochemical carbon dioxide reduction rates. Nat Commun 2017;8 (1):15438.   \n[108] Choksi TS, Roling LT, Streibel V, Abild-Pedersen F. Predicting adsorption properties of catalytic descriptors on bimetallic nanoalloys with site-specific precision. J Phys Chem Lett 2019;10(8):1852–9.   \n[109] Sheldon RA. Green and sustainable manufacture of chemicals from biomass: state of the art. Green Chem 2014;16(3):950–63.   \n[110] Mondelli C, Gözayd1n G, Yan N, Pérez-Ramírez J. Biomass valorisation over metal-based solid catalysts from nanoparticles to single atoms. Chem Soc Rev 2020;49(12):3764–82.   \n[111] Vollmer I, Jenks MJF, Roelands MCP, White RJ, Van Harmelen T, de Wild P, et al. Beyond mechanical recycling: giving new life to plastic waste. Angew Chem Int Ed Engl 2020;59(36):15402–23.   \n[112] Zhou H, Wang Y, Ren Y, Li ZH, Kong XG, Shao MF, et al. Plastic waste valorization by leveraging multidisciplinary catalytic technologies. ACS Catal 2022;12(15):9307–24.   \n[113] Hoyt RA, Montemore MM, Fampiou I, Chen W, Tritsaris G, Kaxiras E. Machine learning prediction of H adsorption energies on Ag alloys. J Chem Inf Model 2019;59(4):1357–65.   \n[114] Saxena S, Khan TS, Jalid F, Ramteke M, Haider MA. In silico high throughput screening of bimetallic and single atom alloys using machine learning and ab initio microkinetic modelling. J Mater Chem A 2020;8(1):107–23.   \n[115] Liu XY, Cai C, Zhao WH, Peng HJ, Wang T. Machine learning-assisted screening of stepped alloy surfaces for C1 catalysis. ACS Catal 2022;12 (8):4252–60.   \n[116] Yang Z, Gao W, Jiang Q. A machine learning scheme for the catalytic activity of alloys with intrinsic descriptors. J Mater Chem A 2020;8(34):17507–15.   \n[117] Zong X, Vlachos DG. Exploring structure–sensitive relations for small species adsorption using machine learning. J Chem Inf Model 2022;62(18):4361–8.   \n[118] Yang J, Wang Z, Liu Z, Wang Q, Wen Y, Zhang A, et al. Rational ensemble design of alloy catalysts for selective ammonia oxidation based on machine learning. J Mater Chem A 2022;10(47):25238–48.   \n[119] Batchelor TAA, Pedersen JK, Winther SH, Castelli IE, Jacobsen KW, Rossmeisl J. High-entropy alloys as a discovery platform for electrocatalysis. Joule 2019;3 (3):834–45.   \n[120] Roy D, Mandal SC, Pathak B. Machine learning-driven high-throughput screening of alloy-based catalysts for selective ${\\mathsf{C O}}_{2}$ hydrogenation to methanol. ACS Appl Mater Interfaces 2021;13(47):56151–63.   \n[121] Pandit NK, Roy D, Mandal SC, Pathak B. Rational designing of bimetallic/ trimetallic hydrogen evolution reaction catalysts using supervised machine learning. J Phys Chem Lett 2022;13(32):7583–93.   \n[122] Zhang X, Li KP, Wen B, Ma J, Diao DF. Machine learning accelerated DFT research on platinum-modified amorphous alloy surface catalysts. Chin Chem Lett 2023;34(5):107833.   \n[123] Tran K, Ulissi ZW. Active learning across intermetallics to guide discovery of electrocatalysts for $\\mathsf{C O}_{2}$ reduction and $\\mathrm{H}_{2}$ evolution. Nat Catal 2018;1 (9):696–703.   \n[124] Xie T, Grossman JC. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys Rev Lett 2018;120(14):145301.   \n[125] Back S, Yoon J, Tian N, Zhong W, Tran K, Ulissi ZW. Convolutional neural network of atomic surface structures to predict binding energies for highthroughput screening of catalysts. J Phys Chem Lett 2019;10(15):4401–8.   \n[126] Gu GH, Noh J, Kim S, Back S, Ulissi Z, Jung Y. Practical deep-learning representation for fast heterogeneous catalyst screening. J Phys Chem Lett 2020;11(9):3185–91.   \n[127] Jain A, Ong SP, Hautier G, Chen W, Richards WD, Dacek S, et al. Commentary: the materials project: a materials genome approach to accelerating materials innovation. APL Mater 2013;1(1):011002.   \n[128] Wang SH, Pillai HS, Wang S, Achenie LEK, Xin H. Infusing theory into deep learning for interpretable reactivity prediction. Nat Commun 2021;12(1):5288.   \n[129] Hansen K, Montavon G, Biegler F, Fazli S, Rupp M, Scheffler M, et al. Assessment and validation of machine learning methods for predicting molecular atomization energies. J Chem Theory Comput 2013;9(8):3404–19.   \n[130] Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model 2010;50(5):742–54.   \n[131] Huang B, Von Lilienfeld OA. Quantum machine learning using atom-inmolecule-based fragments selected on the fly. Nat Chem 2020;12 (10):945–51.   \n[132] Hansen K, Biegler F, Ramakrishnan R, Pronobis W, Von Lilienfeld OA, Müller KR, et al. Machine learning predictions of molecular properties: accurate many-body potentials and nonlocality in chemical space. J Phys Chem Lett 2015;6(12):2326–31.   \n[133] Christensen AS, Bratholm LA, Faber FA, Von Lilienfeld OA. FCHL revisited: faster and more accurate quantum machine learning. J Chem Phys 2020;152 (4):044107.   \n[134] Li X, Chiong R, Hu Z, Cornforth D, Page AJ. Improved representations of heterogeneous carbon reforming catalysis using machine learning. J Chem Theory Comput 2019;15(12):6882–94.   \n[135] Li X, Chiong R, Page AJ. Group and period-based representations for improved machine learning prediction of heterogeneous alloy catalysts. J Phys Chem   \n[136] Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 1988;28(1):31–6.   \n[137] Chowdhury AJ, Yang W, Abdelfatah KE, Zare M, Heyden A, Terejanu GA. A multiple filter based neural network approach to the extrapolation of adsorption energies on metal surfaces for catalysis applications. J Chem Theory Comput 2020;16(2):1105–14.   \n[138] Chowdhury AJ, Yang WQ, Heyden A, Terejanu GA. Comparative study on the machine learning-based prediction of adsorption energies for ring and chain species on metal catalyst surfaces. J Phys Chem C 2021;125(32):17742–8.   \n[139] Wang BC, Gu TJ, Lu YJ, Yang B. Prediction of energies for reaction intermediates and transition states on catalyst surfaces using graph-based machine learning models. Mol Catal 2020;498:111266.   \n[140] Pablo-García S, Morandi S, Vargas-Hernández RA, Jorner K, IvkovicÂ Z, López N, et al. Fast evaluation of the adsorption energy of organic molecules on metals via graph neural networks. Nat Comput Sci 2023;3(5):433–42.   \n[141] Jinnouchi R, Asahi R. Predicting catalytic activity of nanoparticles by a DFT-aided machine-learning algorithm. J Phys Chem Lett 2017;8(17): 4279–83.   \n[142] Jager MOJ, Morooka EV, Canova FF, Himanen L, Foster AS. Machine learning hydrogen adsorption on nanoclusters through structural descriptors. npj Comput Mater 2018;4:37.   \n[143] Chen Y, Huang Y, Cheng T, Goddard III WA. Identifying active sites for $\\mathsf{C O}_{2}$ reduction on dealloyed gold surfaces by combining machine learning with multiscale simulations. J Am Chem Soc 2019;141(29):11651–7.   \n[144] Van Duin ACT, Dasgupta S, Lorant F, Goddard WA. ReaxFF: a reactive force field for hydrocarbons. J Phys Chem A 2001;105(41):9396–409.   \n[145] Naserifar S, Chen YL, Kwon S, Xiao H, Goddard III WA. Artificial intelligence and QM/MM with a polarizable reactive force field for next-generation electrocatalysts. Matter 2021;4(1):195–216.   \n[146] Jiang K, Huang YF, Zeng GS, Toma FM, Goddard III WA, Bell AT. Effects of surface roughness on the electrochemical reduction of $\\mathrm{CO}_{2}$ over Cu. ACS Energy Lett 2020;5(4):1206–14.   \n[147] Gu GH, Lim J, Wan C, Cheng T, Pu H, Kim S, et al. Autobifunctional mechanism of jagged Pt nanowires for hydrogen evolution kinetics via end-to-end simulation. J Am Chem Soc 2021;143(14):5355–63.   \n[148] Zhang JW, Hu PJ, Wang HF. Amorphous catalysis: machine learning driven high-throughput screening of superior active site for hydrogen evolution reaction. J Phys Chem C 2020;124(19):10483–94.   \n[149] Ghanekar PG, Deshpande S, Greeley J. Adsorbate chemical environmentbased machine learning framework for heterogeneous catalysis. Nat Commun 2022;13(1):5788.   \n[150] Deshpande S, Maxson T, Greeley J. Graph theory approach to determine configurations of multidentate and high coverage adsorbates for heterogeneous catalysis. npj Comput Mater 2020;6(1):79.   \n[151] Cao L, Mueller T. Catalytic activity maps for alloy nanoparticles. J Am Chem Soc 2023;145(13):7352–60.   \n[152] Zhong M, Tran K, Min Y, Wang C, Wang Z, Dinh CT, et al. Accelerated discovery of ${\\mathsf{C O}}_{2}$ electrocatalysts using active machine learning. Nature 2020;581(7807):178–83.   \n[153] Van der Maaten L. Accelerating t-SNE using tree-based algorithms. J Mach Learn Res 2014;15(1):3221–45.   \n[154] Pillai HS, Li Y, Wang SH, Omidvar N, Mu Q, Achenie LEK, et al. Interpretable design of Ir-free trimetallic electrocatalysts for ammonia oxidation with graph neural networks. Nat Commun 2023;14(1):792.   \n[155] Zhai S, Xie HP, Cui P, Guan DQ, Wang J, Zhao SY, et al. A combined ionic Lewis acid descriptor and machine-learning approach to prediction of efficient oxygen reduction electrodes for ceramic fuel cells. Nat Energy 2022;7 (9):866–75.   \n[156] Gao Q, Pillai HS, Huang Y, Liu S, Mu Q, Han X, et al. Breaking adsorption– energy scaling limitations of electrocatalytic nitrate reduction on intermetallic CuPd nanocubes by machine-learned insights. Nat Commun 2022;13(1):2338.   \n[157] Winther KT, Hoffmann MJ, Boes JR, Mamun O, Bajdich M, Bligaard T. Catalysis-Hub.org, an open electronic structure database for surface reactions. Sci Data 2019;6(1):75.   \n[158] Chanussot L, Das A, Goyal S, Lavril T, Shuaibi M, Riviere M, et al. Open catalyst 2020 (OC20) dataset and community challenges. ACS Catal 2021;11 (10):6059–72.   \n[159] Kolluru A, Shuaibi M, Palizhati A, Shoghi N, Das A, Wood B, et al. Open challenges in developing generalizable large-scale machine-learning models for catalyst discovery. ACS Catal 2022;12(14):8572–81.   \n[160] Chen C, Ye WK, Zuo YX, Zheng C, Ong SP. Graph networks as a universal machine learning framework for molecules and crystals. Chem Mater 2019;31(9):3564–72.   \n[161] Yang N, Medford AJ, Liu X, Studt F, Bligaard T, Bent SF, et al. Intrinsic selectivity and structure sensitivity of rhodium catalysts for $C_{2+}$ oxygenate production. J Am Chem Soc 2016;138(11):3705–14.   \n[162] Sundararaman R, Vigil-Fowler D, Schwarz K. Improving the accuracy of atomistic simulations of the electrochemical interface. Chem Rev 2022;122 (12):10651–74.   \n[163] Liu X, Schlexer P, Xiao J, Ji Y, Wang L, Sandberg RB, et al. pH effects on the electrochemical reduction of $\\mathrm{CO}_{2}$ towards $C_{2}$ products on stepped copper. Nat Commun 2019;10(1):32.   \n[164] Peng HJ, Tang MT, Halldin Stenlid J, Liu $\\mathbf{\\boldsymbol{x}},$ Abild-Pedersen F. Trends in oxygenate/hydrocarbon selectivity for electrochemical $\\mathrm{CO}_{2}$ reduction to $C_{2}$ products. Nat Commun 2022;13(1):1399.   \n[165] Faber FA, Hutchison L, Huang B, Gilmer J, Schoenholz SS, Dahl GE, et al. Prediction errors of molecular machine learning models lower than hybrid DFT error. J Chem Theory Comput 2017;13(11):5255–64.   \n[166] Bogojeski M, Vogt-Maranto L, Tuckerman ME, Müller KR, Burke K. Quantum chemical accuracy from density functional approximations via machine learning. Nat Commun 2020;11(1):5223.   \n[167] Bisbo MK, Hammer B. Efficient global structure optimization with a machinelearned surrogate model. Phys Rev Lett 2020;124(8):086102.   \n[168] Behler J. First principles neural network potentials for reactive simulations of large molecular and condensed systems. Angew Chem Int Ed Engl 2017;56 (42):12828–40.   \n[169] Friederich P, Häse F, Proppe J, Aspuru-Guzik A. Machine-learned potentials for next-generation matter simulations. Nat Mater 2021;20(6):750–61.",
        "category": " References"
    }
]