wl-hydrophilic-polymer/task1/task1-chunks/HTA╥╗┐ю╙├╙┌╖╓┼ф╕║╘Ё╚╦║═╛█║╧╖┤╙ж╓╨SMILES╡─╬▓▓┐.json

[
    {
        "id": 1,
        "chunk": "# HTA - An open-source software for assigning heads and tails to SMILES in polymerization reactions  \n\nBrenda de Souza Ferrari,∗ Ronaldo Giro,† Mathias Steiner ‡  \n\nJanuary 15, 2025",
        "category": " Abstract"
    },
    {
        "id": 2,
        "chunk": "# Abstract  \n\nArtificial Intelligence (AI) techniques are transforming the computational discovery and design of polymers. The key enablers for polymer informatics are machine-readable molecular string representations of the building blocks of a polymer, i.e., the monomers. In monomer strings, such as SMILES, symbols at the head and tail atoms indicate the locations of bond formation during polymerization. Since the linking of monomers determines a polymer’s properties, the performance of AI prediction models will, ultimately, be limited by the accuracy of the head and tail assignments in the monomer SMILES. Considering the large number of polymer precursors available in chemical data bases, reliable methods for the automated assignment of head and tail atoms are needed. Here, we report a method for assigning head and tail atoms in monomer SMILES by analyzing the reactivity of their functional groups. In a reference data set containing 206 polymer precursors, the HeadTailAssign (HTA) algorithm has correctly predicted the polymer class of 204 monomer SMILES, representing an accuracy of $99\\%$ . The head and tail atoms were correctly assigned to 187 monomer SMILES, representing an accuracy of $91\\%$ . The HTA code is available for validation and reuse at https://github.com/IBM/HeadTailAssign.  \n\nKeywords: Polymers, Cheminformatics, Quantum Chemistry, Materials Science, Materials Discovery  \n\n![](images/7a24989afe25ae09249bb56d2202a268b7de60ee95a669bef3e3a64da9372509.jpg)",
        "category": " Abstract"
    },
    {
        "id": 3,
        "chunk": "# INTRODUCTION  \n\nPolymers are versatile materials with a wide range of applications $^{1-9}$ . Their properties are mainly determined by the way in which the repeat units, or monomers, are connected within the polymer structure. Typically, there are two preferential binding sites per repeat unit, and the respective atomic positions in the structure are labeled “head” and“tail”10. During polymerization, repeat units might connect head-to-tail, head-to-head, or tail-totail11. Depending on how the repeating units are connected, intermolecular interactions between the polymer chains in the material can significantly alter the physical and chemical properties of the polymer12,13.  \n\nIn polymer informatics $^{14}$ , machine learning (ML) techniques are based on machinereadable representations of a polymer’s repeat units, e.g., the Simplified Molecular-Input Line-Entry System (SMILES)15,16. Polymer-SMILES (p-SMILES) is an extension of SMILES in which symbols such as “\\*” indicate the polymerization points of the repeat units. Alternative representations include the “Hierarchical Editing Language for Macromolecules” (HELM)17, the “INternational CHemical Identifier” (InChI)18, and CurlySMILES19. In general, string representations are limited to homopolymers and are not suitable for capturing the stochastic nature of polymers, such as encoding randomly branched polymers, as is the case for CurlySMILES. More recently, BigSMILES $^{\\mathrm{20}}$ has been applied to represent polymers in string format. In BigSMILES, special characters such as “ $\\$7$ and brackets $^{66}\\langle^{93},^{66}\\rangle^{3}$ indicate the position of head and tail bonds between repeat units. As an advancement, BigSMILES can encode copolymers and enables topological representations of polymeric chains in complex polymers. However, BigSMILES strings provide only a qualitative description of a molecular ensemble $^{21}$ . To fully characterize a polymer, a probability and weight must be assigned to each polymer constituent. By providing a standard format for digitalizing data, PolyDAT $^{21}$ serves as a quantitative extension of BigSMILES.  \n\nAlthough string-format representations of polymers require the tagging of head and tail atoms within the repeat units, computational tools that automatically identify these positions do not yet exist. The Open Parser for Systematic IUPAC Nomenclature (OPSIN)22 interprets organochemical nomenclature efficiently by returning as output the SMILES in string format. If an IUPAC polymer name is given as input, OPSIN returns the modified SMILES with head and tail atoms tagged. However, this method is limited to polymers whose nomenclature is already established. The Monomers-to-Polymers tool (M2P) uses known chemical reactions to build polymer chains from monomers23. This approach is limited to cases where a comparison of polymer chains and repeat units reveals the positions of heads and tails.  \n\nIn this work, we report a method for assigning head and tail atoms in monomer SMILES with the objective to obtain polymer repeat units with bond locations. Our HeadTailAssign (HTA) algorithm quantifies the reactivity of the functional groups of the monomer structure and assigns special characters to those positions in the output SMILES (see Fig.1). In the following, we will outline the computational workflow.",
        "category": " Introduction"
    },
    {
        "id": 4,
        "chunk": "# METHODS",
        "category": " Materials and methods"
    },
    {
        "id": 5,
        "chunk": "# Algorithm  \n\nThe HTA algorithm identifies the head and tail positions of the repeat units of a polymer by analyzing the nucleophilicity of its functional groups, see the flow chart in figure 2. In short, the algorithm identifies functional groups within the monomer SMILES and rank-orders their reactivity based on quantum chemical calculations. On the basis of this information, the algorithm then assigns the most likely polymerization mechanism, as well as the positions of head and tail atoms.  \n\nThe input, which is provided as a csv file, contains the polymer name and either a reaction SMILES or a monomer SMILES. If a reaction SMILES is provided, HTA identifies the monomer by evaluating the Chemical Similarity between reactants and products. In short, all molecular entities are separated and their fingerprints are calculated using RDKFingerprint within RDKit $^{24}$ . The fingerprints are then compared with Tanimoto Similarity24,25. Finally, HTA selects the reactant SMILES with the highest similarity score as monomer SMILES.  \n\nIn HTA, data processing is performed by three modules: Assigner, Gamess, and Extractor, see Figure 2. The Assigner performs classification tasks and tags head and tail atoms.  \n\nGamess performs the quantum chemical calculations. The Extractor retrieves information from the output files of the quantum chemical simulations.  \n\nThe Assigner itself performs three operations: Get Class, Get Mechanism, and Get Head Tail. The operations define the polymer class, the polymerization mechanism, and head and tail positions. For identifying the polymer class, the monomer SMILES is compared with the SMARTS $^{26}$ of the most common functional groups that define a polymer class. For example, polyamide contains two functional groups: amide and carboxilic acid. If one of the groups is detected as a nucleophilic site, the molecule is classified as a polyamide.  \n\nIn the current version (2.0.0) of HTA, Get Class is able to classify five polymerization groups: polyvinyl, polyamide, polyester, polyether, and polyurethane. The functional groups representative of each class are shown in figure 3. The most common functional group that promotes polymerization of polyvinyl is alkene. However, in some instances, the alkyne group is also involved.  \n\nFor polyamide, a copolymer or a cyclic monomer is required for polymerization. We account for this by means of a primary amine and a carboxylic acid, as well as primary amine and acyl halide. In the case of a cyclic monomer, we have chosen a secondary amide and a heterocycle monomer to broaden the classification of the polyamide class.  \n\nFor polyester formation, either a copolymer or a cyclic monomer is required. In the first case, monomers are represented by an aliphatic alcohol group and a carboxylic acid group, respectively. In the second case, they are represented by a heterocycle group and a carboxylic acid group, respectively.  \n\nPolyether polymerization requires an opening of a ring in the presence of an ether group. Therefore, the two groups were implemented in HTA to represent the polyether class. Finally, for polyurethane formation, the presence of two monomers is necessary, one with an alcohol group, and the other with a cyanate group.  \n\nIf a monomer contains functional groups compatible with multiple class definitions, it is categorized as such. The quantum-chemical calculations then identify the functional group with the highest reactivity, i.e. the most likely polymerization site.  \n\nTo quantify the reactivity of functional groups within the monomers, we have applied the concept of the nucleophilicity index, which is based on natural orbitals for atomic populations. The atomic index of nucleophilicity involving the highest occupied molecular orbital (HOMO) is defined as27:  \n\n$$\nR_{X}={\\frac{\\sum_{\\alpha}^{X}|C_{\\alpha,n}|^{2}}{(1-\\epsilon_{n,n})}}={\\frac{\\sum_{\\alpha}^{X}|C_{\\alpha}|^{2}}{(1-\\epsilon^{\\star})}}\n$$  \n\nwhere $R_{X}$ is the nucleophilicity index of atom $X$ , $C_{\\alpha,n}$ is the Molecular Orbital (MO) expansion coefficient of the $\\alpha$ th atomic orbital on the $n$ th MO, $\\epsilon_{n,n}$ and $\\boldsymbol{\\epsilon}^{\\star}$ are the HOMO energies, $X$ is the atom index, $\\alpha$ is the index of the atomic orbital, and $n$ is the index of the MO.  \n\nWe have calculated the nucleophilicity index $R_{X}$ with the STO-3G basis set by applying Mulliken’s population analysis method $^{28-30}$ . All quantum states functions were calculated at the SCF / RHF theory level using the standard ab initio quantum-chemistry package GAMESS US $^{31}$ , version 2021 R2.  \n\nWe have generated the GAMESS input file with the Gamess module. In short, the module converts the SMILES string into a 3D coordinate file using the Python library Pybel $^{32}$ , a Python wrapper for the OpenBabel $^{33}$ toolkit. The 3D coordinate file is generated by OpenBabel with geometry optimization using the classical Universal Force Field $^{34}$ with 5000 maximum optimization steps. The 3D coordinate file specifies the coordinates and chemical identity of each atom within the monomer. Finally, the Gamess module constructs the GAMESS input file by merging the keywords with the 3D coordinates in xyz format.  \n\nAll information related to Mulliken‘s population of the HOMO is extracted from the GAMESS output file using the Extractor. The module calculates $R_{X}$ for each of the $X$ atoms of the monomer unit and ranks $R_{X}$ in descending order. In a next step, monomers containing two or more functional groups compatible with existing polymer definitions are classified as follows: the functional group with the highest $R_{X}$ is identified as the monomer’s polymerization site and the monomer is assigned to the respective polymer class.  \n\nThe polymerization mechanism is then obtained by using the Get Class routine. The Get Mechanism routine recognizes the class name and assigns a pre-defined mechanism. For instance, if the polymer class is identified as ”polyamide”, the routine assigns the ”polycondensation” mechanism to the polymer. This process is straightforward for all classes, except for the vinyl mechanism, in which subcategories exist. For example, a pro-vinyl monomer can polymerize through radical polymerization, cationic polymerization, or anionic polymerization35. The likelihood depends on polymerization initiator, solvent, and  \n\npolymerization stereochemistry.  \n\nAn option for identifying the most likely mechanism subcategory is to find the polymerization initiator. If the input provided is related to reaction SMILES, the algorithm can detect the presence of an initiator in the reaction path by means of Chemical Similarity.  \n\nAfter classes and mechanisms are assigned, the HTA algorithm identifies the positions of the head and tail atoms in the monomer SMILES, which are labeled with the symbols $^{66*}$ :1” for head and “\\*:2” for tail, respectively. For each polymer class, the algorithm contains information about the organic function in which the most nucleophilic atom is located. For example, in case of vinyl polymers, polymerization should occur at the double bonds and, in some cases, at the triple bond. Using the atom mappings, the nucleophilic atom is selected as head by convention. In the case where the electrophilic atom occurs in the same organic function, which is the case in vinyl polymerization, the tail is selected from the same organic function. In polyamides, the tail atom is located within a different organic function. In that case, HTA selects the organic functions with the electrophilic atom and, by using atom mappings, assigns the tail atom accordingly. For some classes, such as polyethers and polyamides, the monomers may be structured as a cycle or a macrocycle. In those cases, the cycle is opened by SMILES manipulation, and structural errors are checked with a dedicated sanitization process. SMILES sanitization ensures that a valid molecular structure can be generated from a SMILES string36. In this work, we use the term “sanitization” in the context of manipulating SMILES strings using Regex patterning37.  \n\nThe SMILES representation is treated as a sequence of letters without considering the connections between the atoms. For molecules with aliphatic structures, such as vinyl precursors, this simplification leads to acceptable results. However, for complex molecules, such as cyclic precursors, the connections between atoms should be accounted for.  \n\nFinally, the HTA results are compiled in csv format. They include polymer name, reaction with and without atom mappings, assignment results for monomer, polymer class, polymerization mechanism, as well as head and tail atoms.",
        "category": " Materials and methods"
    },
    {
        "id": 6,
        "chunk": "# Data  \n\nThe validation data set contains 206 data entries in total, with 149 polymers in the vinyl class, 17 in the polyamide class, 25 in the polyester class, 12 in the polyether class, and 3 in the polyurethane class.  \n\n57 polymer names with polymer SMILES that belong to the polyamide, polyester, polyether, polyurethane class, respectively, were found at Polymerdatabase.com $^{38,3}$ 9. 149 polymer names and (some of the) polymer SMILES that belong to the vinyl class were taken from reference40. To complete the data entries, we have created the missing polymer SMILES either from scratch or, alternatively, by conferring OPSIN $^{41}$ . For validation of the HTA algorithm, we have modified the data set by transforming polymer products into precursors, or monomers, on a case-by-case basis. The algorithm could then be tested for detecting the reaction centers of polymerization.",
        "category": " Materials and methods"
    },
    {
        "id": 7,
        "chunk": "# Validation  \n\nFor HTA validation, we have tested monomers belonging to the following classes: polyamide, polyester, polyether, and polyurethane. In addition, we have considered monomers that undergo vinyl polymerization, which could be radical, cationic, or anionic.  \n\nSpecifically, we have compared the true head and tail positions of SMILES with the positions predicted by HTA. The head and tail positions were considered as unique tokens and the difference between heads and tails was not taken into account.  \n\nTo compare the results for homopolymers, both the ground-truth and the predicted data set are sorted by polymer name, and monomers with heads and tails assigned (mon-HTA). The canonicalization of the SMILES is performed using RDKit, assuring that the labeling is unambiguous. The comparison of the SMILES strings reveals if each mon-HTA entry has the same canonical SMILES in the ground-truth and the predicted data set. Since the number of entries in the validation data set is small, we have visually compared each individual molecular structure. The results are compiled as a Boolean series in the HTA output file, while the ground-truth and predicted structures are visualized as a png image file.",
        "category": " Materials and methods"
    },
    {
        "id": 8,
        "chunk": "# RESULTS AND DISCUSSION  \n\nWe have validated the performance of the HTA algorithm; see Fig.2, with a data set containing 206 polymer precursors. The validation data set is described in the Methods section, and the link to the data repository is provided in the Data Availability section.  \n\nWe first evaluate the computational efficiency of the HTA algorithm. Performing the HTA assessment of the full data set required a compute time of roughly 40 minutes on a personal computer (11th Gen Intel Core I5-1135G7, Intel Iris Xe Graphics, 16Gb memory DDR4). This correspond to about 10 seconds per monomer SMILES, including the quantumchemical simulations, which indicates that the HTA algorithm could be used for processing larger data sets.  \n\nIn Fig.4, we present four polymer classification examples representing Polyamide, Polyvinyl, Polyether, and Polyurethane. With a reaction SMILES as input, the algorithm performs an initial assessment of Chemical Similarity. Because the validation data set does not contain any polymerization reactions, the algorithm continues with the polymer classification task.  \n\nIn the example shown in Fig.4a, the monomer polymerizes to Nylon 10 by means of a polycondensation process. The algorithm’s Assigner routine identifies two functional groups in the monomer: an amino group and a carboxylic acid group. By accessing the dictionary that maps the functional groups to polymer classes, the algorithm verifies that both groups indicate the polyamide class and the polycondensation mechanism. The head assignment is performed by finding the atom mapping for the nitrogen of the amino group and the tail assignment is performed by finding the atom mapping for the carbon of the carboxylic acid group.  \n\nIn the second example shown in Fig.4b, the HTA algorithm detects a vinyl group and an amide group. On the basis of the functional group selections that define each class, the HTA algorithm cannot match the monomer with a single polymer class. Therefore, the algorithm has to prioritize the functional groups for polymer class assignment. In this case, the vinyl group is selected. The first reason is that the vinyl group has a higher nucleophilicity index. The second reason is that the amide group is not mapped to any polymer class implemented in HTA. Finally, the head and tail positions are assigned to the carbon atoms forming the double bond in the vinyl group, and the structure is sanitized accordingly.  \n\nIn the third example, which is presented in Fig.4c, the monomer is an epoxy heterocycle. For assignment of head and tail, the monomer has to undergo a ring-opening process. Because there is only one functional group in the SMILES string, the assignments of both polymer class and mechanism are straightforward. The head and tail atoms are then assigned as in the previous example.  \n\nIn the fourth example, shown in Fig.4d, a copolymer is represented with polyurethane precursors. In this case, the algorithm identifies the relevant functional groups of each monomer, i.e. the isocyanate groups and hydroxyl groups, and groups them together for assigning the polymer class. The algorithm can now identify that those monomers belong to the polyurethane class and polymerize through polycondensation. The heads and tails are then assigned to each monomer as two separate entities.  \n\nWe have performed the validation of the HTA results by comparing them with the ground truth and the results are shown in Fig.5. Of the 206 polymer precursors in the data set, HTA has correctly predicted the polymer class for 204 of them, which represents and accuracy of $99.0\\%$ , as shown in Fig.5a).  \n\nThe two monomers of the polyester class that were misclassified, poly(caprolactone) and poly(4-hydroxybutyrate), are displayed in the inset of Fig.5a. Both precursors are heterocycles; however, the algorithm has identified them as cyclic monomers and assigned them to the polyether class. Consequently, the head and tail positions were incorrectly assigned as well. To improve prediction accuracy, a future version of the HTA algorithm should include a definition that cyclic precursors can generate polyester oligomers.  \n\nThe validation of the head and tail assignment, see Fig.5b, reveals that the algorithm has correctly assigned the positions in 187 cases, representing an accuracy of 90.8%. Within the polyurethane class all monomers were correctly assigned.  \n\nWithin the polyvinyl class, incorrect head and tail assignments occurred in two of 149 monomers. A possible explanation is the presence of large groups connected by one of the double bonds in their structures. Polymerization of vinyl monomers follows the polyaddition mechanism in which the reactive site, i.e. the double bond, is attacked by an initiator. The initiator breaks the double bond by forming a single bond with one of the carbon atoms.  \n\nThe second carbon atom remains available to grow the polymer chain42. The attack of the double bond follows chemical rules, and there are situations in which the most nucleophilic atom is not available as a reactive site. As shown in Fig.6a, the structures of Poly(2-tbutyl-1,4-butadiene) and Poly(2-bromo-1,4-butadiene) contain t-butyl and bromine groups, respectively. These groups act as electron donors, increasing the electron population of the vicinal carbon atoms. However, because of their voluminous nature, they might also increase the steric hindrance in the region. The current version of the HTA algorithm does not incorporate specific rules for predicting steric hindrance. As a result, the head and tail positions were simply assigned to the region with the highest nucleophilicity.  \n\nIn some cases, we have observed that the head-and-tail assignment is correct but the sanitization of the SMILES structure is incorrect, in particular if a ring-opening process is involved. We show the example of poly(3-hydroxybutyrate) polymer with its heterocycle precursor in Fig.6b). The head and tail positions were correctly assigned to the oxygen atom of the oxetane ring and to the carbon atom of the carbonyl group. However, the ring-opening process performed by the HTA algorithm generated an incorrect SMILES that cannot be visualized.  \n\nIn another example shown in Fig.6b, the polylactic acid polymer was correctly classified as polyester. However, the polymer head was incorrectly assigned to the carbon atom next to the hydroxyl group. Although the carbonyl and hydroxyl groups were correctly identified as the most nucleophilic regions, the sanitization process removed the hydroxyl group from the tail position but left the hydroxyl group in the structure, leading to incorrect head assignment.  \n\nSuch sanitization issues occurred mainly in the polyether class, in which all precursors, except polyacetal, are cyclic structures. As shown in 5b, over 90% of the structures were incorrectly assigned due to issues associated with the ring-opening process. In case of the poly(hexamethylene oxide) polymer, see Fig.6c), the sanitization process did not produce a SMILES structure for visual evaluation. We have observed improper SMILES sequences in 2 cases and improper ring-opening process in 8 cases. In the case of poly(propylene glycol), the sanitization step has generated a proper SMILES structure; however, the ring-opening process was performed in a manner that has led to an incorrect assignment of the tail position.  \n\nThe same issue was observed in the sole instance in which an incorrect assignment was made within the polyamide class. In the example shown in Fig.6d), the ring in the precursor of Nylon 3 was incorrectly sanitized and generated a false structure with the amide bond intact. In general, the opening of the heterocyclic ring is the most significant challenge of the validation process. Future extensions of the HTA algorithm will require a robust sanitization process for complex monomers, such as cyclic precursors. A potential pathway could be the representation of precursor molecules as graphs during the sanitization phase, with atoms designated as nodes and bonds as edges. The graph representation would allow for the assignment of the bond to be broken, indicated by the edges to be deleted. In addition, atoms or groups of atoms could be deleted or added by indicating the respective nodes. For enhancing the accuracy of the head and tail assignment, we suggest considering the HOMO and LUMO orbitals as the nucleophilic and electrophilic sites, respectively. In addition, considering the comprehensive chemical information provided by the frontier orbitals may be beneficial.  \n\nDespite the methodological limitations discussed above, the lack of polymer data outside the polyvynil class has posed a severe limitation for test and validation of the HTA algorithm. We hope that by making the initial data set publicly available, the computational chemistry community can contribute more data to each polymer class. In addition, improving the existing chemical rules and adding new polymerization classes should enhance the usefulness of the HTA algorithm.",
        "category": " Results and discussion"
    },
    {
        "id": 9,
        "chunk": "# SUMMARY & CONCLUSIONS  \n\nWe have reported HTA, an algorithm for assigning head and tail atoms in monomer SMILES based on the reactivity of their functional groups. In a reference data set of monomer SMILES, the HTA algorithm has correctly predicted the polymer class with an accuracy of 99%. The head and tail atoms were correctly assigned with an accuracy of $91\\%$ .  \n\nFuture extensions of the HTA algorithm will require a robust SMILES sanitization process for complex monomers. For enhancing the accuracy of the head-and-tail assignments, we suggest including an analysis of LUMO and frontier orbitals in the quantum chemical simulation process. A refinement of the implemented chemical rules and the addition of new polymerization classes should lead to further HTA performance enhancements. To overcome the data bottleneck, we encourage researchers to contribute more data to each polymer class of the initial data set.",
        "category": " Conclusions"
    },
    {
        "id": 10,
        "chunk": "# ACKNOWLEDGMENTS  \n\nWe thank Matteo Manica and Teodoro Laino (both IBM Research) for their support in the application of HTA.",
        "category": " References"
    },
    {
        "id": 11,
        "chunk": "# DATA AVAILABILITY  \n\nThe data set ”input.csv” for validating the HTA algorithm is available under the doi: 10.24435/materialscloud:tx-b9 at:  \n\nhttps://archive.materialscloud.org/record/2025.6  \n\nThe HTA output file ”output hta.csv” is available under the doi: 10.24435/materialscloud:txb9 at:  \n\nhttps://archive.materialscloud.org/record/2025.6",
        "category": " References"
    },
    {
        "id": 12,
        "chunk": "# CODE AVAILABILITY  \n\nThe HTA source code is available at https://github.com/IBM/HeadTailAssign.",
        "category": " References"
    },
    {
        "id": 13,
        "chunk": "# References  \n\n1. M. Y. Yuhazri, A. J. Zulfikar and A. Ginting, IOP Conference Series: Materials Science and Engineering, 2020, 1003, 012135.   \n2. C.-T. Chen and K. S. Suslick, Coordination Chemistry Reviews, 1993, 128, 293–322.   \n3. Y. K. Sung and S. W. Kim, Biomaterials Research, 2020, 24,.   \n4. F. Sabbagh and B. S. Kim, Journal of Controlled Release, 2022, 341, 132–146.   \n5. C. Chen, H. Ou, R. Liu and D. Ding, Advanced Materials, 2019, 32, 1806331.   \n6. Q. Zheng, Z. Duan, Y. Zhang, X. Huang, X. Xiong, A. Zhang, K. Chang and Q. Li, Molecules, 2023, 28, 5091.   \n7. S. Behera and P. A. Mahanwar, Polymer-Plastics Technology and Materials, 2019, 59,   \n341–356.   \n8. K. Sampathkumar, K. X. Tan and S. C. J. Loo, iScience, 2020, 23, 101055.   \n9. J. Chen, Y. Zhu, J. Huang, J. Zhang, D. Pan, J. Zhou, J. E. Ryu, A. Umar and Z. Guo, Polymer Reviews, 2020, 61, 157–193.   \n10. R. W. Lenz, Organic chemistry of synthetic high polymers, Intercience Publishers, New York, 1967.   \n11. A. A. Askadskii, Computational materials science of polymers, Cambridge Int Science Publishing, 2003.   \n12. Q. Wang, R. Takita, Y. Kikuzaki and F. Ozawa, Journal of the American Chemical Society, 2010, 132, 11420–11421.   \n13. N. Sazali, H. Ibrahim, A. S. Jamaludin, M. A. Mohamed, W. N. W. Salleh and M. N. Z. Abidin, IOP Conference Series: Materials Science and Engineering, 2020, 788, 012047.   \n14. L. Chen, G. Pilania, R. Batra, T. D. Huan, C. Kim, C. Kuenneth and R. Ramprasad, Materials Science and Engineering: R: Reports, 2021, 144, 100595.   \n15. D. Weininger, Journal of Chemical Information and Computer Sciences, 1988, 28, 31–36.   \n16. D. Weininger, A. Weininger and J. L. Weininger, Journal of Chemical Information and Computer Sciences, 1989, 29, 97–101.   \n17. T. Zhang, H. Li, H. Xi, R. V. Stanton and S. H. Rotstein, Journal of Chemical Information and Modeling, 2012, 52, 2796–2806.   \n18. S. R. Heller, A. McNaught, I. Pletnev, S. Stein and D. Tchekhovskoi, Journal of Cheminformatics, 2015, 7,.   \n19. A. Drefahl, Journal of Cheminformatics, 2011, 3,.   \n20. T.-S. Lin, C. W. Coley, H. Mochigase, H. K. Beech, W. Wang, Z. Wang, E. Woods, S. L. Craig, J. A. Johnson, J. A. Kalow, K. F. Jensen and B. D. Olsen, ACS Central Science, 2019, 5, 1523–1531.   \n21. T.-S. Lin, N. J. Rebello, H. K. Beech, Z. Wang, B. El-Zaatari, D. J. Lundberg, J. A. Johnson, J. A. Kalow, S. L. Craig and B. D. Olsen, Journal of Chemical Information and Modeling, 2021, 61, 1150–1163.   \n22. D. M. Lowe, P. T. Corbett, P. Murray-Rust and R. C. Glen, Journal of Chemical Information and Modeling, 2011, 51, 739–753.   \n23. N. Wilson, P. St. John and M. Crowley, m2p (Monomers to Polymers), 2020, https: //www.osti.gov/doecode/biblio/44795.   \n24. RDKit: Open-source cheminformatics, https://www.rdkit.org, DOI: 10.5281/zenodo.591637.   \n25. T. T. Tanimoto, Elementary mathematical theory of classification and prediction, International Business Machines Corp., 1958.   \n26. D. C. I. System, Daylight Theory Manual, Daylight Chemical Information System, 2011.   \n27. D. W. Szczepanik and J. Mrozek, Journal of Chemistry, 2013, 2013,.   \n28. R. S. Mulliken, The Journal of Chemical Physics, 1955, 23, 1833–1840.   \n29. R. S. Mulliken, The Journal of Chemical Physics, 1955, 23, 1841–1846.   \n30. R. S. Mulliken, The Journal of Chemical Physics, 1955, 23, 2343–2346.   \n31. G. M. J. Barca, C. Bertoni, L. Carrington, D. Datta, N. De Silva, J. E. Deustua, D. G. Fedorov, J. R. Gour, A. O. Gunina, E. Guidez, T. Harville, S. Irle, J. Ivanic, K. Kowalski, S. S. Leang, H. Li, W. Li, J. J. Lutz, I. Magoulas, J. Mato, V. Mironov, H. Nakata, B. Q. Pham, P. Piecuch, D. Poole, S. R. Pruitt, A. P. Rendell, L. B. Roskop, K. Ruedenberg, T. Sattasathuchana, M. W. Schmidt, J. Shen, L. Slipchenko, M. Sosonkina, V. Sundriyal, A. Tiwari, J. L. Galvez Vallejo, B. Westheimer, M. Wloch, P. Xu, F. Zahariev and M. S. Gordon, The Journal of Chemical Physics, 2020, 152, 154102.   \n32. N. M. O'Boyle, C. Morley and G. R. Hutchison, Chemistry Central Journal, 2008, 2,.   \n33. N. M. O'Boyle, M. Banck, C. A. James, C. Morley, T. Vandermeersch and G. R. Hutchison, J Cheminform, 2011, 3,.   \n34. A. K. Rappe, C. J. Casewit, K. S. Colwell, W. A. Goddard and W. M. Skiff, Journal of the American Chemical Society, 1992, 114, 10024–10035.   \n35. P. Bruice, Organic Chemistry, Pearson/Prentice Hall, 2004.   \n36. The RDKit Book - Molecular Sanitization, https://www.rdkit.org/docs/RDKit_ Book.html#molecular-sanitization, Accessed: 2025-01-08.   \n37. G. Van Rossum and F. L. Drake, Python 3 Reference Manual, CreateSpace, Scotts Valley, CA, 2009.   \n38. Polymerdatabase.com, https://www.polymerdatabase.com/main.html, Accessed: 2023-05-09.   \n39. Wayback Machine of Polymerdatabase.com, https://web.archive.org/web/ 20230324233129/http://polymerdatabase.com/polymer%20index/home.html, Accessed: 2023-05-09.   \n40. J. Bicerano, Prediction of polymer properties, CRC Press, 2002.   \n41. D. M. Lowe, P. T. Corbett, P. Murray-Rust and R. C. Glen, Chemical name to structure: OPSIN, an open source solution, 2011.   \n42. T. A. Saleh and V. K. Gupta, in Synthesis of Nanomaterial–Polymer Membranes by Polymerization Methods, Elsevier, 2016, p. 135–160.   \n43. Quantum chemistry with Python, https://pyscf.org/.   \n44. Q. Sun, X. Zhang, S. Banerjee, P. Bao, M. Barbry, N. S. Blunt, N. A. Bogdanov, G. H. Booth, J. Chen, Z.-H. Cui, J. J. Eriksen, Y. Gao, S. Guo, J. Hermann, M. R. Hermes, K. Koh, P. Koval, S. Lehtola, Z. Li, J. Liu, N. Mardirossian, J. D. McClain, M. Motta, B. Mussard, H. Q. Pham, A. Pulkin, W. Purwanto, P. J. Robinson, E. Ronca, E. R. Sayfutyarova, M. Scheurer, H. F. Schurkus, J. E. T. Smith, C. Sun, S.-N. Sun, S. Upadhyay, L. K. Wagner, X. Wang, A. White, J. D. Whitfield, M. J. Williamson, S. Wouters, J. Yang, J. M. Yu, T. Zhu, T. C. Berkelbach, S. Sharma, A. Y. Sokolov and G. K.-L. Chan, The Journal of Chemical Physics, 2020, 153, 024109.   \n45. Q. Sun, T. C. Berkelbach, N. S. Blunt, G. H. Booth, S. Guo, Z. Li, J. Liu, J. D. McClain, E. R. Sayfutyarova, S. Sharma, S. Wouters and G. K. Chan, WIREs Computational Molecular Science, 2017, 8, e1340.  \n\n46. Q. Sun, Journal of Computational Chemistry, 2015, 36, 1664–1671.  \n\nFigure 1: Visual representation of the HeadTailAssign (HTA) method with Poly(isobutyl acrylate) as an example. Isobutyl acrylate is shown on the left and Poly(isobutyl acrylate) on the right. Based on quantum chemical predictions of nucleophilicity, the HTA method identifies the atomic locations at which polymerization reactions occur and assigns head and tail positions. The 3D molecular structure visualization was generated by using RDKit $^{24}$ . Starting from a SMILES string Hydrogen atoms were added and conformers were created with a distance-geometry-based conformation generator. Finally, the structure was optimized using the UFF force field and the canonical molecular orbital HOMO was calculated using PySCF $^{43-46}$ , using the STO-3G basis set at SCF / RHF theory level.  \n\nFigure 3: Functional groups for the automated polymer class assignment with the HTA algorithm. Functional groups mapped to (a) Polyvinils, (b) Polyamides, (c) Polyesters, (d) Polyethers, and (e) Polyurethanes.  \n\nFigure 4: Representative examples of automated polymer classification and head/tail assignments with the HTA algorithm. (a) Nylon10 - Poly(decano-10-lactam), (b) Polyacrylamide, (c) Poly(ethylene glycol), and (d) Poly[(diethylene glycol)-alt-(1,6-hexamethylene disocyanate)]. The symbol “\\*:1” indicates a head atom, the symbol “\\*:2” indicate a the tail atom. Different colors indicate different functional groups: orange - amino, blue - carboxilic acid, red - vinyl, brown - amide, pink - ether heterocycle, yellow - isocyanate, green - hydroxyl.  \n\nFigure 5: Comparison between HTA predictions and ground truth data. (a) Predicted polymer classes (orange) and ground-truth data (green). An example of an incorrect HTA prediction (miss-classification) is shown with the respective canonical SMILES representation. (b) HTA-based head and tail assignments (red) and ground-truth data (blue). The symbol “\\*:1” indicates a head atom, the symbol $^{66*}$ :2” indicate a the tail atom.  \n\nFigure 6: Representative examples of incorrect head/tail assignments by the HTA algorithm in the class of (a) Polyvinyl, (b) Polyester, (c) Polyether, and (d) Polyamide. The symbol ”\\*” in the canonical SMILES representations indicate incorrect structure sanitization. The symbol $^{66*}$ :1” indicates a head atom, the symbol “ $^{:*}$ :2” indicate a the tail atom. N/A indicates not applicable, since no valid molecular structure was generated.  \n\n![](images/5b342b9a36c6f1caa4bfae51d138dc9185b5904b1c82662be0841bfbbb442112.jpg)  \nFigure 1  \n\n![](images/83f3e3e3fe7024d9eb4d8d92c29c11323e41dd6c687f73e165e2f5cd20acf11f.jpg)  \nFigure 2  \n\n![](images/a70ce56e58f6558e267ee492c8bbd060128c7fb914001d318c689457a60d7981.jpg)  \nFigure 3  \n\n![](images/9bbab022167cda4aa9e2d8dcba7aa73dd0e4df1fa9dbc552b8ff54afe75c1732.jpg)  \nFigure 4  \n\n![](images/979b9b7e625dd6419aa5380b36ba75404cf73369d74ebeb00844d5a65c3a968e.jpg)  \nFigure 5  \n\n![](images/8f1c1fdd4c3479a0a1af09cf7074bb17851f5b9e0bc1d94d9fbd96506b54b9f7.jpg)  \nFigure 6",
        "category": " References"
    }
]