wl-hydrophilic-polymer/task2/task2-chunks/╔┘╤∙▒╛SI.json

[
    {
        "id": 1,
        "chunk": "# nature computational science",
        "category": " References"
    },
    {
        "id": 2,
        "chunk": "# Harnessing large language models for datascarce learning of polymer properties  \n\nIn the format provided by the authors and unedited  \n\nSupplementary Algorithm 1 The key steps of the two-phase training strategy.",
        "category": " Materials and methods"
    },
    {
        "id": 3,
        "chunk": "# 1: LLM encoder pretraining:  \n\n2: Use a large dataset of unlabeled SMILES representations of polymers to pretrain an LLM encoder M˜ encode.   \n3: Phase-1 supervised pretraining:   \n4: Use physics-based hypothetical polymer generation methods, such as group contribution (GC), to generate a large dataset of physically meaningful synthetic polymer structures $\\{X_{i}\\}_{i=1}^{S_{G C}}$ with the correlation of fundamental thermophysical properties;   \n5: Build a physics-based model $\\boldsymbol{\\mathcal{M}}_{\\mathrm{p}h y s i c s}$ of the real-world physical process, by leveraging on the fundamental properties calculated from physically meaningful synthetic polymers;   \n6: Construct a physically meaningful synthetic dataset $\\mathcal{D}_{G C}:=\\{(X_{i},\\mathcal{M}_{\\mathrm{p}h y s i c s}(X_{i}))\\}_{i=1}^{S_{G C}}$ ;   \n7: Apply supervised pretraining to the LLM decoder/predictor using the synthetic dataset $\\mathit{\\Delta}\\mathcal{D}_{\\mathit{G C}}$ , to obtain an LLM decoder $\\dot{\\mathcal{M}}_{\\mathrm{decode}}$ with physically consistent initial state;   \n8: Phase-2 finetuning:   \n9: Collect a (usually small) set of high-fidelity measurements from experiments, denoted as $\\mathcal{D}_{H F}\\mathrel{\\mathop:}=$ $\\{({X}_{i}^{H F},{Y}_{i}^{H F})\\}_{i=1}^{S_{H F}}$ ;   \n10: Split the high-fidelity experimental dataset $\\mathcal{D}_{H F}$ as a training set $\\mathcal{D}_{H F}^{\\mathrm{t}r a i n}$ and a test set $\\mathcal{D}_{H F}^{\\mathrm{t}e s t}$ , and finetune the phase-1 LLM $\\tilde{\\mathcal{M}}_{\\mathrm{decode}}$ using $\\mathcal{D}_{H F}^{\\mathrm{t}r a i n}$ ;   \n11: Obtain the final physics-guided LLM, and report the prediction accuracy on the test dataset $\\mathcal{D}_{H F}^{\\mathrm{t}e s t}$ .",
        "category": " Materials and methods"
    }
]