Files
wl-hydrophilic-polymer/task2/task2-chunks/╔┘╤∙▒╛SI.json
2025-05-08 11:50:00 +08:00

17 lines
2.2 KiB
JSON
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[
{
"id": 1,
"chunk": "# nature computational science",
"category": " References"
},
{
"id": 2,
"chunk": "# Harnessing large language models for datascarce learning of polymer properties \n\nIn the format provided by the authors and unedited \n\nSupplementary Algorithm 1 The key steps of the two-phase training strategy.",
"category": " Materials and methods"
},
{
"id": 3,
"chunk": "# 1: LLM encoder pretraining: \n\n2: Use a large dataset of unlabeled SMILES representations of polymers to pretrain an LLM encoder M˜ encode. \n3: Phase-1 supervised pretraining: \n4: Use physics-based hypothetical polymer generation methods, such as group contribution (GC), to generate a large dataset of physically meaningful synthetic polymer structures $\\{X_{i}\\}_{i=1}^{S_{G C}}$ with the correlation of fundamental thermophysical properties; \n5: Build a physics-based model $\\boldsymbol{\\mathcal{M}}_{\\mathrm{p}h y s i c s}$ of the real-world physical process, by leveraging on the fundamental properties calculated from physically meaningful synthetic polymers; \n6: Construct a physically meaningful synthetic dataset $\\mathcal{D}_{G C}:=\\{(X_{i},\\mathcal{M}_{\\mathrm{p}h y s i c s}(X_{i}))\\}_{i=1}^{S_{G C}}$ ; \n7: Apply supervised pretraining to the LLM decoder/predictor using the synthetic dataset $\\mathit{\\Delta}\\mathcal{D}_{\\mathit{G C}}$ , to obtain an LLM decoder $\\dot{\\mathcal{M}}_{\\mathrm{decode}}$ with physically consistent initial state; \n8: Phase-2 finetuning: \n9: Collect a (usually small) set of high-fidelity measurements from experiments, denoted as $\\mathcal{D}_{H F}\\mathrel{\\mathop:}=$ $\\{({X}_{i}^{H F},{Y}_{i}^{H F})\\}_{i=1}^{S_{H F}}$ ; \n10: Split the high-fidelity experimental dataset $\\mathcal{D}_{H F}$ as a training set $\\mathcal{D}_{H F}^{\\mathrm{t}r a i n}$ and a test set $\\mathcal{D}_{H F}^{\\mathrm{t}e s t}$ , and finetune the phase-1 LLM $\\tilde{\\mathcal{M}}_{\\mathrm{decode}}$ using $\\mathcal{D}_{H F}^{\\mathrm{t}r a i n}$ ; \n11: Obtain the final physics-guided LLM, and report the prediction accuracy on the test dataset $\\mathcal{D}_{H F}^{\\mathrm{t}e s t}$ .",
"category": " Materials and methods"
}
]