update readme and add exp logs

2025-04-10 18:45:35 +00:00
parent bad47a7e45
commit f8ee208db1
3 changed files with 49 additions and 5 deletions
--- a/docs/experiment_log.md
+++ b/docs/experiment_log.md
@@ -0,0 +1,33 @@
+
+## Experiment log
+
+### Preliminary results 
+
+Resources: [wandb](https://wandb.ai/peterjin/Search-R1-open)
+
+
+The preliminary experiment is conducted only on natural question (NQ) dataset (+ PPO) with a small number of training steps.
+
+
+### v0.1
+
+Resources: [wandb](https://wandb.ai/peterjin/Search-R1-nq_hotpotqa_train), [docs](https://github.com/PeterGriffinJin/Search-R1/tree/main/scripts/nq_hotpotqa), [scripts](https://github.com/PeterGriffinJin/Search-R1/tree/main/scripts/nq_hotpotqa/v0.1)
+
+
+We extend the experiments from NQ to seven datasets with both PPO and GRPO methods. The studies are still on a small number of training steps with a big learning rate warm up ratio.
+
+
+### v0.2
+
+Resources: [wandb](https://wandb.ai/peterjin/Search-R1-v0.2), [docs](https://github.com/PeterGriffinJin/Search-R1/tree/main/scripts/nq_hotpotqa), [scripts](https://github.com/PeterGriffinJin/Search-R1/tree/main/scripts/nq_hotpotqa/v0.2)
+
+
+We fix several bugs including [retrieved token masking](https://github.com/PeterGriffinJin/Search-R1/pull/21) and [GRPO sample indexing](https://github.com/PeterGriffinJin/Search-R1/commit/9ec2fa9892fbf0315d0c67b4dc08ae8f6cf5f378). 
+The former can largely improve the stablity of RL training. 
+Then we adjust the training scripts, increasing the number of training steps and decreasing the learning rate warm up ratio, to obtain a better performance, and conduct experiments on different scale of LLMs (3B, 7B, 14B).
+
+
+### v0.3
+
+Ongoing, stay tuned!
+