sci-gui-agent-benchmark/mm_agents/gui_som/data_preparation/README.md

1. Get the URLs from majestic_million and save them to `majestic_million.csv`
```bash
python3 majestic_million.py
```
2. Run scrapy spider to get the data from the URLs
```bash
python scrapy_crawler.py
```