strong compute hosted a hackathon for ARC AGI and for finetuning deepseek
inspired by this tweet with this prompt: "In pure three.js, without downloading any assets or textures, visualize a spaceship launching from the surface of earth and reaching the surface of the moon."
i wanted to finetune deepseek to generate three.js code.
first i had to understand GRPO
- Group Relative Policy Optimization (GRPO) - Formula and Code - YouTube
- DeepSeek-R1 Dissection: Understanding PPO & GRPO Without Any Prior Reinforcement Learning Knowledge
- Bite: How Deepseek R1 was trained
- Recent reasoning research: GRPO tweaks, base model RL, and data curation
then it came to understanding how i can finetune deepseek
- Fine-tune Deepseek-R1 with a Synthetic Reasoning Dataset
- Fine-Tuning Qwen-0.5B and Llama3.2-1B with GRPO & LLM-J to Outperform OpenAI’s O1-Preview in Q&A - Datawizz
i needed data curation, which i used curator by bespokelabs to curate synthetic data with claude 3.7 sonnet with thinking.
they're hosting a Reasoning Datasets Competition that i can submit to
the next essential is the reward functions.
- Reasoning - GRPO & RL | Unsloth Documentation
- will brown on X: "a beautiful reward function https://t.co/kb5xvRQNtR"
from what i've gathered, and using claude 3.7 sonnet + extended thinking on web to come up with the rewards, and providing the three.js code generated by deepseek and claude as a reference, i ended up with the rewards below:
combined_score = ( 0.12 * syntax_reward + # Basic correctness 0.08 * reasoning_reward + # Quality of explanation 0.05 * format_reward + # Proper formatting 0.15 * length_reward + # Appropriate length 0.15 * creativity_reward + # Creative solutions 0.15 * animation_reward + # Animation quality 0.10 * performance_reward + # Performance optimizations 0.08 * responsive_reward + # Responsive design 0.10 * interaction_reward + # User interaction 0.02 * rouge_reward # Similarity to reference if available )
but this seems overengineered. i need to tweak this more.