finetuning deepseek

April 18, 2025


strong compute hosted a hackathon for ARC AGI and for finetuning deepseek

inspired by this tweet with this prompt: "In pure three.js, without downloading any assets or textures, visualize a spaceship launching from the surface of earth and reaching the surface of the moon."

i wanted to finetune deepseek to generate three.js code.

first i had to understand GRPO

then it came to understanding how i can finetune deepseek

i needed data curation, which i used curator by bespokelabs to curate synthetic data with claude 3.7 sonnet with thinking.

they're hosting a Reasoning Datasets Competition that i can submit to

the next essential is the reward functions.

from what i've gathered, and using claude 3.7 sonnet + extended thinking on web to come up with the rewards, and providing the three.js code generated by deepseek and claude as a reference, i ended up with the rewards below:

combined_score = ( 0.12 * syntax_reward + # Basic correctness 0.08 * reasoning_reward + # Quality of explanation 0.05 * format_reward + # Proper formatting 0.15 * length_reward + # Appropriate length 0.15 * creativity_reward + # Creative solutions 0.15 * animation_reward + # Animation quality 0.10 * performance_reward + # Performance optimizations 0.08 * responsive_reward + # Responsive design 0.10 * interaction_reward + # User interaction 0.02 * rouge_reward # Similarity to reference if available )

but this seems overengineered. i need to tweak this more.