finetuning deepseek

strong compute hosted a hackathon for ARC AGI and for finetuning deepseek

inspired by this tweet with this prompt: "In pure three.js, without downloading any assets or textures, visualize a spaceship launching from the surface of earth and reaching the surface of the moon."

i wanted to finetune deepseek to generate three.js code.

first i had to understand GRPO

then it came to understanding how i can finetune deepseek

Fine-tune Deepseek-R1 with a Synthetic Reasoning Dataset
Fine-Tuning Qwen-0.5B and Llama3.2-1B with GRPO & LLM-J to Outperform OpenAI’s O1-Preview in Q&A - Datawizz

i needed data curation, which i used curator by bespokelabs to curate synthetic data with claude 3.7 sonnet with thinking.

they're hosting a Reasoning Datasets Competition that i can submit to

the next essential is the reward functions.

from what i've gathered, and using claude 3.7 sonnet + extended thinking on web to come up with the rewards, and providing the three.js code generated by deepseek and claude as a reference, i ended up with the rewards below:

combined_score = (
    0.12 * syntax_reward +      # Basic correctness
    0.08 * reasoning_reward +   # Quality of explanation
    0.05 * format_reward +      # Proper formatting
    0.15 * length_reward +      # Appropriate length
    0.15 * creativity_reward +  # Creative solutions
    0.15 * animation_reward +   # Animation quality
    0.10 * performance_reward + # Performance optimizations
    0.08 * responsive_reward +  # Responsive design
    0.10 * interaction_reward + # User interaction
    0.02 * rouge_reward         # Similarity to reference if available
)

but this seems overengineered. i need to tweak this more.

BENEDICT NEO 梁耀恩

finetuning deepseek