i was wondering if there's a way to know which data actually matters before you even train. like can you look at your dataset and say "these 2k examples are worth more than those 10k"
does more data always = better model?
research shows it's a power law, not linear:
100 samples → loss = 10
1,000 samples (10x more) → loss = 5 (not 1)
10,000 samples (10x more) → loss = 2.5 (not 0.5)
it has diminishing returns that holds across seven orders of magnitude.
what about finetuning? since the base model already knows a lot, you're just teaching it something specific, does the same rule apply?
yes but you might only need 20-50% of your data to get 95% performance. so which 20-50%?
j morris showed that models have a capacity limit. GPT-style models memorize ~3.6 bits per parameter.
this means a 1B parameter model can only memorize ~450MB of information. that's your budget.
training on more data doesn't increase budget. just spreads it thinner.
when you exceed capacity, model is forced to generalize instead of memorize. this explains grokking - that moment when performance suddenly jumps.
so the question becomes: which data fills the budget?
if you have lots of data, keep hard examples. easy ones are redundant.
if you have little data, keep easy examples. hard ones might just be noise.
someone showed you can discard 20% of ImageNet without hurting performance. potentially breaking power law scaling.
how do you actually do this though?
there's information bottleneck theory - find maximally compressed mapping that preserves info about output. keep only data that tells you something useful.
practical methods exist:
- coreset selection - finds small weighted subset that approximates full dataset
 - geometry-based pruning - preserve feature space structure
 - uncertainty-based - keep what model is uncertain about
 - error-based - keep high-loss examples
 
problem: most don't scale well. best ones are expensive to compute.
there's also this idea of four scaling regimes. basically asking two questions:
- is the bottleneck your data or your model?
 - is the problem noise or lack of detail?
 
two limitations:
- variance-limited: error from noise in limited samples (like photos in a dark room)
 - resolution-limited: can't capture fine-grained patterns (like a pixelated image)
 
knowing which regime you're in tells you if more data helps or if you need something else.
j morris continues to show embeddings from different models converge to similar representation geometries.
if there's a universal geometry, maybe there's an optimal compression of training data that fills that structure efficiently.
there's also a ton of research on synthetic data that can fit into the equation as well. a rabbit hole that i would love to dive into some other time.