Just how much data do you need for machine learning anyways? And how clean? / by Chris Shaffer

Wormtongue: It will take a number beyond reckoning, thousands.

Saruman: Tens of thousands.

Wormtongue: But, my lord, there is no such data set.


Just as a heads-up, if you’re expecting an authoritative answer to this question, you’re out of luck - entire books could be written about this topic, cover a tiny fraction of it, and be out of date before they finished printing.

What this post will provide is, with a few examples, a way to think about how to begin to answer that question for a particular problem domain without resorting to mysticism.

If you do some searching and reading, you’ll find a few statements more or less universally repeated:

  1. It depends.

  2. It’s a lot of data. Thousands or tens of thousands for a relatively small or simple problem. Millions or more for complex ones.

  3. The training data needs to be diverse enough as to be representative of what you expect the model to be able to handle

  4. Ten times the number of facets (columns) in your data set, give or take, is a good starting point.


Hey, that last one was almost concrete. Let’s drill into that one.

Here’s one tutorial that walks through building a sentiment analysis model: https://www.digitalocean.com/community/tutorials/how-to-train-a-neural-network-for-sentiment-analysis

Here are the bits we care about for our question:

How many columns?

There are 9,998 unique words in their data set. They’re treating each input text as an n-dimensional vector, in which n is the number of distinct words. A coordinate in 3-dimensional space is defined as [x, y, z]. A coordinate in this space is defined as [a, aardvark, anteater, apple, 9,994 more words …]. This is par for the course with text analysis, including LLMs.

The result they’re looking for is a simple “positive” or “negative” sentiment. So, 9,998 inputs and 1 output. (Round to 10,000)

How many examples do the 10x rule imply we’ll need?

10x 10,000 = 100,000

How many examples?

40,000 in their training set, and 10,000 in their testing set.

So, we have a little under half the data our rule of thumb implies we’d need.

But hey, the rule of thumb didn’t specify data types. They’re outputting a single boolean value. Positive or negative, 0 or 1. They’re not trying to assign a numerical score or predict the next word. Maybe we can get away with half what we’re supposed to need.

Accuracy: 86.59%

Good enough for government Google work.

A good engineer with some time on their hands might squeeze a few more percentage points out of that. Depending on your use case, 90% might be great! Or, it might be unacceptable.

Since we’re pretty far below the recommendation, it’s likely that either this code or an improved version would fare better with more data. But none exists. We might need to get creative - pulling in similar data from other domains or adopting some more advanced data collection and preprocessing techniques.


What about something simpler? Fitting a line to a scatter plot?

The implication here is that, if I have a two-dimensional space (x and y - one input, one output), I would need 10 data points to plot that accurately with a neural net?

Yes.

But an 8th grader or a TI-81 from 1990 can solve that with only two points

Also, yes.

To be fair, algebra students and graphing calculators could only solve that equation with two data points if they were told, for instance, they were looking for a linear relationship. There are innumerable curves that could fit two points when you include quadratic, exponential, etc. equations.

(Obligatory) How does ChatGPT do?

Spoiler alert: it’s not an improvement over more “traditional” machine learning techniques.


The correct answer is, of course, “there’s not enough information to determine”. It gave an answer that could be correct (if I’d told it to look for an exponential equation, or just any equation that fits those two points), though, so maybe it just needs some clarification.

Now it’s doing an objectively worse job than a 30-year-old calculator

A good ol’ regression, ‘tis not.


Back to something we can use

I’m going to train my own model. For the sake of illustration, I’m going to train it to recognize two made-up words.

blork: like, enjoy, feel positively toward

blorp: dislike, don’t enjoy, feel negatively toward

Training Data

I blork steak | Positive
I blork chicken | Positive
I blork corn | Positive
I blork apples | Positive
blork | Positive
I blorp corn | Negative
I blorp apples | Negative
I blorp steak | Negative
I blorp chicken | Negative
blorp | Negative

Testing Data

> I blork sandwiches
Negative

Cut short because it’s already wrong. That jives with what we’ve been told. We have 6 words in our input vector. We’d need 60 examples to be able to reliably learn this simple relationship.

This is absurd. Now what?

If we can’t reduce the number of data points necessary to train a 6-parameter model, can we reduce the number of parameters?

You might hear this referred to as “preprocessing” - you figure out what’s relevant, strip out what’s irrelevant, and maybe approximate or abstract the parts that are only partially relevant.

Since I want it to learn two words, for the purposes of this experiment, I’m going to do some extreme preprocessing.

Training Data

I blork chicken | Positive
You blork steak | Positive
I blorp apples | Negative
You blorp corn | Negative

… 30 examples. Basically, I replace every word that’s not the two I’m learning with “noun”. I don’t care to teach my model about foods, just blork and blorp.

After preprocessing (what actually gets fed to the ML part of the code)

noun blork noun | Positive
noun blork noun | Positive
noun blorp noun | Negative
noun blorp noun | Negative

Testing Data

> I blork sandwiches
Positive
> You blorp word-you've-never-seen-before
Negative
> Person-you've-never-seen-before blork xyz
Positive

… Numerous more examples.

Perfect 100%

I’m feeling pretty clever. Until I try tweaking the question format.

> This burger makes me feel blorp
Positive

See also: over-fitting

It learned to fill in the blank for “noun __ noun” but not for “noun noun noun noun noun __”

(Those aren’t all nouns, oops. But again, I’m not trying to teach it parts of speech.)

That illustrates our diversity principle above - we gave it a very uniform training set, allowing it to answer a uniform set of questions - but not much else.

This is a silly example

My goal here is so narrow, I’d be better served with a Find or a Contains. But then again, every JavaScript tutorial is a to-do list or something equally impractical, right? We’ll circle back to this.

Some more realistic examples

Image recognition

If the number of images I need to train a model is a function of the number of pixels I have, reduce the resolution. Reducing the length and width each by a factor of 3 reduces the number of pixels by a factor of 9. That’s almost an order of magnitude less training data (and compute) required!

Again, just as with my silly example, there’s a tradeoff: it’ll perform much worse with extreme zoom-in. Generally, this is an acceptable trade-off, but it’s there.

Text Cleaning

Remove HTML from text scraped from the internet using classic libraries. This doesn’t really have a tradeoff aside from the obvious one: the resulting model can’t handle HTML - you have to preprocess the inputs in the same way - and therefore create a program that’s entirely oblivious to the existence of HTML.

Lemmatization and Tokenization

Coming up with more clever ways to split up words can make a big bang in machine learning. Replace every word with it’s root word, and (potentially) treat its suffix or conjugation as a separate token.

Example: run, running, eat, eating, stand, standing (6 different tokens) becomes run, run -ing, eat, eat -ing, stand, stand -ing (4 different tokens)

This could yield fewer dimensions and better represent the underlying concepts… though it requires some piece of code, developed and deployed before you do any ML to impart some level of basic understanding of the language.

And lots more

These are just a few popular examples. Every data set will have a few obvious ones and a few non-obvious ones, each with obvious and non-obvious tradeoffs.


Over to you, Sam

Let’s try this example with ChatGPT. It decidedly does not require as many examples as we’d assumed earlier to learn our made up words.

So, is it simply better than previous neural networks? Does it break the rules?

Or is something else going on that might be hinted at by some follow-up questions?



It nailed “blork” and “blorp” but it flopped on “blark”. Why?

It has just as much training data, following the exact same format. It doesn’t, on its face, appear to be a more difficult problem.

Just a guess

There’s no training data on blorks and blorps aside from what I’m giving it. But ChatGPT does have basically the entire internet as training data. And there is a lot of training data that takes this form, if not these words.

Children are frequently taught grammar using make-believe vocabulary (I assume there’s some theory behind this along the lines of “separate those two concepts or you’re at risk of leaning on memorization”). If you’ve ever read Dr. Seuss, you’re familiar with this.

Kids might be asked real questions about the made-up Lorax, but the teachers and tests rarely pose the questions using the made-up vocabulary they just learned. (“Do you see the Lorax on this page?” versus “Glimpse the Lorax?”)

Whatever the reason, it definitely has to do with the large data set that OpenAI has curated. It didn’t train on my two sentences, it trained on billions, definitely including some elementary school lessons.

Furthermore, it likely to do with the proprietary data set OpenAI has curated. Remember, OpenAI has an army of data entry employees at their disposal. The open-source LLMs couldn’t get this one at all. When ChatGPT nails a question easily that another LLM can’t get anywhere close with, a good guess is that it’s to do with that proprietary data, and/or preprocessing.

GPT-4-kids in a trenchcoat

It’s been rumored that “Chat GPT is a bunch of machine-learning models in a trenchcoat”. I don’t have any insider knowledge here, but it’s a plausible theory. This has been a technique for human-impersonating AI at least as far back as IBM’s Watson.

A very fuzzy view of that architecture: there’s some ML that categorizes the question. Depending on what sort of question it is, it gets passed to another code module - maybe that’s an LLM, maybe it’s another piece of ML, maybe it’s a glorified regex, maybe it’s a UNIX command-line tool from 1980. Maybe an LLM splits it up, hands if off to multiple other not-LLMs, and then another LLM synthesizes the answer from their outputs.

Evidence in favor

This is arithmetic, not token prediction


If you give ChatGPT a simple addition problem, with long strings of digits, it gets it right. Now, this is an easy problem for a calculator, but it’s not how LLMs are supposed to work. They’re supposed to predict the next token based on a pattern of tokens they’ve seen in the past. It’s highly unlikely that ChatGPT has seen these numbers before and is predicting based on them.

In this reading, ChatGPT should not be able to solve this problem. Unless it’s recognizing that it’s looking at a math problem and sending it to a calculator (for which it’s an easily solvable problem), rather than processing it with the LLM.

Sure enough, other LLMs cannot answer these correctly.

Evidence against

What it can do with addition and subtraction, it cannot do with multiplication and division. This is what we should expect out of an LLM: “doing multiplication” correctly only when you give it familiar numbers and incorrectly when you give it brand new numbers.

If it were simply passing it off to a calculator, it would be having an easy time with both addition and multiplication.

One explanation is that, if you treat each digit as a token, with a gazillion math sites as training data, it “learned” addition. There are 19 possible results from adding two digits, and 42 from multiplying them. Order of complexity for a neural network to teach itself math based on pattern recognition: 19^n versus 42^n. That’s a lot of room for it to have enough examples of addition, but not enough of multiplication.

Training on your questions

Remember that anything you type into the ChatGPT prompt is usable as training data. Also keep in mind that they released the GUI publicly (largely) as a marketing tactic. They’ve explicitly told us they’re going to look at the questions we ask and try to use that to better answer similar questions in the future - there’s no clause in the terms of service that says “except for trick questions”.

I lean toward “preprocessing and crafting training data” as a response to “how does ChatGPT answer popular trick questions that other LLMs can’t” more than the trenchcoat theory (which OpenAI denies), myself. But they are internally looking for questions that trip up their model and doing something in response that falls outside of the neural net, proper.

It’s also at least technologically possible that they’re playing games - arbitrarily sending math questions either to a calculator or allowing the LLM to bomb them - to impress “thought leaders” without tipping off people who will dig deeper. I normally would dismiss a theory like this out of hand as paranoia … but let he who doesn’t have a collection of eyeballs and a doomsday bunker cast the first accusation of paranoia.

Takeaways

Data is the differentiator

Every advance in AI makes data more important and more valuable. You may have noticed a recurring theme above: training a neural network is not incredibly hard with the right training data but basically impossible without it. Getting the data right is a function of engineering and design that’s upstream of machine learning.

Depending on your use case, you may be able to get clean data for free, purchase it, or build that data set internally. For most business situations, the machine learning bits are largely commoditized - anyone who raises a seed round can build and train a neural network. A well-curated data set provides a moat not so easily reproduced.

A well-structured database can reduce the cost of building ML by 99% or more

You can plot a line with a handful of points if you have them in a clean, structured (read: integers in SQL or CSV) format. It might take thousands of examples if you have prose describing the same data.

It takes all of the reviews on IMDB to train an okay sentiment analysis model, but it only takes a few rows of data for Netflix to tell you that you might like Star Trek: Deep Space Nine if you liked the original Star Trek and Star Trek: The Next Generation.

Data collection and review processes are critical

No matter how good your feature engineering, preprocessing, or data modeling is, you will still need a lot of data. You’ll need processes to collect that data, standardize it, and ultimately feed it into your ML tools. Even in the best case scenario, 1% of a ridiculous amount of data is still a large amount of data.

One thing I didn’t cover in this sprawling post is the impact of incorrect data on training times, volumes required, and accuracy. That will also come with a lot of “it depends” but one thing is for certain, it’s not 1:1 - every bad data point will require more than one good one to outweigh it. If you expect 90% accuracy, 10:1 is probably a good guess. If you expect 99% accuracy, 100:1. That means that filtering, reviewing, and quality take on increasing importance.

You’re also likely to need review processes to identify and correct mistakes your ML makes. There will need to be some thoughtful design around this to ensure the ML isn’t more trouble than it’s worth. What’s the accuracy? How does that compare to a human? What are the consequences to getting one wrong? Is it possible to identify likely mistakes in an automated way? Do we have to review everything? If so, what was the point?

Don’t bother with machine-learning if there’s a clear algorithm

As the examples of long multiplication and division demonstrate, even the planet’s most advanced neural network, given all of the available data on the internet, might not be able to learn an elementary-school-level algorithm.

We’ve demonstrated that we can steer a neural network to learn what we want it to. We don’t need the power of an LLM to train a neural network to learn long multiplication with a tailor-built training set; we can do that on our laptops (though that neural net won’t be able to do much else). As with our toy “sentiment analysis” example that we taught two words to, the question is “why?”

We don’t need a neural net to do any of these tasks. It’s far more work to teach one than it is to … just write the code. Teaching a neural net to learn something that could be accomplished with a few lines of procedural code is valuable for teaching purposes. It’s not supposed to be practical.

This should remind us to stick to using machine-learning techniques where they excel - “I know it when I see it”-style problems where there’s not an algorithm we can coherently express.