Estimating Data Amount Before You See It

"How much data do you need to build this?" drops like a rock in every initial data-related project meeting. (The other one is "How good can your model get?", which I might write about later.)
While it is also common to play the "Uno Reverse" card here and ask, "How much data do you have?", more often than not, this answer just delays answering the actual question: "Is this project feasible?"
Not answering this question might delay or cancel the project's start. I think it is possible to answer this question somewhat accurately while informing everyone of the caveats.
Whose job is it to assess feasibility?
Spoiler alert: everyone’s.
The general tendency is to assume that before someone from the business comes in with an idea, they should have thought about whether it makes sense to implement machine learning in some area, but:
- Listening to the stories of machine learning projects, it is hard to imagine how much work it would take to build the project or maintain it.
- Everyone thinks that they have a lot of data[1], and the data is well-maintained.
- An annoying process day-to-day might feel like it costs much more than it does. The cost-benefit analysis might not hold up to scrutiny.
I don't want to get into the details of every aspect of a project. Just the amount of data represents one of the first huge barriers. Some potential problems: - Not enough data
- Not enough good-quality data
- Not feasible to extract the data
- Not enough labels or ground truth information
Any of these could stop the project in its tracks, and as the specialist, it is my responsibility to help businesses figure out these issues and help them decide to "Go or No-Go" on a project. It is good practice to have some guidelines anyway.[2]
The Questions to Ask
When you search for an answer to the question of "How much data is needed for a Machine Learning project?", the first answer you come up with is: 10 times the number of features. In most cases, you don't know the number of features you'll use until well into the project.
The other common answer you get is: answer it based on which machine learning algorithm you'll use. Same problem. You might know what general family of models you'll use, but the actual best performing one won't be known for a long time.[3]
What the scope of the usage?
Before answering any questions, consider how frequently the model will be used and how frequently the underlying data changes. Businesses often need one-time analyses or analyses that have very slowly changing features. I find it very hard to be honest at this stage. Businesses even less so.
If you're going to classify customers for a type of outreach program, does your customer base really tend to change that much? Can you just do this once a year, maybe do smaller batches for new users monthly?
The data that you'll need will be significantly influenced by how often the solution is used and how long the project will last.
I think I can get a good enough result with far less data if I think this is just a one-time analysis.[4]
Is time an important dimension?
It generally is. There's almost always a dependency on time. Sales increase during holiday periods, traffic decreases during work hours, and markets are only up during certain times of the day.
If there is seasonality, the model will need at least two instances of that seasonality to begin modeling its effects. This means:
- If yearly seasons affect your model, you'll need at least two years of data.
- If you have monthly patterns (that don't depend on weather or holidays, or if those times are not relevant), you'd need a couple of months.
This doesn't mean that you can't start modeling without this data at all, but if these effects are critical to the business outcome, "let's wait until we have enough data" is a valid answer.
How many critical use cases to cover?
In many of the projects I have worked on, the project had critical use cases that it had to cover. Can the data cover all or most of them? I tend to state the amount of data that could reliably cover enough data for each element for the model to learn. For example, 1,000-10,000 labeled instances for each case. I could go higher or lower depending on the use case.
The message I am trying to convey to the project stakeholders is that I need quality data that can answer questions. They might have millions of instances of logs in the databases, but if they don't have enough critical cases, then there is no reason to proceed with the data.
Other Questions
This is by no means a comprehensive list, but here are some generally useful questions:
- How frequently do the positive instances occur?
- Are there any parts of the process that are not recorded?
- Can a human do the job themselves?
- Is the data used actively in any process?
- Have there been any major changes in the process?
- Have there been any major changes in data collection?
This list could go on and on. Clarifying how the outputs will be used is crucial for any estimation, so I try to be as thorough as possible here.
Coming Up With An Answer
I want to stress here again that I am talking about the stages before the project begins. There is no way to know the actual amount of data, but we can assess the project's feasibility.
Two more relevant places to go before estimating:
- If you have done a similar project, I always mention how much data was involved and if it was enough at the time.
- If there's a similar paper or project available online.
These two can let me ground the estimation I am about to provide. The rest is between art, alchemy, and a bit of accumulated experience. Roughly:
- Find the base
- Refine
- Iterate
Find the base
First, I look for my basis in the problem. The basis here represents how the project's business operates.
I'll give examples from two fictitious projects: a retail sales prediction and a sensor failure root cause prediction.
Retail sales prediction: Seasonality is likely to be very strong, so my base would be time—at least two years, however many data points that would be.
Sensor failure root cause prediction: My base would be the number of root causes I'd have to predict. A period that covers 10,000–15,000 for each case would be a great start, with however many negative data points are included.
Refine
Here, I take into account more details of the project and update the initial estimations. Of course, I could receive numerous and contradictory signals from different project members; listening to and distilling these is part of the process.
Continuing from the previous examples:
Retail sales prediction:
- If the effect on holiday and seasonal periods is very strong and the team insists that it is critical and the upfront performance of the tool would be, then I could increase the number of years needed, maybe 3 or 4 years.
- If the team insists that we don't need very strong performance at the start but just need to be strong in a couple of key products, several months of data with the crucial products would suffice.
Sensor Failure Root Cause Prediction: - If the team says that there are some parts of the cases that are only captured by unstructured sources that are harder to parse and missing for instances, that means we would need more data. That means collecting 2-3 times more cases for these instances.
- If the team says that data for only 3 of the root cause cases is enough for a minimum viable product, we can get started with data for only these sources.
Here, adding a contingency with a multiplier could be fine, but be careful about adding too much, as it might signal unfeasibility to the team if it's too high or if extracting it is too costly. Negotiation might be necessary here.
Iterate
No answer is final until the project is completed. I make sure to let the team know that the answer is not final, and I will update the estimations until we have an experimental result.
Keep listening to people. Every time you get new information, go back and update your estimation and let everyone know.
Every update could mean a check for "go or no go" for the project. If the situation seems impossible, there's no need to push. Avoid the sunk-cost fallacy.[5]
To Recap
- It is the Data Scientist's responsibility to assess the feasibility of the data side of the project.
- Listen to the requirements carefully, clarifying with the intention of providing an accurate assessment of a "Go or No-Go" answer.
- Try to understand the core data requirements and influencers of data variability, refining with other statements from subject matter experts.
- Provide, but don't commit to, a number that sounds reasonable. Adding some contingency over your initial estimation is fine, but too much of it can scare the business.
- Iterate and keep the option to change the initial "Go or No-Go" decision based on new information.
I was once told that the business had too much data to analyse, and given 120 lines. Apparently, they were adding a line a day and thought it was a lot. ↩︎
It is important to emphasise that these are just rough estimates. Data people are generally hesitant to commit to these things - for very good reason. But it is very common in all industries for projects to go over 3-5x over the budget and I don't think data budget is more under control than monetary budget. It is ok to make estimates, it is ok to miss them by quite a large margin from time to time. ↩︎
Building solution based on the algorithm sounds like a horrible idea to me. While some applications might apply constraints (image classficiation can't really get away from CNNs), while thinking about a project, I try to think about the problem, not the solution. ↩︎
Just become this is a one time analysis, it doesn't make the project less valuable. ↩︎
One of my favourite sayings in Turkish: Zararin neresinden donsen kardir, it is a profit to stop losing att any point. ↩︎