Mistakes made by new data scientists are disappointingly similar to those still made by professionals
Over the past year, I’ve had the opportunity to introduce many students to data science. Between teaching an MBA course on Big Data at ESCP and working with students at the Politecnico di Torino to develop an algorithm to forecast trends in fashion as part of a business challenge for CLIK (Connection Lab and Innovation Kitchen), I’ve been able to share what we do with students who may not have otherwise considered its value. Seeing them understand why data science is the future has been gratifying.
In the process, I’ve become much more attuned to the types of mistakes people make in applying, developing, and using data science when they start out. Sometimes these are purely technical errors, but more often, these mistakes result from more profound misunderstandings about how we can best leverage the increasingly more massive amounts of data available.
Mistakes are normal and even desirable when you’re developing an algorithm. You learn from those mistakes and create a better final product. As a student or a new data scientist, mistakes are critical to developing skills.
But repeating those mistakes over and over? That’s a different story.
The more I work with them, the more I realize that the mistakes new data scientists make are disappointingly similar to the errors made by professionals. Instead of learning from our early missteps, data scientists are digging in and refusing to learn critical lessons that make the difference between incremental gains from data-driven decisions and systemic transformation.
There are 3 critical mistakes data scientists continue to make that we must learn from to do our jobs better:
Mistake #1: Assuming near-perfect data
Every professional data scientist knows on an intellectual level that you’re never going to have perfect data. Data is going to be incomplete, dirty, error-prone, and full of noise. Dealing with messy data is a daily part of our work and our most common complaint.
When you’re first starting, you have less practical experience with this reality. When I gave my MBA students an assignment that required choosing datasets to use, they defaulted to the most relevant information possible. On the surface, that makes sense: the more relevant the data, the bigger its impact.
In practice, this assumption magnifies every problem in the data. Algorithms don’t have enough data to learn appropriately and overcome the errors in the datasets. While traditional regression models fail when facing massive datasets, machine learning thrives with increased data. Assuming perfect data, my students may have achieved ideal outcomes with their choices, but in the real world, they did not have enough data to achieve their goals.
Data scientists in the field should know better. After all, they spend so much time cleaning datasets that it’s impossible to ignore!
Unfortunately, many data scientists fear providing too much noise and default to safer choices. They see the diminishing returns from additional datasets and decide that the effort isn’t worth it.
This is a fatal mistake.
For example, in supply chain forecasting, the most critical signals come from data like product features, transactions, availability, and traffic. Information from social media, demographics, and weather provide less value.
But not zero value. For proof of concept, the most relevant data may be the most valuable. Still, secondary influences have a significant impact in production, especially when considering the monetary value of a 1% gain in margin for a massive international retailer.
Data scientists are right to be careful when selecting their data, but I still see far too many diminish results by playing it safe. It may not seem the same on the face, but they are making the same mistakes as my students: assuming that their chosen data is close enough to perfect to suffice.
Ultimately, more data is better. Let the machines learn from what you are feeding them and discover ways to compensate for errors themselves. We’ll have more robust, more agile models to give us better recommendations.
Mistake #2: Misunderstanding what drives your industry
These days, industry knowledge is just as vital as technical knowledge for any data scientist. This knowledge guides us to ask the right questions and leverage the data to solve the most compelling problems. When we lack this insight, we get stuck.
Take the CLIK challenge, for example. Students with practically no knowledge about how the fashion industry operates were expected to find solutions to innovate for their production needs. Understandably, many struggled at the beginning to pinpoint what fashion companies actually needed from trend forecasting. Is it the colours, the styles, or some other factor that companies need to understand? This lack of expertise on the way the fashion production cycle worked confused the students at first.
In the first drafts, many of the proposed projects were interesting but ultimately fell short of a practical solution that would work in the fashion industry. They had the data, the science was valid, but the goal was flawed. Even successful implementation would not significantly help fashion companies better plan production and purchasing. Students had to re-group.
In a Challenge like this, knowledge gaps are understandable and expected. The same can’t be said of professional data scientists working full-time in a particular industry. They should have a deeper understanding of the needs of the industry and companies they serve.
And yet, far too many data scientists just don’t know where the crucial pains are for their clients.
I was recently discussing this exact issue with the CEO of a major Russian grocery chain. This chain had long used a demand forecasting and retail planning software considered best-in-class to use its data for more effective replenishment. It was generally acceptable, but it increasingly showed its fundamental flaw over time: it systematically created overstocks.
In grocery, targeted local promotions are a common strategy to achieve a competitive advantage.
It’s difficult to predict the impact that our weekly promotions will have on demand. Some discounts exponentially increase sales while others have only a marginal impact. To ensure sufficient product availability, we tend to overstock promotional items. These overstocks create significant waste and cyclical markdown pressure. We needed a more sustainable solution.
Stockouts of the promotional items are bad for the brand, yet regular overstocks create significant waste. Poor replenishment of promotional items is unsustainable and costly: a core supply chain pain point haunting the grocery industry.
Despite this, many demand forecasts fail to focus on this area, instead achieving gains elsewhere. Simply addressing this disconnect makes huge gains possible: in this case, +23% inventory efficiency.
In a situation where even a few percentage points can have a massive impact on the bottom line, these numbers were overwhelming. But they are all attributable to a sound system targeting the wrong problem. Disregarding the integral role of industry knowledge is a mistake that skilled data scientists make far too often — with disastrous results.
If you are using all the data and a carefully designed algorithm to address a secondary concern, you replicate a damaging mistake. As data scientists, we have to do better.
Mistake #3: Attempting to predict rather than drive outcomes
The final mistake and the one I see most often: a predictive rather than prescriptive approach.
New data scientists hear ‘forecasting’, and they naturally understand that prediction is their primary goal. Linguistically, it’s the most precise definition of their work. From a strategic standpoint, however, these beginners lack the context to understand that their true goal is to improve KPIs, not forecast accuracy. I’ve seen many new data scientists so proud to have achieved high accuracy, blissfully unaware that the ROI from that precision is practically non-existent for the user.
This naivety is normal when you start out, but sadly far too few data scientists outgrow this mindset. I have seen companies proudly tout their accuracy when business is relatively normal, only to be blindsided by the failure of their model in the face of a crisis.
During the early stages of the Covid-19 pandemic, for example, many data scientists using predictive AI saw their models fall apart. Those that continued to deliver actionable advice and allowed companies to pivot successfully to reduce the impact of the crisis? Almost always prescriptive models.
Far too often, AI models are designed primarily to predict the future based on stable conditions rather than optimally achieve future goals. I have written numerous past articles to show why this is a problem. Far too much uncertainty limits your returns.
The core of this error lies in the business philosophy: a company predicting outcomes, not driving them is far too passive.
The most successful companies are focused not on discovering what will happen and adapting successfully. They are market leaders that want to know how to make goals happen.
Predictive analytics limits the value created from your effort. Predictive recommendations tell you the best response to likely events; prescriptive recommendations tell you how to achieve your goals, no matter what roadblocks stand in your way.
Data scientists have to stop making the same mistake of believing that they are building prediction engines. Their true role is to automate decisions and drive better outcomes with a prescriptive approach.
How to finally learn from our mistakes
We can’t keep spinning our wheels, making these mistakes over and over again. As data scientists, we’re holding our field back with these bad habits.
Data scientists have to recognize how they are mirroring the mistakes made early in their careers to adjust and finally move past these errors. The first time we make a mistake, it’s a good thing. We can learn and do better next time. When we make that mistake repeatedly, failing to notice our error, we have a problem. The only way we can learn from our mistakes is to embrace a new path.
We should have learned these lessons already: it’s time to progress past the mistakes of beginners. We’ll surely make new mistakes as we move forward. Embrace them! They’ll bring us towards greater successes rather than keeping us stuck.
About the author
Fabrizio Fantini is the brain behind Evo. His 2009 PhD in Applied Mathematics, proving how simple algorithms can outperform even the most expensive commercial airline pricing software, is the basis for the core scientific research behind our solutions. He holds an MBA from Harvard Business School and has previously worked for 10 years at McKinsey & Company.
He is thrilled to help clients create value and loves creating powerful but simple to use solutions. His ideal software has no user manual but enables users to stand on the shoulders of giants.