Okay, so the title of the article is somewhat apocalyptic and I don’t wish ill on anyone. I am rooting for you and hope your project succeeds beyond expectations. The purpose of this article is not to put a voodoo curse on you and assure your project’s failure. Rather, it a visit to the most common reasons why data science projects do fail and hopefully help you avoid the pitfalls.
Asking the wrong question
If you ask the wrong questions, you will get the wrong answers. An example comes to mind from the financial industry and the problem of fraud identification. The initial question might be “Is this particular transaction fraud or not?”. To make that determination, you will need a dataset that contains examples of fraudulent and non-fraudulent transactions. Most likely this dataset will be generated with human help. i.e., the labeling of this data could be determined by a group of subject matter experts (SME) trained to detect fraud. However, this dataset may be labeled by the experts using the fraudulent behavior they have witnessed in the past. Therefore, the model trained on this data will only catch fraud that fits the old pattern of fraud. If a bad actor finds a new way to commit fraud, our system will be unable to detect it. A potentially better question to ask could be “Is this transaction anomalous or not?”. Therefore, it would not necessarily look for a transaction that has proven to be fraudulent in the past, it would look for a transaction that doesn’t fit the “normal” signature of a transaction. Even in the most sophisticated fraud detection systems, rely on humans to further analyze predicted fraudulent transactions to verify the model results. One bad side effect of this approach is that more than likely it will generate more false positives than the previous model.
Another favorite example of mine for this kind of failures also comes from the area of finance. The investing legend Peter Lynch.
During his tenure at the Fidelity Magellan fund, Lynch trounced the market overall and beat it in most years, racking up a 29 percent annualized return. But Lynch himself points out a fly in the ointment. He calculated that the average investor in his fund made only about 7 percent during the same period. When Lynch would have a setback, for example, the money would flow out of the fund through redemptions. Then when he got back on track it would flow back in, having missed the recovery.
Which would have been a better question to answer?
- What will be the performance of the Fidelity Magellan fund for the next year?
- What will be the number of purchases or redemptions for the Fidelity Magellan fund next year?
Trying to use it to solve the wrong problem
A common mistake that is made here is not focusing on the business use case. When shaping your requirements, there should be a laser focus on this question. “If we solve this problem, will it substantially add value to the business?”. To answer this question, as you are breaking your problem down into subtasks, the initial tasks should be focused on answering this question. As an example, let’s say you come up with a great idea for an Artificial Intelligence product and now you want to start selling it. Let’s say it’s a service where you upload a full-body photograph to a website and the AI program determines your measurements so it can then create a tailored suit that fits your body type and size. Let’s brainstorm some of the tasks that would need to be performed to accomplish this project.
- Develop the AI/ML technology to determine body measurements from the photograph.
- Design and create a website and a phone app to interact with your customers.
- Perform a feasibility study to determine if there is a market for this offering.
As technologists, we are eager to design and code, so we might be tempted to start working on the first two tasks. As you can imagine, that would be a horrible mistake if we perform the feasibility study after performing the first two tasks and the result of the study indicates that there is not a market for our product.
Not having enough data
Some of my projects have been in the life sciences domain and one problem that we have run into is the inability to obtain certain data at any price. The life sciences industry is very sensitive about storing and transmitting protected health information (PHI) so most datasets available scrub this information out. In some cases, this information would have been relevant and would enhance model results. For example, an individual’s location might have a statistically significant impact on their health. As an example, someone from Mississippi might have a higher probability of having diabetes than someone from Connecticut. But since this information might not be available, we won’t be able to use.
Another example comes from the financial industry. Some of the most interesting and relevant datasets can be found in this domain but again much of this information might be very sensitive and closely guarded so access to it might be highly restricted. But without this access, relevant results will not be possible.
Not having the right data
Using faulty data or dirty data can lead to making bad predictions even if you have the best models. In supervised learning, we use data that has been previously labeled. In many cases, this labeling is normally done by a human, which can lead to errors. An extreme hypothetical example would be having a model that consistently has perfect accuracy but is working with inaccurate data. Think for the MNIST dataset for one moment. When we run our models against it, we are assuming that the human labeling of the images was 100% accurate. Now, imagine a third of the numbers are mislabeled. How difficult do you think it would be for your models to produce any kind of decent results regardless of how good your model is? The old adage of garbage in, garbage out is alive and well in the data science domain.
Having too much data
In theory, you can never have too much data (as long it’s correct data). In practice, even though there have been tremendous advances in storage and computing costs and performance, we are still bound by physical constraints of time and space. So currently, one of the most important jobs a data scientist has is to judiciously pick out the data sources they think will have an impact in achieving accurate model predictions. As an example, let’s assume we are trying to predict baby birth weights. Intuitively, a mother age seems like a relevant feature to include but the mother’s name is probably not relevant, whereas address might be relevant. Another example that comes to mind is the MNIST dataset. Most of the information in the MNIST images are in the center of the image and we can probably get away with removing a border around the images without losing much information. Again, in this example, human intervention and intuition were needed to make the determination that removing a certain number of the border pixels would have a minimal impact in predictions. One last example for dimensionality reduction is to use Principal Component Analysis (PCA) and T-distributed Stochastic Neighbor Embedding (t-SNE). Determining which of these features are going to be relevant are still a hard problem for computers before you run the models, but it is a field that is ripe with possibilities to automate the process. In the meantime, having too much data remains a potential trap that can derail your data science project.
Hiring the wrong people
You wouldn’t trust your doctor to fix your car and you should trust your mechanic to perform your colonoscopy. If you have a small data science practice, you might have no choice but to rely on one or a few people to perform all tasks from data gathering and procurement, data cleanup and munging, feature engineering and generation, model selection as well as deploying the model in production. But as your team grows, you should consider hiring specialists for each one of these tasks. The skills required to be an ETL development expert do not necessarily overlap with the expertise of a Natural Language Processing (NLP) expert. In addition, for certain industries – biotech and finance come to mind – having deep domain knowledge might be valuable and even critical. However, having a subject matter expert (SME) and a data scientist with good communication skills might be a suitable alternative. As your data science practice grows, having the right specialized resources is a tricky balancing act, but having the right resources and talent pool is one of the most important keys to your practice success.
Using the wrong tools
There are many examples that come to mind here. One common pitfall is the proverbial “I have a hammer and everything looks like a nail now”. A more industry-specific example: You recently sent your team to train on MySQL and now that they are back, you have a need to set up an analytics pipeline. Having the training fresh on their mind, they suggest using their new shiny tool. However, based on the amount of data that your pipeline will be processing and the amount of analytics you need to perform with the results, this selection might be the wrong choice for the job. Many SQL offerings have a hard limit on the amount of data that can be stored on an individual table. In this case, a better alternative might be to use a NoSQL offering like MongoDB or a highly scalable columnar database such as AWS Redshift.
Not having the right model
A model is a simplified representation of reality. These simplifications are made to discard unnecessary fluff, noise, and detail. A good model allows its users to focus on the specific aspect of reality that is important in a particular domain. For example, in a marketing application, keeping attributes such as customer email and address might be important. Whereas in a medical setting a patient’s height, weight, and blood type might be more important. These simplifications are rooted in assumptions; these assumptions may hold under certain circumstances, but may not hold in others. This suggests that a model that works well under a certain scenario may not work in another.
There is a famous theorem in mathematics. The “No Free Lunch” (NFL) theorem states that there is no one model that works best for every problem. The assumptions of a good model for one domain may not hold for another, so it is not uncommon in data science to iterate using multiple models, trying to find the one that fits best for a given situation. This is especially true in supervised learning. Validation or cross-validation is commonly used to assess the predictive accuracy of multiple models with varying complexity to find the most suitable model. In addition, a model that works well could also be trained using multiple algorithms – for example, linear regression could be trained using normal equations or using gradient descent.
Depending on the use case, it is critical to ascertain the trade-offs between speed, accuracy, and complexity of different models and algorithms and to use the model that works best for a given domain.
Not having the right yardstick
In machine learning, it is critical to be able to evaluate the performance of a trained model. It is essential to measure how well the model performs against the training data and the test data. This information will be used to select the model to be used, the hyperparameter selection and to determine if the model is ready for production usage or not.
To measure model performance, it is of the utmost importance to choose the best evaluation metrics for the task at hand. There is plenty of literature on metric selection so we won’t delve too deeply into this topic but some parameters to keep in mind when selecting metrics are:
- The type of machine learning problem – Supervised learning, unsupervised learning and reinforcement learning.
- The type of supervised learning – binary, classification or regression.
- The dataset type – If the data set is imbalanced a different metric might be more suitable.
Conclusion
There are many ways to fail and only one best way to solve a problem. The definition of success might also be measured against different yardsticks. Is the solution being sought a quick and dirty patch? Are you looking for a best in class solution? Is the emphasis on model training speed, inference endpoint availability, or the accuracy of the model? Maybe finding that a solution did not work can be considered a win because now the focus can be given to other alternatives. I would love to hear your thoughts on this topic. Which other ways have you found to fail? What solutions have worked best for you? Feel free to discuss below or reach out on Linked In. Happy coding!