Learn how to fail in Data Science

Failure is a fact of life. Michael Jordan, arguably the greatest basketball player of all time, missed 12,345 shots. Even the best basketball players make less than half of their shots. Using an example from my personal experience, I can rarely write more than 10 lines of code at a time without a compilation or logic error in the first pass.

For most problems, the number of correct solutions are normally much less than the number of incorrect solutions. Using a very basic example, the only correct answer for 2 plus 2 is 4 but there is an infinite amount of incorrect answers.

At the enterprise level, many IT projects throughout organization never make it to production.  In order for projects to succeed many things need to go right. Such as:

  • The right tools need to be provided.
  • The right folks need to be in place for the different roles in the project.
  • There needs to be good communication between the business and the technology team to gather the appropriate requirements and to solve the right problem.
  • Useful datasets need to be available or generated.

Even if one of these elements is not in place, the whole project could be in jeopardy. Have I depressed you enough? How can we overcome this? We have a few options. Let’s explore these options using basketball analogies.

Option 1: Don’t take the shot.

You can decide not to take the shot. You are guaranteed not to fail but you are guaranteed to not succeed. Not much of an option. Let’s keep looking.

Option 2: Only shoot layups.

You can also make sure that you only take easy shots. Don’t’ shoot any threes and make sure to only shoot when no one is guarding. This approach is not much better than the first option discussed. You want to be able to tackle some hard problems and not just the easy ones. Let’s keep going.

Option 3: Get better.

You can practice your shooting and make a higher percentage of baskets by getting better. The data science equivalent might be to hone your craft and become more adept at applying the best algorithms for a specific set of problems. This is always a good option. You should always strive to become better and stay up to date with the latest technologies and tools. But we are not really breaking any new ground here. This is an obvious answer.

Option 4: Shoot like crazy.

Finally, one last option and one that might not be as intuitive, is to shoot like crazy and not care too much if the shot goes in or not. Keep shooting. If you do this on the basketball court your teammates might not invite you back and call you a ball hog. But in data science, this might not be a bad idea. After all, at the heart of most machine learning algorithms, we are iterating through many possible answers and trying to converge to the “best” answer. Furthermore, many of today’s frameworks have “AutoML” libraries where you can iterate through a set of machine learning algorithms to find out which produces the optimal answer. Two frameworks that come to mind with this kind of capability is h2o.ai as well as TensorFlow, but they are not the only two. Also with the constant drop in CPU and GPU pricing, we can afford to “fail quickly”. We can try different sets of features to see which ones work best. With the advent of “robotic chemists”, we can run more experiments faster than ever before, affording the luxury to fail faster (and eventually coming up with some answers).

So, go ahead. Give yourself permission to fail but make sure to set up your work environment to be able to fail fast.

Tips to fail fast

  • Make sure your “code, compile, test” cycle is as efficient and as fast as possible
  • Consider using continuous integration and continuous delivery tools such as Jenkins
  • Automatic your software tests
  • Learn test-driven programming
  • Convince your boss to get you lots of RAM and a high-powered CPU or GPU for your laptop.
  • Take advantage of cloud instances specifically built to AI/ML. Sagemaker for AWS, Cloud TPU’s for TensorFlow from Google and the Microsoft AI Platform in Azure to name a few.
  • Embed continuous A/B testing in your production environments and put processes in place to measure its effectiveness.