One of the many hats I wear is that of a data scientist. In my opinion, the most important job of a data scientist is asking questions. More specifically asking the right question or the right hypothesis. Ensuring that you correctly frame your research question can be the difference between a making life-changing breakthrough and wasting years on a useless experiment.
Asking the right question may seem trivial. Too often when we get an interesting dataset, we may just start performing analysis of the data (EDA) without stopping to think what the best questions are to ask about the data.
Let’s look at an example. An interesting financial domain to analyze and one in which I have spent a good amount of time thinking about is the loan default prediction domain. Given a particular set of features for a given borrower and a loan, what is the likelihood that the borrower will default on a loan? Intuitively, certain features are going to quite important. The borrower’s FICO score, how much money they make, the size of the loan, etc. For a long time, the company Lending Club released much of the data for each of the loans that they have ever made. It appears that they stopped releasing it, but you can find a historical version of the dataset at Kaggle.
Now, from the previous paragraph, you may think we already have our question. An initial candidate question is:
Give a set of features for a borrower will they default or not?
Are we done? I don’t think so. In some instances, a few loans will fall through the cracks and they may be fraudulent. In this case, it will be likely that the borrower will never make any payments. The profile for such a borrower will be different than a borrower that did intend to pay the loan but later had a financial hardship that caused them to default. Lending Club normally issues loans with terms of 36 or 60 months. So, a better question may be:
Given a set of features for a borrower in which month (1-36 or 1-60) will the borrower default?
That makes the question tighter. We did have to turn the experiment from requiring a binary classification into a multinomial classification.
Can this question be improved? Yes. As the quote from J. Paul Getty goes:
“If you owe the bank $100 dollars, that’s your problem. If you owe the bank $100 million dollars, that’s the bank’s problem”.
Lenders are going to gain or lose more money from the higher-balance loans than from the lower-balance ones. So, an even better question is:
Given a set of features for a borrower, how much money we’ll get back?
If the borrower makes all the payments, we will get the full principal back plus interest. In some cases, even though the borrower may default late in the life of the loan, we may still get the principal back and some of the interest. In some cases, we will take a loss on the principal. So, the experiment now goes from a multinomial classification problem to a regression problem.
Our results will now be able to provide a much accurate answer about how much profit or loss will be generated with these loans. We can continue to refine this question even further. For example, there are certain features that are illegal to use such as race or sex, so we can add these caveats to our question. But by now the main concept should be clear. We want to spend a decent amount of time crafting an appropriate question and this will save an immense amount of time in data collection, data cleansing, and other steps later in the process.
An insight that I hope you will find interesting is that the concept of carefully crafting questions and judiciously setting goals is that it applies to many other aspects of our own careers and lives and not just data science.
One of my other hats is coaching some of my colleagues and help them to get certified in Amazon Web Services (AWS). During my coaching calls, I normally make this statement:
“If you come back to and let me know that you scored a 95/100 on the exam, I will congratulate you on your achievement, but I will also let you know that you actually failed”. The passing score for most of the AWS certifications is around 75%. At our firm, many of my colleagues have at least one AWS certification. For that reason, to stand out from the crowd you want to obtain 3 or more of these certifications. If you score 95% in one of the exams, you probably could have taken the exam months before and passed it and you could be well on your way to getting a second certification.
Regardless of your current occupation or role. Maybe you are a doctor, an architect, a parent, a hardware engineer. Make sure you continuously dedicate some time to continue to refine your question and your goal. You may find that this time will be well-spent and save you a copious amount of time on the backend.
Don’t be that person that spends years of their life trying to accomplish a goal or climbing a mountain only to later find out you climbed the wrong mountain. What is your question?
If you want to learn more about AWS check out my latest book “AWS for Solution Architects” which recently launched.