Table of Contents
Four More Questions You Might Get in a Data Science Interview
Part 2 of a miniseries crafted by somebody with experience as a data science interviewer
Hello again friends! Welcome back with the continuation of our series on interview questions you might get in a data science interview. In case you missed the first post, you can check it out at this link.
I’m going to keep this introduction short and sweet since I give a better, more thorough explanation in the original post. Let me provide a very quick recap on how these posts are structured. Across all four questions, you will see two subsections:
Also one last reminder: the questions I’ve crafted across both posts run the full gamut of data science skills. Depending on the specific role you apply for, you may or may not encounter questions similar to these, but I still wanted to provide a broad range of types of questions to cover that “full stack” of data science skills.
Alrighty, I promised a short and sweet introduction, so let’s jump into the questions!
1. Across various cloud platforms (including AWS SageMaker), it is possible to deploy a model directly from a Jupyter notebook. Can you explain why this is not a preferable pattern, and what might you consider doing instead?
Motivation: This is one of those questions that has less to do with “pure” data science and more about software engineering in general, and it’s also a question where the answer is less important than how the question is answered. In short, the reason this is not a preferable plan is because it doesn’t adhere to the core tenets of software engineering efficiency and resiliency. What I mean by that is that this “deploy in a notebook” process is a clunky one, especially if you want to scale it across hundreds of models.
This “deploy in a notebook” pattern is highly inefficient from many perspectives. I’ll rapid fire those inefficiencies in the bullets below:
Potential Answers: I started to slip in some of the answers as part of the “Motivation” section, but just to make sure you’re extra clear on what those are, let’s reiterate them here. Again, there’s no “one size fits all” answer, so allow me to share a number of different options for answers your interviewer would like to hear:
2. Your company has deployed hundreds of models and wants to ensure each remains performant by producing a consistently expected range of inferences. What is the concept called when the model’s performance degrades over time, and what steps might you take to ensure this degradation does not occur amongst this large number of models?
Motivation: Because businesses use the inferential output of predictive models to support business decisions, it’s naturally important that you would always want that inferential output to be as accurate as possible. A model that produces adverse results not only results in poor business decision making, but it can also have some pretty severe ramifications. For example, if Tesla’s self-driving algorithms stopped working properly, that could literally result in the deaths of people if Tesla cars were to get into accidents as a result of poor algorithm performance.
While you could have a person manually checking each model to ensure it is performing correctly, the question specifically notes that you need to ensure that hundreds of models need to remain performant. Obviously, that becomes too unwieldy to manage if doing manual tests. All that said, the motivation here is to ensure the candidate understands what model performance degradation is and how to manage it at a more automated sort of level. We’ll discuss specific potential answers in the next section, but I wouldn’t worry too much about memorizing all the potential metrics I list down there. Rather, I’d focus more on the general concept that you need to protect the performance of your models in an automated fashion.
Potential Answers: There’s only one real answer for the question about what this concept is called, and that is model drift. Model drift is the concept that because the data inputs fed into a model may change over time from the original expected range of values, the model may not perform as optimally the longer it remains untouched in production. (This is not to be confused with encountering totally new values, which we will discuss with the next question.)
From an automation perspective, the way to assess model drift over time is going to be by capturing various metrics that can quantify model drift in numerical values. There isn’t “one metric to rule them all.” If you want to assess the quality of a binary classification model, you might look at metrics like precision, recall, ROC AUC, or F1 score. For a regression model, you might look at metrics like root mean squared error (RMSE), R square, or mean adjusted error (MAE). It might also behoove you to run data quality metrics, like population stability index (PSI), to ensure that more recent data hasn’t drifted too far away from the original training data.
Based on the outcomes of these metrics, you could have something like a “tiered” automation on how to handle the model drift, perhaps a “stoplight”-like tier system. Models in the green tier would be good to go. Models in the yellow tier would produce an automated notification to the model owner with some message like “Hey, we’re seeing a little iffy-ness here. You might want to look into this.” And for models in a red tier, you might want to have an automated retrain / redeploy mechanism in place. That last tier gets into the very popular concept called MLOps, and while I’m not going to discuss it here, I think your interviewer would be impressed if you could articulate MLOps well, particularly in this context.
3. A model you deployed to production has been generating an expected range of inferences just fine for months now. You come into the office one morning and discover that the model has started producing radically different inferences in the last 24 hours. Why might the model be all the sudden generating different inferences, and what might you do to fix this?
Motivation: At first blush, this might sound like another software engineering question since we’re talking about dealing with a malfunctioning model in production, but that is actually not at all the case here. Notice that in this scenario, the model is still producing inferences, meaning that the software solution undergirding the model is operating just fine. Now if I were to have phrased the question differently and noted that the model is not producing any inferences at all, then yeah, that’s probably more of a software engineering problem.
Now, you might be thinking that this is a model drift issue as we covered with the last question, but that’s not exactly the problem here. The real root cause of the problem here is most likely correlated with a data quality problem. Model drift is more so associated with longer periods of time, and the question here notes an overnight shift. Again, this is likely a data quality issue where some upstream source probably made a radical new change to the data they are sending to your model. For example, let’s say you have a model trained to analyze the price forecasts for ice cream sales. If the model is only trained on a dataset where the flavors are only “chocolate” or “vanilla”, then your model is going to freak out if your upstream source starts sending info about new flavors like “strawberry” and “cookie dough” and “rum raisin” and “rocky road” overnight.
So basically, your interviewer is looking that you understand that data quality can radically alter the performance of your model via the introduction of totally new, totally unexpected values. Now let’s jump into how you might handle this situation.
Potential answers: This is one of those questions where the range of potential answers is actually quite limited. Specifically, I can only think of two options you can feasibly do here:
Actually, the most optimal solution would probably be a combination of both options above. You could suggest the second option as a quick bandaid measure for now, which would buy you some time to retrain the model formally with the new values. Then when the new model is trained and properly validated, you can deploy it out and then ask your upstream source to release the floodgates on record with the new values!
4. You are preparing to deploy a model into a production environment as a consumable software solution (e.g. real-time API, batch cronjob). Aside from creating the solution itself, what all considerations might you take when deploying the solution to your production environment?
Motivation: This last question embraces the software engineering side of data science, and as a software engineered product, your modeling solution will need to adhere to the best practices of developing any software-engineered product. Specifically, the interviewer is going to look that you can perform at least three specific categories of activities: automating the deployment, creating automated tests, and ensuring best security practices.
Potential Answers: Of all the questions in this post, this one probably has the widest range of options you could answer this question with. In the “Motivation” subsection, we talked about three specific categories of activities that the interviewer will be looking for. Let’s rapid fire bulleted lists for each of these categories.
Starting off with automating the deployment:
The next category is creating automated tests. Before sharing a handful of options here, I should note that I actually have a full, separate blog post specifically dedicated to many different tests you can create for a model deployment.
Finally, the last category is ensuring best security practices. We unfortunately live in a world where there are many bad actors out there, so we need to ensure that we properly secure our modeling solutions to protect the integrity of the model as well as simply protecting the data which could very well be customer data.
This content was originally published here.