Finding the Needle in the Haystack

Sometimes instead of accuracy we need to look at different metrics. One such metric is sensitivity, which is a measure of those who are actually targets how many does the model correctly identify. This can be the metric of choice over accuracy when you are dealing with a raw event such as a terrorist attack or even student retention. It is always important to understand what metrics you are optimising your models on.

Michael DeWitt https://michaeldewittjr.com
06-09-2019

One of the challenges in any kind of prediction problem is understand the impact of a) not identifying the target and b) the impact of falsely indentifying the target. To put it into context, if you are trying to use an algorithm to indentify those with ebola, whats the risk of missing someone (they could infect others with the disease) vs identifying someone who does not have the disease as having the disease (they get quarantined and have their life disrupted). Which is worse? That isn’t a statistical question, it is a context, and even an ethical question (ok, yes, you could also apply a loss function here and use that to find a global minimum value should it exist, but then again that is a choice).

Typically we use two terms to talk about this problem, sensitivity (of all “targets” how many did the model correctly identify) and specificity (of those who were not targets, what proportion were correctly identified as not a target).

Data Generating Process

As always, I want to build some simulated data to understand this problem. Let’s assume that we are trying to identify methanol levels in moonshine. Moonshine is typically homemade, illicit, high alcohol spirits. The alcohol level is increase through distillation. Ethanol, the desired alcohol boils at something like 78 deg C while methanol boils at 65 deg C. Methanol has some toxic side effects so you really don’t want any in your cocktail. So let’s make some fake data with some measurements.


n <- 500

pot_temp <- rnorm(n, 78, 10)
mash_wt <- rnorm(n, 50, 2)
ambient_temp <- rnorm(n, 30, 7)
ambient_humidity <- rnorm(n, 82, 12)

methanol_content <- ifelse(pot_temp <65, .5 * mash_wt - ambient_humidity*(ambient_humidity)/1000, 
                           .2* mash_wt - ambient_humidity*(ambient_humidity)/1000)

moonshine_data <- tibble(pot_temp,
                         mash_wt,
                         ambient_temp,
                         ambient_humidity,
                         methanol_content)

Let’s also assume that methanol content greater than 17 is toxic (these numbers are completely made up).

So Let’s see what we have in our data:

Table 1: Incidence of Toxicity in Total Data Set
Toxicity Count Percent
0 466 93.2%
1 34 6.8%

Yikes! We have a raw case problem. Given our data, what we are targetting doesn’t happen very often. So let’s see what happens when we try to model this:


dat_training <- sample_frac(moonshine_data, .7)
dat_testing <- setdiff(moonshine_data, dat_training)

We could build a very simple binomial regression, but it could guess that every sample will pass and be correct >90% of the time!

Table 2: Incidence of Toxicity in Training Data
Toxicity Count Percent
0 323 92.3%
1 27 7.7%

[1] 0.001986915

Let’s see what our model extracted:


5 x 1 sparse Matrix of class "dgCMatrix"
                           1
(Intercept)      30.80897725
ambient_humidity -0.08563842
ambient_temp     -0.02641847
mash_wt           0.29155874
pot_temp         -0.59700118

          
           Positive Negative
  Positive        3        2
  Negative        4      141
Table 3: Estimates from Elastic Net Fit
.metric .estimator .estimate
accuracy binary 0.96
kap binary 0.48
sens binary 0.43
spec binary 0.99

So the estimates looks good for sensitivity meaning that the model only correctly identified a little over half of the toxic batches. Is this good enough? Well for a consumer with no knowledge about the potential risk, I don’t think so. The next steps then would be to tune the model to improve the sensitivity in order to reach an acceptable level. This could be as easy as accepting a higher false positive rating. We can look at this with the ROC curve.

ROC Curve for Fitted Model

Figure 1: ROC Curve for Fitted Model

Still not great. So then what do we do? Accept a higher false positive rate? Or now to we build a better model with different data? What if we could inform the brewer about the process parameters that matter. Then we could fix the problem at the source. Perhaps that’s the best bet….

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

DeWitt (2019, June 9). Michael DeWitt: Finding the Needle in the Haystack. Retrieved from https://michaeldewittjr.com/dewitt_blog/posts/2019-06-09-finding-the-needle-in-the-haystack/

BibTeX citation

@misc{dewitt2019finding,
  author = {DeWitt, Michael},
  title = {Michael DeWitt: Finding the Needle in the Haystack},
  url = {https://michaeldewittjr.com/dewitt_blog/posts/2019-06-09-finding-the-needle-in-the-haystack/},
  year = {2019}
}