How Discerning is the Technical Challenge in GBBO?

My wife and I love to watch the Great British Bake Off on Netflix. The competition is for the most part collegial in general and all around feel good television, especially at night. After watching several seasons of the show, a lingering question came to mind: how good are the judges at estimating talent?

The format of the show is composed of three rounds; the first and third rounds have themes/genre of baked good that the contestants knew about in advance and could decide (and practice) what they wanted to make. The second/middle round is composed of a “technical” challenge where the bakers have all given the same ingrediants and instructions and asked to make something of which they had no prior knowledge. Unlike the other rounds, the judges judge each bake blindly in the technical (and of course the contestants make the same dish).

Enter the Pyschometrics

This set-up is perfect for understanding how well the judges can estimate “ability” to use a psychometric term. Because we have contestants facing the same exact challenge and being judged blindly, we can use pyschometric tools to judge the “ability” of the baker and the “difficulty” of the challenge. There is a lot of noise on these measures due to the fact that contestants are eliminated after each show meaning that they do not get a chance at each challenge, but it will give a little bit of insight the ability of the bakers. We can then compare the outcomes of each round with the judged “technical” ability.

IRT and CRM

High stakes tests like the GRE and GMAT use something called Item Response Theory (IRT) to measure “ability.” The tests work by matching item difficulty (or how hard a question is) to the test-taker’s latent ability (tendency to get the right answer). Test takers should get the correct answer for those items where their ability is greater than the item difficulty, should get those items wrong where the difficulty is greater than their ability, and some distribution due to the measurement error in both ability and difficulty.

Not mentioned here, but a random component for guessing can be added. The literature for IRT is immense and there are many different models (2PL,3PL, Rasch, among others).

The key contribution of IRT over classical test theory (in my opinion) is that there is some latent noise in the test question/item.

IRT typically required a single “correct” answer. When we are looking at the rating of Bakers from 1 to N bakers, we need to observe the continuous data of the data. Enter the Continuous Response Models which allow us to use the principles of IRT for continuous data. In particular we will use Samejima’s continuous response model for the ranking of contestants.

Analysis Plan

So now we can lay out our analysis plan:

  • Get the baking results
  • Run the CRM to understand and rank baker ability
  • Compare the modelled ability to the actual results

Getting the Data

The first part in this analysis is getting the data. Luckily, Wikipedia, the grandest resource on the interweb, provides these data in a regular pattern.

First, we will load the usual suite of packages for webscraping and analysis.

library(tidyverse)
library(rvest)
library(data.table)
library(cmdstanr)

To run an initial test, I am just going to pull Season 3.

url <- "https://en.m.wikipedia.org/wiki/The_Great_British_Bake_Off_(series_3)"
content <- read_html(url)
tables <- content %>% html_table(fill = TRUE)

We can see that there are 18 available. The second table gives us the biographies of the contestants:

knitr::kable(tables[[2]])
BakerAgeOccupationHometownLinks
Brendan Lynch63Recruitment consultantSutton Coldfield44
Cathryn Dresser27Shop assistantPease Pottage, West Sussex55
Danny Bryden45Intensive care consultantSheffield66
James Morton21Medical studentHillswick, Shetland Islands77
John Whaite22Law studentWigan88
Manisha Parmar27Nursery nurseLeicester
Natasha Stringer36MidwifeTamworth, Staffordshire
Peter Maloney43Sales managerWindsor, Berkshire
Ryan Chong38PhotographerBristol99
Sarah-Jane Willis28Vicar’s wifeBewbush, West Sussex55
Stuart Marston-Smith26PE teacherLichfield, Staffordshire1010
Victoria Chester50CEO of the charity PlantlifeSomerset1111

With a little bit of work, we can turn the third table into a nice representation of the results.

tables[[3]] %>%
  as.data.table() %>%
  .[-1,1:11] %>%
  setNames(c("baker", sprintf("%s",1:10))) %>%
  melt(id.vars = "baker") %>%
  .[,round_num:=as.numeric(variable)] %>%
  filter(!value %in% c("", "SB")) %>%
  mutate(perf = sprintf("%s %s", value, round_num))->performance

knitr::kable(performance)
bakervariablevalueround_numperf
Brendan1SAFE1SAFE 1
James1SAFE1SAFE 1
Danny1SAFE1SAFE 1
Cathryn1HIGH1HIGH 1
Ryan1SAFE1SAFE 1
Sarah-Jane1SAFE1SAFE 1
Manisha1SAFE1SAFE 1
Stuart1LOW1LOW 1
Peter1SAFE1SAFE 1
Natasha1OUT1OUT 1
Brendan2HIGH2HIGH 2
James2HIGH2HIGH 2
Danny2SAFE2SAFE 2
Cathryn2SAFE2SAFE 2
Ryan2SAFE2SAFE 2
Sarah-Jane2SAFE2SAFE 2
Manisha2SAFE2SAFE 2
Stuart2LOW2LOW 2
Victoria2LOW2LOW 2
Peter2OUT2OUT 2
Brendan3LOW3LOW 3
Danny3SAFE3SAFE 3
Cathryn3SAFE3SAFE 3
Ryan3SAFE3SAFE 3
Sarah-Jane3SAFE3SAFE 3
Manisha3SAFE3SAFE 3
Stuart3HIGH3HIGH 3
Victoria3OUT3OUT 3
James4SAFE4SAFE 4
Danny4HIGH4HIGH 4
Cathryn4SAFE4SAFE 4
Ryan4LOW4LOW 4
Sarah-Jane4SAFE4SAFE 4
Manisha4LOW4LOW 4
Stuart4OUT4OUT 4
Brendan5HIGH5HIGH 5
James5SAFE5SAFE 5
Danny5LOW5LOW 5
Cathryn5SAFE5SAFE 5
Sarah-Jane5LOW5LOW 5
Manisha5OUT5OUT 5
James6LOW6LOW 6
Danny6HIGH6HIGH 6
Cathryn6SAFE6SAFE 6
Ryan6SAFE6SAFE 6
Sarah-Jane6LOW6LOW 6
Brendan7HIGH7HIGH 7
James7LOW7LOW 7
Cathryn7LOW7LOW 7
Ryan7OUT7OUT 7
Sarah-Jane7OUT7OUT 7
Brendan8SAFE8SAFE 8
Danny8LOW8LOW 8
Cathryn8OUT8OUT 8
Brendan9HIGH9HIGH 9
Danny9OUT9OUT 9
Brendan10Runner-up10Runner-up 10
James10Runner-up10Runner-up 10

Now the more challenging part is to parse all of the results. I am going to use some loops and index variables because I can’t think of a more expediant way to do it.

Importantly, each baker will appear for as many challenges in which they participated. This means someone who was eliminated after the first show will only have one record (enter measurement error) and those who participated in later rounds will appear multiple times.

technicals <- list()
z <- 1
for(i in seq_along(tables)){
  x <- tables[[i]]

  interesting <- grepl(pattern = "Baker|Technical", names(x))

  if(sum(interesting)<2){
    next()
  }

  y <- x[,interesting]

  names(y) <- c("baker", "technical")

  y$technical_no <- z
$
  technicals[[i]] <- y
  z <- 1+z

}

out_long<- do.call(rbind, technicals)

setDT(out_long)

knitr::kable(head(out_long,10))
bakertechnicaltechnical_no
Brendan10th1
Cathryn5th1
Danny7th1
James2nd1
John11th1
Manisha6th1
Natasha12th1
Peter3rd1
Ryan8th1
Sarah-Jane1st1

Now with a little nore parsing we can extract the result and associated rank of the bakers.

out_long[,rank := as.numeric(stringr::str_extract(technical, "\\d+"))]

out_long[,rank_ordered:=12 - rank]

out_long[,baker_id := as.integer(as.factor(baker))]

knitr::kable(head(out_long,10))
bakertechnicaltechnical_norankrank_orderedbaker_id
Brendan10th11021
Cathryn5th1572
Danny7th1753
James2nd12104
John11th11115
Manisha6th1666
Natasha12th11207
Peter3rd1398
Ryan8th1849
Sarah-Jane1st111110

Modelling the Data

In completely transparency, I utilized code from https://cengiz.me/posts/crm-stan/ which provided an excellent starting point for the analysis.

The code is lightly modified (just to tighten some priors) because of the

writeLines(readLines("irt.stan"))
 // From https://cengiz.me/posts/crm-stan/
 data{
    int  J;                    //  number of items
    int  I;                    //  number of individuals
    int  N;                //  number of observed responses
  array [N] int  item;          //  item id
  array[N] int  id;            //  person id
  array[N] real Y;             //  vector of transformed outcome
 }

  parameters {

    vector[J] b;                 // vector of b parameters forJ items
      real mu_b;                 // mean of the b parameters
      real<lower=0> sigma_b;     // standard dev. of the b parameters

    vector<lower=0>[J] a;       // vector of a parameters for J items
      real mu_a;                 // mean of the a parameters
      real<lower=0> sigma_a;     // standard deviation of the a parameters

    vector<lower=0>[J] alpha;   // vector of alpha parameters for J items
      real mu_alpha;             // mean of alpha parameters
      real<lower=0> sigma_alpha; // standard deviation of alpha parameters

    vector[I] theta;             // vector of theta parameters for I individuals
  }

  model{

     mu_b     ~ normal(0,5);
     sigma_b  ~ normal(0,1);
         b    ~ normal(mu_b,sigma_b);

     mu_a    ~ normal(0,5);
     sigma_a ~ normal(0,2.5);
         a   ~ normal(mu_a,sigma_a);

     mu_alpha ~ normal(0,5);
     sigma_alpha ~ cauchy(0,2.5);
         alpha   ~ normal(mu_alpha,sigma_alpha);

     theta   ~ normal(0,1);      // The mean and variance of theta is fixed to 0 and 1
                                 // for model identification

      for(i in 1:N) {
        Y[i] ~ normal(alpha[item[i]]*(theta[id[i]]-b[item[i]]),alpha[item[i]]/a[item[i]]);
       }
   }

Now we just compile the model and format our data:

mod <- cmdstan_model("irt.stan")

dat_stan <- list(
  J = length(unique(out_long$technical_no)),
$  I = length(unique(out_long$baker_id)),
$  N = nrow(out_long),
  item = out_long$technical_no,
$  id = out_long$baker_id,
$  Y = out_long$rank_ordered
$)

baker_list <- out_long %>%
  select(baker_id,baker) %>%
  unique()

We can then fit the model with our data.

fit <- mod$sample(dat_stan,
$                  parallel_chains = 4,
                  max_treedepth = 15, adapt_delta = .99, refresh = 0)
Running MCMC with 4 parallel chains...

Chain 3 finished in 73.7 seconds.
Chain 4 finished in 79.1 seconds.
Chain 2 finished in 80.1 seconds.
Chain 1 finished in 111.4 seconds.

All 4 chains finished successfully.
Mean chain execution time: 86.1 seconds.
Total execution time: 111.5 seconds.

We’re interested here in theta which represents the ability of the bakers.

combined_out <- fit$summary(variables = "theta") %>%
$  mutate(baker_id = as.numeric(stringr::str_extract(variable, "\\d+"))) %>%
  left_join(baker_list)

out_come_with_rank  <- combined_out %>%
  arrange(desc(median)) %>%
  mutate(outcome_modelled = row_number()) %>%
  select(outcome_modelled, baker) %>%
  left_join(performance) %>%
  mutate(outcome_realized = case_when(
    value == "Runner-up"~2,
    value == "WINNER"~1,
    TRUE~13-round_num
  ))

knitr::kable(out_come_with_rank)
outcome_modelledbakervariablevalueround_numperfoutcome_realized
1James1SAFE1SAFE 112
1James2HIGH2HIGH 211
1James4SAFE4SAFE 49
1James5SAFE5SAFE 58
1James6LOW6LOW 67
1James7LOW7LOW 76
1James10Runner-up10Runner-up 102
2Brendan1SAFE1SAFE 112
2Brendan2HIGH2HIGH 211
2Brendan3LOW3LOW 310
2Brendan5HIGH5HIGH 58
2Brendan7HIGH7HIGH 76
2Brendan8SAFE8SAFE 85
2Brendan9HIGH9HIGH 94
2Brendan10Runner-up10Runner-up 102
3Danny1SAFE1SAFE 112
3Danny2SAFE2SAFE 211
3Danny3SAFE3SAFE 310
3Danny4HIGH4HIGH 49
3Danny5LOW5LOW 58
3Danny6HIGH6HIGH 67
3Danny8LOW8LOW 85
3Danny9OUT9OUT 94
4JohnNANANANANA
5Cathryn1HIGH1HIGH 112
5Cathryn2SAFE2SAFE 211
5Cathryn3SAFE3SAFE 310
5Cathryn4SAFE4SAFE 49
5Cathryn5SAFE5SAFE 58
5Cathryn6SAFE6SAFE 67
5Cathryn7LOW7LOW 76
5Cathryn8OUT8OUT 85
6Victoria2LOW2LOW 211
6Victoria3OUT3OUT 310
7Sarah-Jane1SAFE1SAFE 112
7Sarah-Jane2SAFE2SAFE 211
7Sarah-Jane3SAFE3SAFE 310
7Sarah-Jane4SAFE4SAFE 49
7Sarah-Jane5LOW5LOW 58
7Sarah-Jane6LOW6LOW 67
7Sarah-Jane7OUT7OUT 76
8Ryan1SAFE1SAFE 112
8Ryan2SAFE2SAFE 211
8Ryan3SAFE3SAFE 310
8Ryan4LOW4LOW 49
8Ryan6SAFE6SAFE 67
8Ryan7OUT7OUT 76
9Peter1SAFE1SAFE 112
9Peter2OUT2OUT 211
10Manisha1SAFE1SAFE 112
10Manisha2SAFE2SAFE 211
10Manisha3SAFE3SAFE 310
10Manisha4LOW4LOW 49
10Manisha5OUT5OUT 58
11Natasha1OUT1OUT 112
12Stuart1LOW1LOW 112
12Stuart2LOW2LOW 211
12Stuart3HIGH3HIGH 310
12Stuart4OUT4OUT 49

Now we can see what the correlation of ability performance is:

(correlation_analysis <-out_come_with_rank %>%
  select(
    outcome_modelled,outcome_realized
  ) %>%
  as.matrix() %>%
  cor() %>%
  .[1,2] %>%
  round(.,2))
[1] NA

Not too bad! It would seem that there is evidence that the performance in the technical is correlated with the final result (thank goodness).

combined_out %>%
  ggplot(aes(reorder(baker,median), median))+
  geom_pointrange(aes(ymin = q5, ymax =q95))+
  geom_point()+
  coord_flip()+
  theme_classic() +
  geom_text(data = performance,
             aes(x = baker, y = 0,
                 label = perf), inherit.aes = FALSE,
            hjust = 0,nudge_x = .2 , size = 2)+
  labs(
    title = "Who Was the Most Skilled Baker in Season 3?",
    subtitle = glue::glue("Using Samejima’s Continuous Response Model (CRM)\nBased on Technical Round Performance\nTechnical Performance Correlation to Final Results {correlation_analysis}"),
    caption = glue::glue("Data: Wikipedia\n See {url}"),
    x = NULL,
    y = "Skill"
  ) ->p
p
out_come_with_rank %>%
  select(baker,outcome_modelled,outcome_realized) %>%
  mutate(color_use= ifelse(outcome_modelled >outcome_realized,
                           "Better than Skill", "Worse than Skill")) %>%
  ggplot(aes(y = reorder(baker, outcome_modelled)))+
  geom_point(aes(x = outcome_modelled), color = "orange")+
  geom_point(aes(x = outcome_realized), color = "blue")+
  geom_segment(aes(x = outcome_modelled,
                   xend = outcome_realized, yend = baker,
                   color = color_use))+
  theme_classic()+
  labs(
    title = "Comparison Between Outcome and Modelled Skill",
    color = "Outcome",
    y = NULL,
    x = "Rank"
  )+
  scale_color_manual(values = c("green", "red"))+
  scale_x_continuous(breaks = seq(1,12,1))->p2

p2

Next Steps

This analysis only covers one season. It would be neat to come back and do all of the seasons to get a feel for the level of difficulty of the different rounds (i.e., was the technical in round 3 of similar difficulty in each season). Additionally it would be neat to see if this relationship between the technical score and final outcome held up in each season.

Reuse

Citation

BibTeX citation:
@online{dewitt2021,
  author = {Michael E. DeWitt},
  title = {How Discerning is the Technical Challenge in GBBO?},
  date = {2021-09-23},
  url = {https://michaeldewittjr.com/blog/2021-09-23-how-discerning-is-the-technical-challenge-in-gbbo/},
  langid = {en}
}
For attribution, please cite this work as:
Michael E. DeWitt. September 23, 2021. "How Discerning is the Technical Challenge in GBBO?". https://michaeldewittjr.com/blog/2021-09-23-how-discerning-is-the-technical-challenge-in-gbbo/.