CKC Pizza Festival 2024

Table of Contents¶

  • Background
    • The History
    • The Pizza
    • The Judges
    • Logistics
  • Data
    • Raw Scores
    • Normalized Scores
  • Judges Behavior
    • Total Scores
    • Score Distributions
  • Evaluating The Winner
    • Eliminating Bias
    • Scoring Method #1: Average
    • Scoring Method #2: Median
    • Scoring Method #3: Average Minus BestWorst
    • Scoring Method #4: Median Minus BestWorst
    • Scoring Method #5: Average N Best Judges
    • Scoring Method #6: Average N Worst Judges
  • Final Results
    • Standings
    • Testing for Bias
      • Categorial Statistical Tests
      • Linear Regression Model

Background

The History

Back to Top

The first annual Chase-McKeever-McCrystal (CKC) Pizza Festival was held on August 31, 2024 in Brookfield, Connecticut. This new family tradition was established by James McKeever, long-standing rival of Bradford Chase, patriarch of the Chase Family. The earliest recorded mention of the CKC pizza festival is a text message from James to select members of the CKC family on the morning of June 25, 2024. James was inspired to host his own event after being deterred by the steep ticket prices to other festivals. An excerpt of the text is provided here for future historical relevance:

...I would like to host the first annual pizza festival…Before I go over the logistics that would be involved i would like you to know how this idea came about...I thought it would be a lot cheaper to take the list and pick 5 or 6 places and order 1 pie each from these places. I'm going to print up the complete list of pizza places participating and then try to map out the locations and delegate people's responsibility...

After more than two months of meticulous planning and coordination, on the morning of Saturday, August 31, 2024, pizza pies from 12 esteemed pizzerias across the New York metropolitan area converged on Brookfield, Connecticut.

The Pizza

Back to Top

Jack McKeever, youngest son of James, and his partner Madeline, brought pies from four pizzerias deep in the heart of Brooklyn. Benjamin Chase, James’ favorite nephew, brought pies from three pizzerias nestled in downtown New Haven. Michael McKeever, eldest son of James, procured two pizzerias from the outskirts of Naugatuck. Erica McCrystal, lone daughter of James, accompanied by her husband Billy and their three kids, chauffeured pizza from three pizzerias all the way from the mystical isles of New Jersey. By 3:00PM on August 31st, a total of 12 pizzerias were represented in James’ kitchen, pizza boxes covering every available surface. It was a sight to behold. A formal list of the pizzerias is included below for documentation:

  1. L&B Spumoni Gardens, 2725 86th Street, Brooklyn, NY, 11223

  2. Di Fara Pizza, 1424 Avenue J, Brooklyn, NY, 11230

  3. Luigi’s Pizza, 686 5th Avenue, Brooklyn, NY, 11225

  4. Luigi’s Pizza, 4704 5th Avenue, Brooklyn, NY, 11220

  5. Frank Pepe Pizzeria Napoletana, 157 Wooster Street, New Haven, CT, 06511

  6. Zeneli Pizzeria e Cucina Napoletana, 138 Wooster Street, New Haven, CT, 06511

  7. Est Est Est Pizza & Restaurant, 1176 Chapel Street, New Haven, CT, 06511

  8. Angeloni’s Restaurant and Pizza, 6 Brookside Avenue, Caldwell, NJ, 07006

  9. Calabria Pizzeria & Restaurant, 588 South Livingston Avenue, Livingston, NJ, 07039

  10. Fuoco Apizza, 461 West Main Street, Cheshire, CT, 06410

  11. Tony D’s, 3 Hanford Place, Caldwell, NJ, 07006

  12. Luna Woodfired Pizza, 384 North Main Street, Naugatuck, CT, 06770

The Judges

Back to Top

The 11 judges for the first annual CKC pizza-rating festival are officially introduced into the ledger here:

  1. James McKeever, founder

  2. Kristi McKeever, wife of James

  3. Michael McKeever, son of James

  4. Jack McKeever, son of James

  5. Madeline Droher, partner of Jack

  6. Erica McCrystal, daughter of James

  7. Billy McCrystal, partner of Erica

  8. Benjamin Chase, nephew of James

  9. Kelly, sister of James

  10. Joe, husband of Kelly

  11. Sharon, friend of Kristi

Logistics

Back to Top

With all the pizzas and judges gathered in a small kitchen in Brookfield, it was time for the festival to commence. James was well prepared; he had purchased a portable pizza oven several weeks in advance, a tool that proved invaluable throughout the day. Kristi McKeever, wife of James, spearheaded pizza cutting and preparation, and Erica took an efficient command of the supply chain route between the kitchen and the pizza oven. Michael and Billy served as pizza chefs, ensuring each pizza was reheated to the same standard. With the pizzas warm and battle stations manned, the rest of the festival was in the hands of the judges.

To preserve unbiased judging, Kristi secretly labeled each pie with a number, and the pies were brought out in a random order to the rest of the judges. For example, Tony D’s pizza might have been introduced as pizza #4. None of the other judges knew the true identity of any of the pizzas when they tried them. Each pizza was brought out one at a time, with the slices cut into bite-sized pieces. Kristi ensured each piece had a good amount of cheese, sauce, and crust, to maintain fair judging conditions. This aspect makes the CKC pizza-rating festival unique; a pizza is judged anonymously on its ability to convey flavor and enjoyment in a single bite without the distraction of how the pie looks or how the slices are cut. Each judge tried a piece of pizza, wrote down their ratings, and moved on to the next one. After pizza from all 12 pizzerias were rated, the judges’ duties were complete.

The Data

Raw Scores

Back to Top

Not all judges used the same scoring rubric for rating their pizza. Some took a more analytical approach with up to 10 categories, while others used 5 categories. Some judges used decimal places, some did not. Some rated their pizza out of 10, some out of 20, and some out of 100. Some judges might have just gone with their gut and thrown out a number. For the purposes of scoring the pizzas and comparing their performances, each judge provided a final score, converted to a 4-20 scale, for each pizzeria. In this festival, 20 was the highest score a pizzeria could receive and 4 was the lowest possible score. At the end of the rating portion of the festival, we are left with 121 scores; each of the 11 judges provided a score for each of the 12 pizzerias. These raw scores are provided below for the historical record:

In [128]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import scipy.stats as stats
import statsmodels.formula.api as sm
import random

names = pd.read_csv("CKC_Names.8.31.2024.csv")
scores = pd.read_csv("CKC_Ratings.8.31.2024.csv")
ratings = pd.merge(scores, names, "left", left_on = "Pizza ID", right_on = "ID")
ratings.rename(columns={"Name" : "Pizzeria"}, inplace = True)
ratings_output = ratings[["Judge", "Score", "Pizzeria"]].copy()
output = ratings_output.pivot(index = "Pizzeria", columns = "Judge", values = "Score")
display(output)
Judge Ben Billy Erica Jack Jim Joe Kelly Kristi Maddie Mike Sharon
Pizzeria
Angeloni's 14.2 11.0 12.0 13.4 13.0 8.0 7.0 11.0 9.5 15.2 9.0
Calibria 18.0 11.5 14.5 16.0 17.0 16.0 13.0 6.0 14.5 12.2 20.0
Di Fara 11.4 14.5 18.0 18.6 9.0 18.0 17.0 9.0 14.5 14.4 15.0
Est Est Est 18.2 16.0 17.5 16.4 18.0 16.0 19.0 11.0 15.0 14.6 18.0
Frank Pepe 13.8 14.0 10.0 17.6 17.0 18.0 15.0 9.0 12.0 13.2 14.0
Fuoco 13.6 17.0 9.0 17.4 18.0 13.0 4.0 11.0 13.0 16.4 4.0
L&B 17.6 16.5 18.0 17.0 13.0 14.0 12.0 13.0 14.5 13.6 14.0
Luigi's 4704 16.2 12.5 12.5 16.8 16.0 17.0 17.0 11.0 13.5 15.6 17.0
Luigi's 686 12.8 14.0 14.5 11.8 11.0 10.0 12.0 7.0 10.0 13.8 16.0
Luna 12.8 13.5 16.0 12.6 16.0 15.0 9.0 9.0 10.0 14.8 14.0
Tony D 16.4 19.0 7.0 12.2 17.0 16.0 9.0 9.0 11.0 15.8 13.0
Zeneli 10.4 13.0 17.0 15.8 10.0 7.0 10.0 5.0 11.0 13.2 5.0

Normalized Scores

Back to Top

In addition to the raw scores shown above, it will be valuable to analyze the judges' scores relative to their personal scoring distribution. For instance, if one judge only gave scores between 10 and 15 and another judge only gave scores between 16 and 20, it is possible they had significantly different interpretations of the scoring rubrics. In other words, one judge's 15 might be equivalent to another judge's 10. To account for this, we can normalize each judge's list of scores by fixing their lowest score at 0, fixing their highest score at 100, and distributing the rest of their scores linearly between 0 and 100. This transformation will ensure all judge's scores are distributed between 0 and 100 regardless of how widely or narrowly they used the original scale of 4 to 20. This means the normalized scores are a better representation of how each pizzeria scored according to each judge's standards. An analysis of these scores will provide additional insight into the judges' behavior and alternative metrics by which to rank the pizzas.

In [95]:
ratings_norm = pd.DataFrame()
for judge in ratings["Judge"].unique():
    temp = ratings[ratings["Judge"] == judge].copy()
    max = temp["Score"].max()
    min = temp["Score"].min()
    temp["Score"] = round(100 / (max - min) * (temp["Score"] - min), 2)
    ratings_norm = pd.concat([ratings_norm, temp])
ratings_output = ratings_norm[["Judge", "Score", "Pizzeria"]].copy()
output = ratings_output.pivot(index = "Pizzeria", columns = "Judge", values = "Score").astype(int)
display(output)
Judge Ben Billy Erica Jack Jim Joe Kelly Kristi Maddie Mike Sharon
Pizzeria
Angeloni's 48 0 45 23 44 9 20 75 0 71 31
Calibria 97 6 68 61 88 81 60 12 90 0 100
Di Fara 12 43 100 100 0 100 86 50 90 52 68
Est Est Est 100 62 95 67 100 81 100 75 100 57 87
Frank Pepe 43 37 27 85 88 100 73 50 45 23 62
Fuoco 41 75 18 82 100 54 0 75 63 100 0
L&B 92 68 100 76 44 63 53 100 90 33 62
Luigi's 4704 74 18 50 73 77 90 86 75 72 80 81
Luigi's 686 30 37 68 0 22 27 53 25 9 38 75
Luna 30 31 81 11 77 72 33 50 9 61 62
Tony D 76 100 0 5 88 81 33 50 27 85 56
Zeneli 0 25 90 58 11 0 40 0 27 23 6

Judges Behavior

Let's start by examining the judges scoring tendencies:

Total Scores

Back to Top

In [96]:
ax = sns.barplot(ratings.groupby("Judge")["Score"].sum().sort_values(), ec = "black")
ax.bar_label(ax.containers[0])
plt.title("Total Raw Score by Judge")
plt.xticks(fontsize = 9)
display()
No description has been provided for this image

The maximum possible points a single judge could award was 240 (20 points across 12 pizzas). We can see Kristi was by far the harshest critic, handing out the least points overall at a measely 111 or 46%. There was a tight clustering between Billy, Mike, Jim, and Ben, who all awarded around 175 points or 73%. The most generous judge was Jack, who handed out a whopping 186 points or 78%. To put this into perspective, Kristi's average score was less than 10 while Jack's average score was higher than 15. Let's take a look at the normalized score totals to learn more:

In [97]:
ax = sns.barplot(ratings_norm.groupby("Judge")["Score"].sum().astype(int).sort_values(), ec = "black")
ax.bar_label(ax.containers[0])
plt.title("Total Normalized Score by Judge")
plt.xticks(fontsize = 9)
display()
No description has been provided for this image

Billy by far had the lowest normalized score total which indicates he ranked most pizzas much lower than his favorite. Few pizzas came close to the 19/20 he gave to Tony D's. On the other end of the spectrum, Joe had the highest normalized score total which indicates he ranked most pizzas similar to his favorite. Indeed, we see he ranked 6 pizzas with a score between 16 and 18, while his lowest came in at a 7. There is a large group with similar score totals, including Maddie, Mike, Kristi, Kelly, Jack, and Ben, indicating these judges had similar gaps between their favorite and least favorite pizzas. Similar to Joe, both Jim and Erica ranked many pizzas close to their favorite; Jim gave 7 pizzas a score between 16 and 18, and Erica gave 5 pizzas a score between 16 and 18.

Score Distributions

Back to Top

In [98]:
sns.boxplot(data = ratings.sort_values("Judge"), x = "Judge", y = "Score")
ax.bar_label(ax.containers[0])
plt.title("Raw Score Distribution by Judge")
plt.xticks(fontsize = 9)
display()
No description has been provided for this image

The maximum possible points for a single pizza was 20. Sharon was the only judge to award a pizza a perfect score and one of only two judges (Kelly) to award a pizza the lowest possible score of 4. Kristi's lowest score was a 5, adding her to the trio of judges with a strong disdain for a particular pizzeria. We can see Mike's scores were all between 12 and 16, far and away the judge with the least sensitive palette. Kelly and Erica also showcased strong opinions, with Kelly giving a low score of 4 and a high score of 19, and Erica giving a low score of 5 and a high score of 18. Kristi's scores were all significantly lower than average, ranging from 5 to 13 with 75% of her scores between 9 and 11. Ben, Billy, Jack, Jim, and Joe had similar score ranges. Billy and Kelly were the only judges to give a score of 19.

In [99]:
sns.boxplot(data = ratings_norm.sort_values("Judge"), x = "Judge", y = "Score")
ax.bar_label(ax.containers[0])
plt.title("Normalized Score Distribution by Judge")
plt.xticks(fontsize = 9)
display()
No description has been provided for this image

The distributions of normalized scores for all judges are fixed between 0 and 100 by design, but we can still learn more about the judge's behavior from these. Kristi and Sharon gave most pizzas a score around half of their highest score. Kristi's highest score was a 13 and she gave 7 pizzerias a score between 5 and 9. Maddie and Jack had the most diverse set of scores within their scoring ranges. Sharon's two lowest scores were outliers even within her own distribution; she gave two low scores of 4 and 5 despite giving most pizzerias a score between 13 and 20. Jim and Joe had the highest median scores by far, indicating they ranked most pizzerias near their highest score. On the other hand, Ben, Kelly, Kristi, Maddie, and Mike had medians near 50%, which suggests their scores were well distributed. Billy's median score is lower than 40%, reinforcing Tony D's as his runaway favorite.

Evaluating the Winner

Eliminating Bias

Back to Top

The question everyone wants answered is: which pizzeria was the best? This is not an easy question to answer well. There are many different scoring metrics we could use to determine the final rankings of these 12 pizzerias; the average score, the median score, the top five best scores, etc. There is not necessarily one scoring metric that is better than another. To maintain the integrity of this family tradition, we will establish a high standard of rigor from the very first CKC Pizza Festival. To find the best pizzeria, we will calculate a variety of different scoring metrics and aggregate them. For each scoring method, we will calculate the results using both the raw scores and the normalized scores. This will prevent criticisms such as “that pizza only won because you used the average not the median!”, or “that pizza would’ve scored higher if it didn’t have that one bad score!”. We will also conduct statistical analyses to estimate whether or not the order of pizzas impacted the judge’s scores. In summary, we have a 3-pronged system for acheiving an undisputable pizza champion:

  • Multiple scoring methods to eliminate bias due to the data distribution
  • Calculating each scoring method with raw and normalized scores to eliminate bias due to the judges' behavior
  • Statistical analyses to test for bias due to the format of the contest.

Let’s begin.

In [100]:
final_results = pd.DataFrame(ratings["Pizzeria"].drop_duplicates())

Scoring Method #1: Average

Back to Top

The first scoring method will be a simple average of all the judges' scores:

Raw Scores¶

In [101]:
results = ratings.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt = "%.1f")
plt.title("Average Raw Scores")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 20])
display()

text = "Raw Average"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
No description has been provided for this image

Normalized Scores¶

In [102]:
results = ratings_norm.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt= "%.1f")
plt.title("Average Normalized Scores")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 100])
display()

text = "Norm Average"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
No description has been provided for this image

Scoring Method #2: Median

Back to Top

The second scoring method will be a simple median of all judges' scores:

Raw Scores¶

In [103]:
results = ratings.groupby("Pizzeria")["Score"].median().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt = "%.1f")
plt.title("Median Raw Scores")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 20])
display()

text = "Raw Median"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
No description has been provided for this image

Normalized Scores¶

In [104]:
results = ratings_norm.groupby("Pizzeria")["Score"].median().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt= "%.1f")
plt.title("Median Normalized Scores")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 100])
display()

text = "Norm Median"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
No description has been provided for this image

Scoring Method #3 - Average Minus BestWorst

Back to Top

The third scoring method will take the average score without the highest and lowest score(s) for each pizzeria. This will eliminate bias that might have been introduced from overly harsh or kind judges.

Raw Scores¶

In [105]:
temp = ratings.copy()

g = ratings.groupby(['Pizzeria'])['Score'].transform('max')
temp = temp[~(temp['Score'] == g)]

g = temp.groupby(['Pizzeria'])['Score'].transform('min')
temp = temp[~(temp['Score'] == g)]

results = temp.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt = "%.1f")
plt.title("Average Raw Scores Minus BestWorst")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 20])
display()

text = "Raw Average MBW"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
No description has been provided for this image

Normalized Scores¶

In [106]:
temp = ratings_norm.copy()

g = ratings.groupby(['Pizzeria'])['Score'].transform('max')
temp = temp[~(temp['Score'] == g)]

g = temp.groupby(['Pizzeria'])['Score'].transform('min')
temp = temp[~(temp['Score'] == g)]

results = temp.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt = "%.1f")
plt.title("Average Normalized Scores Minus BestWorst")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 100])
display()

text = "Norm Average MBW"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
No description has been provided for this image

Scoring Method #4: Raw Median Minus BestWorst

Back to Top

The fourth scoring method will take the median score without the highest and lowest score(s) for each pizzeria. This will eliminate bias that might have been introduced from overly harsh or kind judges.

Raw Scores¶

In [107]:
temp = ratings.copy()

g = ratings.groupby(['Pizzeria'])['Score'].transform('max')
temp = temp[~(temp['Score'] == g)]

g = temp.groupby(['Pizzeria'])['Score'].transform('min')
temp = temp[~(temp['Score'] == g)]

results = temp.groupby("Pizzeria")["Score"].median().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt = "%.1f")
plt.title("Median Raw Scores Minus BestWorst")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 20])
display()

text = "Raw Median MBW"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
No description has been provided for this image

Normalized Scores¶

In [108]:
temp = ratings_norm.copy()

g = ratings.groupby(['Pizzeria'])['Score'].transform('max')
temp = temp[~(temp['Score'] == g)]

g = temp.groupby(['Pizzeria'])['Score'].transform('min')
temp = temp[~(temp['Score'] == g)]

results = temp.groupby("Pizzeria")["Score"].median().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt = "%.1f")
plt.title("Median Normalized Scores Minus BestWorst")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 100])
display()

text = "Normalized Median MBW"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
No description has been provided for this image

Scoring Method #5: Average N Best Judges

Back to Top

The fifth scoring method will take the average score for each pizzeria with different sets of judges. We will calculate the average scores with the 3 best judges for each pizzeria, then the 4 best judges for each pizzeria, etc. This scoring method will provide insight into how the pizzerias performed among the judges that liked them the most and will eliminate any bias from groups of harsh judges. For brevity, the final rankings are shown for each scoring method instead of the full scores.

Raw Scores¶

In [109]:
aggregate_results = pd.DataFrame()
for i in range (3, 10):
    temp = ratings.copy()
    temp = temp.groupby('Pizzeria')['Score'].nlargest(i).reset_index(level = 1, drop = True).reset_index()
    results = temp.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
    results = results.rename(columns = {"Score" : f'{i} Best Raw'})
    for col in results.columns:
        if col != "Pizzeria":
            results[col] = results[col].rank(method = 'dense', ascending = False).astype(int)
    try:
        aggregate_results = pd.merge(left = aggregate_results, right = results, left_on = "Pizzeria", right_on = "Pizzeria")
    except:
        aggregate_results = results
display(aggregate_results)

final_results = pd.merge(left = final_results, right = aggregate_results, left_on = "Pizzeria", right_on = "Pizzeria")
Pizzeria 3 Best Raw 4 Best Raw 5 Best Raw 6 Best Raw 7 Best Raw 8 Best Raw 9 Best Raw
0 Est Est Est 1 1 1 1 1 1 1
1 Calibria 2 3 2 2 2 4 4
2 Di Fara 3 2 3 3 3 2 3
3 Frank Pepe 4 8 8 7 5 6 6
4 L&B 4 4 6 5 4 5 5
5 Fuoco 5 5 7 8 6 8 8
6 Tony D 5 6 4 6 5 7 7
7 Luigi's 4704 6 7 5 4 3 3 2
8 Luna 7 9 9 9 7 9 9
9 Zeneli 8 10 11 11 9 11 12
10 Luigi's 686 9 11 10 10 8 10 10
11 Angeloni's 10 12 12 12 10 12 11

Normalized Scores¶

In [110]:
aggregate_results = pd.DataFrame()
for i in range (3, 10):
    temp = ratings_norm.copy()
    temp = temp.groupby('Pizzeria')['Score'].nlargest(i).reset_index(level = 1, drop = True).reset_index()
    results = temp.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
    results = results.rename(columns = {"Score" : f'{i} Best Normalized'})
    for col in results.columns:
        if col != "Pizzeria":
            results[col] = results[col].rank(method = 'dense', ascending = False).astype(int)
    try:
        aggregate_results = pd.merge(left = aggregate_results, right = results, left_on = "Pizzeria", right_on = "Pizzeria")
    except:
        aggregate_results = results
display(aggregate_results)

final_results = pd.merge(left = final_results, right = aggregate_results, left_on = "Pizzeria", right_on = "Pizzeria")
Pizzeria 3 Best Normalized 4 Best Normalized 5 Best Normalized 6 Best Normalized 7 Best Normalized 8 Best Normalized 9 Best Normalized
0 Di Fara 1 2 2 2 2 4 4
1 Est Est Est 1 1 1 1 1 1 1
2 L&B 2 3 3 3 3 2 3
3 Calibria 3 4 4 4 4 3 5
4 Fuoco 4 5 6 5 6 6 6
5 Tony D 5 6 5 7 7 7 7
6 Frank Pepe 6 7 8 8 8 8 8
7 Luigi's 4704 7 8 7 6 5 5 2
8 Luna 8 9 9 9 9 9 9
9 Luigi's 686 9 11 11 11 11 11 10
10 Angeloni's 10 10 10 10 10 10 11
11 Zeneli 11 12 12 12 12 12 12

Scoring Method #6: Average N Worst Judges

Back to Top

The sixth scoring method, similar to the fifth method, will show how pizzerias performed among the judges that liked them the least. This will eliminate any bias due to groups of overly kind judges.

Raw Scores¶

In [111]:
aggregate_results = pd.DataFrame()
for i in range (3, 10):
    temp = ratings.copy()
    temp = temp.groupby('Pizzeria')['Score'].nsmallest(i).reset_index(level = 1, drop = True).reset_index()
    results = temp.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
    results = results.rename(columns = {"Score" : f'{i} Worst Raw'})
    for col in results.columns:
        if col != "Pizzeria":
            results[col] = results[col].rank(method = 'dense', ascending = False).astype(int)
    try:
        aggregate_results = pd.merge(left = aggregate_results, right = results, left_on = "Pizzeria", right_on = "Pizzeria")
    except:
        aggregate_results = results
display(aggregate_results)

final_results = pd.merge(left = final_results, right = aggregate_results, left_on = "Pizzeria", right_on = "Pizzeria")
Pizzeria 3 Worst Raw 4 Worst Raw 5 Worst Raw 6 Worst Raw 7 Worst Raw 8 Worst Raw 9 Worst Raw
0 Est Est Est 1 1 1 1 1 1 1
1 L&B 2 2 2 3 3 3 3
2 Luigi's 4704 3 3 3 2 2 2 2
3 Frank Pepe 4 4 5 5 6 6 6
4 Calibria 5 6 6 6 5 5 5
5 Di Fara 6 5 4 4 4 4 4
6 Luna 7 7 7 7 7 7 7
7 Luigi's 686 8 8 8 8 9 9 9
8 Tony D 9 9 9 9 8 8 8
9 Angeloni's 10 10 10 10 11 11 11
10 Fuoco 11 11 11 11 10 10 10
11 Zeneli 11 12 12 12 12 12 12

Normalized Scores¶

In [112]:
aggregate_results = pd.DataFrame()
for i in range (3, 10):
    temp = ratings_norm.copy()
    temp = temp.groupby('Pizzeria')['Score'].nsmallest(i).reset_index(level = 1, drop = True).reset_index()
    results = temp.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
    results = results.rename(columns = {"Score" : f'{i} Worst Normalized'})
    for col in results.columns:
        if col != "Pizzeria":
            results[col] = results[col].rank(method = 'dense', ascending = False).astype(int)
    try:
        aggregate_results = pd.merge(left = aggregate_results, right = results, left_on = "Pizzeria", right_on = "Pizzeria")
    except:
        aggregate_results = results
display(aggregate_results)

final_results = pd.merge(left = final_results, right = aggregate_results, left_on = "Pizzeria", right_on = "Pizzeria")
Pizzeria 3 Worst Normalized 4 Worst Normalized 5 Worst Normalized 6 Worst Normalized 7 Worst Normalized 8 Worst Normalized 9 Worst Normalized
0 Est Est Est 1 1 1 1 1 1 1
1 Luigi's 4704 2 2 2 2 2 2 2
2 L&B 3 3 3 3 3 3 3
3 Frank Pepe 4 4 4 5 6 6 6
4 Di Fara 5 5 5 4 4 4 4
5 Luna 6 6 8 9 9 9 9
6 Tony D 7 8 7 8 8 7 7
7 Luigi's 686 8 10 10 10 10 10 10
8 Calibria 9 7 6 6 5 5 5
9 Fuoco 10 9 9 7 7 8 8
10 Angeloni's 11 11 11 11 11 11 11
11 Zeneli 12 12 12 12 12 12 12

Final Results

With all the scoring methods complete it's time to tabulate the final results.

Standings

Back to Top

It's time to report the final results for the 12 pizzerias of the first annual CKC festival. We will calculate the final score for each pizzera using all 36 scoring methods covered above. Each pizzeria will receive 12 points if they came in first place for a scoring method, 11 points if they came in second, 10 points if they came in third, etc. This means across all 36 scoring methods, the highest possible score for a single pizzeria is 432 points. The final scores will be normalized between 0 and 100 to provide a more relative comparison between the pizzerias.

In [113]:
results = final_results.copy()

results.index = results["Pizzeria"]
results.drop("Pizzeria", axis = 1, inplace = True)
results = len(results.index.unique()) + 1 - results
results["Score"] = results.sum(axis = 1)
results = results[["Score"]].copy()
results['Score'] = ((results['Score'] - results['Score'].min()) / (results['Score'].max() - results['Score'].min())) * 100

ax = sns.barplot(results.sort_values("Score", ascending = False), x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt= "%.1f")
plt.title("Final Scores of the 2024 CKC Pizza Rating Festival")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 105])
display()
No description has been provided for this image

Testing For Bias

Back to Top

One concern with the format of the first annual CKC pizza-rating festival is that the pizzas were brought out in a specifc order. Perhaps the judges were more favorable in the beginning and harsher towards the end. To maintain the integrity of this family tradition, we will conduct rigorous statistical analyses to investigate if there is any detectable bias related to the order of the pizzas. Let's start with a visual:

In [115]:
pizza_ranks = results["Score"].copy()
pizza_orders = ratings[["Pizzeria", "Pizza ID"]].copy()
pizza_orders.index = pizza_orders["Pizzeria"]
pizza_orders.drop("Pizzeria", axis = 1, inplace = True)
pizza_orders = pizza_orders["Pizza ID"].copy()
In [146]:
pizza_data = pd.merge(pizza_orders, pizza_ranks, left_on = "Pizzeria", right_on = "Pizzeria")
sns.scatterplot(data = pizza_data, x = "Pizza ID", y = "Score")
plt.title("Final Scores for Pizzerias by Order in Tasting Lineup")

def label_point(x, y, val, ax):
    a = pd.concat({'x': x, 'y': y, 'val': val}, axis=1)
    for i, point in a.iterrows():
        ax.text(point['x']+0.1, point['y'] - 5, str(point['val']), fontsize = 9)

label_point(pizza_data["Pizza ID"], pizza_data["Score"], pizza_data.index.to_series(), plt.gca()) 

display()
No description has been provided for this image

We can see in the graph above the order of the top 5 pizzerias: 1st, 2nd, 6th, 7th, and 12th. There does not seem to be a visually obvious correlation between overall scores and the pizzeria being earlier or later in the lineup. Let's move on to more concrete testing:

Categorical Statistical Tests¶

Back to Top

We will start by conducting statistical tests to determine if there is a significant difference between the final scores of the pizzerias in the top half of the lineup and the bottom half of the lineup. We will run a variety of tests for completeness.

T-Tests¶

Final Scores¶
In [147]:
first_half = pizza_data[pizza_data["Pizza ID"] <= 6]["Score"].copy()
second_half = pizza_data[pizza_data["Pizza ID"] > 6]["Score"].copy()

t_stat, p_value = stats.ttest_ind(first_half, second_half)
print("T-statistic:", t_stat)
print("P-value:", p_value)
T-statistic: 1.5328059077494744
P-value: 0.1277537008396266

Since the p-value is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.

All Raw Scores¶
In [148]:
first_half = ratings[ratings["Pizza ID"] <= 6]["Score"].copy()
second_half = ratings[ratings["Pizza ID"] > 6]["Score"].copy()

t_stat, p_value = stats.ttest_ind(first_half, second_half)
print("T-statistic:", t_stat)
print("P-value:", p_value)
T-statistic: 0.39358413005468457
P-value: 0.6945328720113341

Since the p-value is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.

All Normalized Scores¶
In [149]:
first_half = ratings_norm[ratings_norm["Pizza ID"] <= 6]["Score"].copy()
second_half = ratings_norm[ratings_norm["Pizza ID"] > 6]["Score"].copy()

t_stat, p_value = stats.ttest_ind(first_half, second_half)
print("T-statistic:", t_stat)
print("P-value:", p_value)
T-statistic: 0.7516878960838435
P-value: 0.4535974942323605

Since the p-value is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.

Mann-Whitney U Tests¶

Final Scores¶
In [150]:
first_half = pizza_data[pizza_data["Pizza ID"] <= 6]["Score"].copy()
second_half = pizza_data[pizza_data["Pizza ID"] > 6]["Score"].copy()

t_stat, p_value = stats.mannwhitneyu(first_half, second_half, method="exact")
print("T-statistic:", t_stat)
print("P-value:", p_value)
T-statistic: 2541.0
P-value: 0.09909440016666508

Since the p-value is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.

All Raw Scores¶
In [151]:
first_half = ratings[ratings["Pizza ID"] <= 6]["Score"].copy()
second_half = ratings[ratings["Pizza ID"] > 6]["Score"].copy()

t_stat, p_value = stats.mannwhitneyu(first_half, second_half, method="exact")
print("T-statistic:", t_stat)
print("P-value:", p_value)
T-statistic: 2292.0
P-value: 0.606593148588398

Since the p-value is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.

All Normalized Scores¶
In [154]:
first_half = ratings_norm[ratings_norm["Pizza ID"] <= 6]["Score"].copy()
second_half = ratings_norm[ratings_norm["Pizza ID"] > 6]["Score"].copy()

t_stat, p_value = stats.mannwhitneyu(first_half, second_half, method="exact")
print("T-statistic:", t_stat)
print("P-value:", p_value)
T-statistic: 2340.0
P-value: 0.46359335005995844

Since the p-value is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.

Linear Regression Model¶

Back to Top

Finally, we will conduct a linear regression analysis on the original judges' scores to determine if there is a significant relationship between the pizzeria's order in the lineup and its scores.

Raw Scores¶

In [155]:
ratings.rename({"Pizza ID": "ID"})
result = sm.ols(formula="Score ~ ID", data=ratings).fit()
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Score   R-squared:                       0.006
Model:                            OLS   Adj. R-squared:                 -0.002
Method:                 Least Squares   F-statistic:                    0.7307
Date:                Sun, 24 Nov 2024   Prob (F-statistic):              0.394
Time:                        10:44:27   Log-Likelihood:                -351.14
No. Observations:                 132   AIC:                             706.3
Df Residuals:                     130   BIC:                             712.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     13.9566      0.647     21.573      0.000      12.677      15.237
ID            -0.0751      0.088     -0.855      0.394      -0.249       0.099
==============================================================================
Omnibus:                        7.375   Durbin-Watson:                   1.563
Prob(Omnibus):                  0.025   Jarque-Bera (JB):                7.629
Skew:                          -0.587   Prob(JB):                       0.0221
Kurtosis:                       2.915   Cond. No.                         15.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Since the p-value on the coefficient is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.

Normalized Scores¶

In [156]:
ratings_norm.rename({"Pizza ID": "ID"})
result = sm.ols(formula="Score ~ ID", data=ratings_norm).fit()
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Score   R-squared:                       0.016
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.120
Date:                Sun, 24 Nov 2024   Prob (F-statistic):              0.148
Time:                        10:44:31   Log-Likelihood:                -642.89
No. Observations:                 132   AIC:                             1290.
Df Residuals:                     130   BIC:                             1296.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     62.7568      5.899     10.639      0.000      51.087      74.427
ID            -1.1670      0.801     -1.456      0.148      -2.753       0.419
==============================================================================
Omnibus:                       32.406   Durbin-Watson:                   1.851
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                8.019
Skew:                          -0.253   Prob(JB):                       0.0181
Kurtosis:                       1.904   Cond. No.                         15.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Since the p-value on the coefficient is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.