CKC Pizza Festival 2024
Table of Contents¶
Background
The History
The first annual Chase-McKeever-McCrystal (CKC) Pizza Festival was held on August 31, 2024 in Brookfield, Connecticut. This new family tradition was established by James McKeever, long-standing rival of Bradford Chase, patriarch of the Chase Family. The earliest recorded mention of the CKC pizza festival is a text message from James to select members of the CKC family on the morning of June 25, 2024. James was inspired to host his own event after being deterred by the steep ticket prices to other festivals. An excerpt of the text is provided here for future historical relevance:
...I would like to host the first annual pizza festival…Before I go over the logistics that would be involved i would like you to know how this idea came about...I thought it would be a lot cheaper to take the list and pick 5 or 6 places and order 1 pie each from these places. I'm going to print up the complete list of pizza places participating and then try to map out the locations and delegate people's responsibility...
After more than two months of meticulous planning and coordination, on the morning of Saturday, August 31, 2024, pizza pies from 12 esteemed pizzerias across the New York metropolitan area converged on Brookfield, Connecticut.
The Pizza
Jack McKeever, youngest son of James, and his partner Madeline, brought pies from four pizzerias deep in the heart of Brooklyn. Benjamin Chase, James’ favorite nephew, brought pies from three pizzerias nestled in downtown New Haven. Michael McKeever, eldest son of James, procured two pizzerias from the outskirts of Naugatuck. Erica McCrystal, lone daughter of James, accompanied by her husband Billy and their three kids, chauffeured pizza from three pizzerias all the way from the mystical isles of New Jersey. By 3:00PM on August 31st, a total of 12 pizzerias were represented in James’ kitchen, pizza boxes covering every available surface. It was a sight to behold. A formal list of the pizzerias is included below for documentation:
L&B Spumoni Gardens, 2725 86th Street, Brooklyn, NY, 11223
Di Fara Pizza, 1424 Avenue J, Brooklyn, NY, 11230
Luigi’s Pizza, 686 5th Avenue, Brooklyn, NY, 11225
Luigi’s Pizza, 4704 5th Avenue, Brooklyn, NY, 11220
Frank Pepe Pizzeria Napoletana, 157 Wooster Street, New Haven, CT, 06511
Zeneli Pizzeria e Cucina Napoletana, 138 Wooster Street, New Haven, CT, 06511
Est Est Est Pizza & Restaurant, 1176 Chapel Street, New Haven, CT, 06511
Angeloni’s Restaurant and Pizza, 6 Brookside Avenue, Caldwell, NJ, 07006
Calabria Pizzeria & Restaurant, 588 South Livingston Avenue, Livingston, NJ, 07039
Fuoco Apizza, 461 West Main Street, Cheshire, CT, 06410
Tony D’s, 3 Hanford Place, Caldwell, NJ, 07006
Luna Woodfired Pizza, 384 North Main Street, Naugatuck, CT, 06770
The Judges
The 11 judges for the first annual CKC pizza-rating festival are officially introduced into the ledger here:
James McKeever, founder
Kristi McKeever, wife of James
Michael McKeever, son of James
Jack McKeever, son of James
Madeline Droher, partner of Jack
Erica McCrystal, daughter of James
Billy McCrystal, partner of Erica
Benjamin Chase, nephew of James
Kelly, sister of James
Joe, husband of Kelly
Sharon, friend of Kristi
Logistics
With all the pizzas and judges gathered in a small kitchen in Brookfield, it was time for the festival to commence. James was well prepared; he had purchased a portable pizza oven several weeks in advance, a tool that proved invaluable throughout the day. Kristi McKeever, wife of James, spearheaded pizza cutting and preparation, and Erica took an efficient command of the supply chain route between the kitchen and the pizza oven. Michael and Billy served as pizza chefs, ensuring each pizza was reheated to the same standard. With the pizzas warm and battle stations manned, the rest of the festival was in the hands of the judges.
To preserve unbiased judging, Kristi secretly labeled each pie with a number, and the pies were brought out in a random order to the rest of the judges. For example, Tony D’s pizza might have been introduced as pizza #4. None of the other judges knew the true identity of any of the pizzas when they tried them. Each pizza was brought out one at a time, with the slices cut into bite-sized pieces. Kristi ensured each piece had a good amount of cheese, sauce, and crust, to maintain fair judging conditions. This aspect makes the CKC pizza-rating festival unique; a pizza is judged anonymously on its ability to convey flavor and enjoyment in a single bite without the distraction of how the pie looks or how the slices are cut. Each judge tried a piece of pizza, wrote down their ratings, and moved on to the next one. After pizza from all 12 pizzerias were rated, the judges’ duties were complete.
The Data
Raw Scores
Not all judges used the same scoring rubric for rating their pizza. Some took a more analytical approach with up to 10 categories, while others used 5 categories. Some judges used decimal places, some did not. Some rated their pizza out of 10, some out of 20, and some out of 100. Some judges might have just gone with their gut and thrown out a number. For the purposes of scoring the pizzas and comparing their performances, each judge provided a final score, converted to a 4-20 scale, for each pizzeria. In this festival, 20 was the highest score a pizzeria could receive and 4 was the lowest possible score. At the end of the rating portion of the festival, we are left with 121 scores; each of the 11 judges provided a score for each of the 12 pizzerias. These raw scores are provided below for the historical record:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import scipy.stats as stats
import statsmodels.formula.api as sm
import random
names = pd.read_csv("CKC_Names.8.31.2024.csv")
scores = pd.read_csv("CKC_Ratings.8.31.2024.csv")
ratings = pd.merge(scores, names, "left", left_on = "Pizza ID", right_on = "ID")
ratings.rename(columns={"Name" : "Pizzeria"}, inplace = True)
ratings_output = ratings[["Judge", "Score", "Pizzeria"]].copy()
output = ratings_output.pivot(index = "Pizzeria", columns = "Judge", values = "Score")
display(output)
Judge | Ben | Billy | Erica | Jack | Jim | Joe | Kelly | Kristi | Maddie | Mike | Sharon |
---|---|---|---|---|---|---|---|---|---|---|---|
Pizzeria | |||||||||||
Angeloni's | 14.2 | 11.0 | 12.0 | 13.4 | 13.0 | 8.0 | 7.0 | 11.0 | 9.5 | 15.2 | 9.0 |
Calibria | 18.0 | 11.5 | 14.5 | 16.0 | 17.0 | 16.0 | 13.0 | 6.0 | 14.5 | 12.2 | 20.0 |
Di Fara | 11.4 | 14.5 | 18.0 | 18.6 | 9.0 | 18.0 | 17.0 | 9.0 | 14.5 | 14.4 | 15.0 |
Est Est Est | 18.2 | 16.0 | 17.5 | 16.4 | 18.0 | 16.0 | 19.0 | 11.0 | 15.0 | 14.6 | 18.0 |
Frank Pepe | 13.8 | 14.0 | 10.0 | 17.6 | 17.0 | 18.0 | 15.0 | 9.0 | 12.0 | 13.2 | 14.0 |
Fuoco | 13.6 | 17.0 | 9.0 | 17.4 | 18.0 | 13.0 | 4.0 | 11.0 | 13.0 | 16.4 | 4.0 |
L&B | 17.6 | 16.5 | 18.0 | 17.0 | 13.0 | 14.0 | 12.0 | 13.0 | 14.5 | 13.6 | 14.0 |
Luigi's 4704 | 16.2 | 12.5 | 12.5 | 16.8 | 16.0 | 17.0 | 17.0 | 11.0 | 13.5 | 15.6 | 17.0 |
Luigi's 686 | 12.8 | 14.0 | 14.5 | 11.8 | 11.0 | 10.0 | 12.0 | 7.0 | 10.0 | 13.8 | 16.0 |
Luna | 12.8 | 13.5 | 16.0 | 12.6 | 16.0 | 15.0 | 9.0 | 9.0 | 10.0 | 14.8 | 14.0 |
Tony D | 16.4 | 19.0 | 7.0 | 12.2 | 17.0 | 16.0 | 9.0 | 9.0 | 11.0 | 15.8 | 13.0 |
Zeneli | 10.4 | 13.0 | 17.0 | 15.8 | 10.0 | 7.0 | 10.0 | 5.0 | 11.0 | 13.2 | 5.0 |
Normalized Scores
In addition to the raw scores shown above, it will be valuable to analyze the judges' scores relative to their personal scoring distribution. For instance, if one judge only gave scores between 10 and 15 and another judge only gave scores between 16 and 20, it is possible they had significantly different interpretations of the scoring rubrics. In other words, one judge's 15 might be equivalent to another judge's 10. To account for this, we can normalize each judge's list of scores by fixing their lowest score at 0, fixing their highest score at 100, and distributing the rest of their scores linearly between 0 and 100. This transformation will ensure all judge's scores are distributed between 0 and 100 regardless of how widely or narrowly they used the original scale of 4 to 20. This means the normalized scores are a better representation of how each pizzeria scored according to each judge's standards. An analysis of these scores will provide additional insight into the judges' behavior and alternative metrics by which to rank the pizzas.
ratings_norm = pd.DataFrame()
for judge in ratings["Judge"].unique():
temp = ratings[ratings["Judge"] == judge].copy()
max = temp["Score"].max()
min = temp["Score"].min()
temp["Score"] = round(100 / (max - min) * (temp["Score"] - min), 2)
ratings_norm = pd.concat([ratings_norm, temp])
ratings_output = ratings_norm[["Judge", "Score", "Pizzeria"]].copy()
output = ratings_output.pivot(index = "Pizzeria", columns = "Judge", values = "Score").astype(int)
display(output)
Judge | Ben | Billy | Erica | Jack | Jim | Joe | Kelly | Kristi | Maddie | Mike | Sharon |
---|---|---|---|---|---|---|---|---|---|---|---|
Pizzeria | |||||||||||
Angeloni's | 48 | 0 | 45 | 23 | 44 | 9 | 20 | 75 | 0 | 71 | 31 |
Calibria | 97 | 6 | 68 | 61 | 88 | 81 | 60 | 12 | 90 | 0 | 100 |
Di Fara | 12 | 43 | 100 | 100 | 0 | 100 | 86 | 50 | 90 | 52 | 68 |
Est Est Est | 100 | 62 | 95 | 67 | 100 | 81 | 100 | 75 | 100 | 57 | 87 |
Frank Pepe | 43 | 37 | 27 | 85 | 88 | 100 | 73 | 50 | 45 | 23 | 62 |
Fuoco | 41 | 75 | 18 | 82 | 100 | 54 | 0 | 75 | 63 | 100 | 0 |
L&B | 92 | 68 | 100 | 76 | 44 | 63 | 53 | 100 | 90 | 33 | 62 |
Luigi's 4704 | 74 | 18 | 50 | 73 | 77 | 90 | 86 | 75 | 72 | 80 | 81 |
Luigi's 686 | 30 | 37 | 68 | 0 | 22 | 27 | 53 | 25 | 9 | 38 | 75 |
Luna | 30 | 31 | 81 | 11 | 77 | 72 | 33 | 50 | 9 | 61 | 62 |
Tony D | 76 | 100 | 0 | 5 | 88 | 81 | 33 | 50 | 27 | 85 | 56 |
Zeneli | 0 | 25 | 90 | 58 | 11 | 0 | 40 | 0 | 27 | 23 | 6 |
Judges Behavior
Let's start by examining the judges scoring tendencies:
Total Scores
ax = sns.barplot(ratings.groupby("Judge")["Score"].sum().sort_values(), ec = "black")
ax.bar_label(ax.containers[0])
plt.title("Total Raw Score by Judge")
plt.xticks(fontsize = 9)
display()
The maximum possible points a single judge could award was 240 (20 points across 12 pizzas). We can see Kristi was by far the harshest critic, handing out the least points overall at a measely 111 or 46%. There was a tight clustering between Billy, Mike, Jim, and Ben, who all awarded around 175 points or 73%. The most generous judge was Jack, who handed out a whopping 186 points or 78%. To put this into perspective, Kristi's average score was less than 10 while Jack's average score was higher than 15. Let's take a look at the normalized score totals to learn more:
ax = sns.barplot(ratings_norm.groupby("Judge")["Score"].sum().astype(int).sort_values(), ec = "black")
ax.bar_label(ax.containers[0])
plt.title("Total Normalized Score by Judge")
plt.xticks(fontsize = 9)
display()
Billy by far had the lowest normalized score total which indicates he ranked most pizzas much lower than his favorite. Few pizzas came close to the 19/20 he gave to Tony D's. On the other end of the spectrum, Joe had the highest normalized score total which indicates he ranked most pizzas similar to his favorite. Indeed, we see he ranked 6 pizzas with a score between 16 and 18, while his lowest came in at a 7. There is a large group with similar score totals, including Maddie, Mike, Kristi, Kelly, Jack, and Ben, indicating these judges had similar gaps between their favorite and least favorite pizzas. Similar to Joe, both Jim and Erica ranked many pizzas close to their favorite; Jim gave 7 pizzas a score between 16 and 18, and Erica gave 5 pizzas a score between 16 and 18.
Score Distributions
sns.boxplot(data = ratings.sort_values("Judge"), x = "Judge", y = "Score")
ax.bar_label(ax.containers[0])
plt.title("Raw Score Distribution by Judge")
plt.xticks(fontsize = 9)
display()
The maximum possible points for a single pizza was 20. Sharon was the only judge to award a pizza a perfect score and one of only two judges (Kelly) to award a pizza the lowest possible score of 4. Kristi's lowest score was a 5, adding her to the trio of judges with a strong disdain for a particular pizzeria. We can see Mike's scores were all between 12 and 16, far and away the judge with the least sensitive palette. Kelly and Erica also showcased strong opinions, with Kelly giving a low score of 4 and a high score of 19, and Erica giving a low score of 5 and a high score of 18. Kristi's scores were all significantly lower than average, ranging from 5 to 13 with 75% of her scores between 9 and 11. Ben, Billy, Jack, Jim, and Joe had similar score ranges. Billy and Kelly were the only judges to give a score of 19.
sns.boxplot(data = ratings_norm.sort_values("Judge"), x = "Judge", y = "Score")
ax.bar_label(ax.containers[0])
plt.title("Normalized Score Distribution by Judge")
plt.xticks(fontsize = 9)
display()
The distributions of normalized scores for all judges are fixed between 0 and 100 by design, but we can still learn more about the judge's behavior from these. Kristi and Sharon gave most pizzas a score around half of their highest score. Kristi's highest score was a 13 and she gave 7 pizzerias a score between 5 and 9. Maddie and Jack had the most diverse set of scores within their scoring ranges. Sharon's two lowest scores were outliers even within her own distribution; she gave two low scores of 4 and 5 despite giving most pizzerias a score between 13 and 20. Jim and Joe had the highest median scores by far, indicating they ranked most pizzerias near their highest score. On the other hand, Ben, Kelly, Kristi, Maddie, and Mike had medians near 50%, which suggests their scores were well distributed. Billy's median score is lower than 40%, reinforcing Tony D's as his runaway favorite.
Evaluating the Winner
Eliminating Bias
The question everyone wants answered is: which pizzeria was the best? This is not an easy question to answer well. There are many different scoring metrics we could use to determine the final rankings of these 12 pizzerias; the average score, the median score, the top five best scores, etc. There is not necessarily one scoring metric that is better than another. To maintain the integrity of this family tradition, we will establish a high standard of rigor from the very first CKC Pizza Festival. To find the best pizzeria, we will calculate a variety of different scoring metrics and aggregate them. For each scoring method, we will calculate the results using both the raw scores and the normalized scores. This will prevent criticisms such as “that pizza only won because you used the average not the median!”, or “that pizza would’ve scored higher if it didn’t have that one bad score!”. We will also conduct statistical analyses to estimate whether or not the order of pizzas impacted the judge’s scores. In summary, we have a 3-pronged system for acheiving an undisputable pizza champion:
- Multiple scoring methods to eliminate bias due to the data distribution
- Calculating each scoring method with raw and normalized scores to eliminate bias due to the judges' behavior
- Statistical analyses to test for bias due to the format of the contest.
Let’s begin.
final_results = pd.DataFrame(ratings["Pizzeria"].drop_duplicates())
Scoring Method #1: Average
The first scoring method will be a simple average of all the judges' scores:
Raw Scores¶
results = ratings.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt = "%.1f")
plt.title("Average Raw Scores")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 20])
display()
text = "Raw Average"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
Normalized Scores¶
results = ratings_norm.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt= "%.1f")
plt.title("Average Normalized Scores")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 100])
display()
text = "Norm Average"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
Scoring Method #2: Median
The second scoring method will be a simple median of all judges' scores:
Raw Scores¶
results = ratings.groupby("Pizzeria")["Score"].median().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt = "%.1f")
plt.title("Median Raw Scores")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 20])
display()
text = "Raw Median"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
Normalized Scores¶
results = ratings_norm.groupby("Pizzeria")["Score"].median().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt= "%.1f")
plt.title("Median Normalized Scores")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 100])
display()
text = "Norm Median"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
Scoring Method #3 - Average Minus BestWorst
The third scoring method will take the average score without the highest and lowest score(s) for each pizzeria. This will eliminate bias that might have been introduced from overly harsh or kind judges.
Raw Scores¶
temp = ratings.copy()
g = ratings.groupby(['Pizzeria'])['Score'].transform('max')
temp = temp[~(temp['Score'] == g)]
g = temp.groupby(['Pizzeria'])['Score'].transform('min')
temp = temp[~(temp['Score'] == g)]
results = temp.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt = "%.1f")
plt.title("Average Raw Scores Minus BestWorst")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 20])
display()
text = "Raw Average MBW"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
Normalized Scores¶
temp = ratings_norm.copy()
g = ratings.groupby(['Pizzeria'])['Score'].transform('max')
temp = temp[~(temp['Score'] == g)]
g = temp.groupby(['Pizzeria'])['Score'].transform('min')
temp = temp[~(temp['Score'] == g)]
results = temp.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt = "%.1f")
plt.title("Average Normalized Scores Minus BestWorst")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 100])
display()
text = "Norm Average MBW"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
Scoring Method #4: Raw Median Minus BestWorst
The fourth scoring method will take the median score without the highest and lowest score(s) for each pizzeria. This will eliminate bias that might have been introduced from overly harsh or kind judges.
Raw Scores¶
temp = ratings.copy()
g = ratings.groupby(['Pizzeria'])['Score'].transform('max')
temp = temp[~(temp['Score'] == g)]
g = temp.groupby(['Pizzeria'])['Score'].transform('min')
temp = temp[~(temp['Score'] == g)]
results = temp.groupby("Pizzeria")["Score"].median().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt = "%.1f")
plt.title("Median Raw Scores Minus BestWorst")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 20])
display()
text = "Raw Median MBW"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
Normalized Scores¶
temp = ratings_norm.copy()
g = ratings.groupby(['Pizzeria'])['Score'].transform('max')
temp = temp[~(temp['Score'] == g)]
g = temp.groupby(['Pizzeria'])['Score'].transform('min')
temp = temp[~(temp['Score'] == g)]
results = temp.groupby("Pizzeria")["Score"].median().sort_values(ascending = False).reset_index()
ax = sns.barplot(results, x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt = "%.1f")
plt.title("Median Normalized Scores Minus BestWorst")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 100])
display()
text = "Normalized Median MBW"
results[text] = results["Score"].rank(ascending = False)
final_results = pd.merge(left = final_results, right = results[["Pizzeria", text]], left_on = "Pizzeria", right_on = "Pizzeria")
Scoring Method #5: Average N Best Judges
The fifth scoring method will take the average score for each pizzeria with different sets of judges. We will calculate the average scores with the 3 best judges for each pizzeria, then the 4 best judges for each pizzeria, etc. This scoring method will provide insight into how the pizzerias performed among the judges that liked them the most and will eliminate any bias from groups of harsh judges. For brevity, the final rankings are shown for each scoring method instead of the full scores.
Raw Scores¶
aggregate_results = pd.DataFrame()
for i in range (3, 10):
temp = ratings.copy()
temp = temp.groupby('Pizzeria')['Score'].nlargest(i).reset_index(level = 1, drop = True).reset_index()
results = temp.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
results = results.rename(columns = {"Score" : f'{i} Best Raw'})
for col in results.columns:
if col != "Pizzeria":
results[col] = results[col].rank(method = 'dense', ascending = False).astype(int)
try:
aggregate_results = pd.merge(left = aggregate_results, right = results, left_on = "Pizzeria", right_on = "Pizzeria")
except:
aggregate_results = results
display(aggregate_results)
final_results = pd.merge(left = final_results, right = aggregate_results, left_on = "Pizzeria", right_on = "Pizzeria")
Pizzeria | 3 Best Raw | 4 Best Raw | 5 Best Raw | 6 Best Raw | 7 Best Raw | 8 Best Raw | 9 Best Raw | |
---|---|---|---|---|---|---|---|---|
0 | Est Est Est | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | Calibria | 2 | 3 | 2 | 2 | 2 | 4 | 4 |
2 | Di Fara | 3 | 2 | 3 | 3 | 3 | 2 | 3 |
3 | Frank Pepe | 4 | 8 | 8 | 7 | 5 | 6 | 6 |
4 | L&B | 4 | 4 | 6 | 5 | 4 | 5 | 5 |
5 | Fuoco | 5 | 5 | 7 | 8 | 6 | 8 | 8 |
6 | Tony D | 5 | 6 | 4 | 6 | 5 | 7 | 7 |
7 | Luigi's 4704 | 6 | 7 | 5 | 4 | 3 | 3 | 2 |
8 | Luna | 7 | 9 | 9 | 9 | 7 | 9 | 9 |
9 | Zeneli | 8 | 10 | 11 | 11 | 9 | 11 | 12 |
10 | Luigi's 686 | 9 | 11 | 10 | 10 | 8 | 10 | 10 |
11 | Angeloni's | 10 | 12 | 12 | 12 | 10 | 12 | 11 |
Normalized Scores¶
aggregate_results = pd.DataFrame()
for i in range (3, 10):
temp = ratings_norm.copy()
temp = temp.groupby('Pizzeria')['Score'].nlargest(i).reset_index(level = 1, drop = True).reset_index()
results = temp.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
results = results.rename(columns = {"Score" : f'{i} Best Normalized'})
for col in results.columns:
if col != "Pizzeria":
results[col] = results[col].rank(method = 'dense', ascending = False).astype(int)
try:
aggregate_results = pd.merge(left = aggregate_results, right = results, left_on = "Pizzeria", right_on = "Pizzeria")
except:
aggregate_results = results
display(aggregate_results)
final_results = pd.merge(left = final_results, right = aggregate_results, left_on = "Pizzeria", right_on = "Pizzeria")
Pizzeria | 3 Best Normalized | 4 Best Normalized | 5 Best Normalized | 6 Best Normalized | 7 Best Normalized | 8 Best Normalized | 9 Best Normalized | |
---|---|---|---|---|---|---|---|---|
0 | Di Fara | 1 | 2 | 2 | 2 | 2 | 4 | 4 |
1 | Est Est Est | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
2 | L&B | 2 | 3 | 3 | 3 | 3 | 2 | 3 |
3 | Calibria | 3 | 4 | 4 | 4 | 4 | 3 | 5 |
4 | Fuoco | 4 | 5 | 6 | 5 | 6 | 6 | 6 |
5 | Tony D | 5 | 6 | 5 | 7 | 7 | 7 | 7 |
6 | Frank Pepe | 6 | 7 | 8 | 8 | 8 | 8 | 8 |
7 | Luigi's 4704 | 7 | 8 | 7 | 6 | 5 | 5 | 2 |
8 | Luna | 8 | 9 | 9 | 9 | 9 | 9 | 9 |
9 | Luigi's 686 | 9 | 11 | 11 | 11 | 11 | 11 | 10 |
10 | Angeloni's | 10 | 10 | 10 | 10 | 10 | 10 | 11 |
11 | Zeneli | 11 | 12 | 12 | 12 | 12 | 12 | 12 |
Scoring Method #6: Average N Worst Judges
The sixth scoring method, similar to the fifth method, will show how pizzerias performed among the judges that liked them the least. This will eliminate any bias due to groups of overly kind judges.
Raw Scores¶
aggregate_results = pd.DataFrame()
for i in range (3, 10):
temp = ratings.copy()
temp = temp.groupby('Pizzeria')['Score'].nsmallest(i).reset_index(level = 1, drop = True).reset_index()
results = temp.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
results = results.rename(columns = {"Score" : f'{i} Worst Raw'})
for col in results.columns:
if col != "Pizzeria":
results[col] = results[col].rank(method = 'dense', ascending = False).astype(int)
try:
aggregate_results = pd.merge(left = aggregate_results, right = results, left_on = "Pizzeria", right_on = "Pizzeria")
except:
aggregate_results = results
display(aggregate_results)
final_results = pd.merge(left = final_results, right = aggregate_results, left_on = "Pizzeria", right_on = "Pizzeria")
Pizzeria | 3 Worst Raw | 4 Worst Raw | 5 Worst Raw | 6 Worst Raw | 7 Worst Raw | 8 Worst Raw | 9 Worst Raw | |
---|---|---|---|---|---|---|---|---|
0 | Est Est Est | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | L&B | 2 | 2 | 2 | 3 | 3 | 3 | 3 |
2 | Luigi's 4704 | 3 | 3 | 3 | 2 | 2 | 2 | 2 |
3 | Frank Pepe | 4 | 4 | 5 | 5 | 6 | 6 | 6 |
4 | Calibria | 5 | 6 | 6 | 6 | 5 | 5 | 5 |
5 | Di Fara | 6 | 5 | 4 | 4 | 4 | 4 | 4 |
6 | Luna | 7 | 7 | 7 | 7 | 7 | 7 | 7 |
7 | Luigi's 686 | 8 | 8 | 8 | 8 | 9 | 9 | 9 |
8 | Tony D | 9 | 9 | 9 | 9 | 8 | 8 | 8 |
9 | Angeloni's | 10 | 10 | 10 | 10 | 11 | 11 | 11 |
10 | Fuoco | 11 | 11 | 11 | 11 | 10 | 10 | 10 |
11 | Zeneli | 11 | 12 | 12 | 12 | 12 | 12 | 12 |
Normalized Scores¶
aggregate_results = pd.DataFrame()
for i in range (3, 10):
temp = ratings_norm.copy()
temp = temp.groupby('Pizzeria')['Score'].nsmallest(i).reset_index(level = 1, drop = True).reset_index()
results = temp.groupby("Pizzeria")["Score"].mean().sort_values(ascending = False).reset_index()
results = results.rename(columns = {"Score" : f'{i} Worst Normalized'})
for col in results.columns:
if col != "Pizzeria":
results[col] = results[col].rank(method = 'dense', ascending = False).astype(int)
try:
aggregate_results = pd.merge(left = aggregate_results, right = results, left_on = "Pizzeria", right_on = "Pizzeria")
except:
aggregate_results = results
display(aggregate_results)
final_results = pd.merge(left = final_results, right = aggregate_results, left_on = "Pizzeria", right_on = "Pizzeria")
Pizzeria | 3 Worst Normalized | 4 Worst Normalized | 5 Worst Normalized | 6 Worst Normalized | 7 Worst Normalized | 8 Worst Normalized | 9 Worst Normalized | |
---|---|---|---|---|---|---|---|---|
0 | Est Est Est | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | Luigi's 4704 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
2 | L&B | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
3 | Frank Pepe | 4 | 4 | 4 | 5 | 6 | 6 | 6 |
4 | Di Fara | 5 | 5 | 5 | 4 | 4 | 4 | 4 |
5 | Luna | 6 | 6 | 8 | 9 | 9 | 9 | 9 |
6 | Tony D | 7 | 8 | 7 | 8 | 8 | 7 | 7 |
7 | Luigi's 686 | 8 | 10 | 10 | 10 | 10 | 10 | 10 |
8 | Calibria | 9 | 7 | 6 | 6 | 5 | 5 | 5 |
9 | Fuoco | 10 | 9 | 9 | 7 | 7 | 8 | 8 |
10 | Angeloni's | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
11 | Zeneli | 12 | 12 | 12 | 12 | 12 | 12 | 12 |
Final Results
With all the scoring methods complete it's time to tabulate the final results.
Standings
It's time to report the final results for the 12 pizzerias of the first annual CKC festival. We will calculate the final score for each pizzera using all 36 scoring methods covered above. Each pizzeria will receive 12 points if they came in first place for a scoring method, 11 points if they came in second, 10 points if they came in third, etc. This means across all 36 scoring methods, the highest possible score for a single pizzeria is 432 points. The final scores will be normalized between 0 and 100 to provide a more relative comparison between the pizzerias.
results = final_results.copy()
results.index = results["Pizzeria"]
results.drop("Pizzeria", axis = 1, inplace = True)
results = len(results.index.unique()) + 1 - results
results["Score"] = results.sum(axis = 1)
results = results[["Score"]].copy()
results['Score'] = ((results['Score'] - results['Score'].min()) / (results['Score'].max() - results['Score'].min())) * 100
ax = sns.barplot(results.sort_values("Score", ascending = False), x = "Pizzeria", y = "Score", ec = "black")
ax.bar_label(ax.containers[0], fmt= "%.1f")
plt.title("Final Scores of the 2024 CKC Pizza Rating Festival")
plt.xticks(fontsize = 8, rotation = 45)
plt.ylim([0, 105])
display()
Testing For Bias
One concern with the format of the first annual CKC pizza-rating festival is that the pizzas were brought out in a specifc order. Perhaps the judges were more favorable in the beginning and harsher towards the end. To maintain the integrity of this family tradition, we will conduct rigorous statistical analyses to investigate if there is any detectable bias related to the order of the pizzas. Let's start with a visual:
pizza_ranks = results["Score"].copy()
pizza_orders = ratings[["Pizzeria", "Pizza ID"]].copy()
pizza_orders.index = pizza_orders["Pizzeria"]
pizza_orders.drop("Pizzeria", axis = 1, inplace = True)
pizza_orders = pizza_orders["Pizza ID"].copy()
pizza_data = pd.merge(pizza_orders, pizza_ranks, left_on = "Pizzeria", right_on = "Pizzeria")
sns.scatterplot(data = pizza_data, x = "Pizza ID", y = "Score")
plt.title("Final Scores for Pizzerias by Order in Tasting Lineup")
def label_point(x, y, val, ax):
a = pd.concat({'x': x, 'y': y, 'val': val}, axis=1)
for i, point in a.iterrows():
ax.text(point['x']+0.1, point['y'] - 5, str(point['val']), fontsize = 9)
label_point(pizza_data["Pizza ID"], pizza_data["Score"], pizza_data.index.to_series(), plt.gca())
display()
We can see in the graph above the order of the top 5 pizzerias: 1st, 2nd, 6th, 7th, and 12th. There does not seem to be a visually obvious correlation between overall scores and the pizzeria being earlier or later in the lineup. Let's move on to more concrete testing:
Categorical Statistical Tests¶
We will start by conducting statistical tests to determine if there is a significant difference between the final scores of the pizzerias in the top half of the lineup and the bottom half of the lineup. We will run a variety of tests for completeness.
T-Tests¶
Final Scores¶
first_half = pizza_data[pizza_data["Pizza ID"] <= 6]["Score"].copy()
second_half = pizza_data[pizza_data["Pizza ID"] > 6]["Score"].copy()
t_stat, p_value = stats.ttest_ind(first_half, second_half)
print("T-statistic:", t_stat)
print("P-value:", p_value)
T-statistic: 1.5328059077494744 P-value: 0.1277537008396266
Since the p-value is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.
All Raw Scores¶
first_half = ratings[ratings["Pizza ID"] <= 6]["Score"].copy()
second_half = ratings[ratings["Pizza ID"] > 6]["Score"].copy()
t_stat, p_value = stats.ttest_ind(first_half, second_half)
print("T-statistic:", t_stat)
print("P-value:", p_value)
T-statistic: 0.39358413005468457 P-value: 0.6945328720113341
Since the p-value is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.
All Normalized Scores¶
first_half = ratings_norm[ratings_norm["Pizza ID"] <= 6]["Score"].copy()
second_half = ratings_norm[ratings_norm["Pizza ID"] > 6]["Score"].copy()
t_stat, p_value = stats.ttest_ind(first_half, second_half)
print("T-statistic:", t_stat)
print("P-value:", p_value)
T-statistic: 0.7516878960838435 P-value: 0.4535974942323605
Since the p-value is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.
Mann-Whitney U Tests¶
Final Scores¶
first_half = pizza_data[pizza_data["Pizza ID"] <= 6]["Score"].copy()
second_half = pizza_data[pizza_data["Pizza ID"] > 6]["Score"].copy()
t_stat, p_value = stats.mannwhitneyu(first_half, second_half, method="exact")
print("T-statistic:", t_stat)
print("P-value:", p_value)
T-statistic: 2541.0 P-value: 0.09909440016666508
Since the p-value is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.
All Raw Scores¶
first_half = ratings[ratings["Pizza ID"] <= 6]["Score"].copy()
second_half = ratings[ratings["Pizza ID"] > 6]["Score"].copy()
t_stat, p_value = stats.mannwhitneyu(first_half, second_half, method="exact")
print("T-statistic:", t_stat)
print("P-value:", p_value)
T-statistic: 2292.0 P-value: 0.606593148588398
Since the p-value is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.
All Normalized Scores¶
first_half = ratings_norm[ratings_norm["Pizza ID"] <= 6]["Score"].copy()
second_half = ratings_norm[ratings_norm["Pizza ID"] > 6]["Score"].copy()
t_stat, p_value = stats.mannwhitneyu(first_half, second_half, method="exact")
print("T-statistic:", t_stat)
print("P-value:", p_value)
T-statistic: 2340.0 P-value: 0.46359335005995844
Since the p-value is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.
Linear Regression Model¶
Finally, we will conduct a linear regression analysis on the original judges' scores to determine if there is a significant relationship between the pizzeria's order in the lineup and its scores.
Raw Scores¶
ratings.rename({"Pizza ID": "ID"})
result = sm.ols(formula="Score ~ ID", data=ratings).fit()
print(result.summary())
OLS Regression Results ============================================================================== Dep. Variable: Score R-squared: 0.006 Model: OLS Adj. R-squared: -0.002 Method: Least Squares F-statistic: 0.7307 Date: Sun, 24 Nov 2024 Prob (F-statistic): 0.394 Time: 10:44:27 Log-Likelihood: -351.14 No. Observations: 132 AIC: 706.3 Df Residuals: 130 BIC: 712.0 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 13.9566 0.647 21.573 0.000 12.677 15.237 ID -0.0751 0.088 -0.855 0.394 -0.249 0.099 ============================================================================== Omnibus: 7.375 Durbin-Watson: 1.563 Prob(Omnibus): 0.025 Jarque-Bera (JB): 7.629 Skew: -0.587 Prob(JB): 0.0221 Kurtosis: 2.915 Cond. No. 15.9 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Since the p-value on the coefficient is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.
Normalized Scores¶
ratings_norm.rename({"Pizza ID": "ID"})
result = sm.ols(formula="Score ~ ID", data=ratings_norm).fit()
print(result.summary())
OLS Regression Results ============================================================================== Dep. Variable: Score R-squared: 0.016 Model: OLS Adj. R-squared: 0.008 Method: Least Squares F-statistic: 2.120 Date: Sun, 24 Nov 2024 Prob (F-statistic): 0.148 Time: 10:44:31 Log-Likelihood: -642.89 No. Observations: 132 AIC: 1290. Df Residuals: 130 BIC: 1296. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 62.7568 5.899 10.639 0.000 51.087 74.427 ID -1.1670 0.801 -1.456 0.148 -2.753 0.419 ============================================================================== Omnibus: 32.406 Durbin-Watson: 1.851 Prob(Omnibus): 0.000 Jarque-Bera (JB): 8.019 Skew: -0.253 Prob(JB): 0.0181 Kurtosis: 1.904 Cond. No. 15.9 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Since the p-value on the coefficient is greater than 0.05, there is no evidence of a significant relationship between the position of the pizzeria in the lineup and its scoring.