Ingredients Of a Feather Go Together

7 min readMay 4, 2021

At least I hope they do…

Background

Over quarantine, I became obsessed with the YouTube channel Tasty. Watching the disembodied hand cook made my mouth water every time. Yet, when I tried to do the same I quickly ran into four big problems.

1. I don’t have 2 hours to make 1 dish.

2. I don’t want to buy the entire store every time I want to make something.

3. I don’t have the hand eye coordination.

4. Cutting onions makes me cry every time.

I can never solve the first one because I am always busy doing OIDD 245 hw. The third one is just because I lack the skills in cooking. The last one is solvable but wearing glasses doesn’t seem to do anything for me. Thus by process of elimination I decided to tackle the second problem.

My goal for this project is to create a tool that takes in a list of ingredients that you have in your refrigerator and returns a list of ingredients that work well together with the list. This way I can buy a few ingredients that all work well together.

Overview

The general pipeline of my project looks something like this

You may notice that I seem to be switching between R and Python. Other than to show off that I know both I chose to switch between R and Python because python has a library (nltk) that will help me clean the ingredients better.

Scraping

Scraping was pretty straightforward, I just used the read_html command in R to get the links to each of the recipes on the first 20 pages, then I used read_html on each page to get the ingredients and place each ingredient into a separate entry. I decided to focus on one type of food, since the types of ingredients that work depend on the cuisine and dish so I decided to go with a relatively simple type of food: salad.

Cleaning

Almost immediately I realized just how hard this project was.

Here is a screenshot of the ingredients right after I scraped them.

As you can see there are many modifiers for each ingredient. However, I don’t care what happens to the onion, pulverize it or juice for all I care, I just want to know that they put onions into the dish. Same thing for the measurements, I don’t really care about how much of each ingredient is put into the salad. I somehow need to identify which words correspond with the ingredients and which corresponds with the modifiers.

The way I tackled this problem was I created a frequency dictionary of each word and I looked at the top 100 words, since some ingredients showed up like 1 time so I don’t think they would help make for a good prediction. I manually excluded a bunch of words like ‘cup’, ’teaspoon’, ’tablespoons’, ’tablespoon’, ’cups’, ’can’, ’teaspoons’ that have no chance of being an ingredient. Then I realized that only nouns would be ingredients so I found a nlp package on python to isolate only the nouns in each ingredient. Finally, I created a n-gram of some of the ingredients because some modifiers are necessary like in cayenne pepper since I can’t remove pepper or cayenne without losing the meaning.

Here is a picture of the cleaned data set

Model

Generally there are 3 types of machine learning models: supervised, unsupervised, and reinforcement learning. This problem falls under the unsupervised learning category as there is no real answer for ingredients that work well. Finding a model that would give me what I want was hard, since I don’t really see how k — nearest neighbors or clustering could help. If I graph the input on a 100 dimensional graph what would it mean to go to the nearest cluster? Graphing the heat map of the frequency of each ingredient we get.

heat map of frequency of each ingredient

Although there are some clusters they don’t seem very defined so I don’t think any of the clustering techniques work. However, When I graphed the frequency of each ingredient in sorted order I noticed something strange.

I thought this graph looked oddly familiar and then I realized that it follows Zipf’s law. If you don’t know Zipf’s law says that all languages follow the rule that if you graph the frequency of the most common words, it will create a graph that is close to y = 1/x. This is true for every single known language in the world so when it applied to the ingredients in salads I was a little shook. I wanted to see if it was just a coincidence.

As you can see the frequency of ingredients matches the graph of 1/x relatively well. This was when I realized I could treat each recipe as a document and the ingredients as words. This would create a corpus that I could use NLP techniques on, since the ingredients formed a pseudo language.

Thus if we consider the corpus of ingredients as a language then I realized I can use Markov chains since given an initial sentence Markov chains can predict the next words through a probability distribution. However, in a normal Markov chain the order of the words matter, but in my case I don’t really care if the onion comes before or after the garlic just that they are in the same recipe which shows that they go well together. I modified the formula of the Markov chain just a little, instead of creating a probability distribution for each word based on the next word I created a probability distribution for each word based on the entire recipe, creating a 100x100 matrix from scratch.

heat map of the ingredient frequency matrix

You may notice that there is a really dark line running through the diagonal, This doesn’t matter too much because I will be limiting the predicted ingredients to not include the input ingredients. So now the Markov chain model is basically done. This graph is basically for each ingredient(row) how many recipes have both the ingredient(row) and the ingredient(column).

Prediction

To make a prediction with the model, I took the input list and found all of the row vectors that correspond to the ingredient then add these up to get the final vector. (One important thing to note is that the input list must be one 100 ingredients. If it isn’t then I have no data for the ingredient and it will throw an error.) Then I divided each number by the total to get percentages. This way I could create a probability distribution then I simply choose a couple ingredients based on the probability. We want to randomize the results because if we predict ingredients based on which percent is higher then every prediction would include the most commonly used Ingredients like onions and salt and pepper.

Just because this is unsupervised there is no correct answer and the only way to verify if it works is eyeing it, since the results are completely subjective. Here are some examples inputs and outputs.

Over all these seem like pretty good combinations of ingredients, I will definitely try putting bacon and almonds together next time I make a salad. I know these are relatively correct because the only things that will have a positive probability are ingredients that show up alongside the input ingredients in a recipe and thus are all acceptable answers. One important thing to note is that the ingredients that share a lot of recipes with the input ingredients will be more likely to show up in the prediction.

Improvements

The ways I could improve this model is including more recipes since I only used 4000 this time. In addition, Adding more types of recipes might change results as well, for instance if you want to make a burger the ingredients would be wildly different than salads and thus the combinations would be different as well. One feature I would want to add is to also recommend a recipe to make. This wouldn’t be too hard since I could just do cosine similarity and return the recipe that is the closest.

Sources

https://www.allrecipes.com/
https://www.nltk.org/