Ingredients Of a Feather Go Together

At least I hope they do…


1. I don’t have 2 hours to make 1 dish.

2. I don’t want to buy the entire store every time I want to make something.

3. I don’t have the hand eye coordination.

4. Cutting onions makes me cry every time.

I can never solve the first one because I am always busy doing OIDD 245 hw. The third one is just because I lack the skills in cooking. The last one is solvable but wearing glasses doesn’t seem to do anything for me. Thus by process of elimination I decided to tackle the second problem.

My goal for this project is to create a tool that takes in a list of ingredients that you have in your refrigerator and returns a list of ingredients that work well together with the list. This way I can buy a few ingredients that all work well together.


You may notice that I seem to be switching between R and Python. Other than to show off that I know both I chose to switch between R and Python because python has a library (nltk) that will help me clean the ingredients better.


credits: allrecipes

Scraping was pretty straightforward, I just used the read_html command in R to get the links to each of the recipes on the first 20 pages, then I used read_html on each page to get the ingredients and place each ingredient into a separate entry. I decided to focus on one type of food, since the types of ingredients that work depend on the cuisine and dish so I decided to go with a relatively simple type of food: salad.


Here is a screenshot of the ingredients right after I scraped them.

As you can see there are many modifiers for each ingredient. However, I don’t care what happens to the onion, pulverize it or juice for all I care, I just want to know that they put onions into the dish. Same thing for the measurements, I don’t really care about how much of each ingredient is put into the salad. I somehow need to identify which words correspond with the ingredients and which corresponds with the modifiers.

The way I tackled this problem was I created a frequency dictionary of each word and I looked at the top 100 words, since some ingredients showed up like 1 time so I don’t think they would help make for a good prediction. I manually excluded a bunch of words like ‘cup’, ’teaspoon’, ’tablespoons’, ’tablespoon’, ’cups’, ’can’, ’teaspoons’ that have no chance of being an ingredient. Then I realized that only nouns would be ingredients so I found a nlp package on python to isolate only the nouns in each ingredient. Finally, I created a n-gram of some of the ingredients because some modifiers are necessary like in cayenne pepper since I can’t remove pepper or cayenne without losing the meaning.

Here is a picture of the cleaned data set


heat map of frequency of each ingredient

Although there are some clusters they don’t seem very defined so I don’t think any of the clustering techniques work. However, When I graphed the frequency of each ingredient in sorted order I noticed something strange.

I thought this graph looked oddly familiar and then I realized that it follows Zipf’s law. If you don’t know Zipf’s law says that all languages follow the rule that if you graph the frequency of the most common words, it will create a graph that is close to y = 1/x. This is true for every single known language in the world so when it applied to the ingredients in salads I was a little shook. I wanted to see if it was just a coincidence.

As you can see the frequency of ingredients matches the graph of 1/x relatively well. This was when I realized I could treat each recipe as a document and the ingredients as words. This would create a corpus that I could use NLP techniques on, since the ingredients formed a pseudo language.

Thus if we consider the corpus of ingredients as a language then I realized I can use Markov chains since given an initial sentence Markov chains can predict the next words through a probability distribution. However, in a normal Markov chain the order of the words matter, but in my case I don’t really care if the onion comes before or after the garlic just that they are in the same recipe which shows that they go well together. I modified the formula of the Markov chain just a little, instead of creating a probability distribution for each word based on the next word I created a probability distribution for each word based on the entire recipe, creating a 100x100 matrix from scratch.

heat map of the ingredient frequency matrix

You may notice that there is a really dark line running through the diagonal, This doesn’t matter too much because I will be limiting the predicted ingredients to not include the input ingredients. So now the Markov chain model is basically done. This graph is basically for each ingredient(row) how many recipes have both the ingredient(row) and the ingredient(column).


List of ingredients

Just because this is unsupervised there is no correct answer and the only way to verify if it works is eyeing it, since the results are completely subjective. Here are some examples inputs and outputs.

Over all these seem like pretty good combinations of ingredients, I will definitely try putting bacon and almonds together next time I make a salad. I know these are relatively correct because the only things that will have a positive probability are ingredients that show up alongside the input ingredients in a recipe and thus are all acceptable answers. One important thing to note is that the ingredients that share a lot of recipes with the input ingredients will be more likely to show up in the prediction.