Chapter 2 Data sources

2.1 Background

After several times of brainstorming, our group found all three of us interested in food security and healthy diet. Ellen found a website called Open Food Facts - Word, which provides free open database of food products with ingredients, allergens, nutrition facts and all the tidbits of information we can find on product labels. With rich content we can obtain, we reached a consensus of working on the final project related to FOODS.

One thing to be noted about the website is that: all data are contributed by the public:

“Open Food Facts is a food products database made by everyone, for everyone. You can use it to make better food choices, and as it is open data, anyone can re-use it for any purpose.” (Quoted)

2.2 About the Dataset

The original thought was to scrape all items on this website until we found the dataset is available on this link https://world.openfoodfacts.org/data.

The dataset contains 163 variables(features) and 356,001 records, Basically we can divide those variables into the following fields:

2.2.1 General information

Including information that uniquely identify the products. Those information include product code, url, etc.

code	url	creator	creator_t	creator_datetime
last_modified_t	last_modified_datetime	product_name	generic_name	quantity

2.2.2 Tags

This part contains tags information that tell us about things, for example, where the product is from? What is its brands? Where it is sold to?

packaging	packaging_tags	brands	brands_tags	categories	categories_tags
categories_fr	origins	origins_tags	manufacturing_places	manufacturing_places_tags	labels
labels_tags	labels_fr	emb_codes	emb_codes_tags	first_packaging_code_geo	cities
cities_tags	purchase_places	stores	countries	countries_tags	countries_fr

2.2.3 Ingredients

As the name suggests, it contains ingredients of the product.

ingredients_text

traces

traces_tags

2.2.4 nutrition facts

Also as the name suggests, it contains the nutrition facts that is visible almost in any food products. One thing need to be noted in this part, all variables in this part are with a suffix of "_100g" which means the amount of nutriment for 100g or 100ml of product.

Because there are too many variables in this part, so, I just list few of which.

energy_100g

energy-kj_100g

energy-kcal_100g

proteins_100g

…

2.2.4.1 Others

There is no general fields for variables in this part.

serving_size	no_nutriments	additives_n	additives
additives_tags	ingredients_from_palm_oil_n	ingredients_from_palm_oil	ingredients_from_palm_oil_tags
ingredients_that_may_be_from_palm_oil_n	ingredients_that_may_be_from_palm_oil	ingredients_that_may_be_from_palm_oil_tags	nutrition_grade_fr
main_category	main_category_fr	image_url	image_small_url

2.3 Observation

Due to the large amount of variables that can be observed from this dataset, we hope by some means we can eliminate part of the variables. Luckily, one observation that can be generated from this dataset is that we found that there are a lot of duplicate variables that end with a certain suffix like brands and brands-tag. They basically describe the same information but with different recording methods. For example in brands an item is “Bob’s Red Mill”, while in brands-tag that item becomes “bob-s-red-mill”. That situation happens to most of the variables with suffix like _tag, _fr, _t, _date,… Some may even use different measurements like energy_100g, energy-kj_100g and energy-kcal_100g. If we just ignore it, that will undoubtedly add to our burden when processing the data.

Therefore, when we choose variables, we avoid choosing ones with those suffixes.