Chapter 2 Data sources
2.1 Background
After several times of brainstorming, our group found all three of us interested in food security and healthy diet. Ellen found a website called Open Food Facts - Word, which provides free open database of food products with ingredients, allergens, nutrition facts and all the tidbits of information we can find on product labels. With rich content we can obtain, we reached a consensus of working on the final project related to FOODS.
One thing to be noted about the website is that: all data are contributed by the public:
“Open Food Facts is a food products database made by everyone, for everyone. You can use it to make better food choices, and as it is open data, anyone can re-use it for any purpose.” (Quoted)
2.2 About the Dataset
The original thought was to scrape all items on this website until we found the dataset is available on this link https://world.openfoodfacts.org/data.
The dataset contains 163 variables(features) and 356,001 records, Basically we can divide those variables into the following fields:
2.2.1 General information
Including information that uniquely identify the products. Those information include product code, url, etc.
code | url | creator | creator_t | creator_datetime |
last_modified_t | last_modified_datetime | product_name | generic_name | quantity |
2.2.3 Ingredients
As the name suggests, it contains ingredients of the product.
ingredients_text | traces | traces_tags |
2.2.4 nutrition facts
Also as the name suggests, it contains the nutrition facts that is visible almost in any food products. One thing need to be noted in this part, all variables in this part are with a suffix of "_100g"
which means the amount of nutriment for 100g or 100ml of product.
Because there are too many variables in this part, so, I just list few of which.
energy_100g | energy-kj_100g | energy-kcal_100g | proteins_100g | … |
2.2.4.1 Others
There is no general fields for variables in this part.
serving_size | no_nutriments | additives_n | additives |
additives_tags | ingredients_from_palm_oil_n | ingredients_from_palm_oil | ingredients_from_palm_oil_tags |
ingredients_that_may_be_from_palm_oil_n | ingredients_that_may_be_from_palm_oil | ingredients_that_may_be_from_palm_oil_tags | nutrition_grade_fr |
main_category | main_category_fr | image_url | image_small_url |
2.3 Observation
Due to the large amount of variables that can be observed from this dataset, we hope by some means we can eliminate part of the variables. Luckily, one observation that can be generated from this dataset is that we found that there are a lot of duplicate variables that end with a certain suffix like brands
and brands-tag
. They basically describe the same information but with different recording methods. For example in brands
an item is “Bob’s Red Mill”, while in brands-tag
that item becomes “bob-s-red-mill”. That situation happens to most of the variables with suffix like _tag
, _fr
, _t
, _date
,… Some may even use different measurements like energy_100g
, energy-kj_100g
and energy-kcal_100g
. If we just ignore it, that will undoubtedly add to our burden when processing the data.
Therefore, when we choose variables, we avoid choosing ones with those suffixes.