Hadas Bar - Blog - The three W’s of data collection

When we design a new big-data system, we must begin with precise data collection. Data collection has three steps: Who, What, and Where.

WHO

The first question is WHO will deliver the information; in other words, data sources. In the DFTB project, we get reports from stakeholders in the supply chain, from farmers, food processors, and distributors to supermarkets. Each stakeholder provides both manual and automatic information to the system. The manual data includes logistic information, while automatic data includes smart devices and IoT output, such as temperature, humidity, and more. Above that, we can also get information from government inspectors who follow standards such as ISPM 31 (International Standard for Phytosanitary Measures). At the end of the supply chain, the end user provides feedback on the fruit they bought.

WHAT

Now that we know who will supply the data, we need to see WHAT information is relevant. How do we know which parameters are interesting to collect? Do we want to collect every piece of information from the field? Do we want the same parameters for all fruits?

As part of the DFTB project, we did a preliminary study on the subject. Hadas Bar studied avocados grown in Israel and found that the relevant parameters, in this case, are:

The percentage of fat in the fruit.
Key events in the fruit’s lifecycle: when was it picked? Was it refrigerated? How long ago was it taken out of refrigeration? Etc.

WHERE

The last question we must ask ourselves is WHERE all this data is stored and processed. The platform selection process must consider the amount of data gathered and the basic template of the data, generally referred to as the data scheme. At the DFTB project, we selected the Ethereum blockchain network based on the data we are collecting, its size, and its sensitivity. The stakeholders using this data were also an essential part of this selection.

Data reliability

After answering these crucial questions, the system must consider the authenticity and correctness of the data received and stored. By using blockchain, we ensure that no one has access to change the data. But, even so, the data is only immutable after insertion. There can still be human errors or even intentional frauds during the insertion. The reliability of our data lies in two things:

The mutual interest between all stakeholders ensures they provide accurate data.
Multiple data sources - we gather data from end users, IoT, smart devices, and government inspectors.

How it works: At each stage of the supply chain, the stakeholder's interest is to ensure the data collected from the previous stakeholder is correct. Then, in case of negative feedback or refund requests, the whole supply chain will look for the stakeholder that causes the issue.

Next, we cross-checked all our data sources and gave the end user only verified information. In the end, if the buyer is unhappy, they can write feedback about the fruit, and the system can learn from it.

In short, first, we find the data sources, then we analyze what data we need, and finally, we find the right platform to hold this data together.

This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement № 818182

WHO

WHAT

WHERE

Data reliability

CONTACT