1 Executive Summary
Big data is the result of the ever-increasing number of devices, sensors, and people interacting and producing data. But in the interest of privacy and efficiency, should machine learning (ML) projects always strive to work with such huge datasets? This is a question that should be asked to help guide the development of ML on its road to maturity so that it delivers the most efficient results while maintaining high accuracy and efficacy.
Often researchers must work with small datasets rather than with Big Data, as that is all that their resource and time constraints permit. But despite the reality that many resarchers and business face when it comes to availability and volume of data, privacy and data minimization trends seen in other industries make it possible that a preference to gather, store, and use smaller amounts of data may eventually emerge in ML as well. Small data may provide benefits in particular contexts, such as in industries that face challenges in collecting high volumes of data to train ML models, in scenarios where the computing power during training or operations should be reduced, or when adding ML capabilities to mobile and IoT devices adds significant value.
Small datasets in ML is not the standard yet, but considering this option can be useful to determine if your ML needs would be better suited to a small data approach. There are of course issues with using small datasets in ML, and they should not be taken lightly. But there are technological developments that can maintain high accuracy of ML models when using small datasets.
Small data refers to smaller datasets, but also on determining corellations on a smaller scale such as within an individual’s behaviors collected from an individual’s devices. Closely intertwined with and dependent on IoT, this is a data paradigm that sees the value in maintaining the context of the individual rather than simply amassing large quantities of data from the increasing number of smart devices.
The research on small data that pertains to Machine Learning was conducted primarily in the context of image classification models, where small datasets can be considered to have fewer than 100 images per class, although there still may be high numbers of classes. The information provided here should equip decision-makers to have informed conversations with technical experts and vendors and develop solutions that are well aligned with the business needs.