Big data fanfare abounds, we continuously hear announcements like more data was created last year than in the entire history of humanity, and that data creation is on a two year-doubling cycle. Better cheap fast storage has been the historical answer to supporting the ever-growing capacity to generate data, however this is not necessarily the best solution. Already much collected data is thrown away (e.g.; CCTV footage, real-time surgery video, and genome sequencing data) without saving anything. Much of stored data remains unused, and not cleaned up into a form that is human-usable since this is costly and challenging (de-duplication a primary example).
Turning big data into smart data means moving away from data fundamentalism, the idea that data must be collected, and that data collection in itself is an ends rather than a means. Advancement comes from smart data, not more data; being able to cleanly extract and use salient aspects of data (e.g.; the ‘diffs,’ for example identifying relevant genomic polymorphisms from the whole genome sequence), not just generate and discard or mindlessly store.