One area of long-standing data interest is mortgage statistics since mis-estimating prepayments can cost investors billions of dollars. This raises the question of how prepayment risk is still being mis-estimated. Irrespective, mortgage data is one of the fastest growing kinds of data, both by row and column of tracked data, growing at more than 2x Moore’s law on a log chart (Moore’s law reflects the hardware on which the data is stored and manipulated (algorithms somewhat fill the gap)). This begs the question of smart data rather than big data.
There is much talk about all types of data growing (and data scientists being the biggest category of job growth), but the size of big data should surely be one of its most basic attributes. What is much more relevant is the value that big data provides through its use. For example, how has having more rows and columns in mortgage-tracking spreadsheets improved (if at all) prepayment prediction?
Like genomics, many big data problems are in the early stages of ‘the diffs,’ not knowing which part of the data is salient to keep out of the 99% that may be useless. ‘The diffs’ are the differences, the differences between a sample data set and the reference/normal data set that constitute salience and allow the rest of the data to be discarded.