What is the definition of Big Data? “Data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods.” according to our friends at SAS[i] Where is the line for “too large, fast and complex”? Data sets that are hard to process with a desktop/laptop and Microsoft Excel? Or a standard server running a relational database? What defines data sets that are growing fast? Complex data that has a lot of columns and variables? How does inexpensive cloud computing and storage change the line for “too large and complex”?
“Big Data” is one of those terms that many people use differently and imprecisely, so it really doesn’t mean much of anything. And, in reality, it’s “Big Insights” — the predictive analytics that lead to real business value — that people are after, not just “Data”.
One of our favorite tongue-in-cheek definitions of Big Data comes from the excellent article, “Critical Questions for Big Data” by Danah Boyd and Kate Crawford[ii]. “Big Data: a cultural, technological, and scholarly phenomenon that rests on the interplay of technology, analysis, and mythology that provokes extensive utopian and dystopian rhetoric.”
To be clear – we believe in using data, statistics, data analysis, machine learning, and artificial intelligence to glean actionable insights. Yet, we see a ton of hype and misinformation around these topics. And the term that seems most unhelpful and utopian/dystopian provoking is “Big Data.”
More importantly, having data of any size is usually not the point. You may not need as much data as you think to realize valuable insights. In fact, we’re seeing that the amount of data to get AI/ML insights is getting smaller while tools and models are getting ‘smarter’. By way of predictive analytics, smarter AI/LM tools are becoming more efficient with their data analysis and relying less on droves of data for their inference.
· What hypotheses are important to test, and what data is there to support them?
· Is the number of outcomes, features, or possible values for features representative of the size of the data? Data Dimensionality can be a curse.
· Do you need to hold back sets of data from training to test and validate your model against? We hope yes, so you’ll want have a large enough data set for holdouts.
· [TA4] Are your data science teams getting caught up in and slowed down by the pipeline or ETL/ELT processes, rather than testing hypotheses quickly? You may need to push for quicker pilots, rather than perfectly structured data sets.
If you’re getting frustrated by no or slow progress towards actual insights from your data science initiatives, or if you’re looking to get started with data science at your company, please let us know how we can help. Getting these projects started, unstuck and on track is what we do. Email us at email@example.com