When I see organizations using and talking about their data, they love to present the tools they’re using to handle and wrangle it. You’ve probably heard terms like Hadoop, Spark, Shark, PostgreSQL, MySQL, MongoDB, and rarely Excel. (If you haven’t, there’s a good list to look up on Wikipedia.)
I won’t argue that taming data doesn’t take good tools, but what I will argue is that the tools you use depend on the scale of your data.
I like to think of the following rough categories of data scale:
Now, I’m a big believer of the Pareto principle, which should lead me to believe that of all of the tech companies out there, only about 20% (or fewer) need the tools suited for big data. Here’s a look at some counts from Indeed.com that roughly confirm that relationship:
So what does that mean for the tools you adopt? First, it means that as soon as your data is too big for Excel/Python/PHP/R/Memory it doesn’t mean that it’s time to adopt Hadoop and go hire a team to set it up. It means that you should look into using something like a relational database to interact and investigate your data. Ideally you’re thinking about how to transform your data into something like a spreadsheet anyway which means that a RDBMS is a natural fit.
Of the four that I listed above, two are free so the only cost you’d incur would be in the machines to host it and the time setting it up. The other main reason is that it’s likely someone on your team/in your company already knows how to start using it now.
All that said, there’s definitely a place for tools like Hadoop, but it’ll be very specific to your implementation and how your dataset is growing.