When I see organizations using and talking about their data, they love to present the tools they’re using to handle and wrangle it. You’ve probably heard terms like Hadoop, Spark, Shark, PostgreSQL, MySQL, MongoDB, and rarely Excel. (If you haven’t, there’s a good list to look up on Wikipedia.)
I won’t argue that taming data doesn’t take good tools, but what I will argue is that the tools you use depend on the scale of your data.
I like to think of the following rough categories of data scale:
- Small data–dataset fits in RAM (anywhere from 1Mb to 8Gb)
- Medium data–dataset fits on a single hard drive (8Gb to 1Tb)
- Big data–dataset takes multiple hard drives to store (anything above 1Tb)
Now, I’m a big believer of the Pareto principle, which should lead me to believe that of all of the tech companies out there, only about 20% (or fewer) need the tools suited for big data. Here’s a look at some counts from Indeed.com that roughly confirm that relationship:
- Spark – 8,701
- Hadoop – 13,723
- Oracle Database – 27,177
- MySQL – 21,770
- PostgreSQL – 4,285
- Microsoft Access – 67,538
So what does that mean for the tools you adopt? First, it means that as soon as your data is too big for Excel/Python/PHP/R/Memory it doesn’t mean that it’s time to adopt Hadoop and go hire a team to set it up. It means that you should look into using something like a relational database to interact and investigate your data. Ideally you’re thinking about how to transform your data into something like a spreadsheet anyway which means that a RDBMS is a natural fit.
Of the four that I listed above, two are free so the only cost you’d incur would be in the machines to host it and the time setting it up. The other main reason is that it’s likely someone on your team/in your company already knows how to start using it now.
All that said, there’s definitely a place for tools like Hadoop, but it’ll be very specific to your implementation and how your dataset is growing.