Getting Started With Tabular Data

May 6, 2015    data-science data databases analysis

I look at data every day. If I had to go back to a past version of myself to give him advice, I’d offer this: make it a rule to fit your data into a box.

There are plenty of mathematical techniques out there for analyzing data but to effectively apply them to your particular data, your data needs to fit the following format:

  • Data consists of rows and columns
  • Your data should be viewable using any common spreadsheet application
  • Each row represents an instance of data (in other words, each row represents one object under study be it a person, a spammy email, or a photograph with a face to be recognized)
  • Each column represents a feature or something that we can use to describe the instance (and this could be a person’s height, the number of occurrences of the word “FREE” in a spammy email, or a length of a detected edge in a picture of a face)

When you encounter some new data, it’s best to strive to fit it into that framework. Do you have a pile of server logs that you’re analyzing? Figure out the instances you’re studying (likely unique visitors) so tie each log entry to a particular visitor (to be represented by a row) and describe that visitor somehow via features (like how many times they visited in a week, what User-Agent they use, which sites they’ve clicked on).

How about trying to analyze the content of a blog post to determine what category it falls under? Then, it depends whether you’re trying to categorize each particular blog post vs a whole blog (including all of it’s content as relevant evidence for the category). Suppose you decide you want to classify individual posts. Then, you’d start off with some approaches that count the occurrence of particular words, banking on the theory that most tech-related posts will mention token words like “gadget,” “app,” or even “apple” with regularity. If you’re wanting to classify an entire blog you might begin by summarizing the text of the about page in similar manner, along with trying to summarize all of the available tags (and their counts) for the blog as a whole.

So, all-in-all, one of the most helpful tips for someone just starting out in anything related to data will be to turn whatever problem they have in front of them into a spreadsheet-like model. More often, the world is messy and doesn’t give you information in this format, so the main task (and the best next action) is to summarize the data as instances of study that have relevant features.