The Power of Perspective

Standard

I used to be a person who would get jealous at others, namely their technical ability. If I thought that the person I was working with were better at math or programming compared to me, it’d cause a drive in me to get better at both of those. I’d pour myself into books on the relevant subjects to try to enhance my ability. I’d work on projects to try to get familiar with these advanced techniques. I’d be lying if I said this didn’t help me become a better programmer or analyst, but I definitely it increased my stress levels more than I needed.

I think I was missing the point the entire time. I lost sight of two factors that hadn’t occurred to me. First, I hadn’t even known that my perspective was incredibly full of worth. Secondly, I had forgotten that the people I was jealous of weren’t even really doing work that I truly wanted to to. Think about that for a moment.

“There will always be people who are better than you at something.” At least that’s what I keep hearing people say when it comes to life, work, and career progress. So, if that’s the case, then how did your boss get hired? Or the CEO of a public company? Couldn’t they have just found someone better than they are to do the job? I’m almost sure of it, but I’m betting it’s not because of raw ability in any particular skill but rather it’s because of the perspective they bring to the table. Your viewpoint is an extremely valuable asset. How you think about a situation or problem is more unique than you think it is and if your boss isn’t using your perspective to enhance his or her own view, both of you are losing out.

On the other point, you have to ask yourself if you’re really doing work that you want to do. Will mastering the skills you’re working on get you to the job that you want. Additionally, I’ve been in situations where a colleague is working on the project that I want to work on. The project. However, every time this has happened, it’s because I never voiced my interest in working on it. And then half the time, the person assigned the work didn’t want to do it nearly as much as I did.

In essence, don’t ever overlook some neglected assets, especially when it comes to sharing how you see the world (and your work) and your unique desire to persue a particular kind of work.

Advertisements

Quiet time to think

Standard

Multitasking is a fallacy. Most of the time when we think we’re optimally getting things done by working no multiple tasks or even multiple projects, we’re selling ourselves short.

You’ve all worked with a programmer like this, they’re the person who freaks out or makes a snide remark each time you walk up to them with a question and didn’t “use the proper channels. Put it in an email or a ticket and if you walk up to me again with your problems I’ll…”

Now, aside from handling the situation extremely poorly…this person isn’t completely wrong. Programming takes a lot of overhead to solve complex problems and interruptions can make the difference by a factor of two in how long a given task can take.

Think about an analogy in physical labor, say painting. If you’ve ever painted the interior of a house, you know that the easiest parts are the long flat walls–you just roll away at them. It’s the edges that are the pain in the ass. You’ll spend an hour edging a room and then 10 minutes painting up around them. It’s a lot of work getting the details right.

And so it is with data analysis. If you started with an unlabeled data file, say a simple CSV, you have to invest time in investigating it and just looking at it. You’re there are any relationships in front of you so you need to plot columns against one other and run correlations. Aha, you think you’ve found something and want to take a look further, maybe take a look at how the relationships change when you introduce some non-linearity when…someone comes up to your desk with a question or a request. Of course this doesn’t take the 2 minutes you originally thought it’d take and turns into 30 minutes of rabbit trails and deeper diving into whatever it was that was a problem. By the time you get back to your real work, you’ve forgotten what you wanted to look into next, you’ve lost that moment of insight that you needed to make a breakthrough.

Now, the point isn’t necessarily “Don’t help people out.” It’s anything but that. The point is that your time is inherently valuable and your uninterrupted quiet time alone is even more valuable and cherish-able. You’ll have to work at getting it back in any way you can.

Try out:

  • Setting “office hours” where people can get help from your. (But be careful to always be open to emergency situations should they come up.)
  • Checking email less often–Email isn’t an instant messaging service, it’s a way to implement deferred communication. You’re not doing anyone any favors by responding to emails within 5 minutes it hitting your inbox, you’re setting expectations for others to get instant gratification out of quick jots off to tell you something.
  • Learning to say “No” or “I’m busy.” It’s really okay to tell someone that you’re in the middle of something.

Getting up and running with Python virtual environments

Standard

Python is a great tool to have available for all sorts of tasks, including data analysis or machine learning. It’s a great language to start off with if you’re a beginner, and there are loads of tutorials out there.  So, if you’re a neophyte Pythonista, head over there and come back here later.

Additionally, plenty of great developers have been working on tools that just get the job done, including pandas for wrangling your data (and turning it into something that looks like a spreadsheet), as well as Scikit-Learn for running anything from basic statistics to more complex learning algorithms on your data.

I’ve used Python for long enough to have made a lot of the mistakes there are with it, but the best piece of advice I have for anyone getting started is to use a virtual environment. You see, Python has some built-in tools that let you download and use other people’s code so you can leverage their work in your own analyses. Most of the time, this happens without a problem. But sometimes, say when a developer changes and updates his or her package in a way that breaks the way you’re using it, you’ll want to stick with the old version until you can try out the upgrade. Virtual environments provide a sandbox that allows you to keep different versions of Python modules separate so they can’t conflict with one another.

In fact, I’d actually suggest you do this in almost all contexts.

  • Are you starting a new project and have no idea where it’s heading? Use a virtual environment.
  • Are you setting up a production server so you can deploy and run your Python code? Use a virtualenvironment
  • Are you writing a research paper that analyzes some data that you’ll eventually publish? Use a virtual environment and then share how it’s set up with other people so they can reproduce your results

So how do you go about using a virtual environment? If you’re using Mac OS or a Linux distribution, one of my favorite tools is pyenv, which works quite seamlessly after you’ve install some dependencies (like some tools that actually build Python). Here’s the original guide I started off with and I still use it as a reference if I run into any issues. The thing I really love about pyenv is that it allows you to install and manage different Python versions as well.

One Windows, the experience is a bit different but I think this guide is great for getting started. That article focuses on setting up virtual environments but the earlier ones should help out with the installation process. Looking around, it seems that Python 3.3 has a tool that allows Windows users to switch between different Python versions, which is very nice to have. I haven’t checked it out yet but I look forward to.

In essence, if you haven’t tried out using virtual environments yet, get started as soon as you can. It’s worth the time invested in getting it set up and understanding a few things under the hood (like how the command-line PATH variable and the PYTHON_PATH work). All in all, I’ve never regretted setting one up for even the simplest of tasks and have almost always cursed myself when I didn’t use on.

You probably need a database

Standard

When I see organizations using and talking about their data, they love to present the tools they’re using to handle and wrangle it. You’ve probably heard terms like Hadoop, Spark, Shark, PostgreSQL, MySQL, MongoDB, and rarely Excel. (If you haven’t, there’s a good list to look up on Wikipedia.)

I won’t argue that taming data doesn’t take good tools, but what I will argue is that the tools you use depend on the scale of your data.

I like to think of the following rough categories of data scale:

  • Small data–dataset fits in RAM (anywhere from 1Mb to 8Gb)
  • Medium data–dataset fits on a single hard drive (8Gb to 1Tb)
  • Big data–dataset takes multiple hard drives to store (anything above 1Tb)

Now, I’m a big believer of the Pareto principle, which should lead me to believe that of all of the tech companies out there, only about 20% (or fewer) need the tools suited for big data. Here’s a look at some counts from Indeed.com that roughly confirm that relationship:

  • Spark – 8,701
  • Hadoop – 13,723
  • Oracle Database – 27,177
  • MySQL – 21,770
  • PostgreSQL – 4,285
  • Microsoft Access – 67,538

So what does that mean for the tools you adopt? First, it means that as soon as your data is too big for Excel/Python/PHP/R/Memory it doesn’t mean that it’s time to adopt Hadoop and go hire a team to set it up. It means that you should look into using something like a relational database to interact and investigate your data. Ideally you’re thinking about how to transform your data into something like a spreadsheet anyway which means that a RDBMS is a natural fit.

Of the four that I listed above, two are free so the only cost you’d incur would be in the machines to host it and the time setting it up. The other main reason is that it’s likely someone on your team/in your company already knows how to start using it now.

All that said, there’s definitely a place for tools like Hadoop, but it’ll be very specific to your implementation and how your dataset is growing.

Getting started with tabular data

Standard

I look at data every day. If I had to go back to a past version of myself to give him advice, I’d offer this: make it a rule to fit your data into a box.

There are plenty of mathematical techniques out there for analyzing data but to effectively apply them to your particular data, your data needs to fit the following format:

  • Data consists of rows and columns
  • Your data should be viewable using any common spreadsheet application
  • Each row represents an instance of data (in other words, each row represents one object under study be it a person, a spammy email, or a photograph with a face to be recognized)
  • Each column represents a feature or something that we can use to describe the instance (and this could be a person’s height, the number of occurrences of the word “FREE” in a spammy email, or a length of a detected edge in a picture of a face)

When you encounter some new data, it’s best to strive to fit it into that framework. Do you have a pile of server logs that you’re analyzing? Figure out the instances you’re studying (likely unique visitors) so tie each log entry to a particular visitor (to be represented by a row) and describe that visitor somehow via features (like how many times they visited in a week, what User-Agent they use, which sites they’ve clicked on).

How about trying to analyze the content of a blog post to determine what category it falls under? Then, it depends whether you’re trying to categorize each particular blog post vs a whole blog (including all of it’s content as relevant evidence for the category). Suppose you decide you want to classify individual posts. Then, you’d start off with some approaches that count the occurrence of particular words, banking on the theory that most tech-related posts will mention token words like “gadget,” “app,” or even “apple” with regularity. If you’re wanting to classify an entire blog you might begin by summarizing the text of the about page in similar manner, along with trying to summarize all of the available tags (and their counts) for the blog as a whole.

So, all-in-all, one of the most helpful tips for someone just starting out in anything related to data will be to turn whatever problem they have in front of them into a spreadsheet-like model. More often, the world is messy and doesn’t give you information in this format, so the main task (and the best next action) is to summarize the data as instances of study that have relevant features.