Nudging and Data Science

Standard

I’ve recently been reading a great book on how people make decisions and what organizations can do to help folks make better choices. That book is Nudge.

What is a nudge?

The authors describe a nudge as anything that can influence the way we make decisions. Take the primacy affect, for instance, namely the idea that order matters in a series of items. We’re more likely to recall the first or last option in a list of items simply because of their positions. This would be a nudge if you later chose the first movie from a list that a friend had recommended mostly because it was the first one to come to mind in the store.

The fact that humans have these biases is in indicator that we don’t always act rationally. In cases where we haven’t had enough experience to learn from our decisions, we need a bit of help finding the most appropriate option for our needs. Most people only decide what type of health care plan they need or at what rate to contribute to their retirements plans a few times in their life, so there isn’t much opportunity to learn at all.

All in all, the book is a great read, and much of it is an explanation of how proper nudges have excellent applications in areas like health care and making financial decisions.

How does Data Science fit in?

So, why bring in Data Science? Well, lately companies have been looking to the field of Machine Learning and Statistics to determine how to make better business decisions and these methods can play an important role in helping define the right nudges to use.

The authors emphasize that proper nudges should a) offer a default option that is stacked in the favor of most people and b) make it easy to stray from the default option as needed.

When I think about those two, a few things come to mind. In Machine Learning, a mathematical optimization takes data about outcomes and selects the best set of choices. And recommender systems are designed to, when given a few hints, offer up suggestions of similar or like items.

In the case of deciding on the most favorable default option, that decision should be made based off of the available data. The authors talk about health-care and Medicare Part D and the fact that the government randomly assigned plans, thereby leaving most people in a sub-optimal situation. An approach to solve this problem given the available data would have been to make a survey of citizens and their prescription needs, and then selected a default plan from every option in a way that minimize some variable, such as the median cost per participant.

Additionally, the authors describe a tool for Medicare Part D that allowed someone to input their prescriptions and assigned someone a plan to choose from. One of the difficulties with this system was that it rarely gave the same answer, even with the same inputs, because the plans would change over time. This gave people a false sense of which plan was good for them. A better approach would have been to give recommendations of appropriate plans, by taking the drug information and matching it to available plans. When presented with 100s of options, people have a difficult time making a choice that will work, but if those 100 could be winnowed down to the 3-5 most appropriate ones, people will have an easier time weighing the pros and cons.

Obviously, there is still plenty of constructive work to be done in supporting any nudge. And I believe that the tools that Data Scientists use day-to-day are valuable to keep in mind in these efforts.

Advertisements

The Power of Perspective

Standard

I used to be a person who would get jealous at others, namely their technical ability. If I thought that the person I was working with were better at math or programming compared to me, it’d cause a drive in me to get better at both of those. I’d pour myself into books on the relevant subjects to try to enhance my ability. I’d work on projects to try to get familiar with these advanced techniques. I’d be lying if I said this didn’t help me become a better programmer or analyst, but I definitely it increased my stress levels more than I needed.

I think I was missing the point the entire time. I lost sight of two factors that hadn’t occurred to me. First, I hadn’t even known that my perspective was incredibly full of worth. Secondly, I had forgotten that the people I was jealous of weren’t even really doing work that I truly wanted to to. Think about that for a moment.

“There will always be people who are better than you at something.” At least that’s what I keep hearing people say when it comes to life, work, and career progress. So, if that’s the case, then how did your boss get hired? Or the CEO of a public company? Couldn’t they have just found someone better than they are to do the job? I’m almost sure of it, but I’m betting it’s not because of raw ability in any particular skill but rather it’s because of the perspective they bring to the table. Your viewpoint is an extremely valuable asset. How you think about a situation or problem is more unique than you think it is and if your boss isn’t using your perspective to enhance his or her own view, both of you are losing out.

On the other point, you have to ask yourself if you’re really doing work that you want to do. Will mastering the skills you’re working on get you to the job that you want. Additionally, I’ve been in situations where a colleague is working on the project that I want to work on. The project. However, every time this has happened, it’s because I never voiced my interest in working on it. And then half the time, the person assigned the work didn’t want to do it nearly as much as I did.

In essence, don’t ever overlook some neglected assets, especially when it comes to sharing how you see the world (and your work) and your unique desire to persue a particular kind of work.

Quiet time to think

Standard

Multitasking is a fallacy. Most of the time when we think we’re optimally getting things done by working no multiple tasks or even multiple projects, we’re selling ourselves short.

You’ve all worked with a programmer like this, they’re the person who freaks out or makes a snide remark each time you walk up to them with a question and didn’t “use the proper channels. Put it in an email or a ticket and if you walk up to me again with your problems I’ll…”

Now, aside from handling the situation extremely poorly…this person isn’t completely wrong. Programming takes a lot of overhead to solve complex problems and interruptions can make the difference by a factor of two in how long a given task can take.

Think about an analogy in physical labor, say painting. If you’ve ever painted the interior of a house, you know that the easiest parts are the long flat walls–you just roll away at them. It’s the edges that are the pain in the ass. You’ll spend an hour edging a room and then 10 minutes painting up around them. It’s a lot of work getting the details right.

And so it is with data analysis. If you started with an unlabeled data file, say a simple CSV, you have to invest time in investigating it and just looking at it. You’re there are any relationships in front of you so you need to plot columns against one other and run correlations. Aha, you think you’ve found something and want to take a look further, maybe take a look at how the relationships change when you introduce some non-linearity when…someone comes up to your desk with a question or a request. Of course this doesn’t take the 2 minutes you originally thought it’d take and turns into 30 minutes of rabbit trails and deeper diving into whatever it was that was a problem. By the time you get back to your real work, you’ve forgotten what you wanted to look into next, you’ve lost that moment of insight that you needed to make a breakthrough.

Now, the point isn’t necessarily “Don’t help people out.” It’s anything but that. The point is that your time is inherently valuable and your uninterrupted quiet time alone is even more valuable and cherish-able. You’ll have to work at getting it back in any way you can.

Try out:

  • Setting “office hours” where people can get help from your. (But be careful to always be open to emergency situations should they come up.)
  • Checking email less often–Email isn’t an instant messaging service, it’s a way to implement deferred communication. You’re not doing anyone any favors by responding to emails within 5 minutes it hitting your inbox, you’re setting expectations for others to get instant gratification out of quick jots off to tell you something.
  • Learning to say “No” or “I’m busy.” It’s really okay to tell someone that you’re in the middle of something.

Getting up and running with Python virtual environments

Standard

Python is a great tool to have available for all sorts of tasks, including data analysis or machine learning. It’s a great language to start off with if you’re a beginner, and there are loads of tutorials out there.  So, if you’re a neophyte Pythonista, head over there and come back here later.

Additionally, plenty of great developers have been working on tools that just get the job done, including pandas for wrangling your data (and turning it into something that looks like a spreadsheet), as well as Scikit-Learn for running anything from basic statistics to more complex learning algorithms on your data.

I’ve used Python for long enough to have made a lot of the mistakes there are with it, but the best piece of advice I have for anyone getting started is to use a virtual environment. You see, Python has some built-in tools that let you download and use other people’s code so you can leverage their work in your own analyses. Most of the time, this happens without a problem. But sometimes, say when a developer changes and updates his or her package in a way that breaks the way you’re using it, you’ll want to stick with the old version until you can try out the upgrade. Virtual environments provide a sandbox that allows you to keep different versions of Python modules separate so they can’t conflict with one another.

In fact, I’d actually suggest you do this in almost all contexts.

  • Are you starting a new project and have no idea where it’s heading? Use a virtual environment.
  • Are you setting up a production server so you can deploy and run your Python code? Use a virtualenvironment
  • Are you writing a research paper that analyzes some data that you’ll eventually publish? Use a virtual environment and then share how it’s set up with other people so they can reproduce your results

So how do you go about using a virtual environment? If you’re using Mac OS or a Linux distribution, one of my favorite tools is pyenv, which works quite seamlessly after you’ve install some dependencies (like some tools that actually build Python). Here’s the original guide I started off with and I still use it as a reference if I run into any issues. The thing I really love about pyenv is that it allows you to install and manage different Python versions as well.

One Windows, the experience is a bit different but I think this guide is great for getting started. That article focuses on setting up virtual environments but the earlier ones should help out with the installation process. Looking around, it seems that Python 3.3 has a tool that allows Windows users to switch between different Python versions, which is very nice to have. I haven’t checked it out yet but I look forward to.

In essence, if you haven’t tried out using virtual environments yet, get started as soon as you can. It’s worth the time invested in getting it set up and understanding a few things under the hood (like how the command-line PATH variable and the PYTHON_PATH work). All in all, I’ve never regretted setting one up for even the simplest of tasks and have almost always cursed myself when I didn’t use on.

You probably need a database

Standard

When I see organizations using and talking about their data, they love to present the tools they’re using to handle and wrangle it. You’ve probably heard terms like Hadoop, Spark, Shark, PostgreSQL, MySQL, MongoDB, and rarely Excel. (If you haven’t, there’s a good list to look up on Wikipedia.)

I won’t argue that taming data doesn’t take good tools, but what I will argue is that the tools you use depend on the scale of your data.

I like to think of the following rough categories of data scale:

  • Small data–dataset fits in RAM (anywhere from 1Mb to 8Gb)
  • Medium data–dataset fits on a single hard drive (8Gb to 1Tb)
  • Big data–dataset takes multiple hard drives to store (anything above 1Tb)

Now, I’m a big believer of the Pareto principle, which should lead me to believe that of all of the tech companies out there, only about 20% (or fewer) need the tools suited for big data. Here’s a look at some counts from Indeed.com that roughly confirm that relationship:

  • Spark – 8,701
  • Hadoop – 13,723
  • Oracle Database – 27,177
  • MySQL – 21,770
  • PostgreSQL – 4,285
  • Microsoft Access – 67,538

So what does that mean for the tools you adopt? First, it means that as soon as your data is too big for Excel/Python/PHP/R/Memory it doesn’t mean that it’s time to adopt Hadoop and go hire a team to set it up. It means that you should look into using something like a relational database to interact and investigate your data. Ideally you’re thinking about how to transform your data into something like a spreadsheet anyway which means that a RDBMS is a natural fit.

Of the four that I listed above, two are free so the only cost you’d incur would be in the machines to host it and the time setting it up. The other main reason is that it’s likely someone on your team/in your company already knows how to start using it now.

All that said, there’s definitely a place for tools like Hadoop, but it’ll be very specific to your implementation and how your dataset is growing.

Getting started with tabular data

Standard

I look at data every day. If I had to go back to a past version of myself to give him advice, I’d offer this: make it a rule to fit your data into a box.

There are plenty of mathematical techniques out there for analyzing data but to effectively apply them to your particular data, your data needs to fit the following format:

  • Data consists of rows and columns
  • Your data should be viewable using any common spreadsheet application
  • Each row represents an instance of data (in other words, each row represents one object under study be it a person, a spammy email, or a photograph with a face to be recognized)
  • Each column represents a feature or something that we can use to describe the instance (and this could be a person’s height, the number of occurrences of the word “FREE” in a spammy email, or a length of a detected edge in a picture of a face)

When you encounter some new data, it’s best to strive to fit it into that framework. Do you have a pile of server logs that you’re analyzing? Figure out the instances you’re studying (likely unique visitors) so tie each log entry to a particular visitor (to be represented by a row) and describe that visitor somehow via features (like how many times they visited in a week, what User-Agent they use, which sites they’ve clicked on).

How about trying to analyze the content of a blog post to determine what category it falls under? Then, it depends whether you’re trying to categorize each particular blog post vs a whole blog (including all of it’s content as relevant evidence for the category). Suppose you decide you want to classify individual posts. Then, you’d start off with some approaches that count the occurrence of particular words, banking on the theory that most tech-related posts will mention token words like “gadget,” “app,” or even “apple” with regularity. If you’re wanting to classify an entire blog you might begin by summarizing the text of the about page in similar manner, along with trying to summarize all of the available tags (and their counts) for the blog as a whole.

So, all-in-all, one of the most helpful tips for someone just starting out in anything related to data will be to turn whatever problem they have in front of them into a spreadsheet-like model. More often, the world is messy and doesn’t give you information in this format, so the main task (and the best next action) is to summarize the data as instances of study that have relevant features.