Four useful books for learning Data Science

I was listening to an old episode of Partially Derivative, a podcast on data science and the news. One of the hosts mentioned that we’re now living in the “golden age of data science instruction” and learning materials. I couldn’t agree more with this statement. Each month, most publishers seem to have another book on the subject and people are writing exciting blog posts about what they’re learning and doing.I wanted to outline a few of the books that helped me along the way, in the order I approached them. Hopefully, you can use them to gain a broader perspective of the field and perhaps as a resource to pass on to others trying to learn.

Learning from Data

I first found Learning from Data through Caltech’s course on the subject. I still think it’s an excellent text but I’m not sure if I would recommend it to the absolute beginner. (To someone who is just coming to the subject, I would probably recommend the next choice down on the list.)

However, I have a Master’s degree in mathematics so I was familiar with the background material in linear algebra and probability as well as the notation used. Learning from Data taught me that there was actual mathematical theory behind a lot of the algorithms employed in data science.

Most algorithms are chosen for their pragmatic application, but they also have features in and of themselves (such as how they bound the space of possible hypotheses about the data) that can help determine their effectiveness on data. There’s also a general theory for how to approach the analysis of these algorithms. At the time of reading, a lot of it was still a bit over my head, but it got me incredibly curious about the field itself.

Now, understanding a few things about the theory is great, but most of the time, people want to know what it can actually do.

Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management

I’ll admit to only having had a cursory understanding of what was possible before I read Data Mining Techniques. I knew that the most widely used algorithms were used for assessing risk, like credit scores. However, I didn’t know much about how you could make gains in the world of marketing using data science techniques.

I appreciated that the authors have a lot of experience in the field, especially experience that predates most of the growth in big data these days. This book makes it clear that many of the most useful algorithms have been around and in use for decades. The authors also offer some explanations from the direct marketing case (print magazines and physical mail) that I hadn’t considered, such as ranking algorithms, which were originally used to prioritize a list of people to contact because of the high costs of mailing paper to people.

More than anything, I liked the breadth of the topics, since they cover just about every form of marketing algorithm and do a great job of giving you a high level view of why they matter.

You won’t walk away from this book knowing how to implementing everything they talk about, but you will get a sense for which algorithms are suited for particular tasks.

This book gave me a better way to think through the initial phases of a project, but I still needed some help in learning how to communicate about data and how to fit it directly into the business context.

Data Science for Business

I read through this one while I was on vacation (yes, I know, I’m that type of geek). That didn’t stop me from soaking up a lot of information from it about how data science applies to a company trying to use these models. Most of the book is focused on helping you think through how to operationalize the process of running and managing a data science project and what outcomes you might expect from the effort.

Beyond that, I think it taught me how to communicate better about data at a company. Being able to talk about the many months it will take to bring a project into fruition and weigh it against alternatives is the bread and butter of working at a company that wants to make money. Moreover, if you believe that a particular project is the right choice, you need to be able to back up that choice by communicating about the benefits.

I want to say that this is a very “bottom-line” type of book, but that’s okay to hear about some of the time. Data science doesn’t always have to be about the hottest technique or the biggest technology if your priorities include keeping your costs below your revenue. However, I still didn’t learn much about getting my hands dirty with the data on a day-to-day basis. For that, I had to rely on the final book I present.

Applied Predictive Modeling

This is a book on predictive modeling in R and on using a package that the author developed for doing that. This isn’t simply about someone tooting their own horn because caret is a quality piece of software. Overall, I think that even if you don’t end up using R as your go-to tool for analyzing data, you’ll still learn a lot from this book. It thoroughly demonstrates the power caret can offer you in a project, to the point that you’ll seek the same functionality in your tool of choice (or hopefully build its equivalent for us).

Caret is a package that offers a consistent interface for just about any predictive task (classification or regression) that you could ask for. One issue some people have with R packages is that the interface for algorithms isn’t very consistent. Learning how to use one package won’t always lead to the same understanding in a completely different package. Caret addresses that by giving you the same way to set up a modeling task for many different algorithms. Moreover, it also automates several tasks like:

  • Data splitting into training and test sets
  • Data transformations like normalization or power transforms
  • Modeling tuning and parameter selection

Essentially, it makes working in R a lot like using Scikit Learn (an excellent library itself) but with many more options and model implementations.

So that’s all you need, right? Just read a couple of books and you’re on your way? Not quite. You’ll actually have to apply some of this and learn from it. Perhaps next time you’re in a meeting discussing priorities for your company, you will need to frame the conversation about your next data project and directing the data effort toward your business goals (Data Science for Business). When you’re brainstorming possible things that you could try to predict and use in a marketing campaign, you will need to outline the possible techniques and what they could offer you (Data Mining Techniques). If you’re evaluating candidate algorithms for their ability to perform the task accurately, you will need to gauge their effectiveness from a theoretical (Learning from Data) and practical (Applied Predictive Modeling) standpoint.

I hope this helps you apply data science at work and gives you perspective in the field. Also, if you’re not a follower on Twitter, please follow me @mathcass.


Nudging and Data Science


I’ve recently been reading a great book on how people make decisions and what organizations can do to help folks make better choices. That book is Nudge.

What is a nudge?

The authors describe a nudge as anything that can influence the way we make decisions. Take the primacy affect, for instance, namely the idea that order matters in a series of items. We’re more likely to recall the first or last option in a list of items simply because of their positions. This would be a nudge if you later chose the first movie from a list that a friend had recommended mostly because it was the first one to come to mind in the store.

The fact that humans have these biases is in indicator that we don’t always act rationally. In cases where we haven’t had enough experience to learn from our decisions, we need a bit of help finding the most appropriate option for our needs. Most people only decide what type of health care plan they need or at what rate to contribute to their retirements plans a few times in their life, so there isn’t much opportunity to learn at all.

All in all, the book is a great read, and much of it is an explanation of how proper nudges have excellent applications in areas like health care and making financial decisions.

How does Data Science fit in?

So, why bring in Data Science? Well, lately companies have been looking to the field of Machine Learning and Statistics to determine how to make better business decisions and these methods can play an important role in helping define the right nudges to use.

The authors emphasize that proper nudges should a) offer a default option that is stacked in the favor of most people and b) make it easy to stray from the default option as needed.

When I think about those two, a few things come to mind. In Machine Learning, a mathematical optimization takes data about outcomes and selects the best set of choices. And recommender systems are designed to, when given a few hints, offer up suggestions of similar or like items.

In the case of deciding on the most favorable default option, that decision should be made based off of the available data. The authors talk about health-care and Medicare Part D and the fact that the government randomly assigned plans, thereby leaving most people in a sub-optimal situation. An approach to solve this problem given the available data would have been to make a survey of citizens and their prescription needs, and then selected a default plan from every option in a way that minimize some variable, such as the median cost per participant.

Additionally, the authors describe a tool for Medicare Part D that allowed someone to input their prescriptions and assigned someone a plan to choose from. One of the difficulties with this system was that it rarely gave the same answer, even with the same inputs, because the plans would change over time. This gave people a false sense of which plan was good for them. A better approach would have been to give recommendations of appropriate plans, by taking the drug information and matching it to available plans. When presented with 100s of options, people have a difficult time making a choice that will work, but if those 100 could be winnowed down to the 3-5 most appropriate ones, people will have an easier time weighing the pros and cons.

Obviously, there is still plenty of constructive work to be done in supporting any nudge. And I believe that the tools that Data Scientists use day-to-day are valuable to keep in mind in these efforts.

The Power of Perspective


I used to be a person who would get jealous at others, namely their technical ability. If I thought that the person I was working with were better at math or programming compared to me, it’d cause a drive in me to get better at both of those. I’d pour myself into books on the relevant subjects to try to enhance my ability. I’d work on projects to try to get familiar with these advanced techniques. I’d be lying if I said this didn’t help me become a better programmer or analyst, but I definitely it increased my stress levels more than I needed.

I think I was missing the point the entire time. I lost sight of two factors that hadn’t occurred to me. First, I hadn’t even known that my perspective was incredibly full of worth. Secondly, I had forgotten that the people I was jealous of weren’t even really doing work that I truly wanted to to. Think about that for a moment.

“There will always be people who are better than you at something.” At least that’s what I keep hearing people say when it comes to life, work, and career progress. So, if that’s the case, then how did your boss get hired? Or the CEO of a public company? Couldn’t they have just found someone better than they are to do the job? I’m almost sure of it, but I’m betting it’s not because of raw ability in any particular skill but rather it’s because of the perspective they bring to the table. Your viewpoint is an extremely valuable asset. How you think about a situation or problem is more unique than you think it is and if your boss isn’t using your perspective to enhance his or her own view, both of you are losing out.

On the other point, you have to ask yourself if you’re really doing work that you want to do. Will mastering the skills you’re working on get you to the job that you want. Additionally, I’ve been in situations where a colleague is working on the project that I want to work on. The project. However, every time this has happened, it’s because I never voiced my interest in working on it. And then half the time, the person assigned the work didn’t want to do it nearly as much as I did.

In essence, don’t ever overlook some neglected assets, especially when it comes to sharing how you see the world (and your work) and your unique desire to persue a particular kind of work.

Quiet time to think


Multitasking is a fallacy. Most of the time when we think we’re optimally getting things done by working no multiple tasks or even multiple projects, we’re selling ourselves short.

You’ve all worked with a programmer like this, they’re the person who freaks out or makes a snide remark each time you walk up to them with a question and didn’t “use the proper channels. Put it in an email or a ticket and if you walk up to me again with your problems I’ll…”

Now, aside from handling the situation extremely poorly…this person isn’t completely wrong. Programming takes a lot of overhead to solve complex problems and interruptions can make the difference by a factor of two in how long a given task can take.

Think about an analogy in physical labor, say painting. If you’ve ever painted the interior of a house, you know that the easiest parts are the long flat walls–you just roll away at them. It’s the edges that are the pain in the ass. You’ll spend an hour edging a room and then 10 minutes painting up around them. It’s a lot of work getting the details right.

And so it is with data analysis. If you started with an unlabeled data file, say a simple CSV, you have to invest time in investigating it and just looking at it. You’re there are any relationships in front of you so you need to plot columns against one other and run correlations. Aha, you think you’ve found something and want to take a look further, maybe take a look at how the relationships change when you introduce some non-linearity when…someone comes up to your desk with a question or a request. Of course this doesn’t take the 2 minutes you originally thought it’d take and turns into 30 minutes of rabbit trails and deeper diving into whatever it was that was a problem. By the time you get back to your real work, you’ve forgotten what you wanted to look into next, you’ve lost that moment of insight that you needed to make a breakthrough.

Now, the point isn’t necessarily “Don’t help people out.” It’s anything but that. The point is that your time is inherently valuable and your uninterrupted quiet time alone is even more valuable and cherish-able. You’ll have to work at getting it back in any way you can.

Try out:

  • Setting “office hours” where people can get help from your. (But be careful to always be open to emergency situations should they come up.)
  • Checking email less often–Email isn’t an instant messaging service, it’s a way to implement deferred communication. You’re not doing anyone any favors by responding to emails within 5 minutes it hitting your inbox, you’re setting expectations for others to get instant gratification out of quick jots off to tell you something.
  • Learning to say “No” or “I’m busy.” It’s really okay to tell someone that you’re in the middle of something.

Getting up and running with Python virtual environments


Python is a great tool to have available for all sorts of tasks, including data analysis or machine learning. It’s a great language to start off with if you’re a beginner, and there are loads of tutorials out there.  So, if you’re a neophyte Pythonista, head over there and come back here later.

Additionally, plenty of great developers have been working on tools that just get the job done, including pandas for wrangling your data (and turning it into something that looks like a spreadsheet), as well as Scikit-Learn for running anything from basic statistics to more complex learning algorithms on your data.

I’ve used Python for long enough to have made a lot of the mistakes there are with it, but the best piece of advice I have for anyone getting started is to use a virtual environment. You see, Python has some built-in tools that let you download and use other people’s code so you can leverage their work in your own analyses. Most of the time, this happens without a problem. But sometimes, say when a developer changes and updates his or her package in a way that breaks the way you’re using it, you’ll want to stick with the old version until you can try out the upgrade. Virtual environments provide a sandbox that allows you to keep different versions of Python modules separate so they can’t conflict with one another.

In fact, I’d actually suggest you do this in almost all contexts.

  • Are you starting a new project and have no idea where it’s heading? Use a virtual environment.
  • Are you setting up a production server so you can deploy and run your Python code? Use a virtualenvironment
  • Are you writing a research paper that analyzes some data that you’ll eventually publish? Use a virtual environment and then share how it’s set up with other people so they can reproduce your results

So how do you go about using a virtual environment? If you’re using Mac OS or a Linux distribution, one of my favorite tools is pyenv, which works quite seamlessly after you’ve install some dependencies (like some tools that actually build Python). Here’s the original guide I started off with and I still use it as a reference if I run into any issues. The thing I really love about pyenv is that it allows you to install and manage different Python versions as well.

One Windows, the experience is a bit different but I think this guide is great for getting started. That article focuses on setting up virtual environments but the earlier ones should help out with the installation process. Looking around, it seems that Python 3.3 has a tool that allows Windows users to switch between different Python versions, which is very nice to have. I haven’t checked it out yet but I look forward to.

In essence, if you haven’t tried out using virtual environments yet, get started as soon as you can. It’s worth the time invested in getting it set up and understanding a few things under the hood (like how the command-line PATH variable and the PYTHON_PATH work). All in all, I’ve never regretted setting one up for even the simplest of tasks and have almost always cursed myself when I didn’t use on.

You probably need a database


When I see organizations using and talking about their data, they love to present the tools they’re using to handle and wrangle it. You’ve probably heard terms like Hadoop, Spark, Shark, PostgreSQL, MySQL, MongoDB, and rarely Excel. (If you haven’t, there’s a good list to look up on Wikipedia.)

I won’t argue that taming data doesn’t take good tools, but what I will argue is that the tools you use depend on the scale of your data.

I like to think of the following rough categories of data scale:

  • Small data–dataset fits in RAM (anywhere from 1Mb to 8Gb)
  • Medium data–dataset fits on a single hard drive (8Gb to 1Tb)
  • Big data–dataset takes multiple hard drives to store (anything above 1Tb)

Now, I’m a big believer of the Pareto principle, which should lead me to believe that of all of the tech companies out there, only about 20% (or fewer) need the tools suited for big data. Here’s a look at some counts from that roughly confirm that relationship:

  • Spark – 8,701
  • Hadoop – 13,723
  • Oracle Database – 27,177
  • MySQL – 21,770
  • PostgreSQL – 4,285
  • Microsoft Access – 67,538

So what does that mean for the tools you adopt? First, it means that as soon as your data is too big for Excel/Python/PHP/R/Memory it doesn’t mean that it’s time to adopt Hadoop and go hire a team to set it up. It means that you should look into using something like a relational database to interact and investigate your data. Ideally you’re thinking about how to transform your data into something like a spreadsheet anyway which means that a RDBMS is a natural fit.

Of the four that I listed above, two are free so the only cost you’d incur would be in the machines to host it and the time setting it up. The other main reason is that it’s likely someone on your team/in your company already knows how to start using it now.

All that said, there’s definitely a place for tools like Hadoop, but it’ll be very specific to your implementation and how your dataset is growing.

Getting started with tabular data


I look at data every day. If I had to go back to a past version of myself to give him advice, I’d offer this: make it a rule to fit your data into a box.

There are plenty of mathematical techniques out there for analyzing data but to effectively apply them to your particular data, your data needs to fit the following format:

  • Data consists of rows and columns
  • Your data should be viewable using any common spreadsheet application
  • Each row represents an instance of data (in other words, each row represents one object under study be it a person, a spammy email, or a photograph with a face to be recognized)
  • Each column represents a feature or something that we can use to describe the instance (and this could be a person’s height, the number of occurrences of the word “FREE” in a spammy email, or a length of a detected edge in a picture of a face)

When you encounter some new data, it’s best to strive to fit it into that framework. Do you have a pile of server logs that you’re analyzing? Figure out the instances you’re studying (likely unique visitors) so tie each log entry to a particular visitor (to be represented by a row) and describe that visitor somehow via features (like how many times they visited in a week, what User-Agent they use, which sites they’ve clicked on).

How about trying to analyze the content of a blog post to determine what category it falls under? Then, it depends whether you’re trying to categorize each particular blog post vs a whole blog (including all of it’s content as relevant evidence for the category). Suppose you decide you want to classify individual posts. Then, you’d start off with some approaches that count the occurrence of particular words, banking on the theory that most tech-related posts will mention token words like “gadget,” “app,” or even “apple” with regularity. If you’re wanting to classify an entire blog you might begin by summarizing the text of the about page in similar manner, along with trying to summarize all of the available tags (and their counts) for the blog as a whole.

So, all-in-all, one of the most helpful tips for someone just starting out in anything related to data will be to turn whatever problem they have in front of them into a spreadsheet-like model. More often, the world is messy and doesn’t give you information in this format, so the main task (and the best next action) is to summarize the data as instances of study that have relevant features.