Using Jupyter on Remote Servers

Standard

As a data scientist, it really helps to have a powerful computer nearby when you need it. Even with an i7 laptop with 16GB of RAM in it, you’ll sometimes find yourself needing more power. Whether your task is compute or memory constrained, though, you’ll find yourself looking to the cloud for more resources. Today I’ll outline how to be more effective when you have to compute remotely.

I like to refer folks to this great article on setting up SSH configs. Not only will a good SSH configuration file simplify the way you access servers, it can also help you streamline the way you work on them.

I find Jupyter to be a superb resource for writing reports and displaying graphics of data. It essentially lets you run code in your web browser. However, one issue with using it on a remote machine is that you may not be able to access the interface because the server is blocking the necessary port to see it on the web(this is a great thing for security and prevents others from seeing your work). There’s a way to work through this by using SSH’s ability to forward ports.

To do that, first you’ll need to log into your remote machine:

ssh -L 8888:127.0.0.1:8888 <remote host>

That means you’re connecting to your remote host, except any time you want to access port 8888 on your local machine (127.0.0.1), it will forward it to the remote machine’s port 8888.

Then, you’ll need to start Jupyter:

cd <project>
jupyter notebook

Finally, head to the url http://localhost:8888 to find yourself accessing the remotely running copy of your notebook.

Here’s a screenshot of what that should look like.

Screenshot from 2016-01-27 21:54:01

Note the highlight line on the right. There isn’t a web browser installed on my remote machine but I was still able to access this notebook by using my local computer.

Whether you want to load a 50GB data frame into Pandas or use jobs=-1 in Scikit Learn, you should find yourself more able to do your work.

Follow me on Twitter @mathcass.

Advertisements

Bloom Filters in Application

Standard

Today we’re going to talk about what a Bloom filter is and discuss some of the applications in data science. In a later post, we’ll build a simple implementation with the goal of learning more about how they work.

What is a Bloom Filter?

A Bloom filter is a probabilistic data structure. Let’s break that term down. Any time you hear the word “probabilistic” the first thing that should come to mind is “error.” That is, it sometimes has errors. When you hear “data structure” you should think about “space,” more specifically storage space or memory.

Bloom filters are designed to answer questions of set membership, that is, “is this item one of X?” Here are some simple questions you might be working with if you were considering them:

  • Have we seen this email address sign up to our site recently?
  • Is this a product the user has bought before?
  • Are the registrations we’re seeing from this IP address on a whitelist of IPs that we can trust?

Basically, it compresses simple set membership information into a smaller memory footprint at the cost of a bit of error.

By design, Bloom filters only implement two types of operations, add and contains. So, once you’ve added a member to it, you wouldn’t be able to remove it. Additionally, you wouldn’t be able to query for a list of elements in it.

So, why would you be okay with error in data in exchange for using less memory? We can answer that question (as well as better understand the use-case of set membership) with a few useful examples.

How are Bloom filters used?

It would be hard not to mention this Medium article on the subject because it clarifies how they apply to a data science task. At Medium, they have some version of a distributed data store. Now, distributing data helps you scale out information across servers, but it also has a chance of increasing your variance in expected response times (read: more requests might take longer to finish running). Most web companies focus on making the user experience as pleasant as possible, so they value response time. In this case, one particular request that was important was the set of articles the particular user had read before.

At risk of retelling the already well-told story, suffice it to say that they used a Bloom filter to prevent recommending to the user articles he or she had read before. This was a case where they used a data science model (the store recommendation engine) and they augmented it by including a component that could use compressed information (the Bloom filter) to prevent the user from needing to see the same article twice. Moreover, even though there’s potential for error in that filter, that error is negligible (in the sense that the user’s experience isn’t hinder by it). To summarize, because they could efficiently represent the set of “articles this user had read before” and since it had a defined error rate, they could improve their user experience.

As another example, think about buying items on Amazon.com. Amazon has almost any item imaginable and probably also distributes their data for scale. They’re still able to tell you right on the product page whether you’ve bought an item before and when. I don’t have any insight into what’s going on behind the scenes but this is another perfect place for a Bloom filter by using one at the user level (holding the set of products someone has bought) or at the product level (holding a set of all the users who have bought the product). A negative match (which will be correct 100% of the time) means operationally you don’t need to perform that database lookup to see if someone bought this item. A positive match (which will probably be rare most of the time) will be the only time when you confirm and go see that transaction data.

Finally, I wanted to point out a use-case that I found by perusing various implementations of the tool. Bloom filters can also be used to track time-dependent information (or various forms of time series data). One thing you could do is store aggregate level information (like whether someone bought a particular product in a given time horizon like the last 30 or 90 days) in a Bloom filter. Then, based on that information, you can make modeling decisions like what sort of ads you show this person.

I hope this post helped you learn a little bit about Bloom filters. In a later post, I’ll go into some detail on how they’re implemented with a focus on pedagogy. In the meantime, follow me on WordPress or Twitter for updates.

Additional links to check out

  • A Python package from Moz on using Redis as a backend (they were using it to ensure they didn’t crawl the same websites multiple times)
  • Another Python implementation optimized for scaling
  • A very detailed article of several other probabilistic data structures
  • A great bitly post on the subject as well as their own implementation (which also supports removing set members)

Four useful books for learning Data Science

Standard
I was listening to an old episode of Partially Derivative, a podcast on data science and the news. One of the hosts mentioned that we’re now living in the “golden age of data science instruction” and learning materials. I couldn’t agree more with this statement. Each month, most publishers seem to have another book on the subject and people are writing exciting blog posts about what they’re learning and doing.I wanted to outline a few of the books that helped me along the way, in the order I approached them. Hopefully, you can use them to gain a broader perspective of the field and perhaps as a resource to pass on to others trying to learn.

Learning from Data

I first found Learning from Data through Caltech’s course on the subject. I still think it’s an excellent text but I’m not sure if I would recommend it to the absolute beginner. (To someone who is just coming to the subject, I would probably recommend the next choice down on the list.)

However, I have a Master’s degree in mathematics so I was familiar with the background material in linear algebra and probability as well as the notation used. Learning from Data taught me that there was actual mathematical theory behind a lot of the algorithms employed in data science.

Most algorithms are chosen for their pragmatic application, but they also have features in and of themselves (such as how they bound the space of possible hypotheses about the data) that can help determine their effectiveness on data. There’s also a general theory for how to approach the analysis of these algorithms. At the time of reading, a lot of it was still a bit over my head, but it got me incredibly curious about the field itself.

Now, understanding a few things about the theory is great, but most of the time, people want to know what it can actually do.

Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management

I’ll admit to only having had a cursory understanding of what was possible before I read Data Mining Techniques. I knew that the most widely used algorithms were used for assessing risk, like credit scores. However, I didn’t know much about how you could make gains in the world of marketing using data science techniques.

I appreciated that the authors have a lot of experience in the field, especially experience that predates most of the growth in big data these days. This book makes it clear that many of the most useful algorithms have been around and in use for decades. The authors also offer some explanations from the direct marketing case (print magazines and physical mail) that I hadn’t considered, such as ranking algorithms, which were originally used to prioritize a list of people to contact because of the high costs of mailing paper to people.

More than anything, I liked the breadth of the topics, since they cover just about every form of marketing algorithm and do a great job of giving you a high level view of why they matter.

You won’t walk away from this book knowing how to implementing everything they talk about, but you will get a sense for which algorithms are suited for particular tasks.

This book gave me a better way to think through the initial phases of a project, but I still needed some help in learning how to communicate about data and how to fit it directly into the business context.

Data Science for Business

I read through this one while I was on vacation (yes, I know, I’m that type of geek). That didn’t stop me from soaking up a lot of information from it about how data science applies to a company trying to use these models. Most of the book is focused on helping you think through how to operationalize the process of running and managing a data science project and what outcomes you might expect from the effort.

Beyond that, I think it taught me how to communicate better about data at a company. Being able to talk about the many months it will take to bring a project into fruition and weigh it against alternatives is the bread and butter of working at a company that wants to make money. Moreover, if you believe that a particular project is the right choice, you need to be able to back up that choice by communicating about the benefits.

I want to say that this is a very “bottom-line” type of book, but that’s okay to hear about some of the time. Data science doesn’t always have to be about the hottest technique or the biggest technology if your priorities include keeping your costs below your revenue. However, I still didn’t learn much about getting my hands dirty with the data on a day-to-day basis. For that, I had to rely on the final book I present.

Applied Predictive Modeling

This is a book on predictive modeling in R and on using a package that the author developed for doing that. This isn’t simply about someone tooting their own horn because caret is a quality piece of software. Overall, I think that even if you don’t end up using R as your go-to tool for analyzing data, you’ll still learn a lot from this book. It thoroughly demonstrates the power caret can offer you in a project, to the point that you’ll seek the same functionality in your tool of choice (or hopefully build its equivalent for us).

Caret is a package that offers a consistent interface for just about any predictive task (classification or regression) that you could ask for. One issue some people have with R packages is that the interface for algorithms isn’t very consistent. Learning how to use one package won’t always lead to the same understanding in a completely different package. Caret addresses that by giving you the same way to set up a modeling task for many different algorithms. Moreover, it also automates several tasks like:

  • Data splitting into training and test sets
  • Data transformations like normalization or power transforms
  • Modeling tuning and parameter selection

Essentially, it makes working in R a lot like using Scikit Learn (an excellent library itself) but with many more options and model implementations.

So that’s all you need, right? Just read a couple of books and you’re on your way? Not quite. You’ll actually have to apply some of this and learn from it. Perhaps next time you’re in a meeting discussing priorities for your company, you will need to frame the conversation about your next data project and directing the data effort toward your business goals (Data Science for Business). When you’re brainstorming possible things that you could try to predict and use in a marketing campaign, you will need to outline the possible techniques and what they could offer you (Data Mining Techniques). If you’re evaluating candidate algorithms for their ability to perform the task accurately, you will need to gauge their effectiveness from a theoretical (Learning from Data) and practical (Applied Predictive Modeling) standpoint.

I hope this helps you apply data science at work and gives you perspective in the field. Also, if you’re not a follower on Twitter, please follow me @mathcass.

Nudging and Data Science

Standard

I’ve recently been reading a great book on how people make decisions and what organizations can do to help folks make better choices. That book is Nudge.

What is a nudge?

The authors describe a nudge as anything that can influence the way we make decisions. Take the primacy affect, for instance, namely the idea that order matters in a series of items. We’re more likely to recall the first or last option in a list of items simply because of their positions. This would be a nudge if you later chose the first movie from a list that a friend had recommended mostly because it was the first one to come to mind in the store.

The fact that humans have these biases is in indicator that we don’t always act rationally. In cases where we haven’t had enough experience to learn from our decisions, we need a bit of help finding the most appropriate option for our needs. Most people only decide what type of health care plan they need or at what rate to contribute to their retirements plans a few times in their life, so there isn’t much opportunity to learn at all.

All in all, the book is a great read, and much of it is an explanation of how proper nudges have excellent applications in areas like health care and making financial decisions.

How does Data Science fit in?

So, why bring in Data Science? Well, lately companies have been looking to the field of Machine Learning and Statistics to determine how to make better business decisions and these methods can play an important role in helping define the right nudges to use.

The authors emphasize that proper nudges should a) offer a default option that is stacked in the favor of most people and b) make it easy to stray from the default option as needed.

When I think about those two, a few things come to mind. In Machine Learning, a mathematical optimization takes data about outcomes and selects the best set of choices. And recommender systems are designed to, when given a few hints, offer up suggestions of similar or like items.

In the case of deciding on the most favorable default option, that decision should be made based off of the available data. The authors talk about health-care and Medicare Part D and the fact that the government randomly assigned plans, thereby leaving most people in a sub-optimal situation. An approach to solve this problem given the available data would have been to make a survey of citizens and their prescription needs, and then selected a default plan from every option in a way that minimize some variable, such as the median cost per participant.

Additionally, the authors describe a tool for Medicare Part D that allowed someone to input their prescriptions and assigned someone a plan to choose from. One of the difficulties with this system was that it rarely gave the same answer, even with the same inputs, because the plans would change over time. This gave people a false sense of which plan was good for them. A better approach would have been to give recommendations of appropriate plans, by taking the drug information and matching it to available plans. When presented with 100s of options, people have a difficult time making a choice that will work, but if those 100 could be winnowed down to the 3-5 most appropriate ones, people will have an easier time weighing the pros and cons.

Obviously, there is still plenty of constructive work to be done in supporting any nudge. And I believe that the tools that Data Scientists use day-to-day are valuable to keep in mind in these efforts.