Using Deep Learning is Easier Than You Think

Standard

I came across a great article on using the Deep Learning Python package tflearn to perform inference on some classic datasets in Machine Learning like the MNIST dataset and the CIFAR-10 dataset. As it turns out, these types of models have been around for quite a while for various tasks in image recognition. The particular case of the CIFAR-10 dataset was solved by a neural network very similar to the one from the mentioned post. The general idea of using convolutional neural networks dates back to Yann LeCun’s paper from 1998 in digit recognition.

Since I have a lot of experience working with data but not a lot of experience working with deep learning algorithms, I wondered how easy it would be to adapt these methods to a new, somewhat related, problem: going through thousands of my photos to identify pictures of my cat. Turns out, it was easier than I thought.

2012-12-30 09.58.28.jpg

This is definitely a cat photo

Training a similar neural network on my own visual data just amounted to connecting up the inputs and the outputs properly. In particular, any input image has to be re-scaled down (or up) to 32×32 pixels. Similarly, your output must be binary and should represent membership of either of the two classes.

The main difficulty involves creating your dataset. This really just means going through your images and classifying a subset of them by hand. For my own run at this, all I did was create a directory like:

images/
    cat/
    not_cat/

I put any cat photos I found into the cat directory while putting any non-cat photographs in the other folder. I tried to keep the same number of images in both directories to try to avoid any class imbalance problems. Then again, this wasn’t as much of a concern since roughly half my photos are cat photos anyway.

From there, tflearn has a helper method that lets you create an HDF5 dataset from your directory of images with a simple function. The X & Y values from that data structure can be used as the inputs to the deep learning model.

By using around 400 images (roughly 200 for each class), my classifier achieved about an 85% accuracy rate on a validation set of data. For my purposes, namely just automatically tagging potential photos of my cat, this was accurate enough. Any effort to increase the accuracy of this would probably involve some combination of:

  • adding more training data by putting images into my class folders
  • changing the shape of the network by adding more layers or more nodes per layer
  • using a pre-trained model to bootstrap the learning process

That’s all it really takes. If you know a bit of Python and can sort a few of your photos into folders based on their categories, you can get started using sophisticated deep learning algorithms on your own images.

You can find the code for this on my account at Github. If you want to chat or reach out at all, follow me on Twitter @mathcass.

Advertisements

Communicate about your work

Standard

I gave a lightning talk earlier this month to the PyData Atlanta Meetup. I’ve given hour-long talks on technical subjects before, but I hadn’t done anything quite that concise before. This fact freaked me out quite a bit. I wanted to reflect a bit on why it’s always a good idea to communicate more what you do.

No matter how mundane or “been done before” you believe your work is, there’s value in showing it to others because some people will learn from it. In machine learning, some methods are designed to try all possible permutations of a set of options to choose the one with the best performance. As the complexity of a model about your data grows, inevitably this tree search method breaks down and you need to apply some heuristics to the problem. What people don’t mention is that these heuristics can come from anywhere, whether it’s a research paper, a book, a mentor, or even a five minute talk you saw on a Thursday night.

Like any good person who over-prepares for things, I read up a bit on it which helped me come to this conclusion (and that helped me think through public speaking in general). Here are some resources:

Oh yeah. I think my talk went pretty well. Here’s a link to my Google Drive slides or a PDF copy. If you’d like to chat about data or about my work, feel free to reach out to me via email or on Twitter.

Eigenlayouts

Standard

For a long time, I’ve been interested in with web technology. In high school, I read Jesse Liberty’s Complete Idiot`s Guide to a Career in Computer Programming learning about Perl, CGI (common gateway interface), HTML, and other technologies. It wasn’t until I finished a degree in mathematics that I really started learning the basics, namely HTML, CSS, and JavaScript.

At that point, folks were just starting to come out of the dark ages of table-base layouts and experimenting with separating content (HTML) from presentation (CSS) from behavior (JavaScript). A popular discussion was over what the best type of layout was. I remember reading discussions of left vs right handed vs centered pages. (The latter is still a pain to implement but the web has still come a long way since then.) Those discussions stuck with me and motivates the study I’m running right now.

In mathematics, people have been using fundamental matrix algebra to help them accomplish some very interesting things. One application is in image decomposition, basically, breaking images into simpler, more basic components. A space of vectors (or coordinates that can represent data points) have “eigenvectors,” which are vectors that you can use to reconstruct other vectors that you have in front of you. In facial recognition, applying this technique to images of faces yields Eigenfaces, patterns that are common to many of the images.

Coming back to the idea of website layouts, I reasoned there must be a way to see this in real web data somehow. If we took images of the most popular websites on the web, how would their “eigenlayouts” look overall. Would certain layouts (like left or right or center) just pop out of the data somehow. Well, to answer that question, we need some data, we need to analyze it, and then we need to interpret it.

Data Retrieval

To run the analysis on these websites, we need to turn them into images. For that, I turned to PhantomJS, a headless browser geared toward rendering websites. For the sake of having a portable solution (able to run just about anywhere without needing too much bootstrapping), I decided to use a community-submitted Docker image that I found that that did that job nicely. Basic usage has you specifying the website you want to render as well as the output filename for the image. You can additionally pass in arguments for your webpage resolution. I went with 800px by 1200px because it’s a sensible minimum that people are creating websites for.

When I need to run a lot of commands and don’t need a who programming language, I typically turn to GNU make for defining a pipeline of work and for parallelizing it out.

IMAGEDIR := images/
DOMAINDIR := domains/

DOMAINLIST := $(shell find $(DOMAINDIR) -type f)
IMAGELIST := $(DOMAINLIST:=.png)
IMAGELIST := $(IMAGELIST:$(DOMAINDIR)%=$(IMAGEDIR)%)

# Docker volume requires an absolute path
VOLUME := $(abspath $(IMAGEDIR))

all: $(IMAGELIST)

%.png:
    mkdir -p $(IMAGEDIR)
    $(eval DOMAIN := $(patsubst $(IMAGEDIR)%.png,%,$@))
    sudo docker run -t --rm -v $(VOLUME):/raster-output herzog31/rasterize http://$(DOMAIN) "$(notdir $@)" 1200px*800px 1.0

# This rule will create 10,000 files in the $(DOMAINDIR) directory
mkdomains:
    mkdir -p $(DOMAINDIR)
    cut -d, -f2 top-1m.csv | head -10 | xargs -n1 -I{} touch "$(DOMAINDIR){}"

The gist of this is that first you run make mkdomains to create a directory domains/ filled with dummy targets of each domain you want to look up (10,000 take up about 80MB of space). Then, running make will use that seeded directory of domains to pull each one down and drop the image file in the images/ directory. You can file all of this under “stupid Makefile hacks.”

Next Steps

So far, this covers a very small portion of the 80% of data science that’s cleaning and munging data. The next blog post will focus on loading and analyzing this image data in Python using the scikit-image library.

If you liked this post, please consider subscribing to updates and following me on Twitter.

Memory Profiling in Python

Standard

Data Scientists often need to sharpen their tools. If you use Python for analyzing data or running predictive models, here’s a tool to help you avoid those dreaded out-of-memory issues that tend to come up with large datasets.

Enter memory_profiler for Python

This memory profile was designed to assess the memory usage of Python programs. It’s cross platform and should work on any modern Python version (2.7 and up).

To use it, you’ll need to install it (using pip is the preferred way).

pip install memory_profiler

Once it’s installed, you’ll be able to use it as a Python module to profile memory usage in your program. To hook into it, you’ll need to do a few things.

First you’ll need to decorate the methods for which you want a memory profile. If you’re not familiar with a decorator, it’s essentially a way to wrap a function you define within another function. In the case of memory_profiler, you’ll wrap your functions in the @profile decorator to get deeper information on their memory usage.

If your function looked like this before:

def my_function():
    """Runs my function"""
    return None

then the @profile decorated version would look like:

@profile
def my_function():
    """Runs my function"""
    return None

It works because your program runs within a special context, so it can measure and store relevant statistics. To invoke it, run your command with the flag -m memory_profiler. That looks like:

python -m memory_profiler <your-program>

Profiling results

To see what the results look like, I produced some sample code snippets that show you some examples.

While these examples are contrived, they illustrate how tracing memory usage in a program can help you debug problems in your code.

Filename: run.py
Line #    Mem usage    Increment   Line Contents
================================================
     8   77.098 MiB    0.000 MiB   @profile
     9                             def infinite_loading():
    10                                 """Exceeds available memory"""
    11 2935.297 MiB 2858.199 MiB       train = pd.read_csv('big.csv')
    12 2935.297 MiB    0.000 MiB       while True:
    13 4968.926 MiB 2033.629 MiB           new = pd.read_csv('big.csv')
    14                                     train = pd.concat([train, new])


Traceback (most recent call last):
...
MemoryError

Above we have a pretty obvious logical error, namely we’re loading a file into memory and repeatedly appending its data onto another data structure. However, the point here is that you’ll get a summary of usage even if your program dies because of an out-of-memory exception.

When should you think about profiling?

Premature optimization is the root of all evil – Donald Knuth

It’s easy to get carried away with optimization. Honestly, it’s best not to start off by immediately profiling your code. It’s often better to wait for an occasion when you need help. Most of the time, I follow this workflow:

First, try to solve the problem as best as you can on a smaller sample of the actual dataset (the key here is to use a small enough dataset so that you have seconds between when it starts and finishes, rather than minutes or hours).

Then, include your entire dataset to see how that runs. At this point, based on your sample runs you should have a) an idea of how long the full dataset should take to run and b) an idea of how much memory it will use. Keep that in mind.

Next, you have to monitor it running, so that could mean three possible outcomes (for simplicity)

  • It finishes successfully
  • It runs out of memory
  • It’s taking too long to run

You should start thinking about profiling your code if you encounter either of the latter cases. In the case of overuse of memory, it will help to run the memory profiler to see which objects are taking up more memory than you expect.

From there, you can take a look at whether you need to encode your variables differently. For example, maybe you’re interpreting a numeric variable as a string and thus using more RAM. Or it could be time to offload your work to a larger server with enough space.

If the algorithm is taking too long, there are a number of options to try out, which I’ll cover in a later post.

Concluding remarks

You just saw how to run some basic memory profiling in your Python programs. Out-of-memory while analyzing a particular dataset is one of the primary hurdles that people encounter in practice. The memory_profiler package isn’t the only one available so check out some of the others in the Further Reading section below.

If you liked this post, please share it on Twitter or Facebook and follow me @mathcass.

Using Jupyter on Remote Servers

Standard

As a data scientist, it really helps to have a powerful computer nearby when you need it. Even with an i7 laptop with 16GB of RAM in it, you’ll sometimes find yourself needing more power. Whether your task is compute or memory constrained, though, you’ll find yourself looking to the cloud for more resources. Today I’ll outline how to be more effective when you have to compute remotely.

I like to refer folks to this great article on setting up SSH configs. Not only will a good SSH configuration file simplify the way you access servers, it can also help you streamline the way you work on them.

I find Jupyter to be a superb resource for writing reports and displaying graphics of data. It essentially lets you run code in your web browser. However, one issue with using it on a remote machine is that you may not be able to access the interface because the server is blocking the necessary port to see it on the web(this is a great thing for security and prevents others from seeing your work). There’s a way to work through this by using SSH’s ability to forward ports.

To do that, first you’ll need to log into your remote machine:

ssh -L 8888:127.0.0.1:8888 <remote host>

That means you’re connecting to your remote host, except any time you want to access port 8888 on your local machine (127.0.0.1), it will forward it to the remote machine’s port 8888.

Then, you’ll need to start Jupyter:

cd <project>
jupyter notebook

Finally, head to the url http://localhost:8888 to find yourself accessing the remotely running copy of your notebook.

Here’s a screenshot of what that should look like.

Screenshot from 2016-01-27 21:54:01

Note the highlight line on the right. There isn’t a web browser installed on my remote machine but I was still able to access this notebook by using my local computer.

Whether you want to load a 50GB data frame into Pandas or use jobs=-1 in Scikit Learn, you should find yourself more able to do your work.

Follow me on Twitter @mathcass.

Bloom Filters in Application

Standard

Today we’re going to talk about what a Bloom filter is and discuss some of the applications in data science. In a later post, we’ll build a simple implementation with the goal of learning more about how they work.

What is a Bloom Filter?

A Bloom filter is a probabilistic data structure. Let’s break that term down. Any time you hear the word “probabilistic” the first thing that should come to mind is “error.” That is, it sometimes has errors. When you hear “data structure” you should think about “space,” more specifically storage space or memory.

Bloom filters are designed to answer questions of set membership, that is, “is this item one of X?” Here are some simple questions you might be working with if you were considering them:

  • Have we seen this email address sign up to our site recently?
  • Is this a product the user has bought before?
  • Are the registrations we’re seeing from this IP address on a whitelist of IPs that we can trust?

Basically, it compresses simple set membership information into a smaller memory footprint at the cost of a bit of error.

By design, Bloom filters only implement two types of operations, add and contains. So, once you’ve added a member to it, you wouldn’t be able to remove it. Additionally, you wouldn’t be able to query for a list of elements in it.

So, why would you be okay with error in data in exchange for using less memory? We can answer that question (as well as better understand the use-case of set membership) with a few useful examples.

How are Bloom filters used?

It would be hard not to mention this Medium article on the subject because it clarifies how they apply to a data science task. At Medium, they have some version of a distributed data store. Now, distributing data helps you scale out information across servers, but it also has a chance of increasing your variance in expected response times (read: more requests might take longer to finish running). Most web companies focus on making the user experience as pleasant as possible, so they value response time. In this case, one particular request that was important was the set of articles the particular user had read before.

At risk of retelling the already well-told story, suffice it to say that they used a Bloom filter to prevent recommending to the user articles he or she had read before. This was a case where they used a data science model (the store recommendation engine) and they augmented it by including a component that could use compressed information (the Bloom filter) to prevent the user from needing to see the same article twice. Moreover, even though there’s potential for error in that filter, that error is negligible (in the sense that the user’s experience isn’t hinder by it). To summarize, because they could efficiently represent the set of “articles this user had read before” and since it had a defined error rate, they could improve their user experience.

As another example, think about buying items on Amazon.com. Amazon has almost any item imaginable and probably also distributes their data for scale. They’re still able to tell you right on the product page whether you’ve bought an item before and when. I don’t have any insight into what’s going on behind the scenes but this is another perfect place for a Bloom filter by using one at the user level (holding the set of products someone has bought) or at the product level (holding a set of all the users who have bought the product). A negative match (which will be correct 100% of the time) means operationally you don’t need to perform that database lookup to see if someone bought this item. A positive match (which will probably be rare most of the time) will be the only time when you confirm and go see that transaction data.

Finally, I wanted to point out a use-case that I found by perusing various implementations of the tool. Bloom filters can also be used to track time-dependent information (or various forms of time series data). One thing you could do is store aggregate level information (like whether someone bought a particular product in a given time horizon like the last 30 or 90 days) in a Bloom filter. Then, based on that information, you can make modeling decisions like what sort of ads you show this person.

I hope this post helped you learn a little bit about Bloom filters. In a later post, I’ll go into some detail on how they’re implemented with a focus on pedagogy. In the meantime, follow me on WordPress or Twitter for updates.

Additional links to check out

  • A Python package from Moz on using Redis as a backend (they were using it to ensure they didn’t crawl the same websites multiple times)
  • Another Python implementation optimized for scaling
  • A very detailed article of several other probabilistic data structures
  • A great bitly post on the subject as well as their own implementation (which also supports removing set members)

Four useful books for learning Data Science

Standard
I was listening to an old episode of Partially Derivative, a podcast on data science and the news. One of the hosts mentioned that we’re now living in the “golden age of data science instruction” and learning materials. I couldn’t agree more with this statement. Each month, most publishers seem to have another book on the subject and people are writing exciting blog posts about what they’re learning and doing.I wanted to outline a few of the books that helped me along the way, in the order I approached them. Hopefully, you can use them to gain a broader perspective of the field and perhaps as a resource to pass on to others trying to learn.

Learning from Data

I first found Learning from Data through Caltech’s course on the subject. I still think it’s an excellent text but I’m not sure if I would recommend it to the absolute beginner. (To someone who is just coming to the subject, I would probably recommend the next choice down on the list.)

However, I have a Master’s degree in mathematics so I was familiar with the background material in linear algebra and probability as well as the notation used. Learning from Data taught me that there was actual mathematical theory behind a lot of the algorithms employed in data science.

Most algorithms are chosen for their pragmatic application, but they also have features in and of themselves (such as how they bound the space of possible hypotheses about the data) that can help determine their effectiveness on data. There’s also a general theory for how to approach the analysis of these algorithms. At the time of reading, a lot of it was still a bit over my head, but it got me incredibly curious about the field itself.

Now, understanding a few things about the theory is great, but most of the time, people want to know what it can actually do.

Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management

I’ll admit to only having had a cursory understanding of what was possible before I read Data Mining Techniques. I knew that the most widely used algorithms were used for assessing risk, like credit scores. However, I didn’t know much about how you could make gains in the world of marketing using data science techniques.

I appreciated that the authors have a lot of experience in the field, especially experience that predates most of the growth in big data these days. This book makes it clear that many of the most useful algorithms have been around and in use for decades. The authors also offer some explanations from the direct marketing case (print magazines and physical mail) that I hadn’t considered, such as ranking algorithms, which were originally used to prioritize a list of people to contact because of the high costs of mailing paper to people.

More than anything, I liked the breadth of the topics, since they cover just about every form of marketing algorithm and do a great job of giving you a high level view of why they matter.

You won’t walk away from this book knowing how to implementing everything they talk about, but you will get a sense for which algorithms are suited for particular tasks.

This book gave me a better way to think through the initial phases of a project, but I still needed some help in learning how to communicate about data and how to fit it directly into the business context.

Data Science for Business

I read through this one while I was on vacation (yes, I know, I’m that type of geek). That didn’t stop me from soaking up a lot of information from it about how data science applies to a company trying to use these models. Most of the book is focused on helping you think through how to operationalize the process of running and managing a data science project and what outcomes you might expect from the effort.

Beyond that, I think it taught me how to communicate better about data at a company. Being able to talk about the many months it will take to bring a project into fruition and weigh it against alternatives is the bread and butter of working at a company that wants to make money. Moreover, if you believe that a particular project is the right choice, you need to be able to back up that choice by communicating about the benefits.

I want to say that this is a very “bottom-line” type of book, but that’s okay to hear about some of the time. Data science doesn’t always have to be about the hottest technique or the biggest technology if your priorities include keeping your costs below your revenue. However, I still didn’t learn much about getting my hands dirty with the data on a day-to-day basis. For that, I had to rely on the final book I present.

Applied Predictive Modeling

This is a book on predictive modeling in R and on using a package that the author developed for doing that. This isn’t simply about someone tooting their own horn because caret is a quality piece of software. Overall, I think that even if you don’t end up using R as your go-to tool for analyzing data, you’ll still learn a lot from this book. It thoroughly demonstrates the power caret can offer you in a project, to the point that you’ll seek the same functionality in your tool of choice (or hopefully build its equivalent for us).

Caret is a package that offers a consistent interface for just about any predictive task (classification or regression) that you could ask for. One issue some people have with R packages is that the interface for algorithms isn’t very consistent. Learning how to use one package won’t always lead to the same understanding in a completely different package. Caret addresses that by giving you the same way to set up a modeling task for many different algorithms. Moreover, it also automates several tasks like:

  • Data splitting into training and test sets
  • Data transformations like normalization or power transforms
  • Modeling tuning and parameter selection

Essentially, it makes working in R a lot like using Scikit Learn (an excellent library itself) but with many more options and model implementations.

So that’s all you need, right? Just read a couple of books and you’re on your way? Not quite. You’ll actually have to apply some of this and learn from it. Perhaps next time you’re in a meeting discussing priorities for your company, you will need to frame the conversation about your next data project and directing the data effort toward your business goals (Data Science for Business). When you’re brainstorming possible things that you could try to predict and use in a marketing campaign, you will need to outline the possible techniques and what they could offer you (Data Mining Techniques). If you’re evaluating candidate algorithms for their ability to perform the task accurately, you will need to gauge their effectiveness from a theoretical (Learning from Data) and practical (Applied Predictive Modeling) standpoint.

I hope this helps you apply data science at work and gives you perspective in the field. Also, if you’re not a follower on Twitter, please follow me @mathcass.