Bayes’ Theorem is the FizzBuzz of Data Science

Standard

The other day, I did a technical interview that involved applying Bayes’ Theorem to a simple example. It stumped me.  And it left me feeling empathy for folks who have had trouble with the FizzBuzz interview question.

It doesn’t reflect day-to-day work

FizzBuzz depends on understanding a few concepts, like conditional execution, the modulus operator, divisibility of numbers, and common denominators.Every programmer should be familiar with the modulus operator and it’s relationship to divisibility, but knowing about it doesn’t mean it’s part of your bread and butter. The day-to-day of software engineering usually takes place at th higher level of understanding good design patterns, parsing requirements, and using APIs for their team’s platform or framework. Diving down to a lower level to reproduce divisibility from simple mathematics is a shift in perspective and takes several mental cycles to get right if it’s out of practice.

It’s stressful to problem solve on the spot

Unless you’re made of steel, you’ll probably get some form of jitters during an interview and this can hurt the way you solve problems.

In 2005, Beilock & Carr published a paper on performance of  math problems between high working-memory and low working-memory undergraduate students. The HWM group, which performed very well on simple mathematical problems at baseline, low-stress conditions performed significantly poorly in a high stress. (Incidentally, Beilock & Carr’s experiment used a modular arithmetic task in their experiment, the same concept that is integral to FizzBuzz.)

Add the stress of trying to explain your solution to an interviewer and this is a recipe for a meltdown.

Experience teases out edge cases, but experience fades with time

One of the really tricky parts about FizzBuzz is that is involves an edge case that can trip people up. Namely, when the number is 15, it’s possible to print out all three of “Fizz,” “Buzz,” and “FizzBuzz” if you’re not careful.

The problem with this is that edge cases aren’t over-arching principles or theoretical concepts, they’re anomalies. And people learn how to deal with anomalies by experiencing them. At work, edge cases would get discovered in a handful of ways, but most likely being a) a developer on the team has come across it before and b) there’s a deliberate, likely time-consuming effort to enumerate possible cases to test them out. As I’ve mentioned above, case (a) isn’t very likely because the edge cases of FizzBuzz aren’t top-of-mind, and case (b) presents problems because of the stress of interviewing.

Bayes’ Theorem

Bayes’ Theorem (or Bayes’ Rule) is just tricky enough behave like FizzBuzz in these situations. It’s certainly something that every data scientist should know, but it isn’t something he or she would use every day. It’s complicated enough to be relegated to the class of abstract math problems, which take a hit during stressful situations. And finally, for edge cases in applying Bayes’ Theorem (like whether two events are independent), it’s difficult and unlikely for individuals to come up with suitable examples to test immediately.

Advertisements

Using Deep Learning is Easier Than You Think

Standard

I came across a great article on using the Deep Learning Python package tflearn to perform inference on some classic datasets in Machine Learning like the MNIST dataset and the CIFAR-10 dataset. As it turns out, these types of models have been around for quite a while for various tasks in image recognition. The particular case of the CIFAR-10 dataset was solved by a neural network very similar to the one from the mentioned post. The general idea of using convolutional neural networks dates back to Yann LeCun’s paper from 1998 in digit recognition.

Since I have a lot of experience working with data but not a lot of experience working with deep learning algorithms, I wondered how easy it would be to adapt these methods to a new, somewhat related, problem: going through thousands of my photos to identify pictures of my cat. Turns out, it was easier than I thought.

2012-12-30 09.58.28.jpg

This is definitely a cat photo

Training a similar neural network on my own visual data just amounted to connecting up the inputs and the outputs properly. In particular, any input image has to be re-scaled down (or up) to 32×32 pixels. Similarly, your output must be binary and should represent membership of either of the two classes.

The main difficulty involves creating your dataset. This really just means going through your images and classifying a subset of them by hand. For my own run at this, all I did was create a directory like:

images/
    cat/
    not_cat/

I put any cat photos I found into the cat directory while putting any non-cat photographs in the other folder. I tried to keep the same number of images in both directories to try to avoid any class imbalance problems. Then again, this wasn’t as much of a concern since roughly half my photos are cat photos anyway.

From there, tflearn has a helper method that lets you create an HDF5 dataset from your directory of images with a simple function. The X & Y values from that data structure can be used as the inputs to the deep learning model.

By using around 400 images (roughly 200 for each class), my classifier achieved about an 85% accuracy rate on a validation set of data. For my purposes, namely just automatically tagging potential photos of my cat, this was accurate enough. Any effort to increase the accuracy of this would probably involve some combination of:

  • adding more training data by putting images into my class folders
  • changing the shape of the network by adding more layers or more nodes per layer
  • using a pre-trained model to bootstrap the learning process

That’s all it really takes. If you know a bit of Python and can sort a few of your photos into folders based on their categories, you can get started using sophisticated deep learning algorithms on your own images.

You can find the code for this on my account at Github. If you want to chat or reach out at all, follow me on Twitter @mathcass.

Communicate about your work

Standard

I gave a lightning talk earlier this month to the PyData Atlanta Meetup. I’ve given hour-long talks on technical subjects before, but I hadn’t done anything quite that concise before. This fact freaked me out quite a bit. I wanted to reflect a bit on why it’s always a good idea to communicate more what you do.

No matter how mundane or “been done before” you believe your work is, there’s value in showing it to others because some people will learn from it. In machine learning, some methods are designed to try all possible permutations of a set of options to choose the one with the best performance. As the complexity of a model about your data grows, inevitably this tree search method breaks down and you need to apply some heuristics to the problem. What people don’t mention is that these heuristics can come from anywhere, whether it’s a research paper, a book, a mentor, or even a five minute talk you saw on a Thursday night.

Like any good person who over-prepares for things, I read up a bit on it which helped me come to this conclusion (and that helped me think through public speaking in general). Here are some resources:

Oh yeah. I think my talk went pretty well. Here’s a link to my Google Drive slides or a PDF copy. If you’d like to chat about data or about my work, feel free to reach out to me via email or on Twitter.

Eigenlayouts

Standard

For a long time, I’ve been interested in with web technology. In high school, I read Jesse Liberty’s Complete Idiot`s Guide to a Career in Computer Programming learning about Perl, CGI (common gateway interface), HTML, and other technologies. It wasn’t until I finished a degree in mathematics that I really started learning the basics, namely HTML, CSS, and JavaScript.

At that point, folks were just starting to come out of the dark ages of table-base layouts and experimenting with separating content (HTML) from presentation (CSS) from behavior (JavaScript). A popular discussion was over what the best type of layout was. I remember reading discussions of left vs right handed vs centered pages. (The latter is still a pain to implement but the web has still come a long way since then.) Those discussions stuck with me and motivates the study I’m running right now.

In mathematics, people have been using fundamental matrix algebra to help them accomplish some very interesting things. One application is in image decomposition, basically, breaking images into simpler, more basic components. A space of vectors (or coordinates that can represent data points) have “eigenvectors,” which are vectors that you can use to reconstruct other vectors that you have in front of you. In facial recognition, applying this technique to images of faces yields Eigenfaces, patterns that are common to many of the images.

Coming back to the idea of website layouts, I reasoned there must be a way to see this in real web data somehow. If we took images of the most popular websites on the web, how would their “eigenlayouts” look overall. Would certain layouts (like left or right or center) just pop out of the data somehow. Well, to answer that question, we need some data, we need to analyze it, and then we need to interpret it.

Data Retrieval

To run the analysis on these websites, we need to turn them into images. For that, I turned to PhantomJS, a headless browser geared toward rendering websites. For the sake of having a portable solution (able to run just about anywhere without needing too much bootstrapping), I decided to use a community-submitted Docker image that I found that that did that job nicely. Basic usage has you specifying the website you want to render as well as the output filename for the image. You can additionally pass in arguments for your webpage resolution. I went with 800px by 1200px because it’s a sensible minimum that people are creating websites for.

When I need to run a lot of commands and don’t need a who programming language, I typically turn to GNU make for defining a pipeline of work and for parallelizing it out.

IMAGEDIR := images/
DOMAINDIR := domains/

DOMAINLIST := $(shell find $(DOMAINDIR) -type f)
IMAGELIST := $(DOMAINLIST:=.png)
IMAGELIST := $(IMAGELIST:$(DOMAINDIR)%=$(IMAGEDIR)%)

# Docker volume requires an absolute path
VOLUME := $(abspath $(IMAGEDIR))

all: $(IMAGELIST)

%.png:
    mkdir -p $(IMAGEDIR)
    $(eval DOMAIN := $(patsubst $(IMAGEDIR)%.png,%,$@))
    sudo docker run -t --rm -v $(VOLUME):/raster-output herzog31/rasterize http://$(DOMAIN) "$(notdir $@)" 1200px*800px 1.0

# This rule will create 10,000 files in the $(DOMAINDIR) directory
mkdomains:
    mkdir -p $(DOMAINDIR)
    cut -d, -f2 top-1m.csv | head -10 | xargs -n1 -I{} touch "$(DOMAINDIR){}"

The gist of this is that first you run make mkdomains to create a directory domains/ filled with dummy targets of each domain you want to look up (10,000 take up about 80MB of space). Then, running make will use that seeded directory of domains to pull each one down and drop the image file in the images/ directory. You can file all of this under “stupid Makefile hacks.”

Next Steps

So far, this covers a very small portion of the 80% of data science that’s cleaning and munging data. The next blog post will focus on loading and analyzing this image data in Python using the scikit-image library.

If you liked this post, please consider subscribing to updates and following me on Twitter.

Memory Profiling in Python

Standard

Data Scientists often need to sharpen their tools. If you use Python for analyzing data or running predictive models, here’s a tool to help you avoid those dreaded out-of-memory issues that tend to come up with large datasets.

Enter memory_profiler for Python

This memory profile was designed to assess the memory usage of Python programs. It’s cross platform and should work on any modern Python version (2.7 and up).

To use it, you’ll need to install it (using pip is the preferred way).

pip install memory_profiler

Once it’s installed, you’ll be able to use it as a Python module to profile memory usage in your program. To hook into it, you’ll need to do a few things.

First you’ll need to decorate the methods for which you want a memory profile. If you’re not familiar with a decorator, it’s essentially a way to wrap a function you define within another function. In the case of memory_profiler, you’ll wrap your functions in the @profile decorator to get deeper information on their memory usage.

If your function looked like this before:

def my_function():
    """Runs my function"""
    return None

then the @profile decorated version would look like:

@profile
def my_function():
    """Runs my function"""
    return None

It works because your program runs within a special context, so it can measure and store relevant statistics. To invoke it, run your command with the flag -m memory_profiler. That looks like:

python -m memory_profiler <your-program>

Profiling results

To see what the results look like, I produced some sample code snippets that show you some examples.

While these examples are contrived, they illustrate how tracing memory usage in a program can help you debug problems in your code.

Filename: run.py
Line #    Mem usage    Increment   Line Contents
================================================
     8   77.098 MiB    0.000 MiB   @profile
     9                             def infinite_loading():
    10                                 """Exceeds available memory"""
    11 2935.297 MiB 2858.199 MiB       train = pd.read_csv('big.csv')
    12 2935.297 MiB    0.000 MiB       while True:
    13 4968.926 MiB 2033.629 MiB           new = pd.read_csv('big.csv')
    14                                     train = pd.concat([train, new])


Traceback (most recent call last):
...
MemoryError

Above we have a pretty obvious logical error, namely we’re loading a file into memory and repeatedly appending its data onto another data structure. However, the point here is that you’ll get a summary of usage even if your program dies because of an out-of-memory exception.

When should you think about profiling?

Premature optimization is the root of all evil – Donald Knuth

It’s easy to get carried away with optimization. Honestly, it’s best not to start off by immediately profiling your code. It’s often better to wait for an occasion when you need help. Most of the time, I follow this workflow:

First, try to solve the problem as best as you can on a smaller sample of the actual dataset (the key here is to use a small enough dataset so that you have seconds between when it starts and finishes, rather than minutes or hours).

Then, include your entire dataset to see how that runs. At this point, based on your sample runs you should have a) an idea of how long the full dataset should take to run and b) an idea of how much memory it will use. Keep that in mind.

Next, you have to monitor it running, so that could mean three possible outcomes (for simplicity)

  • It finishes successfully
  • It runs out of memory
  • It’s taking too long to run

You should start thinking about profiling your code if you encounter either of the latter cases. In the case of overuse of memory, it will help to run the memory profiler to see which objects are taking up more memory than you expect.

From there, you can take a look at whether you need to encode your variables differently. For example, maybe you’re interpreting a numeric variable as a string and thus using more RAM. Or it could be time to offload your work to a larger server with enough space.

If the algorithm is taking too long, there are a number of options to try out, which I’ll cover in a later post.

Concluding remarks

You just saw how to run some basic memory profiling in your Python programs. Out-of-memory while analyzing a particular dataset is one of the primary hurdles that people encounter in practice. The memory_profiler package isn’t the only one available so check out some of the others in the Further Reading section below.

If you liked this post, please share it on Twitter or Facebook and follow me @mathcass.

Using Jupyter on Remote Servers

Standard

As a data scientist, it really helps to have a powerful computer nearby when you need it. Even with an i7 laptop with 16GB of RAM in it, you’ll sometimes find yourself needing more power. Whether your task is compute or memory constrained, though, you’ll find yourself looking to the cloud for more resources. Today I’ll outline how to be more effective when you have to compute remotely.

I like to refer folks to this great article on setting up SSH configs. Not only will a good SSH configuration file simplify the way you access servers, it can also help you streamline the way you work on them.

I find Jupyter to be a superb resource for writing reports and displaying graphics of data. It essentially lets you run code in your web browser. However, one issue with using it on a remote machine is that you may not be able to access the interface because the server is blocking the necessary port to see it on the web(this is a great thing for security and prevents others from seeing your work). There’s a way to work through this by using SSH’s ability to forward ports.

To do that, first you’ll need to log into your remote machine:

ssh -L 8888:127.0.0.1:8888 <remote host>

That means you’re connecting to your remote host, except any time you want to access port 8888 on your local machine (127.0.0.1), it will forward it to the remote machine’s port 8888.

Then, you’ll need to start Jupyter:

cd <project>
jupyter notebook

Finally, head to the url http://localhost:8888 to find yourself accessing the remotely running copy of your notebook.

Here’s a screenshot of what that should look like.

Screenshot from 2016-01-27 21:54:01

Note the highlight line on the right. There isn’t a web browser installed on my remote machine but I was still able to access this notebook by using my local computer.

Whether you want to load a 50GB data frame into Pandas or use jobs=-1 in Scikit Learn, you should find yourself more able to do your work.

Follow me on Twitter @mathcass.

Bloom Filters in Application

Standard

Today we’re going to talk about what a Bloom filter is and discuss some of the applications in data science. In a later post, we’ll build a simple implementation with the goal of learning more about how they work.

What is a Bloom Filter?

A Bloom filter is a probabilistic data structure. Let’s break that term down. Any time you hear the word “probabilistic” the first thing that should come to mind is “error.” That is, it sometimes has errors. When you hear “data structure” you should think about “space,” more specifically storage space or memory.

Bloom filters are designed to answer questions of set membership, that is, “is this item one of X?” Here are some simple questions you might be working with if you were considering them:

  • Have we seen this email address sign up to our site recently?
  • Is this a product the user has bought before?
  • Are the registrations we’re seeing from this IP address on a whitelist of IPs that we can trust?

Basically, it compresses simple set membership information into a smaller memory footprint at the cost of a bit of error.

By design, Bloom filters only implement two types of operations, add and contains. So, once you’ve added a member to it, you wouldn’t be able to remove it. Additionally, you wouldn’t be able to query for a list of elements in it.

So, why would you be okay with error in data in exchange for using less memory? We can answer that question (as well as better understand the use-case of set membership) with a few useful examples.

How are Bloom filters used?

It would be hard not to mention this Medium article on the subject because it clarifies how they apply to a data science task. At Medium, they have some version of a distributed data store. Now, distributing data helps you scale out information across servers, but it also has a chance of increasing your variance in expected response times (read: more requests might take longer to finish running). Most web companies focus on making the user experience as pleasant as possible, so they value response time. In this case, one particular request that was important was the set of articles the particular user had read before.

At risk of retelling the already well-told story, suffice it to say that they used a Bloom filter to prevent recommending to the user articles he or she had read before. This was a case where they used a data science model (the store recommendation engine) and they augmented it by including a component that could use compressed information (the Bloom filter) to prevent the user from needing to see the same article twice. Moreover, even though there’s potential for error in that filter, that error is negligible (in the sense that the user’s experience isn’t hinder by it). To summarize, because they could efficiently represent the set of “articles this user had read before” and since it had a defined error rate, they could improve their user experience.

As another example, think about buying items on Amazon.com. Amazon has almost any item imaginable and probably also distributes their data for scale. They’re still able to tell you right on the product page whether you’ve bought an item before and when. I don’t have any insight into what’s going on behind the scenes but this is another perfect place for a Bloom filter by using one at the user level (holding the set of products someone has bought) or at the product level (holding a set of all the users who have bought the product). A negative match (which will be correct 100% of the time) means operationally you don’t need to perform that database lookup to see if someone bought this item. A positive match (which will probably be rare most of the time) will be the only time when you confirm and go see that transaction data.

Finally, I wanted to point out a use-case that I found by perusing various implementations of the tool. Bloom filters can also be used to track time-dependent information (or various forms of time series data). One thing you could do is store aggregate level information (like whether someone bought a particular product in a given time horizon like the last 30 or 90 days) in a Bloom filter. Then, based on that information, you can make modeling decisions like what sort of ads you show this person.

I hope this post helped you learn a little bit about Bloom filters. In a later post, I’ll go into some detail on how they’re implemented with a focus on pedagogy. In the meantime, follow me on WordPress or Twitter for updates.

Additional links to check out

  • A Python package from Moz on using Redis as a backend (they were using it to ensure they didn’t crawl the same websites multiple times)
  • Another Python implementation optimized for scaling
  • A very detailed article of several other probabilistic data structures
  • A great bitly post on the subject as well as their own implementation (which also supports removing set members)