Bayes’ Theorem is the FizzBuzz of Data Science


The other day, I did a technical interview that involved applying Bayes’ Theorem to a simple example. It stumped me.  And it left me feeling empathy for folks who have had trouble with the FizzBuzz interview question.

It doesn’t reflect day-to-day work

FizzBuzz depends on understanding a few concepts, like conditional execution, the modulus operator, divisibility of numbers, and common denominators.Every programmer should be familiar with the modulus operator and it’s relationship to divisibility, but knowing about it doesn’t mean it’s part of your bread and butter. The day-to-day of software engineering usually takes place at th higher level of understanding good design patterns, parsing requirements, and using APIs for their team’s platform or framework. Diving down to a lower level to reproduce divisibility from simple mathematics is a shift in perspective and takes several mental cycles to get right if it’s out of practice.

It’s stressful to problem solve on the spot

Unless you’re made of steel, you’ll probably get some form of jitters during an interview and this can hurt the way you solve problems.

In 2005, Beilock & Carr published a paper on performance of  math problems between high working-memory and low working-memory undergraduate students. The HWM group, which performed very well on simple mathematical problems at baseline, low-stress conditions performed significantly poorly in a high stress. (Incidentally, Beilock & Carr’s experiment used a modular arithmetic task in their experiment, the same concept that is integral to FizzBuzz.)

Add the stress of trying to explain your solution to an interviewer and this is a recipe for a meltdown.

Experience teases out edge cases, but experience fades with time

One of the really tricky parts about FizzBuzz is that is involves an edge case that can trip people up. Namely, when the number is 15, it’s possible to print out all three of “Fizz,” “Buzz,” and “FizzBuzz” if you’re not careful.

The problem with this is that edge cases aren’t over-arching principles or theoretical concepts, they’re anomalies. And people learn how to deal with anomalies by experiencing them. At work, edge cases would get discovered in a handful of ways, but most likely being a) a developer on the team has come across it before and b) there’s a deliberate, likely time-consuming effort to enumerate possible cases to test them out. As I’ve mentioned above, case (a) isn’t very likely because the edge cases of FizzBuzz aren’t top-of-mind, and case (b) presents problems because of the stress of interviewing.

Bayes’ Theorem

Bayes’ Theorem (or Bayes’ Rule) is just tricky enough behave like FizzBuzz in these situations. It’s certainly something that every data scientist should know, but it isn’t something he or she would use every day. It’s complicated enough to be relegated to the class of abstract math problems, which take a hit during stressful situations. And finally, for edge cases in applying Bayes’ Theorem (like whether two events are independent), it’s difficult and unlikely for individuals to come up with suitable examples to test immediately.


Using Deep Learning is Easier Than You Think


I came across a great article on using the Deep Learning Python package tflearn to perform inference on some classic datasets in Machine Learning like the MNIST dataset and the CIFAR-10 dataset. As it turns out, these types of models have been around for quite a while for various tasks in image recognition. The particular case of the CIFAR-10 dataset was solved by a neural network very similar to the one from the mentioned post. The general idea of using convolutional neural networks dates back to Yann LeCun’s paper from 1998 in digit recognition.

Since I have a lot of experience working with data but not a lot of experience working with deep learning algorithms, I wondered how easy it would be to adapt these methods to a new, somewhat related, problem: going through thousands of my photos to identify pictures of my cat. Turns out, it was easier than I thought.

2012-12-30 09.58.28.jpg

This is definitely a cat photo

Training a similar neural network on my own visual data just amounted to connecting up the inputs and the outputs properly. In particular, any input image has to be re-scaled down (or up) to 32×32 pixels. Similarly, your output must be binary and should represent membership of either of the two classes.

The main difficulty involves creating your dataset. This really just means going through your images and classifying a subset of them by hand. For my own run at this, all I did was create a directory like:


I put any cat photos I found into the cat directory while putting any non-cat photographs in the other folder. I tried to keep the same number of images in both directories to try to avoid any class imbalance problems. Then again, this wasn’t as much of a concern since roughly half my photos are cat photos anyway.

From there, tflearn has a helper method that lets you create an HDF5 dataset from your directory of images with a simple function. The X & Y values from that data structure can be used as the inputs to the deep learning model.

By using around 400 images (roughly 200 for each class), my classifier achieved about an 85% accuracy rate on a validation set of data. For my purposes, namely just automatically tagging potential photos of my cat, this was accurate enough. Any effort to increase the accuracy of this would probably involve some combination of:

  • adding more training data by putting images into my class folders
  • changing the shape of the network by adding more layers or more nodes per layer
  • using a pre-trained model to bootstrap the learning process

That’s all it really takes. If you know a bit of Python and can sort a few of your photos into folders based on their categories, you can get started using sophisticated deep learning algorithms on your own images.

You can find the code for this on my account at Github. If you want to chat or reach out at all, follow me on Twitter @mathcass.

Communicate about your work


I gave a lightning talk earlier this month to the PyData Atlanta Meetup. I’ve given hour-long talks on technical subjects before, but I hadn’t done anything quite that concise before. This fact freaked me out quite a bit. I wanted to reflect a bit on why it’s always a good idea to communicate more what you do.

No matter how mundane or “been done before” you believe your work is, there’s value in showing it to others because some people will learn from it. In machine learning, some methods are designed to try all possible permutations of a set of options to choose the one with the best performance. As the complexity of a model about your data grows, inevitably this tree search method breaks down and you need to apply some heuristics to the problem. What people don’t mention is that these heuristics can come from anywhere, whether it’s a research paper, a book, a mentor, or even a five minute talk you saw on a Thursday night.

Like any good person who over-prepares for things, I read up a bit on it which helped me come to this conclusion (and that helped me think through public speaking in general). Here are some resources:

Oh yeah. I think my talk went pretty well. Here’s a link to my Google Drive slides or a PDF copy. If you’d like to chat about data or about my work, feel free to reach out to me via email or on Twitter.



For a long time, I’ve been interested in with web technology. In high school, I read Jesse Liberty’s Complete Idiot`s Guide to a Career in Computer Programming learning about Perl, CGI (common gateway interface), HTML, and other technologies. It wasn’t until I finished a degree in mathematics that I really started learning the basics, namely HTML, CSS, and JavaScript.

At that point, folks were just starting to come out of the dark ages of table-base layouts and experimenting with separating content (HTML) from presentation (CSS) from behavior (JavaScript). A popular discussion was over what the best type of layout was. I remember reading discussions of left vs right handed vs centered pages. (The latter is still a pain to implement but the web has still come a long way since then.) Those discussions stuck with me and motivates the study I’m running right now.

In mathematics, people have been using fundamental matrix algebra to help them accomplish some very interesting things. One application is in image decomposition, basically, breaking images into simpler, more basic components. A space of vectors (or coordinates that can represent data points) have “eigenvectors,” which are vectors that you can use to reconstruct other vectors that you have in front of you. In facial recognition, applying this technique to images of faces yields Eigenfaces, patterns that are common to many of the images.

Coming back to the idea of website layouts, I reasoned there must be a way to see this in real web data somehow. If we took images of the most popular websites on the web, how would their “eigenlayouts” look overall. Would certain layouts (like left or right or center) just pop out of the data somehow. Well, to answer that question, we need some data, we need to analyze it, and then we need to interpret it.

Data Retrieval

To run the analysis on these websites, we need to turn them into images. For that, I turned to PhantomJS, a headless browser geared toward rendering websites. For the sake of having a portable solution (able to run just about anywhere without needing too much bootstrapping), I decided to use a community-submitted Docker image that I found that that did that job nicely. Basic usage has you specifying the website you want to render as well as the output filename for the image. You can additionally pass in arguments for your webpage resolution. I went with 800px by 1200px because it’s a sensible minimum that people are creating websites for.

When I need to run a lot of commands and don’t need a who programming language, I typically turn to GNU make for defining a pipeline of work and for parallelizing it out.

IMAGEDIR := images/
DOMAINDIR := domains/

DOMAINLIST := $(shell find $(DOMAINDIR) -type f)

# Docker volume requires an absolute path
VOLUME := $(abspath $(IMAGEDIR))


    mkdir -p $(IMAGEDIR)
    $(eval DOMAIN := $(patsubst $(IMAGEDIR)%.png,%,$@))
    sudo docker run -t --rm -v $(VOLUME):/raster-output herzog31/rasterize http://$(DOMAIN) "$(notdir $@)" 1200px*800px 1.0

# This rule will create 10,000 files in the $(DOMAINDIR) directory
    mkdir -p $(DOMAINDIR)
    cut -d, -f2 top-1m.csv | head -10 | xargs -n1 -I{} touch "$(DOMAINDIR){}"

The gist of this is that first you run make mkdomains to create a directory domains/ filled with dummy targets of each domain you want to look up (10,000 take up about 80MB of space). Then, running make will use that seeded directory of domains to pull each one down and drop the image file in the images/ directory. You can file all of this under “stupid Makefile hacks.”

Next Steps

So far, this covers a very small portion of the 80% of data science that’s cleaning and munging data. The next blog post will focus on loading and analyzing this image data in Python using the scikit-image library.

If you liked this post, please consider subscribing to updates and following me on Twitter.

Memory Profiling in Python


Data Scientists often need to sharpen their tools. If you use Python for analyzing data or running predictive models, here’s a tool to help you avoid those dreaded out-of-memory issues that tend to come up with large datasets.

Enter memory_profiler for Python

This memory profile was designed to assess the memory usage of Python programs. It’s cross platform and should work on any modern Python version (2.7 and up).

To use it, you’ll need to install it (using pip is the preferred way).

pip install memory_profiler

Once it’s installed, you’ll be able to use it as a Python module to profile memory usage in your program. To hook into it, you’ll need to do a few things.

First you’ll need to decorate the methods for which you want a memory profile. If you’re not familiar with a decorator, it’s essentially a way to wrap a function you define within another function. In the case of memory_profiler, you’ll wrap your functions in the @profile decorator to get deeper information on their memory usage.

If your function looked like this before:

def my_function():
    """Runs my function"""
    return None

then the @profile decorated version would look like:

def my_function():
    """Runs my function"""
    return None

It works because your program runs within a special context, so it can measure and store relevant statistics. To invoke it, run your command with the flag -m memory_profiler. That looks like:

python -m memory_profiler <your-program>

Profiling results

To see what the results look like, I produced some sample code snippets that show you some examples.

While these examples are contrived, they illustrate how tracing memory usage in a program can help you debug problems in your code.

Line #    Mem usage    Increment   Line Contents
     8   77.098 MiB    0.000 MiB   @profile
     9                             def infinite_loading():
    10                                 """Exceeds available memory"""
    11 2935.297 MiB 2858.199 MiB       train = pd.read_csv('big.csv')
    12 2935.297 MiB    0.000 MiB       while True:
    13 4968.926 MiB 2033.629 MiB           new = pd.read_csv('big.csv')
    14                                     train = pd.concat([train, new])

Traceback (most recent call last):

Above we have a pretty obvious logical error, namely we’re loading a file into memory and repeatedly appending its data onto another data structure. However, the point here is that you’ll get a summary of usage even if your program dies because of an out-of-memory exception.

When should you think about profiling?

Premature optimization is the root of all evil – Donald Knuth

It’s easy to get carried away with optimization. Honestly, it’s best not to start off by immediately profiling your code. It’s often better to wait for an occasion when you need help. Most of the time, I follow this workflow:

First, try to solve the problem as best as you can on a smaller sample of the actual dataset (the key here is to use a small enough dataset so that you have seconds between when it starts and finishes, rather than minutes or hours).

Then, include your entire dataset to see how that runs. At this point, based on your sample runs you should have a) an idea of how long the full dataset should take to run and b) an idea of how much memory it will use. Keep that in mind.

Next, you have to monitor it running, so that could mean three possible outcomes (for simplicity)

  • It finishes successfully
  • It runs out of memory
  • It’s taking too long to run

You should start thinking about profiling your code if you encounter either of the latter cases. In the case of overuse of memory, it will help to run the memory profiler to see which objects are taking up more memory than you expect.

From there, you can take a look at whether you need to encode your variables differently. For example, maybe you’re interpreting a numeric variable as a string and thus using more RAM. Or it could be time to offload your work to a larger server with enough space.

If the algorithm is taking too long, there are a number of options to try out, which I’ll cover in a later post.

Concluding remarks

You just saw how to run some basic memory profiling in your Python programs. Out-of-memory while analyzing a particular dataset is one of the primary hurdles that people encounter in practice. The memory_profiler package isn’t the only one available so check out some of the others in the Further Reading section below.

If you liked this post, please share it on Twitter or Facebook and follow me @mathcass.

Using Jupyter on Remote Servers


As a data scientist, it really helps to have a powerful computer nearby when you need it. Even with an i7 laptop with 16GB of RAM in it, you’ll sometimes find yourself needing more power. Whether your task is compute or memory constrained, though, you’ll find yourself looking to the cloud for more resources. Today I’ll outline how to be more effective when you have to compute remotely.

I like to refer folks to this great article on setting up SSH configs. Not only will a good SSH configuration file simplify the way you access servers, it can also help you streamline the way you work on them.

I find Jupyter to be a superb resource for writing reports and displaying graphics of data. It essentially lets you run code in your web browser. However, one issue with using it on a remote machine is that you may not be able to access the interface because the server is blocking the necessary port to see it on the web(this is a great thing for security and prevents others from seeing your work). There’s a way to work through this by using SSH’s ability to forward ports.

To do that, first you’ll need to log into your remote machine:

ssh -L 8888: <remote host>

That means you’re connecting to your remote host, except any time you want to access port 8888 on your local machine (, it will forward it to the remote machine’s port 8888.

Then, you’ll need to start Jupyter:

cd <project>
jupyter notebook

Finally, head to the url http://localhost:8888 to find yourself accessing the remotely running copy of your notebook.

Here’s a screenshot of what that should look like.

Screenshot from 2016-01-27 21:54:01

Note the highlight line on the right. There isn’t a web browser installed on my remote machine but I was still able to access this notebook by using my local computer.

Whether you want to load a 50GB data frame into Pandas or use jobs=-1 in Scikit Learn, you should find yourself more able to do your work.

Follow me on Twitter @mathcass.

Four useful books for learning Data Science

I was listening to an old episode of Partially Derivative, a podcast on data science and the news. One of the hosts mentioned that we’re now living in the “golden age of data science instruction” and learning materials. I couldn’t agree more with this statement. Each month, most publishers seem to have another book on the subject and people are writing exciting blog posts about what they’re learning and doing.I wanted to outline a few of the books that helped me along the way, in the order I approached them. Hopefully, you can use them to gain a broader perspective of the field and perhaps as a resource to pass on to others trying to learn.

Learning from Data

I first found Learning from Data through Caltech’s course on the subject. I still think it’s an excellent text but I’m not sure if I would recommend it to the absolute beginner. (To someone who is just coming to the subject, I would probably recommend the next choice down on the list.)

However, I have a Master’s degree in mathematics so I was familiar with the background material in linear algebra and probability as well as the notation used. Learning from Data taught me that there was actual mathematical theory behind a lot of the algorithms employed in data science.

Most algorithms are chosen for their pragmatic application, but they also have features in and of themselves (such as how they bound the space of possible hypotheses about the data) that can help determine their effectiveness on data. There’s also a general theory for how to approach the analysis of these algorithms. At the time of reading, a lot of it was still a bit over my head, but it got me incredibly curious about the field itself.

Now, understanding a few things about the theory is great, but most of the time, people want to know what it can actually do.

Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management

I’ll admit to only having had a cursory understanding of what was possible before I read Data Mining Techniques. I knew that the most widely used algorithms were used for assessing risk, like credit scores. However, I didn’t know much about how you could make gains in the world of marketing using data science techniques.

I appreciated that the authors have a lot of experience in the field, especially experience that predates most of the growth in big data these days. This book makes it clear that many of the most useful algorithms have been around and in use for decades. The authors also offer some explanations from the direct marketing case (print magazines and physical mail) that I hadn’t considered, such as ranking algorithms, which were originally used to prioritize a list of people to contact because of the high costs of mailing paper to people.

More than anything, I liked the breadth of the topics, since they cover just about every form of marketing algorithm and do a great job of giving you a high level view of why they matter.

You won’t walk away from this book knowing how to implementing everything they talk about, but you will get a sense for which algorithms are suited for particular tasks.

This book gave me a better way to think through the initial phases of a project, but I still needed some help in learning how to communicate about data and how to fit it directly into the business context.

Data Science for Business

I read through this one while I was on vacation (yes, I know, I’m that type of geek). That didn’t stop me from soaking up a lot of information from it about how data science applies to a company trying to use these models. Most of the book is focused on helping you think through how to operationalize the process of running and managing a data science project and what outcomes you might expect from the effort.

Beyond that, I think it taught me how to communicate better about data at a company. Being able to talk about the many months it will take to bring a project into fruition and weigh it against alternatives is the bread and butter of working at a company that wants to make money. Moreover, if you believe that a particular project is the right choice, you need to be able to back up that choice by communicating about the benefits.

I want to say that this is a very “bottom-line” type of book, but that’s okay to hear about some of the time. Data science doesn’t always have to be about the hottest technique or the biggest technology if your priorities include keeping your costs below your revenue. However, I still didn’t learn much about getting my hands dirty with the data on a day-to-day basis. For that, I had to rely on the final book I present.

Applied Predictive Modeling

This is a book on predictive modeling in R and on using a package that the author developed for doing that. This isn’t simply about someone tooting their own horn because caret is a quality piece of software. Overall, I think that even if you don’t end up using R as your go-to tool for analyzing data, you’ll still learn a lot from this book. It thoroughly demonstrates the power caret can offer you in a project, to the point that you’ll seek the same functionality in your tool of choice (or hopefully build its equivalent for us).

Caret is a package that offers a consistent interface for just about any predictive task (classification or regression) that you could ask for. One issue some people have with R packages is that the interface for algorithms isn’t very consistent. Learning how to use one package won’t always lead to the same understanding in a completely different package. Caret addresses that by giving you the same way to set up a modeling task for many different algorithms. Moreover, it also automates several tasks like:

  • Data splitting into training and test sets
  • Data transformations like normalization or power transforms
  • Modeling tuning and parameter selection

Essentially, it makes working in R a lot like using Scikit Learn (an excellent library itself) but with many more options and model implementations.

So that’s all you need, right? Just read a couple of books and you’re on your way? Not quite. You’ll actually have to apply some of this and learn from it. Perhaps next time you’re in a meeting discussing priorities for your company, you will need to frame the conversation about your next data project and directing the data effort toward your business goals (Data Science for Business). When you’re brainstorming possible things that you could try to predict and use in a marketing campaign, you will need to outline the possible techniques and what they could offer you (Data Mining Techniques). If you’re evaluating candidate algorithms for their ability to perform the task accurately, you will need to gauge their effectiveness from a theoretical (Learning from Data) and practical (Applied Predictive Modeling) standpoint.

I hope this helps you apply data science at work and gives you perspective in the field. Also, if you’re not a follower on Twitter, please follow me @mathcass.