Communicate about your work

Standard

I gave a lightning talk earlier this month to the PyData Atlanta Meetup. I’ve given hour-long talks on technical subjects before, but I hadn’t done anything quite that concise before. This fact freaked me out quite a bit. I wanted to reflect a bit on why it’s always a good idea to communicate more what you do.

No matter how mundane or “been done before” you believe your work is, there’s value in showing it to others because some people will learn from it. In machine learning, some methods are designed to try all possible permutations of a set of options to choose the one with the best performance. As the complexity of a model about your data grows, inevitably this tree search method breaks down and you need to apply some heuristics to the problem. What people don’t mention is that these heuristics can come from anywhere, whether it’s a research paper, a book, a mentor, or even a five minute talk you saw on a Thursday night.

Like any good person who over-prepares for things, I read up a bit on it which helped me come to this conclusion (and that helped me think through public speaking in general). Here are some resources:

Oh yeah. I think my talk went pretty well. Here’s a link to my Google Drive slides or a PDF copy. If you’d like to chat about data or about my work, feel free to reach out to me via email or on Twitter.

Advertisements

Eigenlayouts

Standard

For a long time, I’ve been interested in with web technology. In high school, I read Jesse Liberty’s Complete Idiot`s Guide to a Career in Computer Programming learning about Perl, CGI (common gateway interface), HTML, and other technologies. It wasn’t until I finished a degree in mathematics that I really started learning the basics, namely HTML, CSS, and JavaScript.

At that point, folks were just starting to come out of the dark ages of table-base layouts and experimenting with separating content (HTML) from presentation (CSS) from behavior (JavaScript). A popular discussion was over what the best type of layout was. I remember reading discussions of left vs right handed vs centered pages. (The latter is still a pain to implement but the web has still come a long way since then.) Those discussions stuck with me and motivates the study I’m running right now.

In mathematics, people have been using fundamental matrix algebra to help them accomplish some very interesting things. One application is in image decomposition, basically, breaking images into simpler, more basic components. A space of vectors (or coordinates that can represent data points) have “eigenvectors,” which are vectors that you can use to reconstruct other vectors that you have in front of you. In facial recognition, applying this technique to images of faces yields Eigenfaces, patterns that are common to many of the images.

Coming back to the idea of website layouts, I reasoned there must be a way to see this in real web data somehow. If we took images of the most popular websites on the web, how would their “eigenlayouts” look overall. Would certain layouts (like left or right or center) just pop out of the data somehow. Well, to answer that question, we need some data, we need to analyze it, and then we need to interpret it.

Data Retrieval

To run the analysis on these websites, we need to turn them into images. For that, I turned to PhantomJS, a headless browser geared toward rendering websites. For the sake of having a portable solution (able to run just about anywhere without needing too much bootstrapping), I decided to use a community-submitted Docker image that I found that that did that job nicely. Basic usage has you specifying the website you want to render as well as the output filename for the image. You can additionally pass in arguments for your webpage resolution. I went with 800px by 1200px because it’s a sensible minimum that people are creating websites for.

When I need to run a lot of commands and don’t need a who programming language, I typically turn to GNU make for defining a pipeline of work and for parallelizing it out.

IMAGEDIR := images/
DOMAINDIR := domains/

DOMAINLIST := $(shell find $(DOMAINDIR) -type f)
IMAGELIST := $(DOMAINLIST:=.png)
IMAGELIST := $(IMAGELIST:$(DOMAINDIR)%=$(IMAGEDIR)%)

# Docker volume requires an absolute path
VOLUME := $(abspath $(IMAGEDIR))

all: $(IMAGELIST)

%.png:
    mkdir -p $(IMAGEDIR)
    $(eval DOMAIN := $(patsubst $(IMAGEDIR)%.png,%,$@))
    sudo docker run -t --rm -v $(VOLUME):/raster-output herzog31/rasterize http://$(DOMAIN) "$(notdir $@)" 1200px*800px 1.0

# This rule will create 10,000 files in the $(DOMAINDIR) directory
mkdomains:
    mkdir -p $(DOMAINDIR)
    cut -d, -f2 top-1m.csv | head -10 | xargs -n1 -I{} touch "$(DOMAINDIR){}"

The gist of this is that first you run make mkdomains to create a directory domains/ filled with dummy targets of each domain you want to look up (10,000 take up about 80MB of space). Then, running make will use that seeded directory of domains to pull each one down and drop the image file in the images/ directory. You can file all of this under “stupid Makefile hacks.”

Next Steps

So far, this covers a very small portion of the 80% of data science that’s cleaning and munging data. The next blog post will focus on loading and analyzing this image data in Python using the scikit-image library.

If you liked this post, please consider subscribing to updates and following me on Twitter.