— In the News —
Kaggle, the San Francisco startup that has spent the past five years organizing data prediction competitions, has cut about a third of its staff. Here's what that means for the data science community.
It's no secret that there's a lot of money in sports. And there's a lot of data too. If you happen to be interested in both, there are a lot of opportunities for you. This is a nice overview of recent developments in the sports analytics space.
— Tools and Techniques —
Great tutorial by Lynn Cherny. Using Fifty Shades of Grey as a vehicle, this tutorial goes from labeled text to machine classification - first with NLTK and then the Python machine learning library scikit-learn. The explanations are clear, it's entertaining, and there's a repo of code to back it up. Highly recommended.
If you haven't been hearing about Apache Spark yet, you will be soon. Spark is an open-source cluster computing framework that avoids the I/O costs of Hadoop MapReduce by keeping everything in memory. This allows users to load lots of data into a cluster's memory and query it repeatedly. It's very fast and is especially well suited to iterative algorithms, like those used in machine learning. This is a great tutorial to get started with Spark in Python.
Gentle introduction to building a neural net with Python's numpy library - with sample code.
— Resources —
Linear Algebra is a crucial prerequisite for many things that are related to data. If you're just starting out or want a refresher, there are some really great resources here.
DARPA's mission is to create breakthrough technologies for national security so it's no surprise that they're into data. But you might be surprised to discover that they're helping to develop a substantial open-source library of "tools and techniques to process and analyze large sets of imperfect, incomplete data." There's a lot to explore here...
— Data Viz —
Measles is back in the US – and it's spreading.
This story has been making headlines around the world lately. And the response by the news industry has been consistent: create interactive data visualizations! This is an awesome opportunity to see a variety of visual approaches that come out of digging into the same story. All four of these are really great:
Ever wonder how to show the data you're not so sure about? Visualizing uncertainty is a hard problem. Andy Kirk of Visualizing Data put together this great collection of references, papers and examples to explore the issue and help you decide how to handle it in your next project.