— In the News —
It might seem like it would be easy to sort good science from bad but the reality is complicated. For results to be believable, everything needs to be out in the open, including the raw data, code, and details of how it was processed and analyzed. From there, decision-makers get to figure out how to interpret the uncertainties. This article in FiveThirtyEight explores the issues and how science is being turned against itself.
Great article in Distill that explores how AI-powered user interfaces can give people new tools for reasoning. This is a worthwhile longread that includes interactives to play with.
— Profiles —
Wes McKinney built the basics of pandas in 2008 and made the project public in 2009. The following year, it started being discovered and Wes made the decision to drop out of grad school to work on pandas full-time. That turned out to be an important move for Wes and, for the data science community, it was pivotal. Here's the backstory to what has become one of the most important tools in data science.
— Tools and Techniques —
If you're new to machine learning, definitely check out this slide deck by Jason Mayes at Google. It's a fun and well-organized approach to learning and you decide how deep to go. Do you want the green pill or the blue pill..?
SQL Window Functions offer a lot of flexibility in cases where you might otherwise be tempted to write hack-ish workarounds. Window functions are readable, performant, and are easy to debug. This tutorial describes what they are and shows how to use them.
The typical analytics stack is moving away from monolithic solutions that try to do everything. This post on Mode's blog explores a better approach.
Even if you don't work with it everyday, Excel is a widely used data tool that's worth being familiar with. This tutorial shows how to extract the data from Excel files into pandas so you can work with it; and how to write data back to Excel files so your boss can understand it. Covers a variety of Excel topics like multiple sheets, headers, how to skip records, how to read a subset of columns, pivot tables, etc.
— Deep Learning —
A key research team at Google just released a research paper that shows how machine-learned indexes can replace B-Trees, Hash Indexes, and Bloom Filters. It's significantly faster than current methods, requires far less space, and runs on GPUs. Ultimately, "replacing core components of a data management system through learned models has far reaching implications for future systems designs." This is ground-breaking work.
Twitter was on fire last week with comments and announcements from the thirty-first Annual Conference on Neural Information Processing Systems (NIPS 2017). Here are the highlights, including lots of linked references:
— Data Viz —
There are many types of maps that are used to display data. Common strategies often focus on a particular variable but those don't always work when you have multiple variables to present at the same time. In this post, Jim Vallandingham explores the options for multivariate data presentation.
— In Case You Missed It —
Be sure to catch the most popular links from last week's issue...
— About —
Data Elixir is curated and maintained by @lonriesberg. If some awesome person forwarded this issue to you, subscribe for free at dataelixir.com and get it delivered every week.