— In the News —
This story got a lot of attention around the web this week. It's tells the tale of an Ivy League scientist who massaged low-quality data into headline-grabbing results that went viral big time. But it's more than one researcher's story. In a lot of ways, this is a story about how misaligned incentives and a lack of reproducibility is breaking science.
New York City’s council recently passed a bill to enforce "algorithmic accountability." Nobody knows how that will work yet but this post from an organization called AI Now offers a good starting point for making algorithms and their impacts more easily understandable. This effort is the first of it's kind in the U.S. and will be closely watched.
— Profiles —
By pitting neural networks against one another, Ian Goodfellow has created a powerful AI tool. Now he, and the rest of us, must face the consequences.
— Sponsored Link —
Driverless AI speeds up data science workflows by automating feature engineering, model tuning, ensembling, and model deployment. Use Driverless AI to avoid common mistakes such as under or overfitting, data leakage or improper model validation. Try Driverless AI today - request a free 21-day trial.
— Tools and Techniques —
Here's a nice overview of Gartner’s 2018 report, "Magic Quadrant for Data Science and Machine Learning Platforms." From ~100 companies that sell data science software, Gartner selected 16 of the most important to rate on their vision and ability to execute. This overview by Robert A. Muenchen includes key developments and a link to an in-depth analysis.
This is a great post from Cloudera's engineering team that shows how to build a production scale recommendation system. Covers overall system design, machine learning considerations, data transport and model deployment.
Here's part 2 of Robert Chang's series about data engineering from the perspective of a data scientist. This part covers data modeling, star schema, data partitioning, Airflow, and ETL best practices. This series is a great introduction to the engineering concepts that data scientists should be familiar with. In case you missed it, here's part 1 >>
How much engineering should data scientists really need to know? The answer isn't always clear but this new tool from StitchFix suggests it should be "as little as possible." The tool is called "Flotilla" and it's purpose is to make it easy for data scientists to define and run containerized jobs without support from Engineering. It looks useful and, if you're working on a team, the "Philosophy" section at the end of this post will likely stir up some discussion.
People don't generally think about fonts when they think about data but fonts are important for making sure that data is communicated correctly. This article offers suggestions for selecting fonts along with considerations for size, weight, width, and spacing for a variety of use cases.
This new RStudio addin from David Ranzolin enhances the observability of the programming experience, particularly within the context of the tidyverse. By highlighting any pipe sequence, this addin will generate numbered View tabs, allowing you to observe all the output of your code, step-by-step.
Learn how to manage business objectives and position your IT org for the future. Download now >>