— In the News —
This is such an important story that gets right to the heart of why ethics are so important in data science. It's essentially a profile of the data scientist behind the Cambridge Analytica technology that was used to weaponize massive amounts of Facebook data. It reads like a thriller.
The General Data Protection Regulation (GDPR) goes into effect on May 25 and even though it's a European rule, it will impact businesses throughout the world. This article in Wired offers a good overview of what to expect and, if you have a business, what you need to know.
A recent survey of over 16,000 data professionals reveals the most common challenges to data science in the workplace. This is a short article with some useful insights.
— Sponsored Link —
In November of 2017, a group of leaders from realms including data science, data journalism, academia, analytics, and the Semantic Web gathered to contemplate a fascinating question: What is the most effective, ethical, and modern approach to data teamwork?
The Manifesto for Data Practices describes a simple, powerful, and attainable model for improving data teamwork in any organization. Please read and sign it if you support the vision.
— Tools and Techniques —
Ever get stuck trying to reproduce a model? Maybe even your own model! This post by Pete Warden explores why it's so hard to get reproducibility right, why it matters, and ultimately, what the data science community needs.
This collection of notebooks demonstrates basic machine learning algorithms in plain Python. All algorithms are implemented from scratch without using additional machine learning libraries. The goal here is to show how the algorithms work.
If you've been working with any of the popular deep learning frameworks, you'll appreciate this GitHub repo that bills itself as a "Rosetta Stone" for deep learning frameworks.
— Datasets —
Digg Reader is a news aggregator that's collected most of what's been published online over the past 5 years. That amounts to ~12 billion pieces of content for a total of ~27 TB of data. That dataset could lend itself to a variety of uses, including research into fake news, advertising, and on a practical level, figuring out ways to deal with that much data online. Check out the thread for ideas and access info.
— Data Viz —
This new project by Claus Wilke is definitely worth checking out. Chapters of this upcoming book are posted as they're completed and they're very well done.
In this tutorial, Timo Grossenbacher shows how to create categorical spatial interpolations from a set of georeferenced points using ggplot2 and the kknn package. The technique is very resource intensive so he also describes how he parallelized the processing across multiple CPU cores to increase performance. The end result is amazing.