— In the News —
This New York Times investigation is a must-read deep dive into a data industry that's perfectly legal but should alarm us all. Via apps on our phones, dozens of companies log the movements of tens of millions of people and can show where we spend our days, who we spent last night with, where we worship, whether or not we visit a psychiatrist, etc. Follow the links for info about locking down your phone.
Medical applications are a hot growth area for AI but in the "fail fast and fix things later" world of entrepreneurial tech, serious mistakes are being made.
— Sponsored Link —
To maximize the potential in ML models, high-quality data is key. When measuring the quality of labeled data consider IOU, Accuracy, Recall, Precision, and F1 Score. An enterprise-grade labeling platform that employs a holistic annotation approach can scale quality annotation without diminishing complexity and compromising accuracy.
— Tools and Techniques —
In his latest post, David Robinson starts with a SQL interview question and shows how to make the solution as computationally efficient as possible using a tidyverse approach. This is a good think-through of different approaches.
Great post on the StitchFix tech blog that uses a common marketing question to show how being causal information driven is more effective than being naively data driven.
Smart cities are powered by sensors that measure things like air pollution, traffic congestion, temperature, humidity and road quality. Covering a large area is expensive but what if sensors could be mobile? How many taxis would it take to effectively "scan" an entire city? This is a nice exploration of the problem and how to measure urban sensing.
— Resources —
This free course by Kristen Kehrer starts with simple database queries and continues through data cleaning and feature engineering. Includes videos, cheat sheets and an interactive SQL browser for following along.
This collection of Jupyter notebooks showcases best practices and examples for common scenarios involving text and language. The intention is to significantly reduce development time for both researchers and practitioners.
— Data Viz —
Consider the last few visualizations you encountered outside of scientific publications. Did they depict uncertainty? Probably not. Here's what Jessica Hullman discovered when she asked why.
Matplotlib is notoriously difficult to learn but if you're already familiar with it, this is a great guide to help you level-up your skills.
— Career —
This post is a hard reality check on the path to Machine Learning Research. There's a lot of opportunity here but the competition and stakes are high. This followup thread from Julian Togelius is also worthwhile.