ISSUE 445 · July 25, 2023
Test-Driven Data Analysis (TDDA) is an approach to quality for data and for data analysis pipelines. In this podcast interview, Nick Radcliffe introduces the philosophy and the associated Python library, TDDA, that he's been developing since 2016.
Code for Thought podcast
RudderStack Profiles takes the SQL grunt work out of building customer profiles. You specify the customer traits, then Profiles runs the joins and computations for you to create complete profiles, so you can build better models, faster. And that’s just one use case. Get all the details.
Especially with the popularity of LLMs, vector databases are all the rage these days. Which should you choose? Here's a great overview of nine popular options for Python, including strengths of each, sample code, and useful links along the way.
One of the biggest superpowers of visualization is to lead a reader’s eye to the data you want to emphasize. By using color to create a visual hierarchy, you can decide what your readers see first, second, third, and last. This is a great post with lots of examples along the way.
Datawrapper | Lisa Charlotte Muth
In this online course, you'll learn how to develop and deploy an end-to-end data application with SQL, Python and Jupyter notebooks. Covers exploratory data analysis, SQL basics, workflow reproducibility, data pipelines, deployment, and more.
Nice introduction to Cloud-Based Geospatial Analysis using Google's Earth Engine and the geemap Python package. Covers the basics of Earth Engine data types and how to visualize, analyze, and export Earth Engine data in a Jupyter environment using geemap.
SciPy 2023 | Qiusheng Wu and Steve Greenberg
Awesome salary calculator, based on a compensation prediction model that was built for a Kaggle competition in 2022. This was made specifically for people who work with data and there are a lot of options, including a variety of positions, locations, experience levels, and more.
Jobs In Data
This cookbook for R users offers a side-by-side comparison of polars, R base, dplyr, tidyr and data.table for common tasks and problems.
Exhaustive and concise — a truly Pythonic cheat sheet for the Python programming language.
This is a collection of 8 challenging puzzles about training large language models (or really any NN) on many, many GPUs. The goal of these puzzles is to get hands-on experience with the key primitives and to understand the goals of memory efficiency and compute pipelining.
GitHub | Sasha Rush