If you have a favorite data podcast, cast your vote here and we'll report the top picks in an upcoming issue of the newsletter 👉
In this session from last week's Microsoft Build, Andrej Karpathy describes the pipeline for training bots like ChatGPT. From there, he dives into into practical techniques for using GPT effectively, including prompting techniques, finetuning, tools, and things to expect. This is a great talk but if you're short on time, see Alex Volkov's notes 👉Microsoft Build | Andrej Karpathy — 43 minutes
Forcing data through a rectangle shapes the way we solve problems (e.g. dimensional fact tables, OLAP Cubes). But most data isn't rectangular — it's hierarchical. In this talk, Lloyd Tabb describes a new data programming language that transcends the rectangle paradigm and breaks long held misconceptions in the way we analyze data.
Data Council | Lloyd Tabb — 34 minutes
Data science teams face many challenges when trying to optimize their processes and ship research results and machine learning models faster. Datalore has become a game-changing solution for data teams across industries, enabling ergonomic data access, effortless collaboration, and easy reporting via Jupyter notebooks. Try Datalore for free
Vega-Lite is a high-level language for rapidly creating interactive visualizations. It includes support for a variety of data and visual transformations and doesn't need a lot of code. This multi-part tutorial introduces Vega-Lite and offers a variety of step-by-step examples.
Observable | Jon E. Froehlich
There are plenty of data formats supported by Pandas. Which should you choose and why?
Python⇒Speed | Itamar Turner-Trauring
The sink() function in R is used to divert R output to an external connection. This can be useful for a variety of uses, such as exporting data to a file, logging R output, or debugging code. Here's how it works.
Steve’s Data Tips and Tricks
As ChatGPT and other LLMs get thrust into the mainstream, more people outside of ML and NLP circles are trying to better understand Attention and the Transformer. Here are some answers to common questions, with a focus on conveying the intuition.
The Google Advanced Data Analytics Professional Certificate is a 7-course series that focuses on building regression and machine learning models, applying statistical methods to investigate data, creating data visualizations, and communicating insights from data analysis to stakeholders. The course is run by Coursera and is free to get started.
Data Elixir Partner
Language models fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. You may have heard of "Chain of Thought" prompting to help overcome these issues. "Tree of Thought" works much better.
arXiv | Shunyu Yao, et al.
Researchers at Meta have shown that remarkably capable LLMs can be achieved with only 1,000 carefully curated examples. This could be a game-changer for researchers and small-scale developers.
arXiv | Chunting Zhou, et al.
Nice collection of data science interview questions and answers. There are 100+ questions here, covering machine learning, statistics, probability, python, SQL, and more.
GitHub | Youssef Hosni