Data Elixir logo

ISSUE 450 · August 20, 2023

 

Posts & Tutorials

Handle big, ugly and bad CSV files

The csv file format is one of the most common formats for storing and exchanging data but it has issues. In this deep dive, Andrea Borruso explores the format, its problems, how to use DuckDB to analyze csv, and finally, why the Parquet format is a good alternative. Note that the link goes to an automated English translation, which is pretty good but not perfect. The original is in Italian >>
Andrea Borruso

 

R for Sign Language Linguistics

Nice introduction to sign language data and how to work with it using R. Most people don't give much thought to sign language data but there's actually a lot going on in this space, including an international data science workshop next month called Autumn School.
Carl “Calle” Börstell

 

How to design useful color keys

A carefully designed color key can mean the difference between readers glancing at your visualization and deciding it’s too hard to figure out, and readers actually reading it. This post shows how to create useful, truthful, easily skimmable color keys, starting with simple tricks and ending with a collection of complex, clever, and fun color keys.
Datawrapper | Lisa Charlotte Muth

 

Sponsored Link

Amazon Bedrock offers access to multiple generative AI models

Amazon Bedrock offers access to multiple generative AI models

The emergence of open source LLMs led to the potential generation of toxic outputs. Amazon Bedrock, the latest step in the company’s ongoing effort to democratize ML, uses Amazon’s Titan FM to help customers detect and remove harmful content in inputs and filter model outputs.

 
 
 

Reach Data Elixir readers by sponsoring an issue. for details.

 

Tools & Code

Code Llama, a state-of-the-art LLM for coding

Code Llama is a new state-of-the-art LLM that's designed to generate code from text prompts. It can also generate text about code, generate code completions, help debug code, and ultimately, it can help you write more robust and well-documented software. It's free to use and works with a variety of languages, including Python, C++, Javascript and more.
Meta AI

 

Dataherald

Dataherald is an open-source SQL engine that understands natural language. It's designed for enterprise-level Q/A and can be hosted locally, giving non-SQL business users the ability to answer ad-hoc questions on their own.
GitHub | Dataherald

 

Papers

Computational reproducibility of Jupyter notebooks

After reviewing 27,271 Jupyter notebooks that were associated with 3,467 publications, only 1,203 notebooks ran without any errors. And of those, only 879 produced the expected results. This paper dives into the issues, highlights trends, and offers suggestions to improve Jupyter-related workflows. 
arXiv | Sheeba Samuel and Daniel Mietchen

 

A Survey on LLM-based Autonomous Agents

Great survey paper exploring the landscape and the possibilities for using large language models to power autonomous agents. It covers the construction of LLM-based agents, as well as a summary of applications in the social sciences, natural sciences, and engineering.
arXiv | Lei Wang, et al.

 

Resources

Geographic Data Science with Python

This book covers the tools, methods, and theory for solving geographic problems with data. It starts with a "Building Blocks" section that lays the groundwork for geographic thinking and then dives into a variety of topics in spatial data, mapping, and spatial statistics. Free to download.
Sergio J. Rey, et al.

 

Python for Data Science

This new online book will teach you how to load, transform, visualize, and understand your data using Python. The book is inspired by R for Data Science and assumes that readers have some coding experience but are new to data science.
Arthur Turrell

 
 

Sign up to get Data Elixir's  data science newsletter in your Inbox >>

 
« Previous Issue
 
 
 
Data Elixir logo

Data Elixir, LLC
P.O. Box 21255
Boulder, CO 80308

Data Elixir® is curated and maintained by Lon Riesberg. If you have questions or suggestions, send a note!