ISSUE 450 · August 20, 2023Posts & TutorialsHandle big, ugly and bad CSV filesThe csv file format is one of the most common formats for storing and exchanging data but it has issues. In this deep dive, Andrea Borruso explores the format, its problems, how to use DuckDB to analyze csv, and finally, why the Parquet format is a good alternative. Note that the link goes to an automated English translation, which is pretty good but not perfect. The original is in Italian >> R for Sign Language LinguisticsNice introduction to sign language data and how to work with it using R. Most people don't give much thought to sign language data but there's actually a lot going on in this space, including an international data science workshop next month called Autumn School. How to design useful color keysA carefully designed color key can mean the difference between readers glancing at your visualization and deciding it’s too hard to figure out, and readers actually reading it. This post shows how to create useful, truthful, easily skimmable color keys, starting with simple tricks and ending with a collection of complex, clever, and fun color keys. Sponsored LinkAmazon Bedrock offers access to multiple generative AI modelsThe emergence of open source LLMs led to the potential generation of toxic outputs. Amazon Bedrock, the latest step in the company’s ongoing effort to democratize ML, uses Amazon’s Titan FM to help customers detect and remove harmful content in inputs and filter model outputs. Tools & CodeCode Llama, a state-of-the-art LLM for codingCode Llama is a new state-of-the-art LLM that's designed to generate code from text prompts. It can also generate text about code, generate code completions, help debug code, and ultimately, it can help you write more robust and well-documented software. It's free to use and works with a variety of languages, including Python, C++, Javascript and more. DataheraldDataherald is an open-source SQL engine that understands natural language. It's designed for enterprise-level Q/A and can be hosted locally, giving non-SQL business users the ability to answer ad-hoc questions on their own. PapersComputational reproducibility of Jupyter notebooksAfter reviewing 27,271 Jupyter notebooks that were associated with 3,467 publications, only 1,203 notebooks ran without any errors. And of those, only 879 produced the expected results. This paper dives into the issues, highlights trends, and offers suggestions to improve Jupyter-related workflows. A Survey on LLM-based Autonomous AgentsGreat survey paper exploring the landscape and the possibilities for using large language models to power autonomous agents. It covers the construction of LLM-based agents, as well as a summary of applications in the social sciences, natural sciences, and engineering. ResourcesGeographic Data Science with PythonThis book covers the tools, methods, and theory for solving geographic problems with data. It starts with a "Building Blocks" section that lays the groundwork for geographic thinking and then dives into a variety of topics in spatial data, mapping, and spatial statistics. Free to download. Python for Data ScienceThis new online book will teach you how to load, transform, visualize, and understand your data using Python. The book is inspired by R for Data Science and assumes that readers have some coding experience but are new to data science. |