ISSUE 388 · May 24, 2022In the NewsUsing ML to Help Protect the Great Barrier ReefIn spite of the costs, machine learning has been successfully used in a variety of conservation projects around the world. Here's an inside look at how the Great Barrier Reef Foundation leveraged the latest technologies to survey, monitor and map reefs at scale. OrganizationsDon’t just run your data team like a product team, run it like a company that needs to scaleData teams are always under-resourced, but simultaneously can be seen as an already expensive investment. Here are some ideas for getting the support your data team needs. Sponsored LinkHow to Capture Advantages by Investing in High-Quality Training DataAt the enterprise level, machine learning requires either large amounts of training data or a smaller set of extremely high quality data, as well as the infrastructure to support high data volumes. Consequently, labeling data through robust software or in partnership with an annotation service provider is critical to project success. Read more. Tutorials, Projects & OpinionsHow random forests really workIn this notebook tutorial, Jeremy Howard from fast.ai shows how Random Forests work, by building one from scratch, and then using it to submit to a Kaggle competition. Visualizing multicollinearity in PythonMulticollinearity is when two or more features are correlated with each other in a dataset and it's important to identify and understand it prior to training predictive models. This post explores three ways to visualize multicollinearity, including pros/cons of each. MarginaliaIn the world of statistics, “marginal” means “additional,” or what happens to outcome variable y when explanatory variable x changes a little. This isn't short but it's a gentle introduction to all things marginal and how they work: marginal effects, marginal slopes, average marginal effects, marginal effects at the mean, and more. Unlock Secret Knowledge from Data Experts for $10Packt's Spring Sale is on and for a limited period, all eBooks and Videos are only $10. Our Products are available as PDF, ePub, and MP4 files for you to download and keep forever. All the practical content you need - by developers for developers. ResourcesSoftware Development Resources for Data ScientistsGreat collection of resources that will help data teams create reproducible and production-ready code and tools. This is a crowd-sourced collection covering project structure, automatated testing, reproducible environments, and version control. Mathematics for Machine LearningThis is a tightly curated collection of free books, videos, and papers for learning mathematics for machine learning. Covers all levels. Code & ToolsLineaPyLineaPy is a Python package for data scientists that makes it easy to go from prototype to production. Just add two lines of code and LineaPy will automatically capture, analyze, and transform messy data science code to production data pipelines. No refactoring or new tools needed. NannyMLNannyML is an open-source python library that estimates real-world model performance (without access to targets), detects data drift, and links data drift alerts to changes in model performance. It's easy to use, model-agnostic and supports all tabular binary classification use cases. Obsidian DataviewDataview is a data index and query language over Markdown files. It's designed as an Obsidian plugin and will give you superpowers with your Obsidian Vaults. If you're not familiar with it, Obsidian is a free graph knowledge base that works on top of a local folder of Markdown files and is great for things like note taking, book development, ideation,
etc. Sign up to get Data Elixir's data science newsletter in your Inbox >> |