— In the News —
How machines think has a lot to teach us about how we think.
A data scientist at Twitter reflects on what a "data scientist" really does.
This is more than a history of data science at the New York Times. This talk by Chris Wiggins dives into the roots of "data science" as a discipline and where things are going. Great talk.
Whether you're part of a startup or an established organization, you'll find insights here for deciding which data is important, organizing a team, making decisions with data, and scaling data science to reach all areas of your organization.
Great article about why science is so easy to get wrong. This is very well written and with the p-value hacking simulator too, it's really a must-read.
Here's how Slack sifted through massive amounts of data as it grew to over 1 million daily users to create a signal in the noise.
— Tools and Techniques —
Fantastic slide deck by Jake VanderPlas. This deck replaces some of the theory and jargon of statistics with intuitive computational approaches. There are some fundamentals you need to know but the overall theme here is that if you can write a for-loop, you can do statistical analysis.
Awesome interactive visualization that demonstrates the basics of machine learning. This is a MUST PLAY-WITH article!
This is a great introduction to popular data mining algorithms. Each section includes a description of the algorithm, related terms, common use cases, and linked references.
Trey Causey describes what data scientists really need to know about writing code for quality, understandability, and reusability. It's a short list of essential practices and is a MUST READ.
Think you're a good problem solver? This short quiz may surprise you.
Great tutorial that teaches backpropagation via a simple python example. This is very well explained and includes worthwhile suggestions for further learning.
Overview of key streaming algorithms and how to work with them. This article was adapted by a talk given by Ted Dunning, the Chief Applications Architect for MapR.
— Resources —
This 10-page PDF includes a comprehensive suite of notes summarizing important probability concepts, formulas, and distributions, and includes examples, stories, and solved problems.
Nice collection of curated iPython notebooks for Data Science. These notebooks cover topics for Spark, Hadoop MapReduce, HDFS, AWS, Kaggle, scikit-learn, matplotlib, pandas, NumPy, SciPy, and various command lines.
Awesome collection of public datasets on GitHub. There are currently links to 356 datasets, covering a broad range of domains. This collection is well organized, is regularly updated, and has been starred 6700 times!
This is a free HTML version of a new Python book that's been getting rave reviews. It's billed as a guide to "practical programming for total beginners" so you won't find things like Numpy and Scipy here but you will find practical stuff like web scraping, parsing PDFs and Word docs, updating spreadsheets, working with email, scheduling tasks, pattern matching, and manipulating images.
16 free data science books covering statistics, Python, machine learning, the data science process, and more.
Curated collection of YouTube videos for learning about machine learning, neural nets, and deep learning. There are a lot of great picks here and some are just a few minutes long.
— Data Viz —
Fantastic collection of 55 useful tools for data visualization. This collection is curated by the folks at Datavisualization.ch and is well-organized and easy to search.
I don't typically link to course syllabi but this one is particularly worthwhile. This is for Jeffrey Heer's Data Visualization course at the University of Washington. The course is very well organized and most everything is available, including required readings, course slides, and assignments. This is a fantastic collection of Data Visualization MUST READS and, with the assignments and slides too, it's a great resource for self-study.