Linear Digressions

By Ben Jaffe and Katie Malone

Listen to a podcast, please open Podcast Republic app. Available on Google Play Store.


Linear Digressions is a podcast about machine learning and data science. Machine learning is being used to solve a ton of interesting problems, and to accomplish goals that were out of reach even a few short years ago.

Episode Date
Rerelease: Hurricanes Produced
Now that hurricane season is upon us again (and we are on vacation), we thought a look back on our hurricane forecasting episode was prudent. Stay safe out there.<img src="" height="1" width="1" alt=""/>
Jun 18, 2018
By now, you have probably heard of GDPR, the EU's new data privacy law. It's the reason you've been getting so many emails about everyone's updated privacy policy. In this episode, we talk about some of the potential ramifications of GRPD in the world of data science.<img src="" height="1" width="1" alt=""/>
Jun 11, 2018
Git for Data Scientists
If you're a data scientist, chances are good that you've heard of git, which is a system for version controlling code. Chances are also good that you're not quite as up on git as you want to be--git has a strong following among software engineers but, in our anecdotal experience, data scientists are less likely to know how to use this powerful tool. Never fear: in this episode we'll talk through some of the basics, and what does (and doesn't) translate from version control for regular software to version control for data science software.<img src="" height="1" width="1" alt=""/>
Jun 03, 2018
Analytics Maturity
Data science and analytics are hot topics in business these days, but for a lot of folks looking to bring data into their organization, it can be hard to know where to start and what it looks like when they're succeeding. That was the motivation for writing a whitepaper on the analytics maturity of an organization, and that's what we're talking about today. In particular, we break it down into five attributes of an organization that contribute (or not) to their success in analytics, and what each of those mean and why they matter.  Whitepaper here:<img src="" height="1" width="1" alt=""/>
May 20, 2018
SHAP: Shapley Values in Machine Learning
Shapley values in machine learning are an interesting and useful enough innovation that we figured hey, why not do a two-parter? Our last episode focused on explaining what Shapley values are: they define a way of assigning credit for outcomes across several contributors, originally to understand how impactful different actors are in building coalitions (hence the game theory background) but now they're being cross-purposed for quantifying feature importance in machine learning models. This episode centers on the computational details that allow Shapley values to be approximated quickly, and a new package called SHAP that makes all this innovation accessible.<img src="" height="1" width="1" alt=""/>
May 13, 2018
Game Theory for Model Interpretability: Shapley Values
As machine learning models get into the hands of more and more users, there's an increasing expectation that black box isn't good enough: users want to understand why the model made a given prediction, not just what the prediction itself is. This is motivating a lot of work into feature important and model interpretability tools, and one of the most exciting new ones is based on Shapley Values from game theory. In this episode, we'll explain what Shapley Values are and how they make a cool approach to feature importance for machine learning.<img src="" height="1" width="1" alt=""/>
May 07, 2018
If you were a machine learning researcher or data scientist ten years ago, you might have spent a lot of time implementing individual algorithms like decision trees and neural networks by hand. If you were doing that work five years ago, the algorithms were probably already implemented in popular open-source libraries like scikit-learn, but you still might have spent a lot of time trying different algorithms and tuning hyperparameters to improve performance. If you're doing that work today, scikit-learn and similar libraries don't just have the algorithms nicely implemented--they have tools to help with experimentation and hyperparameter tuning too. Automated machine learning is here, and it's pretty cool.<img src="" height="1" width="1" alt=""/>
Apr 30, 2018
CPUs, GPUs, TPUs: Hardware for Deep Learning
A huge part of the ascent of deep learning in the last few years is related to advances in computer hardware that makes it possible to do the computational heavy lifting required to build models with thousands or even millions of tunable parameters. This week we'll pretend to be electrical engineers and talk about how modern machine learning is enabled by hardware.<img src="" height="1" width="1" alt=""/>
Apr 23, 2018
A Technical Introduction to Capsule Networks
Last episode we talked conceptually about capsule networks, the latest and greatest computer vision innovation to come out of Geoff Hinton's lab. This week we're getting a little more into the technical details, for those of you ready to have your mind stretched.<img src="" height="1" width="1" alt=""/>
Apr 16, 2018
A Conceptual Introduction to Capsule Networks
Convolutional nets are great for image classification... if this were 2016. But it's 2018 and Canada's greatest neural networker Geoff Hinton has some new ideas, namely capsule networks. Capsule nets are a completely new type of neural net architecture designed to do image classification on far fewer training cases than convolutional nets, and they're posting results that are competitive with much more mature technologies. In this episode, we'll give a light conceptual introduction to capsule nets and get geared up for a future episode that will do a deeper technical dive.<img src="" height="1" width="1" alt=""/>
Apr 09, 2018
Convolutional Neural Nets
If you've done image recognition or computer vision tasks with a neural network, you've probably used a convolutional neural net. This episode is all about the architecture and implementation details of convolutional networks, and the tricks that make them so good at image tasks.<img src="" height="1" width="1" alt=""/>
Apr 02, 2018
Google Flu Trends
It's been a nasty flu season this year. So we were remembering a story from a few years back (but not covered yet on this podcast) about when Google tried to predict flu outbreaks faster than the Centers for Disease Control by monitoring searches and looking for spikes in searches for flu symptoms, doctors appointments, and other related terms. It's a cool idea, but after a few years turned into a cautionary tale of what can go wrong after Google's algorithm systematically overestimated flu incidence for almost 2 years straight. Relevant link:<img src="" height="1" width="1" alt=""/>
Mar 26, 2018
How to pick projects for a professional data science team
This week's episodes is for data scientists, sure, but also for data science managers and executives at companies with data science teams. These folks all think very differently about the same question: what should a data science team be working on? And how should that decision be made? That's the subject of a talk that I (Katie) gave at Strata Data in early March, about how my co-department head and I select projects for our team to work on. We have several goals in data science project selection at Civis Analytics (where I work), which can be summarized under "balance the best attributes of bottom-up and top-down decision-making." We achieve this balance, or at least get pretty close, using a process we've come to call the Idea Factory (after a great book about Bell Labs). This talk is about that process, how it works in the real world of a data science company and how we see it working in the data science programs of other companies. Relevant links:<img src="" height="1" width="1" alt=""/>
Mar 19, 2018
Autoencoders are neural nets that are optimized for creating outputs that... look like the inputs to the network. Turns out this is a not-too-shabby way to do unsupervised machine learning with neural nets.<img src="" height="1" width="1" alt=""/>
Mar 12, 2018
When Private Data Isn't Private Anymore
After all the back-patting around making data science datasets and code more openly available, we figured it was time to also dump a bucket of cold water on everyone's heads and talk about the things that can go wrong when data and code is a little too open. In this episode, we'll talk about two interesting recent examples: a de-identified medical dataset in Australia that was re-identified so specific celebrities and athletes could be matched to their medical records, and a series of military bases that were spotted in a public fitness tracker dataset.<img src="" height="1" width="1" alt=""/>
Mar 05, 2018
What makes a machine learning algorithm "superhuman"?
A few weeks ago, we podcasted about a neural network that was being touted as "better than doctors" in diagnosing pneumonia from chest x-rays, and how the underlying dataset used to train the algorithm raised some serious questions. We're back again this week with further developments, as the author of the original blog post pointed us toward more developments. All in all, there's a lot more clarity now around how the authors arrived at their original "better than doctors" claim, and a number of adjustments and improvements as the original result was de/re-constructed. Anyway, there are a few things that are cool about this. First, it's a worthwhile follow-up to a popular recent episode. Second, it goes *inside* an analysis to see what things like imbalanced classes, outliers, and (possible) signal leakage can do to real science. And last, it raises a really interesting question in an age when computers are often claimed to be better than humans: what do those claims really mean? Relevant links:<img src="" height="1" width="1" alt=""/>
Feb 26, 2018
Open Data and Open Science
One interesting trend we've noted recently is the proliferation of papers, articles and blog posts about data science that don't just tell the result--they include data and code that allow anyone to repeat the analysis. It's far from universal (for a timely counterpoint, read this article <>), but we seem to be moving toward a new normal where data science conclusions are expected to be shown, not just told. Relevant links:<img src="" height="1" width="1" alt=""/>
Feb 19, 2018
Defining the quality of a machine learning production system
Building a machine learning system and maintaining it in production are two very different things. Some folks over at Google wrote a paper that shares their thoughts around all the items you might want to test or check for your production ML system. Relevant links:<img src="" height="1" width="1" alt=""/>
Feb 12, 2018
Auto-generating websites with deep learning
We've already talked about neural nets in some detail (links below), and in particular we've been blown away by the way that image recognition from convolutional neural nets can be fed into recurrent neural nets that generate descriptions and captions of the images. Our episode today tells a similar tale, except today we're talking about a blog post where the author fed in wireframes of a website design and asked the neural net to generate the HTML and CSS that would actually build a website that looks like the wireframes. If you're a programmer who thinks your job is challenging enough that you're automation-proof, guess again... Link to blog post:<img src="" height="1" width="1" alt=""/>
Feb 04, 2018
The Case for Learned Index Structures, Part 2: Hash Maps and Bloom Filters
Last week we started the story of how you could use a machine learning model in place of a data structure, and this week we wrap up with an exploration of Bloom Filters and Hash Maps. Just like last week, when we covered B-trees, we'll walk through both the "classic" implementation of these data structures and how a machine learning model could create the same functionality.<img src="" height="1" width="1" alt=""/>
Jan 29, 2018
The Case for Learned Index Structures, Part 1: B-Trees
Jeff Dean and his collaborators at Google are turning the machine learning world upside down (again) with a recent paper about how machine learning models can be used as surprisingly effective substitutes for classic data structures. In this first part of a two-part series, we'll go through a data structure called b-trees. The structural form of b-trees make them efficient for searching, but if you squint at a b-tree and look at it a little bit sideways then the search functionality starts to look a little bit like a regression model--hence the relevance of machine learning models. If this sounds kinda weird, or we lost you at b-tree, don't worry--lots more details in the episode itself.<img src="" height="1" width="1" alt=""/>
Jan 22, 2018
Challenges with Using Machine Learning to Classify Chest X-Rays
Another installment in our "machine learning might not be a silver bullet for solving medical problems" series. This week, we have a high-profile blog post that has been making the rounds for the last few weeks, in which a neural network trained to visually recognize various diseases in chest x-rays is called into question by a radiologist with machine learning expertise. As it seemingly always does, it comes down to the dataset that's used for training--medical records assume a lot of context that may or may not be available to the algorithm, so it's tough to make something that actually helps (in this case) predict disease that wasn't already diagnosed.<img src="" height="1" width="1" alt=""/>
Jan 15, 2018
The Fourier Transform
The Fourier transform is one of the handiest tools in signal processing for dealing with periodic time series data. Using a Fourier transform, you can break apart a complex periodic function into a bunch of sine and cosine waves, and figure out what the amplitude, frequency and offset of those component waves are. It's a really handy way of re-expressing periodic data--you'll never look at a time series graph the same way again.<img src="" height="1" width="1" alt=""/>
Jan 08, 2018
Statistics of Beer
What better way to kick off a new year than with an episode on the statistics of brewing beer?<img src="" height="1" width="1" alt=""/>
Jan 02, 2018
Re - Release: Random Kanye
We have a throwback episode for you today as we take the week off to enjoy the holidays. This week: what happens when you have a markov chain that generates mashup Kanye West lyrics with Bible verses? Exactly what you think.<img src="" height="1" width="1" alt=""/>
Dec 24, 2017
Debiasing Word Embeddings
When we covered the Word2Vec algorithm for embedding words, we mentioned parenthetically that the word embeddings it produces can sometimes be a little bit less than ideal--in particular, gender bias from our society can creep into the embeddings and give results that are sexist. For example, occupational words like "doctor" and "nurse" are more highly aligned with "man" or "woman," which can create problems because these word embeddings are used in algorithms that help people find information or make decisions. However, a group of researchers has released a new paper detailing ways to de-bias the embeddings, so we retain gender info that's not particularly problematic (for example, "king" vs. "queen") while correcting bias.<img src="" height="1" width="1" alt=""/>
Dec 18, 2017
The Kernel Trick and Support Vector Machines
Picking up after last week's episode about maximal margin classifiers, this week we'll go into the kernel trick and how that (combined with maximal margin algorithms) gives us the much-vaunted support vector machine.<img src="" height="1" width="1" alt=""/>
Dec 11, 2017
Maximal Margin Classifiers
Maximal margin classifiers are a way of thinking about supervised learning entirely in terms of the decision boundary between two classes, and defining that boundary in a way that maximizes the distance from any given point to the boundary. It's a neat way to think about statistical learning and a prerequisite for understanding support vector machines, which we'll cover next week--stay tuned!<img src="" height="1" width="1" alt=""/>
Dec 04, 2017
Re - Release: The Cocktail Party Problem
Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!<img src="" height="1" width="1" alt=""/>
Nov 27, 2017
Clustering with DBSCAN
DBSCAN is a density-based clustering algorithm for doing unsupervised learning. It's pretty nifty: with just two parameters, you can specify "dense" regions in your data, and grow those regions out organically to find clusters. In particular, it can fit irregularly-shaped clusters, and it can also identify outlier points that don't belong to any of the clusters. Pretty cool!<img src="" height="1" width="1" alt=""/>
Nov 20, 2017
The Kaggle Survey on Data Science
Want to know what's going on in data science these days?  There's no better way than to analyze a survey with over 16,000 responses that recently released by Kaggle.  Kaggle asked practicing and aspiring data scientists about themselves, their tools, how they find jobs, what they find challenging about their jobs, and many other questions.  Then Kaggle released an interactive summary of the data, as well as the anonymized dataset itself, to help data scientists understand the trends in the data.  In this episode, we'll go through some of the survey toplines that we found most interesting and counterintuitive.<img src="" height="1" width="1" alt=""/>
Nov 13, 2017
Machine Learning: The High Interest Credit Card of Technical Debt
This week, we've got a fun paper by our friends at Google about the hidden costs of maintaining machine learning workflows. If you've worked in software before, you're probably familiar with the idea of technical debt, which are inefficiencies that crop up in the code when you're trying to go fast. You take shortcuts, hard-code variable values, skimp on the documentation, and generally write not-that-great code in order to get something done quickly, and then end up paying for it later on. This is technical debt, and it's particularly easy to accrue with machine learning workflows. That's the premise of this episode's paper.<img src="" height="1" width="1" alt=""/>
Nov 06, 2017
Improving Upon a First-Draft Data Science Analysis
There are a lot of good resources out there for getting started with data science and machine learning, where you can walk through starting with a dataset and ending up with a model and set of predictions. Think something like the homework for your favorite machine learning class, or your most recent online machine learning competition. However, if you've ever tried to maintain a machine learning workflow (as opposed to building it from scratch), you know that taking a simple modeling script and turning it into clean, well-structured and maintainable software is way harder than most people give it credit for. That said, if you're a professional data scientist (or want to be one), this is one of the most important skills you can develop. In this episode, we'll walk through a workshop Katie is giving at the Open Data Science Conference in San Francisco in November 2017, which covers building a machine learning workflow that's more maintainable than a simple script. If you'll be at ODSC, come say hi, and if you're not, here's a sneak preview!<img src="" height="1" width="1" alt=""/>
Oct 30, 2017
Survey Raking
It's quite common for survey respondents not to be representative of the larger population from which they are drawn. But if you're a researcher, you need to study the larger population using data from your survey respondents, so what should you do? Reweighting the survey data, so that things like demographic distributions look similar between the survey and general populations, is a standard technique and in this episode we'll talk about survey raking, a way to calculate survey weights when there are several distributions of interest that need to be matched.<img src="" height="1" width="1" alt=""/>
Oct 23, 2017
Happy Hacktoberfest
It's the middle of October, so you've already made two pull requests to open source repos, right? If you have no idea what we're talking about, spend the next 20 minutes or so with us talking about the importance of open source software and how you can get involved. You can even get a free t-shirt! Hacktoberfest main page:<img src="" height="1" width="1" alt=""/>
Oct 16, 2017
Re - Release: Kalman Runners
In honor of the Chicago marathon this weekend (and due in large part to Katie recovering from running in it...) we have a re-release of an episode about Kalman filters, which is part algorithm part elaborate metaphor for figuring out, if you're running a race but don't have a watch, how fast you're going. Katie's Chicago race report: miles 1-13: light ankle pain, lovely cool weather, the most fun EVAR miles 13-17: no more ankle pain but quads start getting tight, it's a little more effort miles 17-20: oof, really tight legs but still plenty of gas in then tank. miles 20-23: it's warmer out now, legs hurt a lot but running through Pilsen and Chinatown is too fun to notice mile 24: ugh cramp everything hurts miles 25-26.2: awesome crowd support, really tired and loving every second Final time: 3:54:35<img src="" height="1" width="1" alt=""/>
Oct 09, 2017
Neural Net Dropout
Neural networks are complex models with many parameters and can be prone to overfitting.  There's a surprisingly simple way to guard against this: randomly destroy connections between hidden units, also known as dropout.  It seems counterintuitive that undermining the structural integrity of the neural net makes it robust against overfitting, but in the world of neural nets, weirdness is just how things go sometimes. Relevant links:<img src="" height="1" width="1" alt=""/>
Oct 02, 2017
Disciplined Data Science
As data science matures as a field, it's becoming clearer what attributes a data science team needs to have to elevate their work to the next level. Most of our episodes are about the cool work being done by other people, but this one summarizes some thinking Katie's been doing herself around how to guide data science teams toward more mature, effective practices. We'll go through five key characteristics of great data science teams, which we collectively refer to as "disciplined data science," and why they matter.<img src="" height="1" width="1" alt=""/>
Sep 25, 2017
Hurricane Forecasting
It's been a busy hurricane season in the Southeastern United States, with millions of people making life-or-death decisions based on the forecasts around where the hurricanes will hit and with what intensity. In this episode we'll deconstruct those models, talking about the different types of models, the theory behind them, and how they've evolved through the years.<img src="" height="1" width="1" alt=""/>
Sep 18, 2017
Finding Spy Planes with Machine Learning
There are law enforcement surveillance aircraft circling over the United States every day, and in this episode, we'll talk about how some folks at BuzzFeed used public data and machine learning to find them.  The fun thing here, in our opinion, is the blend of intrigue (spy planes!) with tech journalism and a heavy dash of publicly available and reproducible analysis code so that you (yes, you!) can see exactly how BuzzFeed identifies the surveillance planes.<img src="" height="1" width="1" alt=""/>
Sep 11, 2017
Data Provenance
Software engineers are familiar with the idea of versioning code, so you can go back later and revive a past state of the system.  For data scientists who might want to reconstruct past models, though, it's not just about keeping the modeling code.  It's also about saving a version of the data that made the model.  There are a lot of other benefits to keeping track of datasets, so in this episode we'll talk about data lineage or data provenance.<img src="" height="1" width="1" alt=""/>
Sep 04, 2017
Adversarial Examples
Even as we rely more and more on machine learning algorithms to help with everyday decision-making, we're learning more and more about how they're frighteningly easy to fool sometimes. Today we have a roundup of a few successful efforts to create robust adversarial examples, including what it means for an adversarial example to be robust and what this might mean for machine learning in the future.<img src="" height="1" width="1" alt=""/>
Aug 28, 2017
Jupyter Notebooks
This week's episode is just in time for JupyterCon in NYC, August 22-25... Jupyter notebooks are probably familiar to a lot of data nerds out there as a great open-source tool for exploring data, doing quick visualizations, and packaging code snippets with explanations for sharing your work with others. If you're not a data person, or you are but you haven't tried out Jupyter notebooks yet, here's your nudge to go give them a try. In this episode we'll go back to the old days, before notebooks, and talk about all the ways that data scientists like to work that wasn't particularly well-suited to the command line + text editor setup, and talk about how notebooks have evolved over their lifetime to become even more powerful and well-suited to the data scientist's workflow.<img src="" height="1" width="1" alt=""/>
Aug 21, 2017
Curing Cancer with Machine Learning is Super Hard
Today, a dispatch on what can go wrong when machine learning hype outpaces reality: a high-profile partnership between IBM Watson and MD Anderson Cancer Center has recently hit the rocks as it turns out to be tougher than expected to cure cancer with artificial intelligence.  There are enough conflicting accounts in the media to make it tough to say exactly went wrong, but it's a good chance to remind ourselves that even in a post-AI world, hard problems remain hard.<img src="" height="1" width="1" alt=""/>
Aug 14, 2017
KL Divergence
Kullback Leibler divergence, or KL divergence, is a measure of information loss when you try to approximate one distribution with another distribution.  It comes to us originally from information theory, but today underpins other, more machine-learning-focused algorithms like t-SNE.  And boy oh boy can it be tough to explain.  But we're trying our hardest in this episode!<img src="" height="1" width="1" alt=""/>
Aug 07, 2017
It's moneyball time! SABR (the Society for American Baseball Research) is the world's largest organization of statistics-minded baseball enthusiasts, who are constantly applying the craft of scientific analysis to trying to figure out who are the best baseball teams and players. It can be hard to objectively measure sports greatness, but baseball has a data-rich history and plenty of nerdy fans interested in analyzing that data. In this episode we'll dissect a few of the metrics from standard baseball and compare them to related metrics from Sabermetrics, so you can nerd out more effectively at your next baseball game.<img src="" height="1" width="1" alt=""/>
Jul 31, 2017
What Data Scientists Can Learn from Software Engineers
We're back again with friend of the pod Walt, former software engineer extraordinaire and current data scientist extraordinaire, to talk about some best practices from software engineering that are ready to jump the fence over to data science.  If last week's episode was for software engineers who are interested in becoming more like data scientists, then this week's episode is for data scientists who are looking to improve their game with best practices from software engineering.<img src="" height="1" width="1" alt=""/>
Jul 24, 2017
Software Engineering to Data Science
Data scientists and software engineers often work side by side, building out and scaling technical products and services that are data-heavy but also require a lot of software engineering to build and maintain.  In this episode, we'll chat with a Friend of the Pod named Walt, who started out as a software engineer but works as a data scientist now.  We'll talk about that transition from software engineering to data science, and what special capabilities software engineers have that data scientists might benefit from knowing about (and vice versa).<img src="" height="1" width="1" alt=""/>
Jul 17, 2017
Re-Release: Fighting Cholera with Data, 1854
This episode was first released in November 2014. In the 1850s, there were a lot of things we didn’t know yet: how to create an airplane, how to split an atom, or how to control the spread of a common but deadly disease: cholera. When a cholera outbreak in London killed scores of people, a doctor named John Snow used it as a chance to study whether the cause might be very small organisms that were spreading through the water supply (the prevailing theory at the time was miasma, or “bad air”). By tracing the geography of all the deaths from the outbreak, Snow was practicing elementary data science--and stumbled upon one of history’s most famous outliers. In this episode, we’ll tell you more about this single data point, a case of cholera that cracked the case wide open for Snow and provided critical validation for the germ theory of disease.<img src="" height="1" width="1" alt=""/>
Jul 10, 2017
Re-Release: Data Mining Enron
This episode was first release in February 2015. In 2000, Enron was one of the largest and companies in the world, praised far and wide for its innovations in energy distribution and many other markets. By 2002, it was apparent that many bad apples had been cooking the books, and billions of dollars and thousands of jobs disappeared. In the aftermath, surprisingly, one of the greatest datasets in all of machine learning was born--the Enron emails corpus. Hundreds of thousands of emails amongst top executives were made public; there's no realistic chance any dataset like this will ever be made public again. But the dataset that was released has gone on to immortality, serving as the basis for a huge variety of advances in machine learning and other fields.<img src="" height="1" width="1" alt=""/>
Jul 02, 2017
Factorization Machines
What do you get when you cross a support vector machine with matrix factorization? You get a factorization machine, and a darn fine algorithm for recommendation engines.<img src="" height="1" width="1" alt=""/>
Jun 26, 2017
Anscombe's Quartet
Anscombe's Quartet is a set of four datasets that have the same mean, variance and correlation but look very different. It's easy to think that having a good set of summary statistics (like mean, variance and correlation) can tell you everything important about a dataset, or at least enough to know if two datasets are extremely similar or extremely different, but Anscombe's Quartet will always be standing behind you, laughing at how silly that idea is. Anscombe's Quartet was devised in 1973 as an example of how summary statistics can be misleading, but today we can even do one better: the Datasaurus Dozen is a set of twelve datasets, all extremely visually distinct, that have the same summary stats as a source dataset that, there's no other way to put this, looks like a dinosaur. It's an example of how datasets can be generated to look like almost anything while still preserving arbitrary summary statistics. In other words, Anscombe's Quartets can be generated at-will and we all should be reminded to visualize our data (not just compute summary statistics) if we want to claim to really understand it.<img src="" height="1" width="1" alt=""/>
Jun 19, 2017
Traffic Metering Algorithms
Originally release June 2016 This episode is for all you (us) traffic nerds--we're talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so that the highways don't get overloaded with cars and clog up. If you're someone who listens to podcasts while commuting, and especially if your area has on-ramp metering, you'll never look at highway access control the same way again (yeah, we know this is super nerdy; it's also super awesome).<img src="" height="1" width="1" alt=""/>
Jun 12, 2017
Page Rank
The year: 1998.  The size of the web: 150 million pages.  The problem: information retrieval.  How do you find the "best" web pages to return in response to a query?  A graduate student named Larry Page had an idea for how it could be done better and created a search engine as a research project.  That search engine was called Google.<img src="" height="1" width="1" alt=""/>
Jun 05, 2017
Fractional Dimensions
We chat about fractional dimensions, and what the actual heck those are.<img src="" height="1" width="1" alt=""/>
May 29, 2017
Things You Learn When Building Models for Big Data
As more and more data gets collected seemingly every day, and data scientists use that data for modeling, the technical limits associated with machine learning on big datasets keep getting pushed back.  This week is a first-hand case study in using scikit-learn (a popular python machine learning library) on multi-terabyte datasets, which is something that Katie does a lot for her day job at Civis Analytics.  There are a lot of considerations for doing something like this--cloud computing, artful use of parallelization, considerations of model complexity, and the computational demands of training vs. prediction, to name just a few.<img src="" height="1" width="1" alt=""/>
May 22, 2017
How to Find New Things to Learn
If you're anything like us, you a) always are curious to learn more about data science and machine learning and stuff, and b) are usually overwhelmed by how much content is out there (not all of it very digestible). We hope this podcast is a part of the solution for you, but if you're looking to go farther (who isn't?) then we have a few new resources that are presenting high-quality content in a fresh, accessible way. Boring old PDFs full of inscrutable math notation, your days are numbered!<img src="" height="1" width="1" alt=""/>
May 15, 2017
Federated Learning
As machine learning makes its way into more and more mobile devices, an interesting question presents itself: how can we have an algorithm learn from training data that's being supplied as users interact with the algorithm? In other words, how do we do machine learning when the training dataset is distributed across many devices, imbalanced, and the usage associated with any one user needs to be obscured somewhat to protect the privacy of that user? Enter Federated Learning, a set of related algorithms from Google that are designed to help out in exactly this scenario. If you've used keyboard shortcuts or autocomplete on an Android phone, chances are you've encountered Federated Learning even if you didn't know it.<img src="" height="1" width="1" alt=""/>
May 08, 2017
Word2Vec is probably the go-to algorithm for vectorizing text data these days.  Which makes sense, because it is wicked cool.  Word2Vec has it all: neural networks, skip-grams and bag-of-words implementations, a multiclass classifier that gets swapped out for a binary classifier, made-up dummy words, and a model that isn't actually used to predict anything (usually).  And all that's before we get to the part about how Word2Vec allows you to do algebra with text.  Seriously, this stuff is cool.<img src="" height="1" width="1" alt=""/>
May 01, 2017
Feature Processing for Text Analytics
It seems like every day there's more and more machine learning problems that involve learning on text data, but text itself makes for fairly lousy inputs to machine learning algorithms.  That's why there are text vectorization algorithms, which re-format text data so it's ready for using for machine learning.  In this episode, we'll go over some of the most common and useful ways to preprocess text data for machine learning.<img src="" height="1" width="1" alt=""/>
Apr 24, 2017
Education Analytics
This week we'll hop into the rapidly developing industry around predictive analytics for education. For many of the students who eventually drop out, data science is showing that there might be early warning signs that the student is in trouble--we'll talk about what some of those signs are, and then dig into the meatier questions around discrimination, who owns a student's data, and correlation vs. causation. Spoiler: we have more questions than we have answers on this one. Bonus appearance from Maeby the dog, who isn't a data scientist but does like to steal food off the counter.<img src="" height="1" width="1" alt=""/>
Apr 17, 2017
A Technical Deep Dive on Stanley, the First Self-Driving Car
In our follow-up episode to last week's introduction to the first self-driving car, we will be doing a technical deep dive this week and talking about the most important systems for getting a car to drive itself 140 miles across the desert.  Lidar?  You betcha!  Drive-by-wire?  Of course!  Probabilistic terrain reconstruction?  Absolutely!  All this and more this week on Linear Digressions.<img src="" height="1" width="1" alt=""/>
Apr 10, 2017
An Introduction to Stanley, the First Self-Driving Car
In October 2005, 23 cars lined up in the desert for a 140 mile race.  Not one of those cars had a driver.  This was the DARPA grand challenge to see if anyone could build an autonomous vehicle capable of navigating a desert route (and if so, whose car could do it the fastest); the winning car, Stanley, now sits in the Smithsonian Museum in Washington DC as arguably the world's first real self-driving car.  In this episode (part one of a two-parter), we'll revisit the DARPA grand challenge from 2005 and the rules and constraints of what it took for Stanley to win the competition.  Next week, we'll do a deep dive into Stanley's control systems and overall operation and what the key systems were that allowed Stanley to win the race.<img src="" height="1" width="1" alt=""/>
Apr 03, 2017
Feature Importance
Figuring out what features actually matter in a model is harder to figure out than you might first guess.  When a human makes a decision, you can just ask them--why did you do that?  But with machine learning models, not so much.  That's why we wanted to talk a bit about both regularization (again) and also other ways that you can figure out which models have the biggest impact on the predictions of your model.<img src="" height="1" width="1" alt=""/>
Mar 27, 2017
Space Codes!
It's hard to get information to and from Mars.  Mars is very far away, and expensive to get to, and the bandwidth for passing messages with Earth is not huge.  The messages you do pass have to traverse millions of miles, which provides ample opportunity for the message to get corrupted or scrambled.  How, then, can you encode messages so that errors can be detected and corrected?  How does the decoding process allow you to actually find and correct the errors?  In this episode, we'll talk about three pieces of the process (Reed-Solomon codes, convolutional codes, and Viterbi decoding) that allow the scientists at NASA to talk to our rovers on Mars.<img src="" height="1" width="1" alt=""/>
Mar 20, 2017
Finding (and Studying) Wikipedia Trolls
You may be shocked to hear this, but sometimes, people on the internet can be mean.  For some of us this is just a minor annoyance, but if you're a maintainer or contributor of a large project like Wikipedia, abusive users can be a huge problem.  Fighting the problem starts with understanding it, and understanding it starts with measuring it; the thing is, for a huge website like Wikipedia, there can be millions of edits and comments where abuse might happen, so measurement isn't a simple task.  That's where machine learning comes in: by building an "abuse classifier," and pointing it at the Wikipedia edit corpus, researchers at Jigsaw and the Wikimedia foundation are for the first time able to estimate abuse rates and curate a dataset of abusive incidents.  Then those researchers, and others, can use that dataset to study the pathologies and effects of Wikipedia trolls.<img src="" height="1" width="1" alt=""/>
Mar 13, 2017
A Sprint Through What's New in Neural Networks
Advances in neural networks are moving fast enough that, even though it seems like we talk about them all the time around here, it also always seems like we're barely keeping up.  So this week we have another installment in our "neural nets: they so smart!" series, talking about three topics.  And all the topics this week were listener suggestions, too!<img src="" height="1" width="1" alt=""/>
Mar 06, 2017
Stein's Paradox
When you're estimating something about some object that's a member of a larger group of similar objects (say, the batting average of a baseball player, who belongs to a baseball team), how should you estimate it: use measurements of the individual, or get some extra information from the group? The James-Stein estimator tells you how to combine individual and group information make predictions that, taken over the whole group, are more accurate than if you treated each individual, well, individually.<img src="" height="1" width="1" alt=""/>
Feb 27, 2017
Empirical Bayes
Say you're looking to use some Bayesian methods to estimate parameters of a system. You've got the normalization figured out, and the likelihood, but the prior... what should you use for a prior? Empirical Bayes has an elegant answer: look to your previous experience, and use past measurements as a starting point in your prior. Scratching your head about some of those terms, and why they matter? Lucky for you, you're standing in front of a podcast episode that unpacks all of this.<img src="" height="1" width="1" alt=""/>
Feb 20, 2017
Endogenous Variables and Measuring Protest Effectiveness
Have you been out protesting lately, or watching the protests, and wondered how much effect they might have on lawmakers? It's a tricky question to answer, since usually we need randomly distributed treatments (e.g. big protests) to understand causality, but there's no reason to believe that big protests are actually randomly distributed. In other words, protest size is endogenous to legislative response, and understanding cause and effect is very challenging. So, what to do? Well, at least in the case of studying Tea Party protest effectiveness, researchers have used rainfall, of all things, to understand the impact of a big protest. In other words, rainfall is the instrumental variable in this analysis that cracks the scientific case open. What does rainfall have to do with protests? Do protests actually matter? What do we mean when we talk about endogenous and instrumental variables? We wouldn't be very good podcasters if we answered all those questions here--you gotta listen to this episode to find out.<img src="" height="1" width="1" alt=""/>
Feb 13, 2017
Calibrated Models
Remember last week, when we were talking about how great the ROC curve is for evaluating models? How things change... This week, we're exploring calibrated risk models, because that's a kind of model that seems like it would benefit from some nice ROC analysis, but in fact the ROC AUC can steer you wrong there.<img src="" height="1" width="1" alt=""/>
Feb 06, 2017
Rock the ROC Curve
This week: everybody's favorite WWII-era classifier metric! But it's not just for winning wars, it's a fantastic go-to metric for all your classifier quality needs.<img src="" height="1" width="1" alt=""/>
Jan 30, 2017
Ensemble Algorithms
If one machine learning model is good, are two models better? In a lot of cases, the answer is yes. If you build many ok models, and then bring them all together and use them in combination to make your final predictions, you've just created an ensemble model. It feels a little bit like cheating, like you just got something for nothing, but the results don't like: algorithms like Random Forests and Gradient Boosting Trees (two types of ensemble algorithms) are some of the strongest out-of-the-box algorithms for classic supervised classification problems. What makes a Random Forest random, and what does it mean to gradient boost a tree? Have a listen and find out.<img src="" height="1" width="1" alt=""/>
Jan 23, 2017
How to evaluate a translation: BLEU scores
As anyone who's encountered a badly translated text could tell you, not all translations are created equal. Some translations are smooth, fluent and sound like a poet wrote them; some are jerky, non-grammatical and awkward. When a machine is doing the translating, it's awfully easy to end up with a robotic-sounding text; as the state of the art in machine translation improves, though, a natural question to ask is: according to what measure? How do we quantify a "good" translation? Enter the BLEU score, which is the standard metric for quantifying the quality of a machine translation. BLEU rewards translations that have large overlap with human translations of sentences, with some extra heuristics thrown in to guard against weird pathologies (like full sentences getting translated as one word, redundancies, and repetition). Nowadays, if there's a machine translation being evaluated or a new state-of-the-art system (like the Google neural machine translation we've discussed on this podcast before), chances are that there's a BLEU score going into that assessment.<img src="" height="1" width="1" alt=""/>
Jan 16, 2017
Zero Shot Translation
Take Google-size data, the flexibility of a neural net, and all (well, most) of the languages of the world, and what you end up with is a pile of surprises. This episode is about some interesting features of Google's new neural machine translation system, namely that with minimal tweaking, it can accommodate many different languages in a single neural net, that it can do a half-decent job of translating between language pairs it's never been explicitly trained on, and that it seems to have its own internal representation of concepts that's independent of the language those concepts are being represented in. Intrigued? You should be...<img src="" height="1" width="1" alt=""/>
Jan 09, 2017
Google Neural Machine Translation
Recently, Google swapped out the backend for Google Translate, moving from a statistical phrase-based method to a recurrent neural network. This marks a big change in methodology: the tried-and-true statistical translation methods that have been in use for decades are giving way to a neural net that, across the board, appears to be giving more fluent and natural-sounding translations. This episode recaps statistical phrase-based methods, digs into the RNN architecture a little bit, and recaps the impressive results that is making us all sound a little better in our non-native languages.<img src="" height="1" width="1" alt=""/>
Jan 02, 2017
Data and the Future of Medicine : Interview with Precision Medicine Initiative researcher Matt Might
Today we are delighted to bring you an interview with Matt Might, computer scientist and medical researcher extraordinaire and architect of President Obama's Precision Medicine Initiative. As the Obama Administration winds down, we're talking with Matt about the goals and accomplishments of precision medicine (and related projects like the Cancer Moonshot) and what he foresees as the future marriage of data and medicine. Many thanks to Matt, our friends over at Partially Derivative (hi, Jonathon!) and the White House for arranging this opportunity to chat. Enjoy!<img src="" height="1" width="1" alt=""/>
Dec 26, 2016
Special Crossover Episode: Partially Derivative interview with White House Data Scientist DJ Patil
We have the pleasure of bringing you a very special crossover episode this week: our friends at Partially Derivative (another great podcast about data science, you should check it out) recently interviewed White House Chief Data Scientist DJ Patil. We think DJ's message about the importance and impact of data science is worth spreading, so it's our pleasure to bring it to you today. A huge thanks to Jonathon Morgan and Partially Derivative for sharing this interview with us--enjoy! Relevant links:<img src="" height="1" width="1" alt=""/>
Dec 18, 2016
How to Lose at Kaggle
Competing in a machine learning competition on Kaggle is a kind of rite of passage for data scientists. Losing unexpectedly at the very end of the contest is also something that a lot of us have experienced. It's not just bad luck: a very specific combination of overfitting on popular competitions can take someone who is in the top few spots in the final days of a contest and bump them down hundreds of slots in the final tally.<img src="" height="1" width="1" alt=""/>
Dec 12, 2016
Attacking Discrimination in Machine Learning
Imagine there's an important decision to be made about someone, like a bank deciding whether to extend a loan, or a school deciding to admit a student--unfortunately, we're all too aware that discrimination can sneak into these situations (even when everyone is acting with the best of intentions!). Now, these decisions are often made with the assistance of machine learning and statistical models, but unfortunately these algorithms pick up on the discrimination in the world (it sneaks in through the data, which can capture inequities, which the algorithms then learn) and reproduce it. This podcast covers some of the most common ways we can try to minimize discrimination, and why none of those ways is perfect at fixing the problem. Then we'll get to a new idea called "equality of opportunity," which came out of Google recently and takes a pretty practical and well-aimed approach to machine learning bias.<img src="" height="1" width="1" alt=""/>
Dec 05, 2016
Recurrent Neural Nets
This week, we're doing a crash course in recurrent neural networks--what the structural pieces are that make a neural net recurrent, how that structure helps RNNs solve certain time series problems, and the importance of forgetfulness in RNNs. Relevant links:<img src="" height="1" width="1" alt=""/>
Nov 28, 2016
Stealing a PIN with signal processing and machine learning
Want another reason to be paranoid when using the free coffee shop wifi? Allow us to introduce WindTalker, a system that cleverly combines a dose of signal processing with a dash of machine learning to (potentially) steal the PIN from your phone transactions without ever having physical access to your phone. This episode has it all, folks--channel state information, ICMP echo requests, low-pass filtering, PCA, dynamic time warps, and the PIN for your phone.<img src="" height="1" width="1" alt=""/>
Nov 21, 2016
Neural Net Cryptography
Cryptography used to be the domain of information theorists and spies. There's a new player now: neural networks. Given the task of communicating securely, neural networks are inventing new encryption methods that, as best we can tell, are unlike anything humans have ever seen before. Relevant links:<img src="" height="1" width="1" alt=""/>
Nov 14, 2016
Deep Blue
In 1997, Deep Blue was the IBM algorithm/computer that did what no one, at the time, though possible: it beat the world's best chess player. It turns out, though, that one of the most important moves in the matchup, where Deep Blue psyched out its opponent with a weird move, might not have been so inspired after all. It might have been nothing more than a bug in the program, and it changed computer science history. Relevant links:<img src="" height="1" width="1" alt=""/>
Nov 07, 2016
Organizing Google's Datasets
If you're a data scientist, there's a good chance you're used to working with a lot of data. But there's a lot of data, and then there's Google-scale amounts of data. Keeping all that data organized is a Google-sized task, and as it happens, they've built a system for that organizational challenge. This episode is all about that system, called Goods, and in particular we'll dig into some of the details of what makes this so tough. Relevant links:<img src="" height="1" width="1" alt=""/>
Oct 31, 2016
Fighting Cancer with Data Science: Followup
A few months ago, Katie started on a project for the Vice President's Cancer Moonshot surrounding how data can be used to better fight cancer. The project is all wrapped up now, so we wanted to tell you about how that work went and what changes to cancer data policy were suggested to the Vice President. See for links to the reports discussed on this episode.<img src="" height="1" width="1" alt=""/>
Oct 24, 2016
The 19-year-old determining the US election
Sick of the presidential election yet? We are too, but there's still almost a month to go, so let's just embrace it together. This week, we'll talk about one of the presidential polls, which has been kind of an outlier for quite a while. This week, the NY Times took a closer look at this poll, and was able to figure out the reason it's such an outlier. It all goes back to a 19-year-old African American man, living in Illinois, who really likes Donald Trump... Relevant Links: followup article from LA Times, released after recording:<img src="" height="1" width="1" alt=""/>
Oct 17, 2016
How to Steal a Model
What does it mean to steal a model? It means someone (the thief, presumably) can re-create the predictions of the model without having access to the algorithm itself, or the training data. Sound far-fetched? It isn't. If that person can ask for predictions from the model, and he (or she) asks just the right questions, the model can be reverse-engineered right out from under you. Relevant links:<img src="" height="1" width="1" alt=""/>
Oct 09, 2016
Lots of data is usually seen as a good thing. And it is a good thing--except when it's not. In a lot of fields, a problem arises when you have many, many features, especially if there's a somewhat smaller number of cases to learn from; supervised machine learning algorithms break, or learn spurious or un-interpretable patterns. What to do? Regularization can be one of your best friends here--it's a method that penalizes overly complex models, which keeps the dimensionality of your model under control.<img src="" height="1" width="1" alt=""/>
Oct 03, 2016
The Cold Start Problem
You might sometimes find that it's hard to get started doing something, but once you're going, it gets easier. Turns out machine learning algorithms, and especially recommendation engines, feel the same way. The more they "know" about a user, like what movies they watch and how they rate them, the better they do at suggesting new movies, which is great until you realize that you have to start somewhere. The "cold start" problem will be our focus in this episode, both the heuristic solutions that help deal with it and a bit of realism about the importance of skepticism when someone claims a great solution to cold starts. Relevant links:<img src="" height="1" width="1" alt=""/>
Sep 26, 2016
Open Source Software for Data Science
If you work in tech, software or data science, there's an excellent chance you use tools that are built upon open source software. This is software that's built and distributed not for a profit, but because everyone benefits when we work together and share tools. Tim Head of scikit-optimize chats with us further about what it's like to maintain an open source library, how to get involved in open source, and why people like him need people like you to make it all work.<img src="" height="1" width="1" alt=""/>
Sep 19, 2016
Scikit + Optimization = Scikit-Optimize
We're excited to welcome a guest, Tim Head, who is one of the maintainers of the scikit-optimize package. With all the talk about optimization lately, it felt appropriate to get in a few words with someone who's out there making it happen for python. Relevant links:<img src="" height="1" width="1" alt=""/>
Sep 12, 2016
Two Cultures: Machine Learning and Statistics
It's a funny thing to realize, but data science modeling is usually about either explainability, interpretation and understanding, or it's about predictive accuracy. But usually not both--optimizing for one tends to compromise the other. Leo Breiman was one of the titans of both kinds of modeling, a statistician who helped bring machine learning into statistics and vice versa. In this episode, we unpack one of his seminal papers from 2001, when machine learning was just beginning to take root, and talk about how he made clear what machine learning could do for statistics and why it's so important. Relevant links:<img src="" height="1" width="1" alt=""/>
Sep 05, 2016
Optimization Solutions
You've got an optimization problem to solve, and a less-than-forever amount of time in which to solve it. What do? Use a heuristic optimization algorithm, like a hill climber or simulated annealing--we cover both in this episode! Relevant link:<img src="" height="1" width="1" alt=""/>
Aug 29, 2016
Optimization Problems
If modeling is about predicting the unknown, optimization tries to answer the question of what to do, what decision to make, to get the best results out of a given situation. Sometimes that's straightforward, but sometimes... not so much. What makes an optimization problem easy or hard, and what are some of the methods for finding optimal solutions to problems? Glad you asked! May we recommend our latest podcast episode to you?<img src="" height="1" width="1" alt=""/>
Aug 22, 2016
Multi-level modeling for understanding DEADLY RADIOACTIVE GAS
Ok, this episode is only sort of about DEADLY RADIOACTIVE GAS. It's mostly about multilevel modeling, which is a way of building models with data that has distinct, related subgroups within it. What are multilevel models used for? Elections (we can't get enough of 'em these days), understanding the effect that a good teacher can have on their students, and DEADLY RADIOACTIVE GAS. Relevant links:<img src="" height="1" width="1" alt=""/>
Aug 15, 2016
How Polls Got Brexit "Wrong"
Continuing the discussion of how polls do (and sometimes don't) tell us what to expect in upcoming elections--let's take a concrete example from the recent past, shall we? The Brexit referendum was, by and large, expected to shake out for "remain", but when the votes were counted, "leave" came out ahead. Everyone was shocked (SHOCKED!) but maybe the polls weren't as wrong as the pundits like to claim. Relevant links:<img src="" height="1" width="1" alt=""/>
Aug 08, 2016
Election Forecasting
Not sure if you heard, but there's an election going on right now. Polls, surveys, and projections about, as far as the eye can see. How to make sense of it all? How are the projections made? Which are some good ones to follow? We'll be your trusty guides through a crash course in election forecasting. Relevant links:<img src="" height="1" width="1" alt=""/>
Aug 01, 2016
Machine Learning for Genomics
Genomics data is some of the biggest #bigdata, and doing machine learning on it is unlocking new ways of thinking about evolution, genomic diseases like cancer, and what really makes each of us different for everyone else. This episode touches on some of the things that make machine learning on genomics data so challenging, and the algorithms designed to do it anyway.<img src="" height="1" width="1" alt=""/>
Jul 25, 2016
Climate Modeling
Hot enough for you? Climate models suggest that it's only going to get warmer in the coming years. This episode unpacks those models, so you understand how they work. A lot of the episodes we do are about fun studies we hear about, like "if you're interested, this is kinda cool"--this episode is much more important than that. Understanding these models, and taking action on them where appropriate, will have huge implications in the years to come. Relevant links:<img src="" height="1" width="1" alt=""/>
Jul 18, 2016
Reinforcement Learning Gone Wrong
Last week’s episode on artificial intelligence gets a huge payoff this week—we’ll explore a wonderful couple of papers about all the ways that artificial intelligence can go wrong. Malevolent actors? You bet. Collateral damage? Of course. Reward hacking? Naturally! It’s fun to think about, and the discussion starting now will have reverberations for decades to come.<img src="" height="1" width="1" alt=""/>
Jul 11, 2016
Reinforcement Learning for Artificial Intelligence
There’s a ton of excitement about reinforcement learning, a form of semi-supervised machine learning that underpins a lot of today’s cutting-edge artificial intelligence algorithms. Here’s a crash course in the algorithmic machinery behind AlphaGo, and self-driving cars, and major logistical optimization projects—and the robots that, tomorrow, will clean our houses and (hopefully) not take over the world…<img src="" height="1" width="1" alt=""/>
Jul 03, 2016
Differential Privacy: how to study people without being weird and gross
Apple wants to study iPhone users' activities and use it to improve performance. Google collects data on what people are doing online to try to improve their Chrome browser. Do you like the idea of this data being collected? Maybe not, if it's being collected on you--but you probably also realize that there is some benefit to be had from the improved iPhones and web browsers. Differential privacy is a set of policies that walks the line between individual privacy and better data, including even some old-school tricks that scientists use to get people to answer embarrassing questions honestly. Relevant links:<img src="" height="1" width="1" alt=""/>
Jun 27, 2016
How the sausage gets made
Something a little different in this episode--we'll be talking about the technical plumbing that gets our podcast from our brains to your ears. As it turns out, it's a multi-step bucket brigade process of RSS feeds, links to downloads, and lots of hand-waving when it comes to trying to figure out how many of you (listeners) are out there.<img src="" height="1" width="1" alt=""/>
Jun 20, 2016
SMOTE: makin' yourself some fake minority data
Machine learning on imbalanced classes: surprisingly tricky. Many (most?) algorithms tend to just assign the majority class label to all the data and call it a day. SMOTE is an algorithm for manufacturing new minority class examples for yourself, to help your algorithm better identify them in the wild. Relevant links:<img src="" height="1" width="1" alt=""/>
Jun 13, 2016
Conjoint Analysis: like AB testing, but on steroids
Conjoint analysis is like AB tester, but more bigger more better: instead of testing one or two things, you can test potentially dozens of options. Where might you use something like this? Well, if you wanted to design an entire hotel chain completely from scratch, and to do it in a data-driven way. You'll never look at Courtyard by Marriott the same way again. Relevant link:<img src="" height="1" width="1" alt=""/>
Jun 06, 2016
Traffic Metering Algorithms
This episode is for all you (us) traffic nerds--we're talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so that the highways don't get overloaded with cars and clog up. If you're someone who listens to podcasts while commuting, and especially if your area has on-ramp metering, you'll never look at highway access control the same way again (yeah, we know this is super nerdy; it's also super awesome). Relevant links:<img src="" height="1" width="1" alt=""/>
May 30, 2016
Um Detector 2: The Dynamic Time Warp
One tricky thing about working with time series data, like the audio data in our "um" detector (remember that? because we barely do...), is that sometimes events look really similar but one is a little bit stretched and squeezed relative to the other. Besides having an amazing name, the dynamic time warp is a handy algorithm for aligning two time series sequences that are close in shape, but don't quite line up out of the box. Relevant link:<img src="" height="1" width="1" alt=""/>
May 23, 2016
Inside a Data Analysis: Fraud Hunting at Enron
It's storytime this week--the story, from beginning to end, of how Katie designed and built the main project for Udacity's Intro to Machine Learning class, when she was developing the course. The project was to use email and financial data to hunt for signatures of fraud at Enron, one of the biggest cases of corporate fraud in history; that description makes the project sound pretty clean but getting the data into the right shape, and even doing some dataset merging (that hadn't ever been done before), made this project much more interesting to design than it might appear. Here's the story of what a data analysis like this looks like...from the inside.<img src="" height="1" width="1" alt=""/>
May 16, 2016
What's the biggest #bigdata?
Data science and is often mentioned in the same breath as big data. But how big is big data? And who has the biggest big data? CERN? Youtube? ... Something (or someone) else? Relevant link:<img src="" height="1" width="1" alt=""/>
May 09, 2016
Data Contamination
Supervised machine learning assumes that the features and labels used for building a classifier are isolated from each other--basically, that you can't cheat by peeking. Turns out this can be easier said than done. In this episode, we'll talk about the many (and diverse!) cases where label information contaminates features, ruining data science competitions along the way. Relevant links:<img src="" height="1" width="1" alt=""/>
May 02, 2016
Model Interpretation (and Trust Issues)
Machine learning algorithms can be black boxes--inputs go in, outputs come out, and what happens in the middle is anybody's guess. But understanding how a model arrives at an answer is critical for interpreting the model, and for knowing if it's doing something reasonable (one could even say... trustworthy). We'll talk about a new algorithm called LIME that seeks to make any model more understandable and interpretable. Relevant Links:<img src="" height="1" width="1" alt=""/>
Apr 25, 2016
Updates! Political Science Fraud and AlphaGo
We've got updates for you about topics from past shows! First, the political science scandal of the year 2015 has a new chapter, we'll remind you about the original story and then dive into what has happened since. Then, we've got an update on AlphaGo, and his/her/its much-anticipated match against the human champion of the game Go. Relevant Links:<img src="" height="1" width="1" alt=""/>
Apr 18, 2016
Ecological Inference and Simpson's Paradox
Simpson's paradox is the data science equivalent of looking through one eye and seeing a very clear trend, and then looking through the other eye and seeing the very clear opposite trend. In one case, you see a trend one way in a group, but then breaking the group into subgroups gives the exact opposite trend. Confused? Scratching your head? Welcome to the tricky world of ecological inference. Relevant links:<img src="" height="1" width="1" alt=""/>
Apr 11, 2016
Discriminatory Algorithms
Sometimes when we say an algorithm discriminates, we mean it can tell the difference between two types of items. But in this episode, we'll talk about another, more troublesome side to discrimination: algorithms can be... racist? Sexist? Ageist? Yes to all of the above. It's an important thing to be aware of, especially when doing people-centered data science. We'll discuss how and why this happens, and what solutions are out there (or not). Relevant Links:<img src="" height="1" width="1" alt=""/>
Apr 04, 2016
Recommendation Engines and Privacy
This episode started out as a discussion of recommendation engines, like Netflix uses to suggest movies. There's still a lot of that in here. But a related topic, which is both interesting and important, is how to keep data private in the era of large-scale recommendation engines--what mistakes have been made surrounding supposedly anonymized data, how data ends up de-anonymized, and why it matters for you. Relevant links:<img src="" height="1" width="1" alt=""/>
Mar 28, 2016
Neural nets play cops and robbers (AKA generative adverserial networks)
One neural net is creating counterfeit bills and passing them off to a second neural net, which is trying to distinguish the real money from the fakes. Result: two neural nets that are better than either one would have been without the competition. Relevant links:<img src="" height="1" width="1" alt=""/>
Mar 21, 2016
A Data Scientist's View of the Fight against Cancer
In this episode, we're taking many episodes' worth of insights and unpacking an extremely complex and important question--in what ways are we winning the fight against cancer, where might that fight go in the coming decade, and how do we know when we're making progress? No matter how tricky you might think this problem is to solve, the fact is, once you get in there trying to solve it, it's even trickier than you thought.<img src="" height="1" width="1" alt=""/>
Mar 14, 2016
Congress Bots and DeepDrumpf
Hey, sick of the election yet? Fear not, there are algorithms that can automagically generate political-ish speech so that we never need to be without an endless supply of Congressional speeches and Donald Trump twitticisms! Relevant links:<img src="" height="1" width="1" alt=""/>
Mar 11, 2016
Multi - Armed Bandits
Multi-armed bandits: how to take your randomized experiment and make it harder better faster stronger. Basically, a multi-armed bandit experiment allows you to optimize for both learning and making use of your knowledge at the same time. It's what the pros (like Google Analytics) use, and it's got a great name, so... winner! Relevant link:<img src="" height="1" width="1" alt=""/>
Mar 07, 2016
Experiments and Messy, Tricky Causality
"People with a family history of heart disease are more likely to eat healthy foods, and have a high incidence of heart attacks." Did the healthy food cause the heart attacks? Probably not. But establishing causal links is extremely tricky, and extremely important to get right if you're trying to help students, test new medicines, or just optimize a website. In this episode, we'll unpack randomized experiments, like AB tests, and maybe you'll be smarter as a result. Will you be smarter BECAUSE of this episode? Well, tough to say for sure... Relevant link:<img src="" height="1" width="1" alt=""/>
Mar 04, 2016
The reason that neural nets are taking over the world right now is because they can be efficiently trained with the backpropagation algorithm. In short, backprop allows you to adjust the weights of the neural net based on how good of a job the neural net is doing at classifying training examples, thereby getting better and better at making predictions. In this episode: we talk backpropagation, and how it makes it possible to train the neural nets we know and love.<img src="" height="1" width="1" alt=""/>
Feb 29, 2016
Text Analysis on the State Of The Union
First up in this episode: a crash course in natural language processing, and important steps if you want to use machine learning techniques on text data. Then we'll take that NLP know-how and talk about a really cool analysis of State of the Union text, which analyzes the topics and word choices of every President from Washington to Obama. Relevant link:<img src="" height="1" width="1" alt=""/>
Feb 26, 2016
Paradigms in Artificial Intelligence
Artificial intelligence includes a number of different strategies for how to make machines more intelligent, and often more human-like, in their ability to learn and solve problems. An ambitious group of researchers is working right now to classify all the approaches to AI, perhaps as a first step toward unifying these approaches and move closer to strong AI. In this episode, we'll touch on some of the most provocative work in many different subfields of artificial intelligence, and their strengths and weaknesses. Relevant links:<img src="" height="1" width="1" alt=""/>
Feb 22, 2016
Survival Analysis
Survival analysis is all about studying how long until an event occurs--it's used in marketing to study how long a customer stays with a service, in epidemiology to estimate the duration of survival of a patient with some illness, and in social science to understand how the characteristics of a war inform how long the war goes on. This episode talks about the special challenges associated with survival analysis, and the tools that (data) scientists use to answer all kinds of duration-related questions.<img src="" height="1" width="1" alt=""/>
Feb 19, 2016
Gravitational Waves
All aboard the gravitational waves bandwagon--with the first direct observation of gravitational waves announced this week, Katie's dusting off her physics PhD for a very special gravity-related episode. Discussed in this episode: what are gravitational waves, how are they detected, and what does this announcement mean for future studies of the universe. Relevant links:<img src="" height="1" width="1" alt=""/>
Feb 15, 2016
The Turing Test
Let's imagine a future in which a truly intelligent computer program exists. How would it convince us (humanity) that it was intelligent? Alan Turing's answer to this question, proposed over 60 years ago, is that the program could convince a human conversational partner that it, the computer, was in fact a human. 60 years later, the Turing Test endures as a gold standard of artificial intelligence. It hasn't been beaten, either--yet. Relevant links:<img src="" height="1" width="1" alt=""/>
Feb 12, 2016
Item Response Theory: how smart ARE you?
Psychometrics is all about measuring the psychological characteristics of people; for example, scholastic aptitude. How is this done? Tests, of course! But there's a chicken-and-egg problem here: you need to know both how hard a test is, and how smart the test-taker is, in order to get the results you want. How to solve this problem, one equation with two unknowns? Item response theory--the data science behind such tests and the GRE. Relevant links:<img src="" height="1" width="1" alt=""/>
Feb 08, 2016
As you may have heard, a computer beat a world-class human player in Go last week. As recently as a year ago the prediction was that it would take a decade to get to this point, yet here we are, in 2016. We'll talk about the history and strategy of game-playing computer programs, and what makes Google's AlphaGo so special. Relevant link:<img src="" height="1" width="1" alt=""/>
Feb 05, 2016
Great Social Networks in History
The Medici were one of the great ruling families of Europe during the Renaissance. How did they come to rule? Not power, or money, or armies, but through the strength of their social network. And speaking of great historical social networks, analysis of the network of letter-writing during the Enlightenment is helping humanities scholars track the dispersion of great ideas across the world during that time, from Voltaire to Benjamin Franklin and everyone in between. Relevant links:<img src="" height="1" width="1" alt=""/>
Feb 01, 2016
How Much to Pay a Spy (and a lil' more auctions)
A few small encores on auction theory, and then--how can you value a piece of information before you know what it is? Decision theory has some pointers. Some highly relevant information if you are trying to figure out how much to pay a spy. Relevant links:<img src="" height="1" width="1" alt=""/>
Jan 29, 2016
Sold! Auctions (Part 2)
The Google ads auction is a special kind of auction, one you might not know as well as the famous English auction (which we talked about in the last episode). But if it's what Google uses to sell billions of dollars of ad space in real time, you know it must be pretty cool. Relevant links:<img src="" height="1" width="1" alt=""/>
Jan 25, 2016
Going Once, Going Twice: Auctions (Part 1)
The Google AdWords algorithm is (famously) an auction system for allocating a massive amount of online ad space in real time--with that fascinating use case in mind, this episode is part one in a two-part series all about auctions. We dive into the theory of auctions, and what makes a "good" auction. Relevant links:<img src="" height="1" width="1" alt=""/>
Jan 22, 2016
Chernoff Faces and Minard Maps
A data visualization extravaganza in this episode, as we discuss Chernoff faces (you: "faces? huh?" us: "oh just you wait") and the greatest data visualization of all time, or at least the Napoleonic era. Relevant links:<img src="" height="1" width="1" alt=""/>
Jan 18, 2016
t-SNE: Reduce Your Dimensions, Keep Your Clusters
Ever tried to visualize a cluster of data points in 40 dimensions? Or even 4, for that matter? We prefer to stick to 2, or maybe 3 if we're feeling well-caffeinated. The t-SNE algorithm is one of the best tools on the market for doing dimensionality reduction when you have clustering in mind. Relevant links:<img src="" height="1" width="1" alt=""/>
Jan 15, 2016
The [Expletive Deleted] Problem
The town of [expletive deleted], England, is responsible for the clbuttic [expletive deleted] problem. This week on Linear Digressions: we try really hard not to swear too much. Related links:<img src="" height="1" width="1" alt=""/>
Jan 11, 2016
Unlabeled Supervised Learning--whaaa?
In order to do supervised learning, you need a labeled training dataset. Or do you...? Relevant links:<img src="" height="1" width="1" alt=""/>
Jan 08, 2016
Hacking Neural Nets
Machine learning: it can be fooled, just like you or me. Here's one of our favorite examples, a study into hacking neural networks. Relevant links:<img src="" height="1" width="1" alt=""/>
Jan 05, 2016
Zipf's Law
Zipf's law is related to the statistics of how word usage is distributed. As it turns out, this is also strikingly reminiscent of how income is distributed, and populations of cities, and bug reports in software, as well as tons of other phenomena that we all interact with every day. Relevant links:<img src="" height="1" width="1" alt=""/>
Dec 31, 2015
Indie Announcement
We've gone indie! Which shouldn't change anything about the podcast that you know and love, but we're super excited to keep bringing you Linear Digressions as a fully independent podcast. Some links mentioned in the show:<img src="" height="1" width="1" alt=""/>
Dec 30, 2015
Portrait Beauty
It's Da Vinci meets Skynet: what makes a portrait beautiful, according to a machine learning algorithm. Snap a selfie and give us a listen.<img src="" height="1" width="1" alt=""/>
Dec 27, 2015
The Cocktail Party Problem
Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!<img src="" height="1" width="1" alt=""/>
Dec 18, 2015
A Criminally Short Introduction to Semi Supervised Learning
Because there are more interesting problems than there are labeled datasets, semi-supervised learning provides a framework for getting feedback from the environment as a proxy for labels of what's "correct." Of all the machine learning methodologies, it might also be the closest to how humans usually learn--we go through the world, getting (noisy) feedback on the choices we make and learn from the outcomes of our actions.<img src="" height="1" width="1" alt=""/>
Dec 04, 2015
Thresholdout: Down with Overfitting
Overfitting to your training data can be avoided by evaluating your machine learning algorithm on a holdout test dataset, but what about overfitting to the test data? Turns out it can be done, easily, and you have to be very careful to avoid it. But an algorithm from the field of privacy research shows promise for keeping your test data safe from accidental overfitting<img src="" height="1" width="1" alt=""/>
Nov 27, 2015
The State of Data Science
How many data scientists are there, where do they live, where do they work, what kind of tools do they use, and how do they describe themselves? RJMetrics wanted to know the answers to these questions, so they decided to find out and share their analysis with the world. In this very special interview episode, we welcome Tristan Handy, VP of Marketing at RJMetrics, who will talk about "The State of Data Science Report."<img src="" height="1" width="1" alt=""/>
Nov 10, 2015
Data Science for Making the World a Better Place
There's a good chance that great data science is going on close to you, and that it's going toward making your city, state, country, and planet a better place. Not all the data science questions being tackled out there are about finding the sleekest new algorithm or billion-dollar company idea--there's a whole world of social data science that just wants to make the world a better place to live in.<img src="" height="1" width="1" alt=""/>
Nov 06, 2015
Kalman Runners
The Kalman Filter is an algorithm for taking noisy measurements of dynamic systems and using them to get a better idea of the underlying dynamics than you could get from a simple extrapolation. If you've ever run a marathon, or been a nuclear missile, you probably know all about these challenges already. By the way, we neglected to mention in the episode: Katie's marathon time was 3:54:27!<img src="" height="1" width="1" alt=""/>
Oct 29, 2015
Neural Net Inception
When you sleep, the neural pathways in your brain take the "white noise" of your resting brain, mix in your experiences and imagination, and the result is dreams (that is a highly unscientific explanation, but you get the idea). What happens when neural nets are put through the same process? Train a neural net to recognize pictures, and then send through an image of white noise, and it will start to see some weird (but cool!) stuff.<img src="" height="1" width="1" alt=""/>
Oct 23, 2015
Benford's Law
Sometimes numbers are... weird. Benford's Law is a favorite example of this for us--it's a law that governs the distribution of the first digit in certain types of numbers. As it turns out, if you're looking up the length of a river, the population of a country, the price of a stock... not all first digits are created equal.<img src="" height="1" width="1" alt=""/>
Oct 16, 2015
Not to oversell it, but the student's t-test has got to have the most interesting history of any statistical test. Which is saying a lot, right? Add some boozy statistical trivia to your arsenal in this epsiode.<img src="" height="1" width="1" alt=""/>
Oct 07, 2015
PFun with P Values
Doing some science, and want to know if you might have found something? Or maybe you've just accomplished the scientific equivalent of going fishing and reeling in an old boot? Frequentist p-values can help you distinguish between "eh" and "oooh interesting". Also, there's a lot of physics in this episode, nerds.<img src="" height="1" width="1" alt=""/>
Sep 02, 2015
This machine learning algorithm beat the human champions at Jeopardy. What is... Watson?<img src="" height="1" width="1" alt=""/>
Aug 25, 2015
Bayesian Psychics
Come get a little "out there" with us this week, as we use a meta-study of extrasensory perception (or ESP, often used in the same sentence as "psychics") to chat about Bayesian vs. frequentist statistics.<img src="" height="1" width="1" alt=""/>
Aug 18, 2015
Troll Detection
Ever found yourself wasting time reading online comments from trolls? Of course you have; we've all been there (it's 4 AM but I can't turn off the computer and go to sleep--someone on the internet is WRONG!). Now there's a way to use machine learning to automatically detect trolls, and minimize the impact when they try to derail online conversations.<img src="" height="1" width="1" alt=""/>
Aug 07, 2015
Yiddish Translation
Imagine a language that is mostly spoken rather than written, contains many words in other languages, and has relatively little written overlap with English. Now imagine writing a machine-learning-based translation system that can convert that language to English. That's the problem that confronted researchers when they set out to automatically translate between Yiddish and English; the tricks they used help us understand a lot about machine translation.<img src="" height="1" width="1" alt=""/>
Aug 03, 2015
Modeling Particles in Atomic Bombs
In a fun historical journey, Katie and Ben explore the history of the Manhattan Project, discuss the difficulties in modeling particle movement in atomic bombs with only punch-card computers and ingenuity, and eventually come to present-day uses of the Metropolis-Hastings algorithm... mentioning Solitaire along the way.<img src="" height="1" width="1" alt=""/>
Jul 06, 2015
Random Number Generation
Let's talk about randomness! Although randomness is pervasive throughout the natural world, it's surprisingly difficult to generate random numbers. And even if your numbers look random (but actually aren't), it can have interesting consequences on the security of systems, and the accuracy of models and research. In this episode, Katie and Ben talk about randomness, its place in machine learning and computation in general, along with some random digressions of their own.<img src="" height="1" width="1" alt=""/>
Jun 19, 2015
Electoral Insights (Part 2)
Following up on our last episode about how experiments can be performed in political science, now we explore a high-profile case of an experiment gone wrong. An extremely high-profile paper that was published in 2014, about how talking to people can convince them to change their minds on topics like abortion and gay marriage, has been exposed as the likely product of a fraudulently produced dataset. We’ll talk about a cool data science tool called the Kolmogorov-Smirnov test, which a pair of graduate students used to reverse-engineer the likely way that the fraudulent data was generated. But a bigger question still remains—what does this whole episode tell us about fraud and oversight in science?<img src="" height="1" width="1" alt=""/>
Jun 09, 2015
Electoral Insights (Part 1)
The first of our two-parter discussing the recent electoral data fraud case. The results of the study in question were covered widely, including by This American Life (who later had to issue a retraction). Data science for election research involves studying voters, who are people, and people are tricky to study—every one of them is different, and the same treatment can have different effects on different voters. But with randomized controlled trials, small variations from person to person can even out when you look at a larger group. With the advent of randomized experiments in elections a few decades ago, a whole new door was opened for studying the most effective ways to campaign.<img src="" height="1" width="1" alt=""/>
Jun 05, 2015
Falsifying Data
In the first of a few episodes on fraud in election research, we’ll take a look at a case study from a previous Presidential election, where polling results were faked. What are some telltale signs that data fraud might be present in a dataset? We’ll explore that in this episode.<img src="" height="1" width="1" alt=""/>
Jun 01, 2015
Reporter Bot
There’s a big difference between a table of numbers or statistics, and the underlying story that a human might tell about how those numbers were generated. Think about a baseball game—the game stats and a newspaper story are describing the same thing, but one is a good input for a machine learning algorithm and the other is a good story to read over your morning coffee. Data science and machine learning are starting to bridge this gap, taking the raw data on things like baseball games, financial scenarios, etc. and automatically writing human-readable stories that are increasingly indistinguishable from what a human would write. In this episode, we’ll talk about some examples of auto-generated content—you’ll be amazed at how sophisticated some of these reporter-bots can be. By the way, this summary was written by a human. (Or was it?)<img src="" height="1" width="1" alt=""/>
May 20, 2015
Careers in Data Science
Let’s talk money. As a “hot” career right now, data science can pay pretty well. But for an individual person matched with a specific job or industry, how much should someone expect to make? Since Katie was on the job market lately, this was something she’s been researching, and it turns out that data science itself (in particular linear regressions) has some answers. In this episode, we go through a survey of hundreds of data scientists, who report on their job duties, industry, skills, education, location, etc. along with their salaries, and then talk about how this data was fed into a linear regression so that you (yes, you!) can use the patterns in the data to know what kind of salary any particular kind of data scientist might expect.<img src="" height="1" width="1" alt=""/>
May 16, 2015
That's "Dr Katie" to You
Katie successfully defended her thesis! We celebrate her return, and talk a bit about what getting a PhD in Physics is like.<img src="" height="1" width="1" alt=""/>
May 14, 2015
Neural Nets (Part 2)
In the last episode, we zipped through neural nets and got a quick idea of how they work and why they can be so powerful. Here’s the real payoff of that work: In this episode, we’ll talk about a brand-new pair of results, one from Stanford and one from Google, that use neural nets to perform automated picture captioning. One neural net does the object and relationship recognition of the image, a second neural net handles the natural language processing required to express that in an English sentence, and when you put them together you get an automated captioning tool. Two heads are better than one indeed...<img src="" height="1" width="1" alt=""/>
May 11, 2015
Neural Nets (Part 1)
There is no known learning algorithm that is more flexible and powerful than the human brain. That's quite inspirational, if you think about it--to level up machine learning, maybe we should be going back to biology and letting millions of year of evolution guide the structure of our algorithms. This is the idea behind neural nets, which mock up the structure of the brain and are some of the most studied and powerful algorithms out there. In this episode, we’ll lay out the building blocks of the neural net (called neurons, naturally) and the networks that are built out of them. We’ll also explore the results that neural nets get when used to do object recognition in photographs.<img src="" height="1" width="1" alt=""/>
May 01, 2015
Inferring Authorship (Part 2)
Now that we’re up to speed on the classic author ID problem (who wrote the unsigned Federalist Papers?), we move onto a couple more contemporary examples. First, J.K. Rowling was famously outed using computational linguistics (and Twitter) when she wrote a book under the pseudonym Robert Galbraith. Second, we’ll talk about a mystery that still endures--who is Satoshi Nakamoto? Satoshi is the mysterious person (or people) behind an extremely lucrative cryptocurrency (aka internet money) called Bitcoin; no one knows who he, she or they are, but we have plenty of writing samples in the form of whitepapers and Bitcoin forum posts. We’ll discuss some attempts to link Satoshi Nakamoto with a cryptocurrency expert and computer scientist named Nick Szabo; the links are tantalizing, but not a smoking gun. “Who is Satoshi” remains an example of attempted author identification where the threads are tangled, the conclusions inconclusive and the stakes high.<img src="" height="1" width="1" alt=""/>
Apr 28, 2015
Inferring Authorship (Part 1)
This episode is inspired by one of our projects for Intro to Machine Learning: given a writing sample, can you use machine learning to identify who wrote it? Turns out that the answer is yes, a person’s writing style is as distinctive as their vocal inflection or their gait when they walk. By tracing the vocabulary used in a given piece, and comparing the word choices to the word choices in writing samples where we know the author, it can be surprisingly clear who is the more likely author of a given piece of text. We’ll use a seminal paper from the 1960’s as our example here, where the Naive Bayes algorithm was used to determine whether Alexander Hamilton or James Madison was the more likely author of a number of anonymous Federalist Papers.<img src="" height="1" width="1" alt=""/>
Apr 16, 2015
Statistical Mistakes and the Challenger Disaster
After the Challenger exploded in 1986, killing all 7 astronauts aboard, an investigation into the cause was immediately launched. In the cold temperatures the night before the launch, the o-rings that seal off the fuel tanks from the rocket boosters became inflexible, so they did not seal properly, which led to the fuel tank explosion. NASA knew that there could be o-ring problems, but performed the analysis of their data incorrectly and ended up massively underestimating the risk associated with the cold temperatures. In this episode, we'll unpack the mistakes they made. We'll talk about how they excluded data points that they thought were irrelevant but which actually were critical to recognizing a fatal pattern.<img src="" height="1" width="1" alt=""/>
Apr 06, 2015
Genetics and Um Detection (HMM Part 2)
In part two of our series on Hidden Markov Models (HMMs), we talk to Katie and special guest Francesco about more useful and novel applications of HMMs. We revisit Katie's "Um Detector," and hear about how HMMs are used in genetics research.<img src="" height="1" width="1" alt=""/>
Mar 25, 2015
Introducing Hidden Markov Models (HMM Part 1)
Wikipedia says, "A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states." What does that even mean? In part one of a special two-parter on HMMs, Katie, Ben, and special guest Francesco explain the basics of HMMs, and some simple applications of them in the real world. This episode sets the stage for part two, where we explore the use of HMMs in Modern Genetics, and possibly Katie's "Um Detector."<img src="" height="1" width="1" alt=""/>
Mar 24, 2015
Monte Carlo For Physicists
This is another physics-centered podcast, about an ML-backed particle identification tool that we use to figure out what kind of particle caused a particular blob in the detector. But in this case, as in many cases, it looks hard at the outset to use ML because we don't have labeled training data. Monte Carlo to the rescue! Monte Carlo (MC) is fake data that we generate for ourselves, usually following certain sets of rules (often a Markov chain; in physics we generate MC according to the laws of physics as we understand them) and since you generated the event, you "know" what the correct label is. Of course, it's a lot of work to validate your MC, but the payoff is that then you can use Machine Learning where you never could before.<img src="" height="1" width="1" alt=""/>
Mar 12, 2015
Random Kanye
Ever feel like you could randomly assemble words from a certain vocabulary and make semi-coherent Kanye West lyrics? Or technical documentation, imitations of local newscasters, your politically outspoken uncle, etc.? Wonder no more, there's a way to do this exact type of thing: it's called a Markov Chain, and probably the most powerful way to generate made-up data that you can then use for fun and profit. The idea behind a Markov Chain is that you probabilistically generate a sequence of steps, numbers, words, etc. where each next step/number/word depends only on the previous one, which makes it fast and efficient to computationally generate. Usually Markov Chains are used for serious academic uses, but this ain't one of them: here they're used to randomly generate rap lyrics based on Kanye West lyrics.<img src="" height="1" width="1" alt=""/>
Mar 04, 2015
Lie Detectors
Often machine learning discussions center around algorithms, or features, or datasets--this one centers around interpretation, and ethics. Suppose you could use a technology like fMRI to see what regions of a person's brain are active when they ask questions. And also suppose that you could run trials where you watch their brain activity while they lie about some minor issue (say, whether the card in their hand is a spade or a club)--could you use machine learning to analyze those images, and use the patterns in them for lie detection? Well you certainly can try, and indeed researchers have done just that. There are important problems though--the images of brains can be high variance, meaning that for any given person, there might not be a lot of certainty about whether they're lying or not. It's also open to debate whether the training set (in this case, test subjects with playing cards in their hands) really generalize well to the more important cases, like a person accused of a crime. So while machine learning has yielded some impressive gains in lie detection, it is not a solution to these thornier scientific issues.<img src="" height="1" width="1" alt=""/>
Feb 25, 2015
The Enron Dataset
In 2000, Enron was one of the largest and companies in the world, praised far and wide for its innovations in energy distribution and many other markets. By 2002, it was apparent that many bad apples had been cooking the books, and billions of dollars and thousands of jobs disappeared. In the aftermath, surprisingly, one of the greatest datasets in all of machine learning was born--the Enron emails corpus. Hundreds of thousands of emails amongst top executives were made public; there's no realistic chance any dataset like this will ever be made public again. But the dataset that was released has gone on to immortality, serving as the basis for a huge variety of advances in machine learning and other fields.<img src="" height="1" width="1" alt=""/>
Feb 09, 2015
Labels and Where To Find Them
Supervised classification is built on the backs of labeled datasets, but a good set of labels can be hard to find. Great data is everywhere, but the corresponding labels can sometimes be really tricky. Take a few examples we've already covered, like lie detection with an MRI machine (have to take pictures of someone's brain while they try to lie, not a trivial task) or automated image captioning (so many images! so many valid labels!) In this epsiode, we'll dig into this topic in depth, talking about some of the standard ways to get a labeled dataset if your project requires labels and you don't already have them.<img src="" height="1" width="1" alt=""/>
Feb 04, 2015
Um Detector 1
So, um... what about machine learning for audio applications? In the course of starting this podcast, we've edited out a lot of "um"'s from our raw audio files. It's gotten now to the point that, when we see the waveform in soundstudio, we can almost identify an "um" by eye. Which makes it an interesting problem for machine learning--is there a way we can train an algorithm to recognize the "um" pattern, too? This has become a little side project for Katie, which is very much still a work in progress. We'll talk about what's been accomplished so far, some design choices Katie made in getting the project off the ground, and (of course) mistakes made and hopefully corrected. We always say that the best way to learn something is by doing it, and this is our chance to try our own machine learning project instead of just telling you about what someone else did!<img src="" height="1" width="1" alt=""/>
Jan 23, 2015
Better Facial Recognition with Fisherfaces
Now that we know about eigenfaces (if you don't, listen to the previous episode), let's talk about how it breaks down. Variations that are trivial to humans when identifying faces can really mess up computer-driven facial ID--expressions, lighting, and angle are a few. Something that can easily happen is an algorithm can optimize to identify one of those traits, rather than the underlying trait of whether the person is the same (for example, if the training image is me smiling, you may reject an image of me frowning but accidentally approve an image of another woman smiling). Fisherfaces uses a fisher linear discriminant to distinguish based on the dimension in the data that shows the smallest inter-class distance, rather than maximizing the variation overall (we'll unpack this statement), and it is much more robust than our pal eigenfaces when there's shadows, cut-off images, expressions, etc.<img src="" height="1" width="1" alt=""/>
Jan 07, 2015
Facial Recognition with Eigenfaces
A true classic topic in ML: Facial recognition is very high-dimensional, meaning that each picture can have millions of pixels, each of which can be a single feature. It's computationally expensive to deal with all these features, and invites overfitting problems. PCA (principal components analysis) is a classic dimensionality reduction tool that compresses these many dimensions into the few that contain the most variation in the data, and those principal components are often then fed into a classic ML algorithm like and SVM. One of the best thing about eigenfaces is the great example code that you can find in sklearn--you can distinguish pictures of world leaders yourself in just a few minutes!<img src="" height="1" width="1" alt=""/>
Jan 07, 2015
Stats of World Series Streaks
Baseball is characterized by a high level of equality between teams; even the best teams might only have 55% win percentages (contrast this with college football, where teams go undefeated pretty regularly). In this regime, where 2 outcomes (Giants win/Giants lose) are approximately equally likely, we can model the win/loss chances with a binomial distribution. Using the binomial distribution, we can calculate an interesting little result: what's the chance of the world series going to only 4 games? 5? 6? All the way to 7? Then we can compare to decades' worth of world series data, to see how well the data follows the binomial assumption. The result tells us a lot about sports psychology--if each game is independent of the others, 4/5/6/7 game series are equally likely. The data shows a different trend: 4 and 7 game series are significantly more likely than 5 or 6. There's a powerful psychological effect at play--everybody loves the 7th game of the world series, or a good sweep. And it turns out that the baseball teams, whether they intend it or not, oblige our love of short (4) and long (7) world series!<img src="" height="1" width="1" alt=""/>
Dec 17, 2014
Computers Try to Tell Jokes
Computers are capable of many impressive feats, but making you laugh is usually not one of them. Or could it be? This episode will talk about a custom-built machine learning algorithm that searches through text and writes jokes based on what it finds. The jokes are formulaic: they're all of the form "I like my X like I like my Y: Z" where X and Y are nouns, and Z is an adjective that can describe both X and Y. For (dumb) example, "I like my men like I like my coffee: steaming hot." The joke is funny when ZX and ZY are both very common phrases, but X and Y are rarely seen together. So, given a large enough corpus of text, the algorithm looks for triplets of words that fit this description and writes jokes based on them. Are the jokes funny? You be the judge...<img src="" height="1" width="1" alt=""/>
Nov 26, 2014
How Outliers Helped Defeat Cholera
In the 1850s, there were a lot of things we didn’t know yet: how to create an airplane, how to split an atom, or how to control the spread of a common but deadly disease: cholera. When a cholera outbreak in London killed scores of people, a doctor named John Snow used it as a chance to study whether the cause might be very small organisms that were spreading through the water supply (the prevailing theory at the time was miasma, or “bad air”). By tracing the geography of all the deaths from the outbreak, Snow was practicing elementary data science--and stumbled upon one of history’s most famous outliers. In this episode, we’ll tell you more about this single data point, a case of cholera that cracked the case wide open for Snow and provided critical validation for the germ theory of disease.<img src="" height="1" width="1" alt=""/>
Nov 22, 2014
Hunting for the Higgs
Machine learning and particle physics go together like peanut butter and jelly--but this is a relatively new development. For many decades, physicists looked through their fairly large datasets using the laws of physics to guide their exploration; that tradition continues today, but as ever-larger datasets get made, machine learning becomes a more tractable way to deal with the deluge. With this in mind, ATLAS (one of the major experiments at CERN, the European Center for Nuclear Research and home laboratory of the recently discovered Higgs boson) ran a machine learning contest over the summer, to see what advances could be found from opening up the dataset to non-physicists. The results were impressive--physicists are smart folks, but there’s clearly lots of advances yet to make as machine learning and physics learn from one another. And who knows--maybe more Nobel prizes to win as well!<img src="" height="1" width="1" alt=""/>
Nov 16, 2014