Data Science and Speed

I have worked in big data, data engineering, and math for more than a decade. Over the course of my career I have spoken with hundreds of companies, across all sorts of industries, all over the globe. Lately, I have been having a lot of discussions about statistics, predictive analytics, and data science; and seeing how practitioners operate. One thing I have learned is that there is a near-infinite number of different ways to approach things, some better than others.

What I see all the time are hard-working, smart people and teams who are incredibly busy delivering valuable information to their customers. I see people who have toiled for years to build and expand the skills and experience that are so valuable in today’s marketplace. I see organizations who rely upon these teams to make bedrock decisions that can affect tens of thousands of people and millions of dollars.

What I also see are individuals who are so focused on delivering results, they don’t take the time to step back and look at the bigger picture. People who are missing the forest through the trees. People who focus solely on the task at hand, never looking around to see if there is a better way to get their work done. People who have learned how to do things one way and who don’t know or don’t want to know about new approaches in their field.

How could this happen with such smart people?

Well, it isn’t terribly difficult to see how it works. People get set in their ways. They learn something new and then they want to use it, and use it again. Organizations learn what works, develop processes for doing them, and don’t want to waste their time on doing things the wrong way. In a lot of ways that makes sense. Why focus on the edge cases when there is so much in our sweet spot?

Until we see an outlier. A black swan. An unexpected breakthrough.

Today, there are a huge number of products that will help you do predictive analytics. There is no mystery in the algorithms that any one vendor brings to the table. They are all doing the same regressions, the same clustering, the same machine learning. The real difference is in how fast you can do your analysis.

Your work rate has a massive impact on the precision, accuracy, and usability of your analytics, at every level. Think about this:

  • If you could avoid data movement then you eliminate the largest time waster in data mining.  This is true for any in-database analytics, from Teradata to Netezza to SQL Server.
  • If you could speed up “janitor work” then you remove a significant hurdle to real insight.
  • If you could explore and understand your data in half the time, you have doubled how much time you have for experimentation with your model.
  • If you could build a model in 10% of the time it usually takes, you have given yourself the leeway to dramatically increase its accuracy. You have time to look at more variables or even build a more nuanced ensemble model.

Speed matters. Performance matters. Focus matters.

In this blog series I will show you how to make your analytics run faster, what kind of impact performance has, and why you should care about speed. I will do this by walking through some hands-on examples with publicly available data sets, and provide code and benchmarks. We will also look at real-life business cases and see the true impact faster analytics have on the big picture.

More to follow.  Meanwhile, what impact does speed have on the analytics your organization is doing?  Comment below.  BTW — if you think that speed doesn’t make a difference to what you do, you probably aren’t thinking about the big picture.