Something I end up talking and thinking about a lot is the performance of production analytics and predictions, at scale.
A lot of data scientists and statisticians focus on improving their statistical techniques. They think if they can just write a slightly more elegant algorithm then everyone will be happy. However, where most of us neglect things is in the actual long-term delivery of analytics to the people who will consume the data and make decisions based on the results.
What does a typical “analytic process” look like? Well, first let me define the term as we will be using it. A production analytic process is everything that is needed to deliver predictions to your customers.
Typically, to do modeling or calculation, a process needs to:
- Extract data.
- Transfer data.
- Combine data.
- Transform data.
- Deliver results.
An analytic process consists of the solutions for every problem that stands in the way of you delivering your predictions and analyses instantaneously. If you were running a stopwatch, it would start as soon as the data became available from operational systems and would stop as soon as the customer was able to use the data.
Interestingly, most of the things on our list are related to the logistics of the data, not doing any actual calculations. All the time, I see people who take 12 hours to complete a process where 10+ hours are simply extracting from and loading data to databases.
Teradata has a great video that talks about this in a fun way. Definitely worth a view.
How about that trick, then?
The “1 weird trick” that you can do to speed up your analytics is to avoid data transfer, plain and simple. Skipping data movement is only a ‘trick’ if you assume that you have to move it to get work done. All too often, statisticians assume and data scientists that the only way to do their work is to extract data from a database, then use their tool of choice. This simply isn’t true.
At Fuzzy Logix, we try to move the calculations to the data. In terms of bytes, you don’t want to move a terabyte of data to the place where a few megabytes of instructions reside. Your data spends most of its useful life in a database. Your calculations should too!