My thoughts on where our space has been and where it might be going.
Trends in data infrastructure
The good (core principles to keep in future innovation)
Horizontal products
We no longer need to buy a bunch of vertical-specific products to do analytics on specific things; we push data into a warehouse and can then analyze it all together in a common set of tools.
Fast
The modern data stack is both fast from an iteration perspective—connecting new data and exploring it is a snap relative to 2012—and a pure query execution time perspective, as the performance breakthroughs of the MPP database now feed through the entire stack.
Unlimited Scale
Using cloud infrastructure, it is now possible to trivially scale up just about as far as you could want to go. Cost now becomes the primary constraint to data processing.
Low overhead
Sophisticated data infrastructures of 2012 required massive overhead investment—infrastructure engineers, data engineers, etc. The modern data stack requires virtually none of this.
United by SQL
In 2012 it wasn’t at all clear what language / what API would be primarily used to unite data products, and as such integrations were spotty and few people had the skills to interface with the data. Today, all components of the modern data stack speak SQL, allowing for easy integrations and unlocking data access to a broad range of practitioners.
The bad (opps for future innovation)
Governance is immature
Throwing data into a warehouse and unlocking transformation and analysis to a broad range of people unlocks potential but can also create chaos. Tooling and best-practices are needed to bring trust and context to the modern data stack.
Batch-based
The entire modern data stack is built on batch-based operations: polling and job scheduling. This is great for analytics, but a transition to streaming could unlock tremendous potential for the data pipelines we’re already building…
Data doesn’t feed back into operational tools
The modern data stack is a one-way pipeline today: from data sources to warehouses to some type of data analysis viewed by a human on a screen. But data is about making decisions, and decisions happen in operational tools: messaging, CRM, ecommerce… Without a connection with operational tooling, tremendous value created by these pipelines is being lost.
Bridge not yet built to data consumers
Data consumers were actually more self-serve prior to the advent of the modern data stack: Excel skills are widely dispersed through the population of knowledge workers. There has not yet been an analogous interface where all knowledge workers can seamlessly interact with data in the modern data stack in a horizontal way.
Vertical analytical experiences
With consolidation into a centralized data infrastructure, we’ve lost differentiated analytical experiences for specific types of data. Purpose-built experiences for analyzing web and mobile data, sales data, marketing data are critically important.