ABOUT THIS EPISODE
Data infrastructure is advancing beyond the days of Hadoop MapReduce, single-node databases, and nightly reporting.
Companies are adopting modern data warehouses, streaming data systems, and cloud-specific data tools like BigQuery. Every company with a large amount of data wants to aggregate that data into a data lake and make the data available to developers. All of this data can be used to power machine learning models which can potentially improve every area within a company where they have historical data.
“Data pipeline” is a term used to describe the process of preparing data, building machine learning models, deploying those models, and tracking the results of those models.
Pachyderm is a company and open source project that is focused on deployment, management, and scalability of data pipelines. Pachyderm allows developers to version data, track the state of data sets, backtest machine learning models, and collaborate on data. It also tackles the very hard problem of machine learning auditability.
Joe Doliner is the CEO of Pachyderm and joins the show to discuss his experience building Pachyderm over the last five years. Data infrastructure has changed a lot in five years, and the world has moved in a direction that has benefitted Pachyderm, with more infrastructure moving to containers and more data teams advancing beyond a world of just Hadoop MapReduce.
In today’s show, Joe talks about modern infrastructure, data provenance, and the long-term vision of Pachyderm.