ABOUT THIS EPISODE

Data infrastructure is advancing beyond the days of Hadoop MapReduce, single-node databases, and nightly reporting.

Companies are adopting modern data warehouses, streaming data systems, and cloud-specific data tools like BigQuery. Every company with a large amount of data wants to aggregate that data into a data lake and make the data available to developers. All of this data can be used to power machine learning models which can potentially improve every area within a company where they have historical data.

“Data pipeline” is a term used to describe the process of preparing data, building machine learning models, deploying those models, and tracking the results of those models.

Pachyderm is a company and open source project that is focused on deployment, management, and scalability of data pipelines. Pachyderm allows developers to version data, track the state of data sets, backtest machine learning models, and collaborate on data. It also tackles the very hard problem of machine learning auditability.

Joe Doliner is the CEO of Pachyderm and joins the show to discuss his experience building Pachyderm over the last five years. Data infrastructure has changed a lot in five years, and the world has moved in a direction that has benefitted Pachyderm, with more infrastructure moving to containers and more data teams advancing beyond a world of just Hadoop MapReduce.

In today’s show, Joe talks about modern infrastructure, data provenance, and the long-term vision of Pachyderm.

The post Pachyderm: Data Pipelines with Joe Doliner appeared first on Software Engineering Daily.

English
United States

TRANSCRIPT

Transcribe this episode
We transcribe podcasts (Example). Transcribing all podcasts in the world takes time... Please help us prioritize what episodes to transcribe.
Disclaimer: The podcast and artwork embedded on this page are from Software Engineering Daily, which is the property of its owner and not affiliated with or endorsed by Listen Notes, Inc.

EDIT

Thank you for helping to keep the podcast database up to date.