Cooking Data guided by Lior
Cooking Data Podcast
Data Breakthroughs Podcast Shorts | Episode 1: Tackling Data Pipeline Breakages
0:00
-15:56

Data Breakthroughs Podcast Shorts | Episode 1: Tackling Data Pipeline Breakages

Welcome to the first episode of the Data Breakthroughs Podcast! The idea behind this podcast series is to bring together data professionals and consumers to discuss and clarify how to solve real-world data issues. We aim to create a platform where business problems can be shared and used during episodes with data professionals who are matched to discuss them. The goal is to move beyond interviews and instead have practical, entertaining, and challenging sessions where we attempt to solve problems together.

In this inaugural episode, we tackled a common and critical problem: how to avoid data pipelines breaking and influencing the dashboards used for decision making. Our guest, Ilya, who has over 20 years of experience in data, starting as a database developer before the term "data engineer" was common, highlighted the significance of this issue.

Why is this problem so important? Companies ideally use data for decision-making. When data pipelines break or provide wrong data, people lose trust in the data. This can lead to stakeholders reverting to gut feelings or, even worse, bypassing the established data systems and teams to do their analysis on source data, often using tools like Excel. It's easy to lose business stakeholders' trust, but difficult to regain it. A data pipeline break is manageable if detected before users see it, but if it impacts end users, it needs to be addressed.

The Situation Ilya shared a specific example from an IoT startup where he was the first data person hired to build a team and platform. Data was collected from multiple sources, including internal ERP and web shop systems. The teams maintaining these source systems were constantly changing data structures, tables, schemas, and column names as they actively developed new features. These changes frequently broke the data pipelines, resulting in Tableau dashboards being unavailable or showing incorrect data. This led to frustration among stakeholders, like the CFO, who saw numbers changing backward or reports contradicting each other, causing them to stop using the data provided by the data team. This functional setup, where data teams were separate from data-producing teams, highlighted the drawbacks of not embedding data people into domains. The tech stack involved Airflow for ETL, AWS RDS (Postgres) for storage, Tableau for more important reports, and Redash for self-service analytics.

Key Insights

Key Insights Based on our brainstorming and discussion, we arrived at several key insights:

  • There is no one-size-fits-all for these complex problems. Different situations and company sizes require different approaches.

  • Data mesh is still relevant as a philosophy. While often misunderstood and improperly implemented, the core idea of decentralizing data and empowering domain teams can be powerful if adapted to the organization's culture and needs, similar to how organizations adapt the Agile philosophy. It's a cultural thing, not just applying a book step-by-step.

  • People are 80% of the solution, technology is 20%. Tools alone won't solve problems if people aren't communicating, collaborating, and taking responsibility. Transparency and communication are crucial. While changing people's attitudes can be difficult, it's often the root cause and the key to fixing issues.

Action Items

Action Items: To address the problem of breaking data pipelines and loss of trust, we identified several actionable steps:

  1. Talk to each other and create transparency. Data producers need to understand the downstream impact of their changes.

  2. Sit down and document the data structure and expected events. Agree on basic details like required fields and expected values, even if it's just on paper or a simple document.

  3. Implement automated data checks and set up systems to reject data that doesn't meet the agreed-upon structure or quality criteria.

  4. Utilize a data monitoring tool to be alerted when issues occur before users discover them. Many ETL tools offer basic alerting. Consider tools specifically for data quality or data contracts.

  5. Establish a strong data incident process. Since things will inevitably break, having a process for identifying, resolving, and performing root cause analysis helps learn from failures and prevent recurrence. This includes understanding data sensitivity and impact to prioritize fixes.

Why You Should Listen:

This episode dives deep into a practical, real-world problem faced by many data teams. By listening, you'll gain insights into:

  • The significant impact of broken data pipelines on business trust and decision-making.

  • Common causes of pipeline breakages, particularly due to schema changes from data producers.

  • Different potential solutions range from organizational approaches (like team structure and collaboration efforts) to technical ones (like data contracts and quality checks).

  • The importance of the "people" aspect in data problems.

  • Concrete, actionable steps you can consider implementing in your organization starting tomorrow.

  • An exploration of concepts like Data Mesh and Data Contracts and their practical implications.

The episode also invites listeners to share their problems and ideas, fostering a collaborative environment for finding breakthroughs. While we didn't find a single magic solution, the discussion provided a robust set of tools and principles to tackle the problem effectively.

You can find links to the episode, the discussion board (Figma board), notes, and other relevant resources in the episode description.

We hope you enjoyed this first episode and look forward to seeing you in the next one!

Discussion about this episode

User's avatar