Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 10: Data Governance and Security

Building a data platform is not just about technology; it is also about trust. If your users do not trust the data in your platform, they will not use it. And if your platform is not secure, it can become a massive liability for your organization. This is where data governance and data security come in. These two disciplines are often overlooked in the rush to build new data pipelines and implement new technologies, but they are absolutely critical to the long-term success and sustainability of any data initiative.

Data governance is the overall management of the availability, usability, integrity, and security of the data in an enterprise. It is a set of policies, processes, and standards that ensure that data is a trusted and well-managed asset. Data security, on the other hand, is the practice of protecting digital information from unauthorized access, use, disclosure, alteration, or destruction.

This chapter is dedicated to these two crucial topics. We will explore the key pillars of data governance, including data stewardship, data cataloging, data lineage, and data quality. We will look at practical ways to implement data quality at scale using open-source tools like Great Expectations. We will then dive into the world of data security, covering best practices for encryption, access control, and secrets management. Finally, we will discuss the important topic of data privacy and compliance with regulations like GDPR and CCPA. By the end of this chapter, you will have a clear understanding of how to build a data platform that is not only powerful but also trustworthy and secure.

10.1 The Pillars of Data Governance

Data governance is not a single project but an ongoing program. It is a cultural shift that requires the collaboration of business users, data professionals, and IT. A successful data governance program is built on several key pillars.

10.2 Implementing Data Quality at Scale

Manually checking data quality is not a scalable solution. In a modern data platform, you need to have automated data quality checks integrated into your data pipelines. Great Expectations is a popular open-source tool for data validation and documentation that has become a standard for implementing data quality at scale.

How Great Expectations Works

  1. Expectations: You define your data quality rules as “Expectations.” An Expectation is a simple, declarative statement about what you expect from your data. For example:

    • expect_column_values_to_not_be_null("user_id")

    • expect_column_values_to_be_unique("order_id")

    • expect_column_values_to_be_in_set("status", ["shipped", "pending", "cancelled"])

  2. Data Validation: You run these Expectations against your data (e.g., a DataFrame in a Spark job, a table in a database). Great Expectations will then tell you whether your data meets your Expectations.

  3. Data Docs: Great Expectations automatically generates a set of clean, human-readable documentation from your Expectations, which can serve as a data quality report.

By integrating Great Expectations into your data pipelines (e.g., as a step in your Airflow DAG), you can automatically validate your data as it flows through your system and prevent bad data from reaching your users.

Another popular tool for data quality is dbt (Data Build Tool). While dbt is primarily a data transformation tool, it has excellent built-in support for data quality testing. You can define simple tests (like not_null and unique) in your dbt models, and dbt will automatically run these tests as part of your transformation pipeline.

10.3 Data Security Best Practices

Data security is a critical and non-negotiable aspect of data engineering. A data breach can have devastating consequences for a company, including financial loss, reputational damage, and legal penalties. As a data engineer, you have a responsibility to ensure that the data in your platform is secure.

Encryption: The First Line of Defense

Access Control: The Principle of Least Privilege

The principle of least privilege states that a user should only be given the minimum level of access that they need to do their job. You should not give everyone admin access to your data platform.

Data Masking and Anonymization

For sensitive data, such as personally identifiable information (PII), you may need to implement data masking or anonymization techniques. This involves replacing the sensitive data with realistic but fake data, or removing it altogether.

Secrets Management

Your data pipelines will need to connect to a variety of different systems, which means you will need to manage a lot of secrets, such as database passwords, API keys, and encryption keys. These secrets should never be hard-coded in your scripts or checked into version control. Instead, you should use a dedicated secrets management tool, such as:

In recent years, there has been a growing global focus on data privacy. Several major regulations have been enacted that have a significant impact on how data engineers build and manage data platforms.

As a data engineer, you need to be aware of these regulations and design your data platforms to be compliant. This includes:

10.5 Data Catalogs and Discovery: Making Data Usable

As we have discussed, a data catalog is a key component of data governance. It is the tool that makes your data discoverable, understandable, and trustworthy. A good data catalog should provide:

Popular Data Catalog Tools:

Chapter Summary

In this chapter, we have explored the critical topics of data governance and security. We have learned that building a successful data platform is not just about technology; it is about building a platform that is trusted and secure. We have explored the key pillars of data governance, from data stewardship to data quality, and we have looked at practical ways to implement these concepts using open-source tools. We have also taken a deep dive into the world of data security, covering best practices for encryption, access control, and secrets management. Finally, we have discussed the importance of data privacy and compliance with regulations like GDPR.

With a solid understanding of how to govern and secure our data platform, we are now ready to take a closer look at how to build and manage these platforms on a specific cloud provider. In the next chapter, we will take a deep dive into the world of data engineering on Alibaba Cloud.