top of page
  • viralsoftwaresolut

3 Steps for Capturing Data Lineage in GCP & Airflow

Updated: Aug 17, 2023



Data Lineage in Google Cloud Platform: A Comprehensive Guide




The Value of Data Lineage

Before we dive into how data lineage can be implemented in GCP, let's understand the concept of Data Lineage first. Data Lineage is the ability to track data lineage has many benefits. It assists in debugging data issues, improving data governance, and maintaining compliance with various data regulations. Moreover, it builds confidence in data quality and reliability, which in turn drives data-driven decision-making within organizations.


Additionally, data lineage is the concept of tracking the lifecycle of your data as it travels through various systems in your pipelines. For example, you may have processes to pull data from an API, database or an SFTP server. Then you may transform this data and land this data into Cloud Storage. You may then trigger a Google Cloud Storage Transfer Service to then transfer these files over to Big Query.


By implementing data lineage into your data processes, you gain further insights into where your data started, transformations that may have taken place on your data, and where the data has finally landed.


In order to better solve these Data Lineage use cases, GCP Offers the DataPlex service. The DataPlex service has many various components to it, but today we will be discussing in order to be able to perform lineage management and metadata management. Let's begin by exploring how we can use DataPlex to better solve our data needs.



Data Catalog in Google Cloud Platform:

GCP offers 2 features within the DataPlex service, Data Catalog and Data Lineage, to provide metadata management and lineage management of your data.


Using the Data Catalog feature within the Dataplex, your leadership has an auditable track record of all of your Google Cloud assets such as Big Query or Cloud Storage. You are also able to create custom assets that allow you to track systems such as APIs or external systems that may not be best represented by the standard GCP services.


Once you have created this asset in Data Catalog you can then assign an owner to this asset, and tag the asset with useful metadata that will allow you to better identify the asset


Data Catalog exposes two resources that enable you to be able to track assets:

  1. Entries

  2. Entry Groups


You can create entries which represents your assets and group these assets into Entry Groups


Data Lineage in Google Cloud Platform (GCP)

Now that we are able to create entries within Data Catalog representing our assets, we can now capture the lineage of that asset so we can capture the journey that our data takes throughout our ETL pipelines. For example we can track where our data has originated (API, database, etc), and the various systems that it travels through (Cloud Storage) and how it lands to our final destination (Big Query Data Transfer Service, Big Query)



Lets consider the 3 steps needed in order to capture data lineage in GCP:

  1. Create an Entry Group which will be used to store all our entries related to our

gcloud data-catalog entry-groups create api-group-name --location=us-east1 --display-name=<display-name> --description=<description>

2. Add Entry to Entry Group


Here we are creating a custom entry using the fullyQualifedName attribute, and adding the entry to the entry group.


curl 
-H "Authorization: Bearer \"$(gcloud auth application-default print-access-token)\""  
-H "Content-Type: application/json; charset=utf-8"
-X POST https://datacatalog.googleapis.com/v1/projects/<project-name>/locations/us-east1/entryGroups/api-group-name/entries?entryId=<entry-id-name> 
-d '{
  "name": "ourCustomAPI",
  "userSpecifiedType": "api",
  "userSpecifiedSystem": "ourCustomAPI",
  "fullyQualifiedName": "custom:api-name",
}'


3. Track Lineage - Now in our Composer job we can define airflow inlets and outlets in such a way where we are able to track lineage across our Google Cloud components.


For example, we are able to define lineage between a custom api data source and a Google Cloud Storage bucket through the following code


from airflow.composer.data_lineage.entities import BigQueryTable, GCSEntity, PostgresTable, DataLineageEntity, MySQLTable
    

@task(
        inlets=[
      DataLineageEntity(fully_qualified_name="custom:api-name")
        ],
        outlets=[
            GCSEntity(
                bucket="<bucket-name>", path='/path'
            )
        ]
    )
    def sample_task():
        pass


After the above task is run in GCP Composer, along with an additional task (not shown in the above code) used to run a Big Query Data Transfer operation to transfer the data from Cloud Storage to Big Query, we are able to see the below lineage graph in DataPlex


Using the below lineage graph we are better able to visualize how our data has traveled from our source API, through our Cloud Storage Bucket and into our final Big Query Table




Special Thanks to the Viral Professional Services Team


Please note, at the time of this writing, several of the imports in the code were not found in any documentation online (GCSEntity for example). Credit to our Professional Services team for having done their due diligence and finding these hidden gems, as well as working alongside the Google Cloud team to better understand how to track lineage the correct way




Conclusion

GCP's DataPlex service offers robust capabilities for managing and understanding your data lineage. By leveraging these services, you can gain a comprehensive view of your data's journey, increase trust in your data, and enhance your organization's data management strategies.


Remember, the journey of a thousand miles begins with a single step. Start your data lineage journey today with GCP and experience the benefits of better data visibility and management. Keep following for more insights on data management and GCP. Happy data journeying!




75 views0 comments

Recent Posts

See All
bottom of page