Cloud Pub/Sub

Hi,

Is it possible to stream data from pub/sub to cloud data quality ? I have seen docs saying that we need to create a data lake to get it consumed by cloud data quality. I want to directly consume it from pub/sub topic without creating any data lakes. Please suggest.

 

Solved Solved
1 1 112
1 ACCEPTED SOLUTION

You cannot directly stream data from a Google Cloud Pub/Sub topic to a data quality service without some form of intermediate processing or storage. However, Dataplex offers a robust solution to facilitate this process by integrating Data Quality capabilities:

Why Intermediate Storage is Important for Data Quality Analysis:

  • Data Structure: Data quality services, including DQ within Dataplex, require data in a structured format (like tables within BigQuery) to effectively perform profiling, rule creation, and data cleansing. Pub/Sub messages may carry structured data but are not inherently structured for immediate DQ analysis.

  • Data Volume and Retention: Data quality services are optimized for analyzing larger datasets to identify patterns, quality issues, and inconsistencies. Pub/Sub, focusing on real-time message passing, lacks the persistent storage necessary for comprehensive historical analysis.

  • Data Profiling and Analysis: The essence of data quality services is the in-depth analysis of data. Access to data in a persistent and structured format is crucial for conducting this analysis effectively.

Dataplex and Integrated Data Quality (DQ):

  • Dataplex: As a unified data management and governance platform, Dataplex includes automated data quality capabilities powered by DQ, offering a streamlined solution for data quality monitoring and improvement.

  • Integration: Dataplex can seamlessly integrate with Pub/Sub, BigQuery, and GCS, facilitating an end-to-end solution for data quality analysis.

  • Automated Profiling and Analysis: With data landing in designated zones within Dataplex (from Pub/Sub via BigQuery or GCS), DQ automatically profiles and analyzes it for quality issues, offering insights and actionable recommendations.

Possible Solutions with Dataplex:

  • Pub/Sub to BigQuery + Dataplex: Utilize Dataflow or a BigQuery Subscription to ingest data into BigQuery. Dataplex, coupled with DQ, will then automatically undertake the data quality analysis.

  • Pub/Sub to Cloud Storage + Dataplex: Direct Pub/Sub messages to GCS. Dataplex, along with DQ, will discover, profile, and analyze the data for quality.

Considerations:

  • Real-time vs. Batch Processing: Assess the need for real-time data quality checks against the feasibility and cost of implementing real-time processing pipelines.

  • Cost and Complexity: Consider the potential increase in cost and complexity when setting up continuous streaming and processing architectures.

  • Data Volume and Historical Analysis: Tailor your solution to the volume of data and the necessity for historical analysis. Lower data volumes or less critical real-time analysis requirements might be more efficiently handled through batch processing and periodic analysis.

While direct utilization of Pub/Sub data with standalone data quality services presents challenges, Dataplex offers a powerful framework for integrating Data Quality. This approach simplifies the process of streaming data from Pub/Sub, ensuring it is stored in an appropriate format, and automatically conducting quality analysis.

View solution in original post

1 REPLY 1

You cannot directly stream data from a Google Cloud Pub/Sub topic to a data quality service without some form of intermediate processing or storage. However, Dataplex offers a robust solution to facilitate this process by integrating Data Quality capabilities:

Why Intermediate Storage is Important for Data Quality Analysis:

  • Data Structure: Data quality services, including DQ within Dataplex, require data in a structured format (like tables within BigQuery) to effectively perform profiling, rule creation, and data cleansing. Pub/Sub messages may carry structured data but are not inherently structured for immediate DQ analysis.

  • Data Volume and Retention: Data quality services are optimized for analyzing larger datasets to identify patterns, quality issues, and inconsistencies. Pub/Sub, focusing on real-time message passing, lacks the persistent storage necessary for comprehensive historical analysis.

  • Data Profiling and Analysis: The essence of data quality services is the in-depth analysis of data. Access to data in a persistent and structured format is crucial for conducting this analysis effectively.

Dataplex and Integrated Data Quality (DQ):

  • Dataplex: As a unified data management and governance platform, Dataplex includes automated data quality capabilities powered by DQ, offering a streamlined solution for data quality monitoring and improvement.

  • Integration: Dataplex can seamlessly integrate with Pub/Sub, BigQuery, and GCS, facilitating an end-to-end solution for data quality analysis.

  • Automated Profiling and Analysis: With data landing in designated zones within Dataplex (from Pub/Sub via BigQuery or GCS), DQ automatically profiles and analyzes it for quality issues, offering insights and actionable recommendations.

Possible Solutions with Dataplex:

  • Pub/Sub to BigQuery + Dataplex: Utilize Dataflow or a BigQuery Subscription to ingest data into BigQuery. Dataplex, coupled with DQ, will then automatically undertake the data quality analysis.

  • Pub/Sub to Cloud Storage + Dataplex: Direct Pub/Sub messages to GCS. Dataplex, along with DQ, will discover, profile, and analyze the data for quality.

Considerations:

  • Real-time vs. Batch Processing: Assess the need for real-time data quality checks against the feasibility and cost of implementing real-time processing pipelines.

  • Cost and Complexity: Consider the potential increase in cost and complexity when setting up continuous streaming and processing architectures.

  • Data Volume and Historical Analysis: Tailor your solution to the volume of data and the necessity for historical analysis. Lower data volumes or less critical real-time analysis requirements might be more efficiently handled through batch processing and periodic analysis.

While direct utilization of Pub/Sub data with standalone data quality services presents challenges, Dataplex offers a powerful framework for integrating Data Quality. This approach simplifies the process of streaming data from Pub/Sub, ensuring it is stored in an appropriate format, and automatically conducting quality analysis.