We are running AWS Glue Data Quality on Glue 5.0 and need a single evaluation of a DQDL
ruleset to deliver two outputs:
1. Row-level outcomes — so the ETL job can split input rows into good and bad partitions
for downstream processing.
2. Catalog publishing — the run must appear under the catalog table's Data Quality tab in
the Glue console.
Today these capabilities live in two different surfaces, and neither one delivers both:
- EvaluateDataQuality.process_rows (ETL transform) — returns row-level outcomes, but does
not publish to the table's DQ tab. - start_data_quality_ruleset_evaluation_run (boto3 API) — publishes to the table's DQ
tab, but does not return row-level outcomes.
To get both, we would need to evaluate the same ruleset twice, which doubles the run of the data quality rules.
What we have tried
- Created the ruleset with create_data_quality_ruleset using TargetTable, so the catalog
table is bound to the ruleset. - Ran process_rows from inside the Glue job with enableDataQualityResultsPublishing and
enableDataQualityCloudWatchMetrics set to true. - Set dataQualityEvaluationContext to the ruleset name.
- Passed database, tableName, and catalogId in additional_options.
- Set transformation_ctx on the source DynamicFrame to match the table name.
None of the above caused the run to appear under the catalog table's Data Quality tab.
Running start_data_quality_ruleset_evaluation_run against the same ruleset populates the
table tab , but only returns rule-level pass/fail — no row-level outcomes.
Questions
1. Is there a configuration of EvaluateDataQuality.process_rows that publishes results to
the catalog table's Data Quality tab in the same call?
2. Is there a way to get row-level segregation out of
start_data_quality_ruleset_evaluation_run?
3. If this split is intentional (ETL transform for row-level outcomes, API for the
catalog surface), is there a boto3 call that can post-publish an already-computed
evaluation result to the catalog table tab, so we don't have to evaluate the ruleset a
second time?