I have the following two tables:
Table article_description:
| article_code | locale | article_description |
|---|---|---|
| 0001 | EN | SAMPLE DESCRIPTION THAT MAY CONTAIN A KEYWORD ASC |
| 0001 | NL | DUMMY OMSCHRIJVING MET KEYWOORD ASC |
| 1234 | EN | SAMPLE DESCRIPTION ASC THAT MAY CONTAIN A KEYWORD |
| 1234 | NL | DUMMY OMSCHRIJVING ASC MET KEYWOORD |
| 4567 | EN | SAMPLE DESCRIPTION WITHOUT A KEYWORD |
Table article_labels:
| article_code | locale | label_code |
|---|---|---|
| 0001 | EN | QM0029 |
| 0001 | NL | QM0029 |
I would like to identify article_codes that have potentially missing quality label codes by comparing keywords that occur in the article_description, with the quality label(s) present in the other table. (Sadly we lack other more robust information to do a better check.)
One example keyword is "ASC"; each article that has this keyword in the description should have the corresponding label "QM0029". I have about 30 pairs of labels and keywords.
I'm struggling to make a nice maintainable query that can use this logic.
I started by making a query to handle 1 pair using a regex to match the isolated pattern (see code below).
I'd like to get a scaled up query to check for all the pairs of keywords. But I'm not sure how to do this in an effective and efficient manner. The only way I know to do it would be copy pasting the query below multiple times and use UNION ALL, which is not very maintainable. Is there a better way?
Some notes that might help:
I would like to return each article_code that has at least 1 locale with a missing label
Article descriptions are always capitalised; so I don't need case sensitive to search. They should only contain (capitalised) letters and/or numbers
In principle, each article_code has either 0 or 1 quality labels, but robustness to handle multiple labels per article would be a bonus
Theres about 100.000 different article_codes, and 5 different locales. Some degree of efficiency would be nice
I consider a keyword to be present when it is separated by at least one space from other words (can be in the middle, start, or end of the string)
Given the example tables, I would expect the article_code 1234 to be returned, since it has the keyword "ASC" in 2 descriptions, but the corresponding label qm0029 is missing
Please let me know if you have any questions, or have suggestions to improve this post. Additionally, I am by no means very good with SQL yet, so any other improvements/advice would also be greatly appreciated.
Cheers!
WITH article_labels AS (
SELECT article_code, locale, label_code
FROM article_label
WHERE label_code = 'QM0029'
),
article_descriptions AS (
SELECT article_code, locale
FROM article_description
WHERE article_description ~ '(^| )ASC( |$)'
AND article_description IS NOT NULL
)
SELECT DISTINCT t1.article_code, 'QM0029 missing' AS check_value
FROM article_descriptions AS t1
LEFT JOIN article_labels AS t2
ON
t1.article_code = t2.article_code
AND
t1.locale = t2.locale
WHERE t2.label_code IS NULL