Goal - Extract links from email Body
I extract links from outlook emails where I need to avoid counting links from the quoted thread (contents from prev mails in the thread)/ signature that clients append below a reply finding where the new content ends and the quoted history begins, then scan `href`s only before that point.
I basically do this with body content of outlook email collected from MS graph - doing this with outlook emails doesn't limit the scope of removing the quoted threads done by outlook, I also aim to rip of quoted contents appended by any email clients/ marketing & CRM services.
Why not uniqueBody
on saying extracting body from graph first suggestion would be to utilize the uniqueBody provided in the graph. but, unfortunately via Microsoft unreliable. selecting it in graph query intermittently fails with ErrorItemPropertyRequestedFailed. refer forum query and surprisingly no such errors were found for body for the same message I encounter issue when unique body is requested no matter how big the size of the body content is.
ContentType preference: HTML
Extracting links become much reliable with href tags from the bodyContent in HTML format. with body content as text there would be too much of specific handlings to keep the success rate of extracting all links and to target only actual links from the body
Approach: regex
I scan body content with a single compiled, case-insensitive, leftmost-match regex in C#. The match index is the cutoff and links are scanned only before it. looking for other options to do so compared regex for extracting links with HtmlAgilityPack - where regex seemed to be faster in extraction.
To do so, I gone through the email body content and got to know about some patterns and when I looked into web for any utility services or open source projects that does this I came across Talon available in python and .NET but the .NET version seems incomplete for extraction from HTML so I just took a reference & picked suitable patterns form the used in it prepared my own list to make my goal done.
| Group | regex pattern | ||
|---|---|---|---|
| append | ]*?\s+)?id\s*=\s*['"]?appendonsend |
||
| reply | ]*?\s+)?id\s*=\s*['"]?divRplyFwdMsg |
||
| signature | ]*?\s+)?id\s*=\s*['"]?Signature |
||
| older Outlook desktop quote marker | <(?:div\ | ||
| Quotes - Gmail, HubSpot/CRM, Yahoo, Thunderbird | <(?:div\ | hs_reply\ | moz-cite) |
| Apple Mail | ]*?type\s*=\s*['"]?cite` |
Questions
regarding Coverage: Am I missing any patterns from common senders (Outlook variants, Gmail, Apple Mail, Yahoo, any other marketing/CRM platforms). is it possible to really achieve this.
I welcome any help, advise or suggestion over my followed steps.