Converting HTML E-mail To Plain Text
The Battle Of Evermore...
OK, I admit it. I've caught the CRM development bug. What started as a harmless bit of fun working on document library integration between CRM & SharePoint has now developed into an obsession. In this post I will describe how to build a plug-in that examines the body of any e-mail promoted promoted from Outlook or the e-mail router and converts the HTML into plain text.
After a bit of searching, I found a good article which showed how you could use regular expressions to remove unwanted HTML tags leaving just the plain text - Convert HTML to Plain Text. Converting this from C# to VB (my preferred choice of language) and stripping out some of the bits I didn't need, I came up with the following code which forms the basis of this plug-in.
Private Function ConvertHTMLToText(ByVal Source As String) As String
Dim result As String = Source
' Remove formatting that will prevent regex from running reliably
' \r - Matches a carriage return \u000D.
' \n - Matches a line feed \u000A.
' \f - Matches a form feed \u000C.
' For more details see https://msdn.microsoft.com/en-us/library/4edbef7e.aspx
result = Replace(result, "[\r\n\f]", String.Empty, Text.RegularExpressions.RegexOptions.IgnoreCase)
' replace the most commonly used special characters:
result = Replace(result, "<", "<", RegexOptions.IgnoreCase)
result = Replace(result, ">", ">", RegexOptions.IgnoreCase)
result = Replace(result, " ", " ", RegexOptions.IgnoreCase)
result = Replace(result, """, """", RegexOptions.IgnoreCase)
result = Replace(result, "&", "&", RegexOptions.IgnoreCase)
' Remove ASCII character code sequences such as &#nn; and &#nnn;
result = Replace(result, "&#[0-9]{2,3};", String.Empty, RegexOptions.IgnoreCase)
' Remove all other special characters. More can be added - see the following for more details:
' https://www.degraeve.com/reference/specialcharacters.php
' https://www.web-source.net/symbols.htm
result = Replace(result, "&.{2,6};", String.Empty, RegexOptions.IgnoreCase)
' Remove all attributes and whitespace from the <head> tag
result = Replace(result, "< *head[^>]*>", "<head>", RegexOptions.IgnoreCase)
' Remove all whitespace from the </head> tag
result = Replace(result, "< */ *head *>", "</head>", RegexOptions.IgnoreCase)
' Delete everything between the <head> and </head> tags
result = Replace(result, "<head>.*</head>", String.Empty, RegexOptions.IgnoreCase)
' Remove all attributes and whitespace from all <script> tags
result = Replace(result, "< *script[^>]*>", "<script>", RegexOptions.IgnoreCase)
' Remove all whitespace from all </script> tags
result = Replace(result, "< */ *script *>", "</script>", RegexOptions.IgnoreCase)
' Delete everything between all <script> and </script> tags
result = Replace(result, "<script>.*</script>", String.Empty, RegexOptions.IgnoreCase)
' Remove all attributes and whitespace from all <style> tags
result = Replace(result, "< *style[^>]*>", "<style>", RegexOptions.IgnoreCase)
' Remove all whitespace from all </style> tags
result = Replace(result, "< */ *style *>", "</style>", RegexOptions.IgnoreCase)
' Delete everything between all <style> and </style> tags
result = Replace(result, "<style>.*</style>", String.Empty, RegexOptions.IgnoreCase)
' Insert tabs in place of <td> tags
result = Replace(result, "< *td[^>]*>", vbTab, RegexOptions.IgnoreCase)
' Insert single line breaks in place of <br> and <li> tags
result = Replace(result, "< *br[^>]*>", vbCrLf, RegexOptions.IgnoreCase)
result = Replace(result, "< *li[^>]*>", vbCrLf, RegexOptions.IgnoreCase)
' Insert double line breaks in place of <p>, <div> and <tr> tags
result = Replace(result, "< *div[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
result = Replace(result, "< *tr[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
result = Replace(result, "< *p[^>]*>", vbCrLf + vbCrLf, RegexOptions.IgnoreCase)
' Remove all reminaing html tags
result = Replace(result, "<[^>]*>", String.Empty, RegexOptions.IgnoreCase)
' Replace repeating spaces with a single space
result = Replace(result, " +", " ")
' Remove any trailing spaces and tabs from the end of each line
result = Replace(result, "[ \t]+\r\n", vbCrLf)
' Remove any leading whitespace characters
result = Replace(result, "^[\s]+", String.Empty)
' Remove any trailing whitespace characters
result = Replace(result, "[\s]+$", String.Empty)
' Remove extra line breaks if there are more than two in a row
result = Replace(result, "\r\n\r\n(\r\n)+", vbCrLf + vbCrLf)
' Thats it.
Return result
End Function
All that remains is to implement the IPlugin.Execute method. In order to be able to modify the e-mail message before the e-mail activity gets created in the database, I had to figure out which event(s) to intercept. Through a bit of trial and error, I observed that any e-mail promoted from Outlook triggers the "DeliverPromote" event, whereas any incoming e-mail handled by the e-mail router triggers the "DeliverIncoming" event. Interestingly enough, the "Create" event was also called as a child pipeline for these events, but modifying the message here didn't have any effect, even in the pre-processing stage.
Because plug-ins have the potential to introduce significant performance and scalability issues into your environment, it is important to ensure that the code is as efficient as possible. To that end I added additional checks to ensure that the even if registered on multiple events, the main code will only run if the plug-in:
- is running on the 'DeliverPromote' or 'DeliverIncoming' messages
- is running synchronously
- is running against the 'Email' entity
- is running in the 'pre-processing' stage of the pipeline
- is running in a 'Parent' pipeline
Public Class ConvertHtmlToText
Implements IPlugin
Public Sub Execute(ByVal context As IPluginExecutionContext) Implements IPlugin.Execute
' Exit if any of the following conditions are true:
' 1. plug-in is not running synchronously
' 2. plug-in is not running against the 'Email' entity
' 3. plug-in is not running in the 'pre-processing' stage of the pipeline
' 4. plug-in is not running in a 'Parent' pipeline
If Not (context.Mode = 0) Or Not (context.PrimaryEntityName = "email") Or Not (context.Stage = 10) Or Not (context.InvocationSource = 0) Then
Exit Sub
End If
If (context.MessageName = "DeliverPromote") Or (context.MessageName = "DeliverIncoming") Then
For Each item In context.InputParameters.Properties
If (item.Name = "Body") Then
context.InputParameters.Properties.Item("Body") = ConvertHTMLToText(CStr(item.Value))
End If
Next
End If
End Sub
End Class
As always, I have include the source code to my project here. Please do bear in mind that I haven't included any error handling or logging, so it's not production-ready. However, it should provide you with a good head-start.
This posting is provided "AS IS" with no warranties, and confers no rights.
Comments
Anonymous
August 08, 2008
PingBack from http://emanuel.freevideonewsnetwork.info/htmlmailto.htmlAnonymous
March 15, 2011
Hi, I've adapted your plugin in order to work with crm 2011. Here is the code snipet I'd to change: Added references tho M.crm.sdk.proxy and M.xrm.sdk Public Class ConvertHtmlToText Implements IPlugin Public Sub Execute(ByVal serviceProvider As System.IServiceProvider) Implements Microsoft.Xrm.Sdk.IPlugin.Execute Dim context As Microsoft.Xrm.Sdk.IPluginExecutionContext = DirectCast(serviceProvider.GetService(GetType(Microsoft.Xrm.Sdk.IPluginExecutionContext)), IPluginExecutionContext) ' Exit if any of the following conditions are true: ' 1. plug-in is not running synchronously ' 2. plug-in is not running against the 'Email' entity ' 3. plug-in is not running in the 'pre-processing' stage of the pipeline ' 4. plug-in is not running in a 'Parent' pipeline (now, this is configurable in the registration TOOL, I guess, because I couldn't find an equivalent) If Not (context.Mode = 0) Or Not (context.PrimaryEntityName = "email") Or Not (context.Stage = 10) Then ' Or Not (context.InvocationSource = 0) Exit Sub End If If (context.MessageName = "DeliverPromote") Or (context.MessageName = "DeliverIncoming") Then Try For Each elemento In context.InputParameters If (elemento.Key = "Body") Then Dim contenido As String = CStr(elemento.Value) context.InputParameters.Item("Body") = ConvertHTMLToText(contenido) 'Throw New System.Exception("Se ha modificado el valor de key: Valor=" + context.InputParameters.Item("Body")) 'CStr(elemento.Value)) ' + elemento.ToString()) Exit For End If Next Catch ex As Exception Throw New System.Exception("Se ha modificado el valor de key: " + ex.Message) End Try End If End Sub
Also, I've added these replace sentences, because I receive mails in spanish: result = Replace(result, "á", "á", RegexOptions.IgnoreCase) result = Replace(result, "é", "é", RegexOptions.IgnoreCase) result = Replace(result, "í", "í", RegexOptions.IgnoreCase) result = Replace(result, "ó", "ó", RegexOptions.IgnoreCase) result = Replace(result, "ú", "ú", RegexOptions.IgnoreCase) result = Replace(result, "Á", "Á", RegexOptions.IgnoreCase) result = Replace(result, "É", "É", RegexOptions.IgnoreCase) result = Replace(result, "Í", "Í", RegexOptions.IgnoreCase) result = Replace(result, "Ó", "Ó", RegexOptions.IgnoreCase) result = Replace(result, "Ú", "Ú", RegexOptions.IgnoreCase) result = Replace(result, "Ñ", "Ñ", RegexOptions.IgnoreCase) result = Replace(result, "ñ", "ñ", RegexOptions.IgnoreCase) result = Replace(result, " ", vbCrLf, RegexOptions.IgnoreCase)
If you see something wrong let me know, but it's working like a charm. Regards
Anonymous
March 15, 2011
Nice one Jorge. If I get a chance, I will republish in a new post. I wonder if there is a better way of identifying all language-specific character sets, rather than adding an exception for each character?Anonymous
March 23, 2011
Hi It would be great if you just could give me a keyword for what i have to google to find the solution of how to implement such code into dynamics crm... thank you a lot!Anonymous
March 23, 2011
Hi again Finaly i could implement the code into dynamics using the plugin registration tool. Know I thought there will be a custom step in the workflow area... wrong again :) What do I have to do to remove the html tags out of my mails? thank you regardsAnonymous
March 24, 2011
Hi Nicolas, after you have registered the plug-in, you need to register it against two specific events (steps). You can register these steps in the plug-in registration tool as well
Event: DeliverPromote; Entity email
Event: DeliverIncoming; Entity: email Make sure these are registered to run synchronously in the pre-processing stage of the event pipeline. Best regards, Simon
- Anonymous
January 18, 2017
Hi Simon,I have been trying to implement the above code in CRM 365 on premise, build and then use Registration Tool, but it is not working as I may have wrongly implemented your guide.-Event: DeliverPromote; Entity email-Event: DeliverIncoming; Entity: email-Make sure these are registered to run synchronously in the pre-processing stage of the event pipelinePlease advise me how to implement the above three steps.
- Anonymous
Anonymous
March 25, 2011
Hi Simon Thank you very much! Everything works fine! I expected a custom "action" for workflows... Now I know that your plugin converts all mail messages. I'll keep searching :) Have a nice day Best regards, NicolasAnonymous
April 03, 2012
Once this Plugin is compiled into a DLL is it them somehow installed in Outlook? Once installed how is it triggered a button? I would like to modify this to trigger when I click "Convert Email to CRM Case" and parse out the HTML body to auto populate the Case for fields. Please forgive my foolish questions. ThanksAnonymous
April 03, 2012
A plug-in only runs when triggered by a CRM event (such as DeleverPromote), and does not shown up in the Outlook or Web UI. To be able to use this as part of the "Convert E-mail To Case" function in the Outlook client, you will need to work out what events are triggered, and modify the plug-in to work with those events. Unfortunately I don't have the ability to check this out for the next couple of weeks, as I am on vacation right now. Best regards, Simon