Share via

remove HTML coding?

Anonymous
2015-03-09T23:03:56+00:00

Is there an easy way to remove HTML coding in Word for Mac 2011?  Way back in the last century, when Word used some other basic software, a friend of mine wrote a dandy macro that would automatically remove the HTML from any document open in Word.   That macro does not work now.   

The problem is with Visual Basic.   I do not remember whether Visual Basic is what Word used to use, or if it is what Word uses now.  What matters is that my trusty macro no longer works.

Here's how this arises.   I like to save a good many texts that I find on the Web.   One way to save the text is as Web Archive,   The problem there is that I do my work in Word, not Safari.   When I remember, the best way to save such web-based text is to copy it out of Safari and paste it into a Word document.

Sometimes I don't remember.  Or sometimes I find an old document on the hard drive that I saved before Word made the switch to? from? Visual Basic.  When I open such a document in Word, all that HTML coding is there clogging up the text.  Removing all the coding manually is tedious and time-consuming.

I would love to have a macro as in the old days so that I could sit and watch Word do its stuff.   It was both efficient and entertaining to watch watch Word jump back and forth in a document far faster than any human could, removing the coding unnecessary to a text. 

Is there any technique for removing HTML coding in Word for Mac 2011?

Microsoft 365 and Office | Word | For home | Windows

Locked Question. This question was migrated from the Microsoft Support Community. You can vote on whether it's helpful, but you can't add comments or replies or follow the question.

0 comments No comments

Answer accepted by question author

John Korchok 232.4K Reputation points Volunteer Moderator
2015-03-10T22:23:06+00:00

Yes, I got the same result when working on a large page that included JavaScript and CSS. The macro ran when I stripped those out. That's what I was referring to when I wrote "the macro ran out of memory on large complex pages". 

The macro is about 20 years old, but web pages were a lot smaller then. The macro works by selecting the entire page and loading that into a string. The string is too big and overloads the buffer. It's not how I would write a macro today. Frankly, I think it's pretty amazing that 20 year old code can run at all.

The other choice you have is to simply open the HTML file using Word's HTML filter instead of its text filter, then save as a Word document. If you're just after the text in the HTML, save the Word file as text and all markup is removed.

I'm posting the basic code here, in case someone feels like re-writing this from scratch. The HTMLForm is in the template linked to in my previous message. It's nothing fancy, just an interface for choosing which macro to run.:

Public SubToRun$

Sub HTMLCleanup()

  Load HTMLForm

  HTMLForm.Show

  Select Case SubToRun$

    Case "DeleteHTML"

      Call DeleteHTML

    Case "FindAmpersand"

      Call FindAmpersand

    Case "FindExtended"

      Call FindExtended

    Case Else

  End Select

End Sub

'Part 1 -- Remove HTML comments

Sub DeleteHTML()

  Application.ScreenUpdating = False

  WordBasic.EditSelectAll

  SelText$ = WordBasic.GetText$(WordBasic.GetSelStartPos(), WordBasic.GetSelEndPos())

  InitLen = Len(SelText$)

Loop1:

  SelText$ = WordBasic.GetText$(WordBasic.GetSelStartPos(), WordBasic.GetSelEndPos())

  NewLen = Len(SelText$)

  'WordBasic.Print Int(100 - (NewLen * 100 / InitLen)); "%"

  StartPos = InStr(1, SelText$, "<!-")

  EndPos = InStr(1, SelText$, "->")

  If StartPos = 0 Or EndPos = 0 Then GoTo EndSub1

  WordBasic.EditGoTo "\StartOfSel"

  If StartPos <> 1 Then WordBasic.CharRight StartPos - 1

  WordBasic.CharRight EndPos - StartPos + 2, 1

  WordBasic.EditClear

  Count = Count + 1

  WordBasic.ExtendSelection

  WordBasic.EditGoTo "\EndOfDoc"

  WordBasic.Cancel

  GoTo Loop1

EndSub1:

  Application.ScreenUpdating = True

  Beep

  MsgBox Str$(Count) + " '<!-->' codes removed" + Chr$(13) + "End Part 1"

'Part 2 -- Remove HTML proper

  WordBasic.EditSelectAll

  Application.ScreenUpdating = False

Loop2:

  SelText$ = WordBasic.GetText$(WordBasic.GetSelStartPos(), WordBasic.GetSelEndPos())

  NewLen = Len(SelText$)

  'WordBasic.Print Int(100 - (NewLen * 100 / InitLen)); "%"

  StartPos = InStr(1, SelText$, "<")

  EndPos = InStr(1, SelText$, ">")

  If StartPos = 0 Or EndPos = 0 Then GoTo EndSub2

  WordBasic.EditGoTo "\StartOfSel"

  If StartPos <> 1 Then WordBasic.CharRight StartPos - 1

  WordBasic.CharRight EndPos - StartPos + 1, 1

  WordBasic.EditClear

  Count = Count + 1

  WordBasic.ExtendSelection

  WordBasic.EditGoTo "\EndOfDoc"

  WordBasic.Cancel

  GoTo Loop2

EndSub2:

  Application.ScreenUpdating = True

  WordBasic.Beep:  Beep

  MsgBox Str$(Count) + " '<>' codes removed" + Chr$(13) + "End Macro"

End Sub

Sub FindAmpersand()

  WordBasic.EditSelectAll

Loop3:

  SelText$ = WordBasic.GetText$(WordBasic.GetSelStartPos(), WordBasic.GetSelEndPos())

  StartPos = InStr(1, SelText$, "&")

  EndPos = InStr(1, SelText$, ";")

  If StartPos = 0 Or EndPos = 0 Then GoTo EndSub

  WordBasic.EditGoTo "\StartOfSel"

  If StartPos <> 1 Then WordBasic.CharRight StartPos - 1

  If StartPos > EndPos Then     'Check for ";" not part of HTML code

    WordBasic.ExtendSelection

    WordBasic.EditGoTo "\EndOfDoc"

    WordBasic.Cancel

    GoTo Loop3

  End If

  WordBasic.CharRight EndPos - StartPos + 1, 1

  WordBasic.CharColor 9

  Count = Count + 1

  WordBasic.CharRight 1

  WordBasic.ExtendSelection

  WordBasic.EditGoTo "\EndOfDoc"

  WordBasic.Cancel

  GoTo Loop3

EndSub:

  MsgBox Str$(Count) + " '&;' codes highlighted blue."

End Sub

Sub FindExtended()

  With Selection.Find

    .ClearFormatting

    .Replacement.ClearFormatting

    .Replacement.Font.Color = wdColorRed

    .Text = "[^0127-^0255]"

    .Replacement.Text = ""

    .Forward = True

    .Wrap = wdFindContinue

    .Format = True

    .MatchWildcards = True

    .Execute Replace:=wdReplaceAll

  End With

  MsgBox "Extended ASCII characters highlighted red."

End Sub

Was this answer helpful?

1 person found this answer helpful.
0 comments No comments

11 additional answers

Sort by: Most helpful
  1. Anonymous
    2015-03-10T19:28:50+00:00

    When I ran the first two choices in the macro (remove HTML and ampersands), I got this dialog box:

    When I ran the third task in the macro--removing ASCII--the macro changed the color of the ASCII characters from black to red.  That is all it did.  I did not get the dialog box.

    At first I did try running the first step (remove HTML) on a document open as "docx."  The macro gave me the above dialog box.  I then saved the document as a text file and ran the macro again, with the results stated above.

    Was this answer helpful?

    0 comments No comments
  2. John Korchok 232.4K Reputation points Volunteer Moderator
    2015-03-10T18:48:36+00:00

    You weren't kidding that you've had this macro a while! I haven't worked on WordBasic for 15 years!

    Here is a slightly updated version: Updated HTML Cleaner Template. This template needs to be attached to the HTML file that you're processing. To attach a file to a template:

    1. Open the HTML file in Word as a text file.
    2. Use Word's Tools>Templates and Add-ins command.
    3. Click on the Attach button.
    4. Navigate to the HTML.dotm template file, select it and OK out.
    5. Word will warn about macros. Choose to Enable Macros.

    Now you can run the macros:

    1. In Word, choose Tools>Macro>Macros.
    2. Pick HTMLCleanup, then click on the Run button.
    3. The dialog will open allowing you to run the other three macros.

    I didn't completely re-write these, I just added the WordBasic object in front of the old WordBasic commands. If you're doing large web pages, it helps if you delete Javascript and CSS from the pages first to reduce the size. On a couple of tests, the macro ran out of memory on large complex pages.

    I'm not sure if the second 2 macros are working as originally intended, you would have a better idea what effect is expected. Please test first.

    Was this answer helpful?

    0 comments No comments
  3. Anonymous
    2015-03-10T13:32:30+00:00

    REM *** Remove HTML Coding ***

    REM Version 1.1

    Sub MAIN

    Begin Dialog UserDialog 281, 121, "Remove HTML Coding"

         OKButton 167, 85, 65, 20

         CancelButton 18, 85, 65, 20

         OptionGroup  .Option

              OptionButton 10, 5, 177, 18, "Remove HTML Coding"

              OptionButton 10, 25, 203, 18, "Locate Ampersand Codes"

              OptionButton 10, 45, 187, 18, "Locate Extended ASCII"

    End Dialog

    Dim Dlg As UserDialog

    x = Dialog(Dlg)

    If x = 0 Then Goto EndSub

    Select Case Dlg.Option

         Case 0

              Call DeleteHTML

         Case 1

              Call FindAmpersand

         Case 2

              Call FindExtended

    End Select

    EndSub:

    End Sub

    'Part 1 -- Remove HTML comments

    Sub DeleteHTML

    EditSelectAll

    SelText$ = GetText$(GetSelStartPos(), GetSelEndPos())

    InitLen = Len(SelText$)

    ScreenUpdating 0

    Loop1:

    SelText$ = GetText$(GetSelStartPos(), GetSelEndPos())

    NewLen = Len(SelText$)

    Print Int(100 - (NewLen * 100 / InitLen)); "%"

    StartPos = InStr(1, SelText$, "<!-")

    EndPos = InStr(1, SelText$, "->")

    If StartPos = 0 Or EndPos = 0 Then Goto EndSub1

    EditGoTo "\StartOfSel"

    If StartPos <> 1 Then CharRight StartPos - 1

    CharRight EndPos - StartPos + 2, 1

    EditClear

    Count = Count + 1

    ExtendSelection

    EditGoTo "\EndOfDoc"

    Cancel

    Goto Loop1

    EndSub1:

    ScreenUpdating 1

    Beep

    MsgBox Str$(Count) + " '<!-->' codes removed" + Chr$(13) + "End Part 1"

    'Part 2 -- Remove HTML proper

    EditSelectAll

    ScreenUpdating 0

    Loop2:

    SelText$ = GetText$(GetSelStartPos(), GetSelEndPos())

    NewLen = Len(SelText$)

    Print Int(100 - (NewLen * 100 / InitLen)); "%"

    StartPos = InStr(1, SelText$, "<")

    EndPos = InStr(1, SelText$, ">")

    If StartPos = 0 Or EndPos = 0 Then Goto EndSub2

    EditGoTo "\StartOfSel"

    If StartPos <> 1 Then CharRight StartPos - 1

    CharRight EndPos - StartPos + 1, 1

    EditClear

    Count = Count + 1

    ExtendSelection

    EditGoTo "\EndOfDoc"

    Cancel

    Goto Loop2

    EndSub2:

    ScreenUpdating 1

    Beep : Beep

    MsgBox Str$(Count) + " '<>' codes removed" + Chr$(13) + "End Macro"

    End Sub

    Sub FindAmpersand

    EditSelectAll

    Loop:

    SelText$ = GetText$(GetSelStartPos(), GetSelEndPos())

    StartPos = InStr(1, SelText$, "&")

    EndPos = InStr(1, SelText$, ";")

    If StartPos = 0 Or EndPos = 0 Then Goto EndSub

    EditGoTo "\StartOfSel"

    If StartPos <> 1 Then CharRight StartPos - 1

    If StartPos > EndPos Then     'Check for ";" not part of HTML code

         ExtendSelection

         EditGoTo "\EndOfDoc"

         Cancel

         Goto Loop

    End If

    CharRight EndPos - StartPos + 1, 1

    CharColor 9

    Count = Count + 1

    CharRight 1

    ExtendSelection

    EditGoTo "\EndOfDoc"

    Cancel

    Goto Loop

    EndSub:

    MsgBox Str$(Count) + " '&;' codes highlighted blue."

    End Sub

    Sub FindExtended

    EditFindClearFormatting

    EditReplaceClearFormatting

    EditReplaceFont .Color = 13

    EditReplace .Find = "[^0127-^0255]", .Replace = "", .Direction = 0, .MatchCase = 0, .WholeWord = 0, .PatternMatch = 1, .SoundsLike = 0, .ReplaceAll, .Format = 1, .Wrap = 1

    EditReplace .PatternMatch = 0

    EditFindClearFormatting

    EditReplaceClearFormatting

    MsgBox "Extended ASCII characters highlighted red."

    End Sub

    Was this answer helpful?

    0 comments No comments
  4. John Korchok 232.4K Reputation points Volunteer Moderator
    2015-03-10T00:25:54+00:00

    Visual Basic for Applications is still in Word 2011. Your old macro may just a tweak or two to get it running again. You can post it here and we'll take a look at it.

    Was this answer helpful?

    0 comments No comments