Share via

remove HTML coding?

Anonymous
2015-03-09T23:03:56+00:00

Is there an easy way to remove HTML coding in Word for Mac 2011?  Way back in the last century, when Word used some other basic software, a friend of mine wrote a dandy macro that would automatically remove the HTML from any document open in Word.   That macro does not work now.   

The problem is with Visual Basic.   I do not remember whether Visual Basic is what Word used to use, or if it is what Word uses now.  What matters is that my trusty macro no longer works.

Here's how this arises.   I like to save a good many texts that I find on the Web.   One way to save the text is as Web Archive,   The problem there is that I do my work in Word, not Safari.   When I remember, the best way to save such web-based text is to copy it out of Safari and paste it into a Word document.

Sometimes I don't remember.  Or sometimes I find an old document on the hard drive that I saved before Word made the switch to? from? Visual Basic.  When I open such a document in Word, all that HTML coding is there clogging up the text.  Removing all the coding manually is tedious and time-consuming.

I would love to have a macro as in the old days so that I could sit and watch Word do its stuff.   It was both efficient and entertaining to watch watch Word jump back and forth in a document far faster than any human could, removing the coding unnecessary to a text. 

Is there any technique for removing HTML coding in Word for Mac 2011?

Microsoft 365 and Office | Word | For home | Windows

Locked Question. This question was migrated from the Microsoft Support Community. You can vote on whether it's helpful, but you can't add comments or replies or follow the question.

0 comments No comments

Answer accepted by question author

John Korchok 232.4K Reputation points Volunteer Moderator
2015-03-10T22:23:06+00:00

Yes, I got the same result when working on a large page that included JavaScript and CSS. The macro ran when I stripped those out. That's what I was referring to when I wrote "the macro ran out of memory on large complex pages". 

The macro is about 20 years old, but web pages were a lot smaller then. The macro works by selecting the entire page and loading that into a string. The string is too big and overloads the buffer. It's not how I would write a macro today. Frankly, I think it's pretty amazing that 20 year old code can run at all.

The other choice you have is to simply open the HTML file using Word's HTML filter instead of its text filter, then save as a Word document. If you're just after the text in the HTML, save the Word file as text and all markup is removed.

I'm posting the basic code here, in case someone feels like re-writing this from scratch. The HTMLForm is in the template linked to in my previous message. It's nothing fancy, just an interface for choosing which macro to run.:

Public SubToRun$

Sub HTMLCleanup()

  Load HTMLForm

  HTMLForm.Show

  Select Case SubToRun$

    Case "DeleteHTML"

      Call DeleteHTML

    Case "FindAmpersand"

      Call FindAmpersand

    Case "FindExtended"

      Call FindExtended

    Case Else

  End Select

End Sub

'Part 1 -- Remove HTML comments

Sub DeleteHTML()

  Application.ScreenUpdating = False

  WordBasic.EditSelectAll

  SelText$ = WordBasic.GetText$(WordBasic.GetSelStartPos(), WordBasic.GetSelEndPos())

  InitLen = Len(SelText$)

Loop1:

  SelText$ = WordBasic.GetText$(WordBasic.GetSelStartPos(), WordBasic.GetSelEndPos())

  NewLen = Len(SelText$)

  'WordBasic.Print Int(100 - (NewLen * 100 / InitLen)); "%"

  StartPos = InStr(1, SelText$, "<!-")

  EndPos = InStr(1, SelText$, "->")

  If StartPos = 0 Or EndPos = 0 Then GoTo EndSub1

  WordBasic.EditGoTo "\StartOfSel"

  If StartPos <> 1 Then WordBasic.CharRight StartPos - 1

  WordBasic.CharRight EndPos - StartPos + 2, 1

  WordBasic.EditClear

  Count = Count + 1

  WordBasic.ExtendSelection

  WordBasic.EditGoTo "\EndOfDoc"

  WordBasic.Cancel

  GoTo Loop1

EndSub1:

  Application.ScreenUpdating = True

  Beep

  MsgBox Str$(Count) + " '<!-->' codes removed" + Chr$(13) + "End Part 1"

'Part 2 -- Remove HTML proper

  WordBasic.EditSelectAll

  Application.ScreenUpdating = False

Loop2:

  SelText$ = WordBasic.GetText$(WordBasic.GetSelStartPos(), WordBasic.GetSelEndPos())

  NewLen = Len(SelText$)

  'WordBasic.Print Int(100 - (NewLen * 100 / InitLen)); "%"

  StartPos = InStr(1, SelText$, "<")

  EndPos = InStr(1, SelText$, ">")

  If StartPos = 0 Or EndPos = 0 Then GoTo EndSub2

  WordBasic.EditGoTo "\StartOfSel"

  If StartPos <> 1 Then WordBasic.CharRight StartPos - 1

  WordBasic.CharRight EndPos - StartPos + 1, 1

  WordBasic.EditClear

  Count = Count + 1

  WordBasic.ExtendSelection

  WordBasic.EditGoTo "\EndOfDoc"

  WordBasic.Cancel

  GoTo Loop2

EndSub2:

  Application.ScreenUpdating = True

  WordBasic.Beep:  Beep

  MsgBox Str$(Count) + " '<>' codes removed" + Chr$(13) + "End Macro"

End Sub

Sub FindAmpersand()

  WordBasic.EditSelectAll

Loop3:

  SelText$ = WordBasic.GetText$(WordBasic.GetSelStartPos(), WordBasic.GetSelEndPos())

  StartPos = InStr(1, SelText$, "&")

  EndPos = InStr(1, SelText$, ";")

  If StartPos = 0 Or EndPos = 0 Then GoTo EndSub

  WordBasic.EditGoTo "\StartOfSel"

  If StartPos <> 1 Then WordBasic.CharRight StartPos - 1

  If StartPos > EndPos Then     'Check for ";" not part of HTML code

    WordBasic.ExtendSelection

    WordBasic.EditGoTo "\EndOfDoc"

    WordBasic.Cancel

    GoTo Loop3

  End If

  WordBasic.CharRight EndPos - StartPos + 1, 1

  WordBasic.CharColor 9

  Count = Count + 1

  WordBasic.CharRight 1

  WordBasic.ExtendSelection

  WordBasic.EditGoTo "\EndOfDoc"

  WordBasic.Cancel

  GoTo Loop3

EndSub:

  MsgBox Str$(Count) + " '&;' codes highlighted blue."

End Sub

Sub FindExtended()

  With Selection.Find

    .ClearFormatting

    .Replacement.ClearFormatting

    .Replacement.Font.Color = wdColorRed

    .Text = "[^0127-^0255]"

    .Replacement.Text = ""

    .Forward = True

    .Wrap = wdFindContinue

    .Format = True

    .MatchWildcards = True

    .Execute Replace:=wdReplaceAll

  End With

  MsgBox "Extended ASCII characters highlighted red."

End Sub

Was this answer helpful?

1 person found this answer helpful.
0 comments No comments

11 additional answers

Sort by: Most helpful
  1. Jim G 134K Reputation points MVP Volunteer Moderator
    2015-03-14T18:56:48+00:00

    Great Answers! Nicely done!

    Was this answer helpful?

    0 comments No comments
  2. John Korchok 232.4K Reputation points Volunteer Moderator
    2015-03-12T15:41:55+00:00

    You can make Word give you a choice of how it will convert a document by following these steps:

    1. In Word, choose the Word menu, then Preferences.
    2. Click on the General tab.
    3. Check Confirm conversion at Open. OK Out.

    Now when you open an HTML, you'll get a list of possible formats. By default Word will highlight the HTML choice. The HTML filter will interpret the page styling and create analogous styling in Word.

    When opening HTML, you could also choose to open it as Text, in which case it will just copy all the HTML markup as editable text.

    Was this answer helpful?

    0 comments No comments
  3. Anonymous
    2015-03-12T15:19:20+00:00

    Interesting.  I have not heard of an HTML filter in Word.  And "Word Help" does not recognize the term "HTML Filter."    But, of course, there are a great many features hidden in our big, modern applications waiting to be discovered.   

    How do I activate the HTML Filter?

    Thanks for your continuing assistances.

    Was this answer helpful?

    0 comments No comments
  4. Anonymous
    2015-03-11T00:59:25+00:00

    Very nicely done John... though I remember the old days, I'm not so sure I could reach back and make this old code work as well as you have.

    Richard Gilpin... my professional opinion is... if you need to take this further then you'd probably better contact John privately and work out detailed project specifications and an agreed upon price to continue.

    Was this answer helpful?

    0 comments No comments