Share via

remove HTML coding?

Anonymous
2015-03-09T23:03:56+00:00

Is there an easy way to remove HTML coding in Word for Mac 2011?  Way back in the last century, when Word used some other basic software, a friend of mine wrote a dandy macro that would automatically remove the HTML from any document open in Word.   That macro does not work now.   

The problem is with Visual Basic.   I do not remember whether Visual Basic is what Word used to use, or if it is what Word uses now.  What matters is that my trusty macro no longer works.

Here's how this arises.   I like to save a good many texts that I find on the Web.   One way to save the text is as Web Archive,   The problem there is that I do my work in Word, not Safari.   When I remember, the best way to save such web-based text is to copy it out of Safari and paste it into a Word document.

Sometimes I don't remember.  Or sometimes I find an old document on the hard drive that I saved before Word made the switch to? from? Visual Basic.  When I open such a document in Word, all that HTML coding is there clogging up the text.  Removing all the coding manually is tedious and time-consuming.

I would love to have a macro as in the old days so that I could sit and watch Word do its stuff.   It was both efficient and entertaining to watch watch Word jump back and forth in a document far faster than any human could, removing the coding unnecessary to a text. 

Is there any technique for removing HTML coding in Word for Mac 2011?

Microsoft 365 and Office | Word | For home | Windows

Locked Question. This question was migrated from the Microsoft Support Community. You can vote on whether it's helpful, but you can't add comments or replies or follow the question.

0 comments No comments

Answer accepted by question author

John Korchok 232.4K Reputation points Volunteer Moderator
2015-03-10T22:23:06+00:00

Yes, I got the same result when working on a large page that included JavaScript and CSS. The macro ran when I stripped those out. That's what I was referring to when I wrote "the macro ran out of memory on large complex pages". 

The macro is about 20 years old, but web pages were a lot smaller then. The macro works by selecting the entire page and loading that into a string. The string is too big and overloads the buffer. It's not how I would write a macro today. Frankly, I think it's pretty amazing that 20 year old code can run at all.

The other choice you have is to simply open the HTML file using Word's HTML filter instead of its text filter, then save as a Word document. If you're just after the text in the HTML, save the Word file as text and all markup is removed.

I'm posting the basic code here, in case someone feels like re-writing this from scratch. The HTMLForm is in the template linked to in my previous message. It's nothing fancy, just an interface for choosing which macro to run.:

Public SubToRun$

Sub HTMLCleanup()

  Load HTMLForm

  HTMLForm.Show

  Select Case SubToRun$

    Case "DeleteHTML"

      Call DeleteHTML

    Case "FindAmpersand"

      Call FindAmpersand

    Case "FindExtended"

      Call FindExtended

    Case Else

  End Select

End Sub

'Part 1 -- Remove HTML comments

Sub DeleteHTML()

  Application.ScreenUpdating = False

  WordBasic.EditSelectAll

  SelText$ = WordBasic.GetText$(WordBasic.GetSelStartPos(), WordBasic.GetSelEndPos())

  InitLen = Len(SelText$)

Loop1:

  SelText$ = WordBasic.GetText$(WordBasic.GetSelStartPos(), WordBasic.GetSelEndPos())

  NewLen = Len(SelText$)

  'WordBasic.Print Int(100 - (NewLen * 100 / InitLen)); "%"

  StartPos = InStr(1, SelText$, "<!-")

  EndPos = InStr(1, SelText$, "->")

  If StartPos = 0 Or EndPos = 0 Then GoTo EndSub1

  WordBasic.EditGoTo "\StartOfSel"

  If StartPos <> 1 Then WordBasic.CharRight StartPos - 1

  WordBasic.CharRight EndPos - StartPos + 2, 1

  WordBasic.EditClear

  Count = Count + 1

  WordBasic.ExtendSelection

  WordBasic.EditGoTo "\EndOfDoc"

  WordBasic.Cancel

  GoTo Loop1

EndSub1:

  Application.ScreenUpdating = True

  Beep

  MsgBox Str$(Count) + " '<!-->' codes removed" + Chr$(13) + "End Part 1"

'Part 2 -- Remove HTML proper

  WordBasic.EditSelectAll

  Application.ScreenUpdating = False

Loop2:

  SelText$ = WordBasic.GetText$(WordBasic.GetSelStartPos(), WordBasic.GetSelEndPos())

  NewLen = Len(SelText$)

  'WordBasic.Print Int(100 - (NewLen * 100 / InitLen)); "%"

  StartPos = InStr(1, SelText$, "<")

  EndPos = InStr(1, SelText$, ">")

  If StartPos = 0 Or EndPos = 0 Then GoTo EndSub2

  WordBasic.EditGoTo "\StartOfSel"

  If StartPos <> 1 Then WordBasic.CharRight StartPos - 1

  WordBasic.CharRight EndPos - StartPos + 1, 1

  WordBasic.EditClear

  Count = Count + 1

  WordBasic.ExtendSelection

  WordBasic.EditGoTo "\EndOfDoc"

  WordBasic.Cancel

  GoTo Loop2

EndSub2:

  Application.ScreenUpdating = True

  WordBasic.Beep:  Beep

  MsgBox Str$(Count) + " '<>' codes removed" + Chr$(13) + "End Macro"

End Sub

Sub FindAmpersand()

  WordBasic.EditSelectAll

Loop3:

  SelText$ = WordBasic.GetText$(WordBasic.GetSelStartPos(), WordBasic.GetSelEndPos())

  StartPos = InStr(1, SelText$, "&")

  EndPos = InStr(1, SelText$, ";")

  If StartPos = 0 Or EndPos = 0 Then GoTo EndSub

  WordBasic.EditGoTo "\StartOfSel"

  If StartPos <> 1 Then WordBasic.CharRight StartPos - 1

  If StartPos > EndPos Then     'Check for ";" not part of HTML code

    WordBasic.ExtendSelection

    WordBasic.EditGoTo "\EndOfDoc"

    WordBasic.Cancel

    GoTo Loop3

  End If

  WordBasic.CharRight EndPos - StartPos + 1, 1

  WordBasic.CharColor 9

  Count = Count + 1

  WordBasic.CharRight 1

  WordBasic.ExtendSelection

  WordBasic.EditGoTo "\EndOfDoc"

  WordBasic.Cancel

  GoTo Loop3

EndSub:

  MsgBox Str$(Count) + " '&;' codes highlighted blue."

End Sub

Sub FindExtended()

  With Selection.Find

    .ClearFormatting

    .Replacement.ClearFormatting

    .Replacement.Font.Color = wdColorRed

    .Text = "[^0127-^0255]"

    .Replacement.Text = ""

    .Forward = True

    .Wrap = wdFindContinue

    .Format = True

    .MatchWildcards = True

    .Execute Replace:=wdReplaceAll

  End With

  MsgBox "Extended ASCII characters highlighted red."

End Sub

Was this answer helpful?

1 person found this answer helpful.
0 comments No comments

11 additional answers

Sort by: Most helpful
  1. Anonymous
    2015-03-14T21:25:41+00:00

    I found an example.   This is a random extract from a ".txt" file I have been working on:

    "<p style="margin-top:1em;margin-right:0px;margin-bottom:1em;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px">&&&&&&&&On the appeal as to defendant Losinsky, plaintiff Travelers argues that 'medical expense is a special damage, is separate and apart from a bodily injury claim, and the right to recover such special damage is assignable and subject to the principles of subrogation.' So, plaintiff says that 'medical expense stands on the same footing as property damage' and that, on the authority of&<a href="https://apps.fastcase.com/Research/Pages/Document.aspx?LTID=wsJzx3lX0HYVUXStKzzJ0AY8bHPgZco%2fY7P%2f0Z2s07UcTvjV0hV7jSQmgBSWIdzY4%2f58SSJTNhJxjrh65DYyyUs8h24lgiIIJKWLZePfqAkCa3Svw0haW045mtAkOlKp&ECF=General+Exchange+Insurance+Corp.+v.+Young%2c+357+Mo.+1099" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;color:rgb(0, 153, 204)" target="_blank">General Exchange Insurance Corp. v. Young, 357 Mo. 1099</a>,&<a href="https://apps.fastcase.com/Research/Pages/Document.aspx?LTID=wsJzx3lX0HYVUXStKzzJ0AY8bHPgZco%2fY7P%2f0Z2s07UcTvjV0hV7jSQmgBSWIdzY4%2f58SSJTNhJxjrh65DYyyUs8h24lgiIIJKWLZePfqAkCa3Svw0haW045mtAkOlKp&ECF=212+S.W.2d+396" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;color:rgb(0, 153, 204)" target="_blank">212 S.W.2d 396</a>, recovery should be allowed on the subrogation claim under consideration. "

    I can remove the individual HTML manually, but that old Word Basic Macro made it a breeze.

    Was this answer helpful?

    0 comments No comments
  2. John Korchok 232.4K Reputation points Volunteer Moderator
    2015-03-14T20:00:00+00:00

    HTML is the default, so you don't see a difference. If you chose Text, it would have a completely different appearance. Confirm conversion at open just gives you the option to choose.

    Was this answer helpful?

    0 comments No comments
  3. Anonymous
    2015-03-14T19:35:55+00:00

    Checking the "Confirm conversion at Open" does not make a difference.   Word opens the the HTML file the same way either with or without the option.

    I see, though, a distinction in the type of document being opened.   What I was looking for was a method of stripping HTML from a document already saved in Word.   That is, I have some documents I had downloaded (or saved in some manner) from the Web, and I saved them as Word files.   

    Why do it that way?  Well, because that's way the old Macro worked.   The Macro worked only on Word files.   Save the Web archive into Word, run the Macro, and get good, usable written words.

    Now I have some old Word files, and to remove the HTML in those files is tedious and time-consuming. 

    It is not, however, probably too important.   If one of these old files is ever important enough, I will spend the time to clean it up.

    Was this answer helpful?

    0 comments No comments