Share via

Scan for duplicate sentences

Anonymous
2020-03-21T12:16:03+00:00

I have a Word Document which is nearly 3,000 pages long. 

It consists exclusively of circa 20,000 quotations by Carl Jung.

I would like to find out how to scan this document and to find duplicate entries of sentences containing the same quotations.

Any help on this matter will be greatly appreciated.

Thank you,

Lewis

Microsoft 365 and Office | Word | For home | Windows

Locked Question. This question was migrated from the Microsoft Support Community. You can vote on whether it's helpful, but you can't add comments or replies or follow the question.

0 comments No comments

12 answers

Sort by: Most helpful
  1. Anonymous
    2020-03-21T13:35:31+00:00

    Hi Sammy,

    I'm an Independent Advisor here to help you out,

    You can do that with a macro:

    Open up the document and press Alt + F11 simultaneously. This will open up VBE (Visual Basic Editor).

    On the left hand side you should see under Project - the name of your document, If you double-click the name of your document you add the following macro:

    Option Explicit

    Sub Sample()

    Dim MyArray() As String
    
    Dim n As Long, i As Long
    
    Dim Col As New Collection
    
    Dim itm
    
    
    
    n = 0
    
    '~~> Get all the sentences from the word document in an array
    
    For i = 1 To ActiveDocument.Sentences.Count
    
        n = n + 1
    
        ReDim Preserve MyArray(n)
    
        MyArray(n) = Trim(ActiveDocument.Sentences(i).Text)
    
    Next
    
    
    
    '~~> Sort the array
    
    SortArray MyArray, 0, UBound(MyArray)
    
    
    
    '~~> Extract Duplicates
    
    For i = 1 To UBound(MyArray)
    
        If i = UBound(MyArray) Then Exit For
    
        If InStr(1, MyArray(i + 1), MyArray(i), vbTextCompare) Then
    
            On Error Resume Next
    
            Col.Add MyArray(i), """" & MyArray(i) & """"
    
            On Error GoTo 0
    
        End If
    
    Next i
    
    
    
    '~~> Highlight duplicates
    
    For Each itm In Col
    
        Selection.Find.ClearFormatting
    
        Selection.HomeKey wdStory, wdMove
    
        Selection.Find.Execute itm
    
        Do Until Selection.Find.Found = False
    
            Selection.Range.HighlightColorIndex = wdPink
    
            Selection.Find.Execute
    
        Loop
    
    Next
    

    End Sub

    '~~> Sort the array

    Public Sub SortArray(vArray As Variant, i As Long, j As Long)

    Dim tmp As Variant, tmpSwap As Variant

    Dim ii As Long, jj As Long

    ii = i: jj = j: tmp = vArray((i + j) \ 2)

    While (ii <= jj)

     While (vArray(ii) &lt; tmp And ii &lt; j)
    
        ii = ii + 1
    
     Wend
    
     While (tmp &lt; vArray(jj) And jj &gt; i)
    
        jj = jj - 1
    
     Wend
    
     If (ii &lt;= jj) Then
    
        tmpSwap = vArray(ii)
    
        vArray(ii) = vArray(jj): vArray(jj) = tmpSwap
    
        ii = ii + 1: jj = jj - 1
    
     End If
    

    Wend

    If (i < jj) Then SortArray vArray, i, jj

    If (ii < j) Then SortArray vArray, ii, j

    End Sub

    Close the macro window you just entered this into and just below the very top menu, there is a small play button. Click that, select the macro you just made and hit run,

    This will then highlight all duplicate sections of your text,

    Best,

    Nick

    Was this answer helpful?

    20+ people found this answer helpful.
    0 comments No comments
  2. Anonymous
    2020-03-21T19:04:22+00:00

    Hi Sammy,

    It is really small, I've reuploaded your screenshot circling where it is :)

    Was this answer helpful?

    6 people found this answer helpful.
    0 comments No comments
  3. Anonymous
    2020-03-21T17:33:44+00:00

    If that still doesn't work, you can add the Developer tab to word and manually click on the Visual Basic option:

    On the File tab, go to Options > Customize Ribbon.

    Under Customize the Ribbon and under Main Tabs, select the Developer check box.

    After you show the tab, the Developer tab stays visible, unless you clear the check box or have to reinstall a Microsoft Office program.

    Best,

    Nick

    Was this answer helpful?

    3 people found this answer helpful.
    0 comments No comments
  4. Anonymous
    2020-03-21T18:19:25+00:00

    On the left hand side, where you see "This Document," double-click it and a smaller window will appear.

    You'll then need to copy and paste the large macro I sent earlier - it starts with "Option explicit" and ends with "End Sub"

    Once you add that in, close the macro window you just entered this into and just below the very top menu, there is a small, green play button. Click that, select the macro you just made (it will most likely be the only option) and hit run,

    Was this answer helpful?

    2 people found this answer helpful.
    0 comments No comments
  5. Anonymous
    2020-03-21T14:21:47+00:00

    The macro is definitely the way to go. Sorting will help.

    .

    But you will also have to carefully look through the results even after the macro is run. Simple things like extra spaces, especially at the end of the paragraph could "fool" the macro into thinking that the sentences are not the same.

    .

    Are all of your quotes single sentences? (Many of mine aren't ... ).

    .

    Are you sure all of your quotes are in the same format, ie where the source name is, how it is formatted etc?

    Was this answer helpful?

    1 person found this answer helpful.
    0 comments No comments