Need help in extracting contents of online pdf

NAVEEN MOHANDAS 21 Reputation points
2022-08-08T16:41:54.97+00:00

Hi all ,

    So my task is to scrape contents of an online pdf . I have made some progress and converted the response to stream .  

i am able to write the file when url is "http://www.africau.edu/images/default/sample.pdf"

Dim req As WebRequest=WebRequest.Create("http://www.africau.edu/images/default/sample.pdf")
Dim res As WebResponse=req.GetResponse()
Dim dataStream As Stream=res.GetResponseStream()
Dim reader As New StreamReader(dataStream)
Dim filecontents()As Byte=Encoding.UTF8.GetBytes(reader.ReadToEnd())
File.WriteAllBytes("mypdf.pdf",filecontents)

but i was testing the code for another pdf "https://efile.fara.gov/docs/7070-Exhibit-AB-20220113-1.pdf" as well and this pdf is saying file corrupted when i open it . On further analysis found that this pdf has written with FlateDecode. So i tried using DeflateStream class but still no luck.
Dim buffer As New StringBuilder()
Dim req As WebRequest=WebRequest.Create("https://efile.fara.gov/docs/7070-Exhibit-AB-20220113-1.pdf")
Dim res As WebResponse=req.GetResponse()
Dim dataStream As Stream=res.GetResponseStream()
Dim compressStream As DeflateStream=New DeflateStream(dataStream,CompressionMode.Decompress)
Dim reader As New StreamReader(compressStream)
Console.WriteLine(reader.ReadToEnd())
Dim filecontents()As Byte=Encoding.ASCII.GetBytes(reader.ReadToEnd())
File.WriteAllBytes("mydocument.pdf",filecontents)

Exception
08/08/2022 12:34:49 => [Debug] Execution started for file: Main
08/08/2022 12:34:51 => [Info] pdfpagecount execution started
08/08/2022 12:34:59 => [Error] Invoke code: Exception has been thrown by the target of an invocation.
08/08/2022 12:34:59 => [Info] pdfpagecount execution ended in: 00:00:07
08/08/2022 12:34:59 => [Error] RemoteException wrapping System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> RemoteException wrapping System.IO.InvalidDataException: Found invalid data while decoding.
at System.IO.Compression.Inflater.DecodeDynamicBlockHeader()
at System.IO.Compression.Inflater.Decode()
at System.IO.Compression.Inflater.Inflate(Byte[] bytes, Int32 offset, Int32 length)
at System.IO.Compression.DeflateStream.Read(Byte[] array, Int32 offset, Int32 count)
at System.IO.StreamReader.ReadBuffer()
at System.IO.StreamReader.ReadToEnd()
at UiPathCodeRunner_32007da9d8904936b77f16cd351da8de.Run()
--- End of inner exception stack trace ---
at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)
at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj, Object[] parameters, Object[] arguments)
at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
at System.RuntimeType.InvokeMember(String name, BindingFlags bindingFlags, Binder binder, Object target, Object[] providedArgs, ParameterModifier[] modifiers, CultureInfo culture, String[] namedParams)
at UiPath.Activities.System.Utilities.InvokeCode.CompilerRunner.Run(Object[] args)
at UiPath.Activities.System.Utilities.InvokeCode.NetCodeInvoker.Run(String userCode, List1 inArgs, IEnumerable1 imps, Object[] args)
at UiPath.Core.Activities.InvokeCode.Execute(CodeActivityContext context)
at System.Activities.CodeActivity.InternalExecute(ActivityInstance instance, ActivityExecutor executor, BookmarkManager bookmarkManager)
at System.Activities.ActivityInstance.Execute(ActivityExecutor executor, BookmarkManager bookmarkManager)
at System.Activities.Runtime.ActivityExecutor.ExecuteActivityWorkItem.ExecuteBody(ActivityExecutor executor, BookmarkManager bookmarkManager, Location resultLocation)

PDF binary content(shortened)
08/08/2022 10:34:42 => [Debug] %PDF-1.5
%����
57 0 obj
<</Filter/FlateDecode/Length 61>>
stream
x�+�w,�LKL.��� �,H� HLO��srqV034R0 BsK ���������� � � �
endstream
endobj
58 0 obj
<</Filter/FlateDecode/Length 12>>
stream
x�+T T �
endstream
endobj
59 0 obj
<</Filter/FlateDecode/Length 2004>>
stream
x��Xio\� ���
,˳V��Ij 3��� �ѦvR (�"�& ��څ 4 � ���$ /��� *}�;:�빗�i���߽}��o�>�|{�� o�y���߿z�������������������� � ���;�bru�f���� b(] �R���w�����u���h��?� a � ~يq�� ��mn���� Ei J -0��#�:�I Ǫ]"F(� !����n��ɃӇg F�g�a�[5�93�� %K�q V�( �G�� Ng�� |^, 3�����: R�Z'k� �� �z�O=J �$ P�Y�+��0g�H ��� �>������"� > 8�U��|0�B-� ������� F xΉ��� ��� ƍ���o<����ş��bCa��� Ac :t��� ���]7�k��뎆Z̨ g���ఐ��L�|��T� �Â35�c ! �
�q���h��1.Y� ޖG=��GC� nR�b�>k&I�

Appreciate your help

Developer technologies | VB
0 comments No comments
{count} votes

Accepted answer
  1. Castorix31 90,686 Reputation points
    2022-08-09T13:46:16.48+00:00

    I also tested with a code I use to download images, using HttpWebRequest ,
    and I get the correct bytes array :

            Dim bytes As Byte() = Nothing  
            Try  
                Net.ServicePointManager.Expect100Continue = True  
                Net.ServicePointManager.SecurityProtocol = Net.SecurityProtocolType.Tls12  
                Dim webRequest As Net.HttpWebRequest = CType(Net.HttpWebRequest.Create("https://efile.fara.gov/docs/7070-Exhibit-AB-20220113-1.pdf"), Net.HttpWebRequest)  
                webRequest.AllowWriteStreamBuffering = True  
                Using webResponse As Net.WebResponse = webRequest.GetResponse()  
                    Dim stream As IO.Stream = webResponse.GetResponseStream()  
                    Using ms As IO.MemoryStream = New IO.MemoryStream()  
                        stream.CopyTo(ms)  
                        bytes = ms.ToArray()  
                        File.WriteAllBytes("mypdf3.pdf", bytes)  
                    End Using  
                End Using  
            Catch ex As Exception  
                ' Code...  
            End Try  
    
    1 person found this answer helpful.

2 additional answers

Sort by: Most helpful
  1. Castorix31 90,686 Reputation points
    2022-08-09T06:14:21.637+00:00

    You can use URLDownloadToFile

    I tested with your file and it worked :

    URLDownloadToFile(IntPtr.Zero, "https://efile.fara.gov/docs/7070-Exhibit-AB-20220113-1.pdf", "test.pdf", 0, IntPtr.Zero)  
    

    with declarations :

    <DllImport("Urlmon.dll", SetLastError:=True, CharSet:=CharSet.Unicode)>  
    Public Shared Function URLDownloadToFile(pCaller As IntPtr, szURL As String, szFileName As String, dwReserved As UInteger, lpfnCB As IntPtr) As HRESULT  
    End Function  
    
    Public Enum HRESULT As Integer  
        S_OK = 0  
        S_FALSE = 1  
        E_NOINTERFACE = &H80004002  
        E_NOTIMPL = &H80004001  
        E_FAIL = &H80004005  
        E_UNEXPECTED = &H8000FFFF  
        E_OUTOFMEMORY = &H8007000E  
    End Enum  
    
    0 comments No comments

  2. NAVEEN MOHANDAS 21 Reputation points
    2022-08-09T12:32:43.41+00:00

    @Castorix31 thank you for looking at my question .Does this function help to download the file to local file system because we are not allowed to download. Instead would you be able to convert the contents of the file into a byte array or some kind of stream.
    Actually the end goal of mine is to have a byte array of this pdf that i can load and then extract pages from it(this i will take care :)).


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.