Редактиране

Споделяне чрез


Copy data from an HTTP endpoint by using Azure Data Factory or Azure Synapse Analytics

APPLIES TO: Azure Data Factory Azure Synapse Analytics

Tip

Try out Data Factory in Microsoft Fabric, an all-in-one analytics solution for enterprises. Microsoft Fabric covers everything from data movement to data science, real-time analytics, business intelligence, and reporting. Learn how to start a new trial for free!

This article outlines how to use Copy Activity in Azure Data Factory and Azure Synapse to copy data from an HTTP endpoint. The article builds on Copy Activity, which presents a general overview of Copy Activity.

The difference among this HTTP connector, the REST connector and the Web table connector are:

  • REST connector specifically support copying data from RESTful APIs;
  • HTTP connector is generic to retrieve data from any HTTP endpoint, e.g. to download file. Before REST connector becomes available, you may happen to use the HTTP connector to copy data from RESTful APIs, which is supported but less functional comparing to REST connector.
  • Web table connector extracts table content from an HTML webpage.

Supported capabilities

This HTTP connector is supported for the following capabilities:

Supported capabilities IR
Copy activity (source/-) ① ②
Lookup activity ① ②

① Azure integration runtime ② Self-hosted integration runtime

For a list of data stores that are supported as sources/sinks, see Supported data stores.

You can use this HTTP connector to:

  • Retrieve data from an HTTP/S endpoint by using the HTTP GET or POST methods.
  • Retrieve data by using one of the following authentications: Anonymous, Basic, Digest, Windows, or ClientCertificate.
  • Copy the HTTP response as-is or parse it by using supported file formats and compression codecs.

Tip

To test an HTTP request for data retrieval before you configure the HTTP connector, learn about the API specification for header and body requirements. You can use tools like Visual Studio, PowerShell's Invoke-RestMethod, or a web browser to validate.

Prerequisites

If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private Cloud, you need to configure a self-hosted integration runtime to connect to it.

If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow list.

You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the on-premises network without installing and configuring a self-hosted integration runtime.

For more information about the network security mechanisms and options supported by Data Factory, see Data access strategies.

Get started

To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:

Create a linked service to an HTTP source using UI

Use the following steps to create a linked service to an HTTP source in the Azure portal UI.

  1. Browse to the Manage tab in your Azure Data Factory or Synapse workspace and select Linked Services, then click New:

  2. Search for HTTP and select the HTTP connector.

    Screenshot of the HTTP connector.

  3. Configure the service details, test the connection, and create the new linked service.

    Screenshot of configuration for an HTTP linked service.

Connector configuration details

The following sections provide details about properties you can use to define entities that are specific to the HTTP connector.

Linked service properties

The following properties are supported for the HTTP linked service:

Property Description Required
type The type property must be set to HttpServer. Yes
url The base URL to the web server. Yes
enableServerCertificateValidation Specify whether to enable server TLS/SSL certificate validation when you connect to an HTTP endpoint. If your HTTPS server uses a self-signed certificate, set this property to false. No
(the default is true)
authenticationType Specifies the authentication type. Allowed values are Anonymous, Basic, Digest, Windows, and ClientCertificate. You can additionally configure authentication headers in authHeader property. See the sections that follow this table for more properties and JSON samples for these authentication types. Yes
authHeaders Additional HTTP request headers for authentication.
For example, to use API key authentication, you can select authentication type as “Anonymous” and specify API key in the header.
No
connectVia The Integration Runtime to use to connect to the data store. Learn more from Prerequisites section. If not specified, the default Azure Integration Runtime is used. No

Using Basic, Digest, or Windows authentication

Set the authenticationType property to Basic, Digest, or Windows. In addition to the generic properties that are described in the preceding section, specify the following properties:

Property Description Required
userName The user name to use to access the HTTP endpoint. Yes
password The password for the user (the userName value). Mark this field as a SecureString type to store it securely. You can also reference a secret stored in Azure Key Vault. Yes

Example

{
    "name": "HttpLinkedService",
    "properties": {
        "type": "HttpServer",
        "typeProperties": {
            "authenticationType": "Basic",
            "url" : "<HTTP endpoint>",
            "userName": "<user name>",
            "password": {
                "type": "SecureString",
                "value": "<password>"
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

Using ClientCertificate authentication

To use ClientCertificate authentication, set the authenticationType property to ClientCertificate. In addition to the generic properties that are described in the preceding section, specify the following properties:

Property Description Required
embeddedCertData Base64-encoded certificate data. Specify either embeddedCertData or certThumbprint.
certThumbprint The thumbprint of the certificate that's installed on your self-hosted Integration Runtime machine's cert store. Applies only when the self-hosted type of Integration Runtime is specified in the connectVia property. Specify either embeddedCertData or certThumbprint.
password The password that's associated with the certificate. Mark this field as a SecureString type to store it securely. You can also reference a secret stored in Azure Key Vault. No

If you use certThumbprint for authentication and the certificate is installed in the personal store of the local computer, grant read permissions to the self-hosted Integration Runtime:

  1. Open the Microsoft Management Console (MMC). Add the Certificates snap-in that targets Local Computer.
  2. Expand Certificates > Personal, and then select Certificates.
  3. Right-click the certificate from the personal store, and then select All Tasks > Manage Private Keys.
  4. On the Security tab, add the user account under which the Integration Runtime Host Service (DIAHostService) is running, with read access to the certificate.
  5. The HTTP connector loads only trusted certificates. If you're using a self-signed or nonintegrated CA-issued certificate, to enable trust, the certificate must also be installed in one of the following stores:
    • Trusted People
    • Third-Party Root Certification Authorities
    • Trusted Root Certification Authorities

Example 1: Using certThumbprint

{
    "name": "HttpLinkedService",
    "properties": {
        "type": "HttpServer",
        "typeProperties": {
            "authenticationType": "ClientCertificate",
            "url": "<HTTP endpoint>",
            "certThumbprint": "<thumbprint of certificate>"
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

Example 2: Using embeddedCertData

{
    "name": "HttpLinkedService",
    "properties": {
        "type": "HttpServer",
        "typeProperties": {
            "authenticationType": "ClientCertificate",
            "url": "<HTTP endpoint>",
            "embeddedCertData": "<Base64-encoded cert data>",
            "password": {
                "type": "SecureString",
                "value": "password of cert"
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

Using authentication headers

In addition, you can configure request headers for authentication along with the built-in authentication types.

Example: Using API key authentication

{
    "name": "HttpLinkedService",
    "properties": {
        "type": "HttpServer",
        "typeProperties": {
            "url": "<HTTP endpoint>",
            "authenticationType": "Anonymous",
            "authHeader": {
                "x-api-key": {
                    "type": "SecureString",
                    "value": "<API key>"
                }
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

Dataset properties

For a full list of sections and properties available for defining datasets, see the Datasets article.

Azure Data Factory supports the following file formats. Refer to each article for format-based settings.

The following properties are supported for HTTP under location settings in format-based dataset:

Property Description Required
type The type property under location in dataset must be set to HttpServerLocation. Yes
relativeUrl A relative URL to the resource that contains the data. The HTTP connector copies data from the combined URL: [URL specified in linked service][relative URL specified in dataset]. No

Note

The supported HTTP request payload size is around 500 KB. If the payload size you want to pass to your web endpoint is larger than 500 KB, consider batching the payload in smaller chunks.

Example:

{
    "name": "DelimitedTextDataset",
    "properties": {
        "type": "DelimitedText",
        "linkedServiceName": {
            "referenceName": "<HTTP linked service name>",
            "type": "LinkedServiceReference"
        },
        "schema": [ < physical schema, optional, auto retrieved during authoring > ],
        "typeProperties": {
            "location": {
                "type": "HttpServerLocation",
                "relativeUrl": "<relative url>"
            },
            "columnDelimiter": ",",
            "quoteChar": "\"",
            "firstRowAsHeader": true,
            "compressionCodec": "gzip"
        }
    }
}

Copy Activity properties

This section provides a list of properties that the HTTP source supports.

For a full list of sections and properties that are available for defining activities, see Pipelines.

HTTP as source

Azure Data Factory supports the following file formats. Refer to each article for format-based settings.

The following properties are supported for HTTP under storeSettings settings in format-based copy source:

Property Description Required
type The type property under storeSettings must be set to HttpReadSettings. Yes
requestMethod The HTTP method.
Allowed values are Get (default) and Post.
No
additionalHeaders Additional HTTP request headers. No
requestBody The body for the HTTP request. No
httpRequestTimeout The timeout (the TimeSpan value) for the HTTP request to get a response. This value is the timeout to get a response, not the timeout to read response data. The default value is 00:01:40. No
maxConcurrentConnections The upper limit of concurrent connections established to the data store during the activity run. Specify a value only when you want to limit concurrent connections. No

Example:

"activities":[
    {
        "name": "CopyFromHTTP",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<Delimited text input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "DelimitedTextSource",
                "formatSettings":{
                    "type": "DelimitedTextReadSettings",
                    "skipLineCount": 10
                },
                "storeSettings":{
                    "type": "HttpReadSettings",
                    "requestMethod": "Post",
                    "additionalHeaders": "<header key: header value>\n<header key: header value>\n",
                    "requestBody": "<body for POST HTTP request>"
                }
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

Lookup activity properties

To learn details about the properties, check Lookup activity.

Legacy models

Note

The following models are still supported as-is for backward compatibility. You are suggested to use the new model mentioned in above sections going forward, and the authoring UI has switched to generating the new model.

Legacy dataset model

Property Description Required
type The type property of the dataset must be set to HttpFile. Yes
relativeUrl A relative URL to the resource that contains the data. When this property isn't specified, only the URL that's specified in the linked service definition is used. No
requestMethod The HTTP method. Allowed values are Get (default) and Post. No
additionalHeaders Additional HTTP request headers. No
requestBody The body for the HTTP request. No
format If you want to retrieve data from the HTTP endpoint as-is without parsing it, and then copy the data to a file-based store, skip the format section in both the input and output dataset definitions.

If you want to parse the HTTP response content during copy, the following file format types are supported: TextFormat, JsonFormat, AvroFormat, OrcFormat, and ParquetFormat. Under format, set the type property to one of these values. For more information, see JSON format, Text format, Avro format, Orc format, and Parquet format.
No
compression Specify the type and level of compression for the data. For more information, see Supported file formats and compression codecs.

Supported types: GZip, Deflate, BZip2, and ZipDeflate.
Supported levels: Optimal and Fastest.
No

Note

The supported HTTP request payload size is around 500 KB. If the payload size you want to pass to your web endpoint is larger than 500 KB, consider batching the payload in smaller chunks.

Example 1: Using the Get method (default)

{
    "name": "HttpSourceDataInput",
    "properties": {
        "type": "HttpFile",
        "linkedServiceName": {
            "referenceName": "<HTTP linked service name>",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "relativeUrl": "<relative url>",
            "additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n"
        }
    }
}

Example 2: Using the Post method

{
    "name": "HttpSourceDataInput",
    "properties": {
        "type": "HttpFile",
        "linkedServiceName": {
            "referenceName": "<HTTP linked service name>",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "relativeUrl": "<relative url>",
            "requestMethod": "Post",
            "requestBody": "<body for POST HTTP request>"
        }
    }
}

Legacy copy activity source model

Property Description Required
type The type property of the copy activity source must be set to HttpSource. Yes
httpRequestTimeout The timeout (the TimeSpan value) for the HTTP request to get a response. This value is the timeout to get a response, not the timeout to read response data. The default value is 00:01:40. No

Example

"activities":[
    {
        "name": "CopyFromHTTP",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<HTTP input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "HttpSource",
                "httpRequestTimeout": "00:01:00"
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

For a list of data stores that Copy Activity supports as sources and sinks, see Supported data stores and formats.