GPT-4 Turbo with Vision を使用する

[アーティクル]
05/06/2024

GPT-4 Turbo with Vision は、OpenAI によって開発された大規模マルチモーダルモデル (LMM) であり、画像を分析し、それらに関する質問に対するテキスト応答を提供できます。自然言語処理とビジュアル解釈の両方が組み込まれています。

GPT-4 Turbo with Vision モデルでは、画像に何が存在するかに関する一般的な質問に回答します。また、Vision 拡張機能を使用する場合は、ビデオを表示することもできます。

ヒント

GPT-4 Turbo with Vision を使用するには、デプロイした GPT-4 Turbo with Vision モデルで Chat Completion API を呼び出します。 Chat Completion API に慣れていない場合は、GPT-4 Turbo と GPT-4 の攻略ガイドを参照してください。

GPT-4 Turbo モデルのアップグレード

GPT-4 Turbo の最新 GA リリースは次のとおりです。

gpt-4 バージョン turbo-2024-04-09

これは、次のプレビューモデルに代わるものです。

gpt-4 バージョン 1106-Preview
gpt-4 バージョン 0125-Preview
gpt-4 バージョン vision-preview

OpenAI と Azure OpenAI GPT-4 Turbo GA モデルの違い

OpenAI の最新の 0409 ターボモデルバージョンでは、すべての推論要求に対して JSON モードと関数呼び出しがサポートされています。
Azure OpenAI の最新の turbo-2024-04-09 バージョンでは、現在、画像 (ビジョン) 入力による推論要求を行う場合、JSON モードと関数呼び出しの使用はサポートされていません。テキストベース入力の要求 (image_url とインラインイメージがない要求) では、JSON モードと関数呼び出しがサポートされています。

gpt-4 vision-preview との違い

Azure AI 固有の Vision 拡張機能と GPT-4 Turbo with Vision の統合は、gpt-4 バージョン: turbo-2024-04-09 ではサポートされません。これには、光学式文字認識 (OCR)、オブジェクトグラウンディング、ビデオプロンプト、画像を含むデータの処理の改善が含まれます。

GPT-4 Turbo のプロビジョニングされたマネージド可用性

gpt-4 バージョン turbo-2024-04-09 は、標準デプロイとプロビジョニングされたデプロイの両方で使用できます。現在、このモデルのプロビジョニングされたバージョンでは、イメージ/ビジョン推論要求はサポートされていません。このモデルのプロビジョニングされたデプロイでは、テキスト入力のみ受け入れます。標準のモデルデプロイでは、テキストと画像/ビジョンの両方の推論要求を受け入れます。

利用可能なリージョン

リージョン別のモデルの提供状況については、標準とプロビジョニングされたデプロイのモデルマトリックスを参照してください。

GPT-4 Turbo with Vision GA のデプロイ

Studio UI から GA モデルをデプロイするには、GPT-4 を選択し、ドロップダウンメニューから turbo-2024-04-09 バージョンを選択します。 gpt-4-turbo-2024-04-09 モデルの既定のクォータは、GPT-4-Turbo の現在のクォータと同じになります。リージョンのクォータ制限を参照してください。

Chat Completion API を呼び出す

次のコマンドは、GPT-4 Turbo with Vision モデルをコードで使用する最も基本的な方法を示しています。これらのモデルをプログラムで初めて使用する場合は、GPT-4 Turbo with Vision のクイックスタートから始めることをお勧めします。

REST
Python

POST 要求を https://{RESOURCE_NAME}.openai.azure.com/openai/deployments/{DEPLOYMENT_NAME}/chat/completions?api-version=2024-02-15-preview に送信します。このとき

RESOURCE_NAME は Azure OpenAI リソースの名前です
DEPLOYMENT_NAME は、GPT-4 Turbo with Vision モデルデプロイの名前です

必須のヘッダー:

Content-Type: application/json
api-key: {API_KEY}

本文: 要求本文のサンプルを次に示します。形式は GPT-4 の Chat Completions API と同じですが、メッセージの内容がテキストと画像 (画像への有効な HTTP または HTTPS URL、または base-64 でエンコードされた画像) を含む配列でもかまわない点が異なります。

重要

必ず "max_tokens" 値を設定してください。そうしないと、戻り値の出力が途切れます。

重要

画像をアップロードする場合、チャット要求ごとに 10 個の画像という制限があります。

{
    "messages": [ 
        {
            "role": "system", 
            "content": "You are a helpful assistant." 
        },
        {
            "role": "user", 
            "content": [
	            {
	                "type": "text",
	                "text": "Describe this picture:"
	            },
	            {
	                "type": "image_url",
	                "image_url": {
                        "url": "<image URL>"
                    }
                } 
           ] 
        }
    ],
    "max_tokens": 100, 
    "stream": false 
}

Azure OpenAI リソースエンドポイントとキーを定義します。
GPT-4 Turbo with Vision モデルデプロイの名前を入力します。

これらの値を使用してクライアントオブジェクトを作成します。

api_base = '<your_azure_openai_endpoint>' # your endpoint should look like the following https://YOUR_RESOURCE_NAME.openai.azure.com/
api_key="<your_azure_openai_key>"
deployment_name = '<your_deployment_name>'
api_version = '2024-02-15-preview' # this might change in the future

client = AzureOpenAI(
    api_key=api_key,  
    api_version=api_version,
    base_url=f"{api_base}openai/deployments/{deployment_name}",
)

次に、クライアントの create メソッドを呼び出します。次のコードは、サンプルの要求本文を示しています。形式は GPT-4 の Chat Completions API と同じですが、メッセージの内容がテキストと画像 (画像への有効な HTTP または HTTPS URL、または base-64 でエンコードされた画像) を含む配列でもかまわない点が異なります。

重要

必ず "max_tokens" 値を設定してください。そうしないと、戻り値の出力が途切れます。
```
response = client.chat.completions.create(
    model=deployment_name,
    messages=[
        { "role": "system", "content": "You are a helpful assistant." },
        { "role": "user", "content": [  
            { 
                "type": "text", 
                "text": "Describe this picture:" 
            },
            { 
                "type": "image_url",
                "image_url": {
                    "url": "<image URL>"
                }
            }
        ] } 
    ],
    max_tokens=2000 
)
print(response)
```

ヒント

ローカル画像を使用する

ローカル画像を使用する場合は、次の Python コードを使用して base64 に変換し、API に渡すことができます。代替のファイル変換ツールはオンラインで入手できます。

import base64
from mimetypes import guess_type

# Function to encode a local image into data URL 
def local_image_to_data_url(image_path):
    # Guess the MIME type of the image based on the file extension
    mime_type, _ = guess_type(image_path)
    if mime_type is None:
        mime_type = 'application/octet-stream'  # Default MIME type if none is found

    # Read and encode the image file
    with open(image_path, "rb") as image_file:
        base64_encoded_data = base64.b64encode(image_file.read()).decode('utf-8')

    # Construct the data URL
    return f"data:{mime_type};base64,{base64_encoded_data}"

# Example usage
image_path = '<path_to_image>'
data_url = local_image_to_data_url(image_path)
print("Data URL:", data_url)

base64 画像データの準備ができたら、次のように要求本文で API に渡すことができます。

...
"type": "image_url",
"image_url": {
   "url": "data:image/jpeg;base64,<your_image_data>"
}
...

出力

API 応答は次のようになります。

{
    "id": "chatcmpl-8VAVx58veW9RCm5K1ttmxU6Cm4XDX",
    "object": "chat.completion",
    "created": 1702439277,
    "model": "gpt-4",
    "prompt_filter_results": [
        {
            "prompt_index": 0,
            "content_filter_results": {
                "hate": {
                    "filtered": false,
                    "severity": "safe"
                },
                "self_harm": {
                    "filtered": false,
                    "severity": "safe"
                },
                "sexual": {
                    "filtered": false,
                    "severity": "safe"
                },
                "violence": {
                    "filtered": false,
                    "severity": "safe"
                }
            }
        }
    ],
    "choices": [
        {
            "finish_reason":"stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The picture shows an individual dressed in formal attire, which includes a black tuxedo with a black bow tie. There is an American flag on the left lapel of the individual's jacket. The background is predominantly blue with white text that reads \"THE KENNEDY PROFILE IN COURAGE AWARD\" and there are also visible elements of the flag of the United States placed behind the individual."
            },
            "content_filter_results": {
                "hate": {
                    "filtered": false,
                    "severity": "safe"
                },
                "self_harm": {
                    "filtered": false,
                    "severity": "safe"
                },
                "sexual": {
                    "filtered": false,
                    "severity": "safe"
                },
                "violence": {
                    "filtered": false,
                    "severity": "safe"
                }
            }
        }
    ],
    "usage": {
        "prompt_tokens": 1156,
        "completion_tokens": 80,
        "total_tokens": 1236
    }
}

すべての応答には "finish_details" フィールドが含まれます。値は次のいずれかです。

stop: API から完全なモデル出力が返されました。
length: max_tokens 入力パラメーターまたはモデルのトークン制限により、不完全なモデル出力になりました。
content_filter: コンテンツフィルターからのフラグによりコンテンツが省略されました。

画像処理の詳細パラメーター設定: low、high、auto

モデルの detail パラメータには、モデルが画像を解釈して処理する方法を調整するための 3 つの選択肢 (low、high、または auto) が用意されています。既定の設定は auto です。この場合、モデルは画像入力のサイズに基づいて low か high かを決定します。

low 設定: モデルは "高解像度" モードをアクティブにせず、代わりに低解像度の 512x512 バージョンを処理します。その結果、微細さが重要ではないシナリオでは応答が速くなり、トークンの消費量が少なくなります。
high 設定: モデルは "高解像度" モードをアクティブにします。この場合、モデルは最初に低解像度画像を表示し、次に入力画像から詳細な 512x512 セグメントを生成します。各セグメントは 2 倍のトークン予算を使うため、画像をより詳細に解釈できます。''

使われるトークンと価格に画像パラメーターが与える影響の詳細については、OpenAI の概要に関するページの「画像トークン (GPT-4 Turbo with Vision)」を参照してください

画像で Vision 拡張機能を使用する

GPT-4 Turbo with Vision では、Azure AI サービスのカスタマイズされた拡張機能への排他的アクセスを提供します。 Azure AI Vision と組み合わせると、画像内の表示可能なテキストとオブジェクトの場所に関するより詳細な情報がチャットモデルに提供され、チャットのエクスペリエンスが向上します。

光学式文字認識 (OCR) 統合により、モデルでは、高密度のテキスト、変換された画像、大量の財務ドキュメントに対して、より高品質の応答を生成できます。また、より広い範囲の言語もカバーされます。

オブジェクトグラウンディング 統合により、データ分析とユーザー操作に新しいレイヤーが追加されます。この機能では、処理する画像内の重要な要素を視覚的に区別して強調表示できるためです。

重要

Azure OpenAI リソースで Vision 拡張機能を使用するには、Computer Vision リソースを指定する必要があります。これは有料 (S1) レベルで、GPT-4 Turbo with Vision リソースと同じ Azure リージョンにある必要があります。 Azure AI Services リソースを使用している場合、Computer Vision リソースを追加する必要はありません。

注意

GPT-4 Turbo with Vision の Azure AI 拡張機能は、コア機能とは別に課金されます。 GPT-4 Turbo with Vision の特定の Azure AI 拡張機能には、それぞれ異なる料金があります。詳細については、特別価格情報を参照してください。

REST
Python

POST 要求を https://{RESOURCE_NAME}.openai.azure.com/openai/deployments/{DEPLOYMENT_NAME}/chat/completions?api-version=2024-02-15-preview に送信します。このとき

RESOURCE_NAME は Azure OpenAI リソースの名前です
DEPLOYMENT_NAME は、GPT-4 Turbo with Vision モデルデプロイの名前です

必須のヘッダー:

Content-Type: application/json
api-key: {API_KEY}

本文は次のようになります。

形式は GPT-4 の Chat Completions API と同様ですが、メッセージの内容は、文字列と画像 (画像への有効な HTTP または HTTPS URL、または base-64 でエンコードされた画像) を含む配列でもかまいません。

また、enhancements と dataSources の各オブジェクトも含める必要があります。 enhancements は、チャットで要求された特定の Vision 拡張機能を表します。 grounding および ocr プロパティがあり、両方にブール値の enabled プロパティがあります。これらを使用して、OCR サービスや物体検出/グラウンディングサービスを要求します。 dataSources は、Vision 拡張機能に必要な Computer Vision リソースデータを表します。 "AzureComputerVision" および parameters プロパティにする必要がある type プロパティがあります。 endpoint および key は、Computer Vision リソースのエンドポイント URL とアクセスキーに設定します。

重要

必ず "max_tokens" 値を設定してください。そうしないと、戻り値の出力が途切れます。

{
    "enhancements": {
            "ocr": {
              "enabled": true
            },
            "grounding": {
              "enabled": true
            }
    },
    "dataSources": [
    {
        "type": "AzureComputerVision",
        "parameters": {
            "endpoint": "<your_computer_vision_endpoint>",
            "key": "<your_computer_vision_key>"
        }
    }],
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": [
	            {
	                "type": "text",
	                "text": "Describe this picture:"
	            },
	            {
	                "type": "image_url",
	                "image_url": {
                        "url":"<image URL>" 
                    }
                }
           ] 
        }
    ],
    "max_tokens": 100, 
    "stream": false 
}

前の手順と同じメソッドを呼び出しますが、新しい extra_body パラメータを含めます。これには enhancements と dataSources のフィールドが含まれています。

enhancements は、チャットで要求された特定の Vision 拡張機能を表します。 grounding と ocr のフィールドがあり、両方にブール値の enabled プロパティがあります。これらを使用して、OCR サービスや物体検出/グラウンディングサービスを要求します。

dataSources は、Vision 拡張機能に必要な Computer Vision リソースデータを表します。 type フィールド ("AzureComputerVision" である必要があります) と parameters フィールドがあります。 endpoint および key は、Computer Vision リソースのエンドポイント URL とアクセスキーに設定します。 R

重要

必ず "max_tokens" 値を設定してください。そうしないと、戻り値の出力が途切れます。

response = client.chat.completions.create(
    model=deployment_name,
    messages=[
        { "role": "system", "content": "You are a helpful assistant." },
        { "role": "user", "content": [  
            { 
                "type": "text", 
                "text": "Describe this picture:" 
            },
            { 
                "type": "image_url",
                "image_url": {
                    "url": "<image URL>"
                }
            }
        ] } 
    ],
    extra_body={
        "dataSources": [
            {
                "type": "AzureComputerVision",
                "parameters": {
                    "endpoint": "<your_computer_vision_endpoint>",
                    "key": "<your_computer_vision_key>"
                }
            }],
        "enhancements": {
            "ocr": {
                "enabled": True
            },
            "grounding": {
                "enabled": True
            }
        }
    },
    max_tokens=2000
)
print(response)

出力

モデルから受信するチャット応答には、オブジェクトラベルや境界ボックス、OCR 結果など、画像に関するより詳細な情報が含まれるようになりました。 API 応答は次のようになります。

{
    "id": "chatcmpl-8UyuhLfzwTj34zpevT3tWlVIgCpPg",
    "object": "chat.completion",
    "created": 1702394683,
    "model": "gpt-4",
    "choices":
    [
        {
            "finish_details": {
                "type": "stop",
                "stop": "<|fim_suffix|>"
            },
            "index": 0,
            "message":
            {
                "role": "assistant",
                "content": "The image shows a close-up of an individual with dark hair and what appears to be a short haircut. The person has visible ears and a bit of their neckline. The background is a neutral light color, providing a contrast to the dark hair."
            },
            "enhancements":
            {
                "grounding":
                {
                    "lines":
                    [
                        {
                            "text": "The image shows a close-up of an individual with dark hair and what appears to be a short haircut. The person has visible ears and a bit of their neckline. The background is a neutral light color, providing a contrast to the dark hair.",
                            "spans":
                            [
                                {
                                    "text": "the person",
                                    "length": 10,
                                    "offset": 99,
                                    "polygon": [{"x":0.11950000375509262,"y":0.4124999940395355},{"x":0.8034999370574951,"y":0.4124999940395355},{"x":0.8034999370574951,"y":0.6434999704360962},{"x":0.11950000375509262,"y":0.6434999704360962}]
                                }
                            ]
                        }
                    ],
                    "status": "Success"
                }
            }
        }
    ],
    "usage":
    {
        "prompt_tokens": 816,
        "completion_tokens": 49,
        "total_tokens": 865
    }
}

すべての応答には "finish_details" フィールドが含まれます。値は次のいずれかです。

stop: API から完全なモデル出力が返されました。
length: max_tokens 入力パラメーターまたはモデルのトークン制限により、不完全なモデル出力になりました。
content_filter: コンテンツフィルターからのフラグによりコンテンツが省略されました。

ビデオで Vision 拡張機能を使用する

GPT-4 Turbo with Vision では、Azure AI サービスのカスタマイズされた拡張機能への排他的アクセスを提供します。 ビデオプロンプト統合では、Azure AI Vision ビデオ検索を使用して、ビデオから一連のフレームをサンプリングし、ビデオで音声のトランスクリプトを作成します。これで、AI モデルによってビデオコンテンツに関する概要と回答を提供できます。

ビデオ検索システムを設定し、AI チャットモデルと統合するには、次の手順に従います。

重要

注意

ヒント

必要に応じて、代わりに Jupyter ノートブックを使用して、次の手順を実行できます。ビデオチャット補完入力ノートブック。

Azure Blob Storage に動画をアップロードする

Azure BLOB ストレージコンテナーに動画をアップロードする必要があります。まだお持ちでない場合、新しいストレージアカウントを作成します。

動画がアップロードされると、その SAS URL を取得できます。SAS URL は、後の手順で動画にアクセスするために使用します。

適切な読み取りアクセスを確認する

認証方法によっては、Azure Blob Storage コンテナーへのアクセスを許可するための追加の手順が必要になることがあります。 Azure OpenAI リソースではなく Azure AI Services リソースを使用している場合、マネージド ID を使用し、Azure Blob Storage の読み取りアクセスを付与する必要があります。

システム割り当て ID を使用する
ユーザー割り当て ID を使用する

次の手順に従って、Azure AI Services リソースでシステム割り当て ID を有効にします。

Azure portal の AI Services リソースから、[リソース管理]、[ID] の順に選択し、状態を [オン] に切り替えます。
ストレージ BLOB データ読み取りアクセスを AI Services リソースに割り当てる: [ID] ページで [Azure ロールの割り当て] を選択し、次の設定でロールの割り当てを追加します。
- [範囲]: ストレージ
- [サブスクリプション]: {自分のサブスクリプション}
- [リソース]: {Azure Blob Storage リソースを選択する}
- [ロール]: ストレージ BLOB データ閲覧者
設定を保存します。

ビデオ検索インデックスを作成する

使用している Azure OpenAI リソースと同じリージョンにある Azure AI Vision リソースを取得します。
ビデオファイルとそのメタデータを保存および整理するためのインデックスを作成します。下のコマンド例は、Create Index API を使用して my-video-index という名前のインデックスを作成する方法を示しています。インデックス名を一時的な場所に保存します。後の手順で必要になります。

ヒント

ビデオインデックスを作成する方法の詳細については、「ベクター化を使用したビデオ検索の実行」を参照してください。

重要

ビデオインデックス名は、GUID (36 文字まで) でない限り、最大 24 文字まで指定できます。
```
curl.exe -v -X PUT "https://<YOUR_ENDPOINT_URL>/computervision/retrieval/indexes/my-video-index?api-version=2023-05-01-preview" -H "Ocp-Apim-Subscription-Key: <YOUR_SUBSCRIPTION_KEY>" -H "Content-Type: application/json" --data-ascii "
{
  'metadataSchema': {
    'fields': [
      {
        'name': 'cameraId',
        'searchable': false,
        'filterable': true,
        'type': 'string'
      },
      {
        'name': 'timestamp',
        'searchable': false,
        'filterable': true,
        'type': 'datetime'
      }
    ]
  },
  'features': [
    {
      'name': 'vision',
      'domain': 'surveillance'
    },
    {
      'name': 'speech'
    }
  ]
}"
```

関連するメタデータと共にビデオファイルをインデックスに追加します。次の例は、SAS URL と Create Ingestion API を使用し、2 つのビデオファイルをインデックスに追加する方法を示しています。 SAS URL と documentId の値を一時的な場所に保存します。後の手順で必要になります。

curl.exe -v -X PUT "https://<YOUR_ENDPOINT_URL>/computervision/retrieval/indexes/my-video-index/ingestions/my-ingestion?api-version=2023-05-01-preview" -H "Ocp-Apim-Subscription-Key: <YOUR_SUBSCRIPTION_KEY>" -H "Content-Type: application/json" --data-ascii "
{
  'videos': [
    {
      'mode': 'add',
      'documentId': '02a504c9cd28296a8b74394ed7488045',
      'documentUrl': 'https://example.blob.core.windows.net/videos/02a504c9cd28296a8b74394ed7488045.mp4?sas_token_here',
      'metadata': {
        'cameraId': 'camera1',
        'timestamp': '2023-06-30 17:40:33'
      }
    },
    {
      'mode': 'add',
      'documentId': '043ad56daad86cdaa6e493aa11ebdab3',
      'documentUrl': '[https://example.blob.core.windows.net/videos/043ad56daad86cdaa6e493aa11ebdab3.mp4?sas_token_here',
      'metadata': {
        'cameraId': 'camera2'
      }
    }
  ]
}"

インデックスにビデオファイルを追加すると、インジェストプロセスが開始されます。ファイルのサイズと数によっては、これに時間がかかる場合があります。 Get Ingestion API を使用して状態を確認し、検索を実行する前にインジェストの完了を確認できます。この呼び出しによって "state" = "Completed" が返されるのを待ってから、次の手順に進みます。
```
curl.exe -v -X GET "https://<YOUR_ENDPOINT_URL>/computervision/retrieval/indexes/my-video-index/ingestions?api-version=2023-05-01-preview&$top=20" -H "ocp-apim-subscription-key: <YOUR_SUBSCRIPTION_KEY>"
```

https://{RESOURCE_NAME}.openai.azure.com/openai/deployments/{DEPLOYMENT_NAME}/chat/completions?api-version=2024-02-15-preview への POST 要求を準備します。このとき
- RESOURCE_NAME は Azure OpenAI リソースの名前です
- DEPLOYMENT_NAME は、GPT-4 Vision モデルデプロイの名前です
必須のヘッダー:
- Content-Type: application/json
- api-key: {API_KEY}

要求本文に次の JSON 構造体を追加します。

{
    "enhancements": {
            "video": {
              "enabled": true
            }
    },
    "dataSources": [
    {
        "type": "AzureComputerVisionVideoIndex",
        "parameters": {
            "computerVisionBaseUrl": "<your_computer_vision_endpoint>",
            "computerVisionApiKey": "<your_computer_vision_key>",
            "indexName": "<name_of_your_index>",
            "videoUrls": ["<your_video_SAS_URL>"]
        }
    }],
    "messages": [ 
        {
            "role": "system", 
            "content": "You are a helpful assistant." 
        },
        {
            "role": "user",
            "content": [
                    {
                        "type": "acv_document_id",
                        "acv_document_id": "<your_video_ID>"
                    },
                    {
                        "type": "text",
                        "text": "Describe this video:"
                    }
                ]
        }
    ],
    "max_tokens": 100, 
}

要求には、enhancements および dataSources オブジェクトが含まれます。 enhancements は、チャットで要求された特定の Vision 拡張機能を表します。 dataSources は、Vision 拡張機能に必要な Computer Vision リソースデータを表します。これには、AI Vision とビデオ情報を含む "AzureComputerVisionVideoIndex" および parameters プロパティにする必要がある、type プロパティがあります。

上記のすべての <placeholder> フィールドに独自の情報を入力します。必要に応じて、OpenAI および AI Vision リソースのエンドポイント URL とキーを入力し、前の手順からビデオインデックス情報を取得します。
API エンドポイントに POST 要求を送信します。これには、OpenAI と AI Vision の資格情報、ビデオインデックスの名前、1 つのビデオの ID と SAS URL が含まれている必要があります。

Python スクリプトで、前のセクションと同様に、クライアントの create メソッドを呼び出しますが、extra_body パラメータを含めます。ここでは、enhancements と data_sources のフィールドが含まれています。 enhancements は、チャットで要求された特定の Vision 拡張機能を表します。これには、ブール値の enabled プロパティを持つ video フィールドがあります。これを使用して、ビデオ検索サービスを要求します。

data_sources は、Vision 拡張機能に必要な外部リソースデータを表します。 type フィールド ("AzureComputerVisionVideoIndex" である必要があります) と parameters フィールドがあります。

computerVisionBaseUrl および computerVisionApiKey は、Computer Vision リソースのエンドポイント URL とアクセスキーに設定します。 indexName をビデオインデックスの名前に設定します。 videoUrls を実際のビデオの SAS URL のリストに設定します。

重要

必ず "max_tokens" 値を設定してください。そうしないと、戻り値の出力が途切れます。

response = client.chat.completions.create(
    model=deployment_name,
    messages=[
        { "role": "system", "content": "You are a helpful assistant." },
        { "role": "user", "content": [  
            {
                "type": "acv_document_id",
                "acv_document_id": "<your_video_ID>"
            },
            { 
                "type": "text", 
                "text": "Describe this video:" 
            }
        ] } 
    ],
    extra_body={
        "data_sources": [
            {
                "type": "AzureComputerVisionVideoIndex",
                "parameters": {
                    "computerVisionBaseUrl": "<your_computer_vision_endpoint>", # your endpoint should look like the following https://YOUR_RESOURCE_NAME.cognitiveservices.azure.com/computervision
                    "computerVisionApiKey": "<your_computer_vision_key>",
                    "indexName": "<name_of_your_index>",
                    "videoUrls": ["<your_video_SAS_URL>"]
                }
            }],
        "enhancements": {
            "video": {
                "enabled": True
            }
        }
    },
    max_tokens=100
)

print(response)

重要

"data_sources" オブジェクトのコンテンツは、ご利用の Azure リソースの種類と認証方法によって異なります。次のリファレンスをご覧ください。

"data_sources": [
{
    "type": "AzureComputerVisionVideoIndex",
    "parameters": {
    "endpoint": "<your_computer_vision_endpoint>",
    "computerVisionApiKey": "<your_computer_vision_key>",
    "indexName": "<name_of_your_index>",
    "videoUrls": ["<your_video_SAS_URL>"]
    }
}],

"data_sources": [
{
    "type": "AzureComputerVisionVideoIndex",
    "parameters": {
    "indexName": "<name_of_your_index>",
    "videoUrls": ["<your_video_SAS_URL>"]
    }
}],

"data_sources": [
{
    "type": "AzureComputerVisionVideoIndex",
    "parameters": {
        "indexName": "<name_of_your_index>",
        "documentAuthenticationKind": "managedidentity",
    }
}],

出力

モデルから受け取るチャット応答には、ビデオに関する情報が含まれているはずです。 API 応答は次のようになります。

{
    "id": "chatcmpl-8V4J2cFo7TWO7rIfs47XuDzTKvbct",
    "object": "chat.completion",
    "created": 1702415412,
    "model": "gpt-4",
    "choices":
    [
        {
            "finish_reason":"stop",
            "index": 0,
            "message":
            {
                "role": "assistant",
                "content": "The advertisement video opens with a blurred background that suggests a serene and aesthetically pleasing environment, possibly a workspace with a nature view. As the video progresses, a series of frames showcase a digital interface with search bars and prompts like \"Inspire new ideas,\" \"Research a topic,\" and \"Organize my plans,\" suggesting features of a software or application designed to assist with productivity and creativity.\n\nThe color palette is soft and varied, featuring pastel blues, pinks, and purples, creating a calm and inviting atmosphere. The backgrounds of some frames are adorned with abstract, organically shaped elements and animations, adding to the sense of innovation and modernity.\n\nMidway through the video, the focus shifts to what appears to be a browser or software interface with the phrase \"Screens simulated, subject to change; feature availability and timing may vary,\" indicating the product is in development and that the visuals are illustrative of its capabilities.\n\nThe use of text prompts continues with \"Help me relax,\" followed by a demonstration of a 'dark mode' feature, providing a glimpse into the software's versatility and user-friendly design.\n\nThe video concludes by revealing the product name, \"Copilot,\" and positioning it as \"Your everyday AI companion,\" implying the use of artificial intelligence to enhance daily tasks. The final frames feature the Microsoft logo, associating the product with the well-known technology company.\n\nIn summary, the advertisement video is for a Microsoft product named \"Copilot,\" which seems to be an AI-powered software tool aimed at improving productivity, creativity, and organization for its users. The video conveys a message of innovation, ease, and support in daily digital interactions through a visually appealing and calming presentation."
            }
        }
    ],
    "usage":
    {
        "prompt_tokens": 2068,
        "completion_tokens": 341,
        "total_tokens": 2409
    }
}

すべての応答には "finish_details" フィールドが含まれます。値は次のいずれかです。

stop: API から完全なモデル出力が返されました。
length: max_tokens 入力パラメーターまたはモデルのトークン制限により、不完全なモデル出力になりました。
content_filter: コンテンツフィルターからのフラグによりコンテンツが省略されました。

ビデオプロンプトの価格の例

GPT-4 Turbo with Vision の価格は動的であり、使われる特定の機能と入力によって変わります。 Azure OpenAI の価格の包括的なビューについては、Azure OpenAI の価格に関するページを参照してください。

基本料金と追加機能の概要を次に示します。

GPT-4 Turbo with Vision の基本価格は次のとおりです。

入力: 1000 トークンあたり $0.01
出力: 1000 トークンあたり $0.03

ビデオプロンプトとビデオ検索アドオンの統合:

インジェスト: ビデオの 1 分あたり $0.05
トランザクション: ビデオ検索の 1,000 クエリあたり $0.25

次の方法で共有

GPT-4 Turbo with Vision を使用する

GPT-4 Turbo モデルのアップグレード

OpenAI と Azure OpenAI GPT-4 Turbo GA モデルの違い

gpt-4 vision-preview との違い

GPT-4 Turbo のプロビジョニングされたマネージド可用性

利用可能なリージョン

GPT-4 Turbo with Vision GA のデプロイ

Chat Completion API を呼び出す

ローカル画像を使用する

出力

画像処理の詳細パラメーター設定: low、high、auto

画像で Vision 拡張機能を使用する

出力

ビデオで Vision 拡張機能を使用する

Azure Blob Storage に動画をアップロードする

適切な読み取りアクセスを確認する

ビデオ検索インデックスを作成する

Vision を使用すてビデオインデックスを GPT-4 Turbo と統合する

出力

ビデオプロンプトの価格の例

次のステップ

フィードバック

フィードバック

その他のリソース

次の方法で共有

GPT-4 Turbo with Vision を使用する

GPT-4 Turbo モデルのアップグレード

OpenAI と Azure OpenAI GPT-4 Turbo GA モデルの違い

gpt-4 vision-preview との違い

GPT-4 Turbo のプロビジョニングされたマネージド可用性

利用可能なリージョン

GPT-4 Turbo with Vision GA のデプロイ

Chat Completion API を呼び出す

ローカル画像を使用する

出力

画像処理の詳細パラメーター設定: low、high、auto

画像で Vision 拡張機能を使用する

出力

ビデオで Vision 拡張機能を使用する

Azure Blob Storage に動画をアップロードする

適切な読み取りアクセスを確認する

ビデオ検索インデックスを作成する

Vision を使用すてビデオ インデックスを GPT-4 Turbo と統合する

出力

ビデオ プロンプトの価格の例

次のステップ

フィードバック

フィードバック

その他のリソース

Vision を使用すてビデオインデックスを GPT-4 Turbo と統合する

ビデオプロンプトの価格の例