응답 API를 사용하여 응답 생성

10분

OoenAI 응답 API는 이전에 분리된 두 개의 API(ChatCompletions 및 Assistants)의 기능을 통합된 환경으로 제공합니다. 상태를 유지하는 다중 턴 응답 생성을 제공하여 대화형 AI 애플리케이션에 이상적입니다. Foundry SDK 또는 OpenAI SDK를 사용하여 OpenAI 호환 클라이언트를 통해 응답 API에 액세스할 수 있습니다.

응답 API 이해

응답 API는 기존 채팅 완료에 비해 몇 가지 이점을 제공합니다.

상태 저장 대화: 여러 차례의 대화에 걸쳐 대화 컨텍스트를 유지합니다
통합 환경: 채팅 완료 및 길잡이 API 패턴을 결합합니다.
Foundry 직접 모델: Azure OpenAI 모델뿐만 아니라 Microsoft Foundry에서 직접 호스트되는 모델과 함께 사용됩니다.
간단한 통합: OpenAI 호환 클라이언트를 통해 접근합니다

비고

응답 API는 Microsoft Foundry 애플리케이션에서 AI 응답을 생성하는 데 권장되는 방법입니다. 대부분의 시나리오에서 이전 ChatCompletions API를 대체합니다.

간단한 응답 생성

OpenAI 호환 클라이언트를 사용하면 responses.create() 메서드를 사용하여 응답을 생성할 수 있습니다.

# Generate a response using the OpenAI-compatible client
response = openai_client.responses.create(
    model="gpt-4.1",  # Your model deployment name
    input="What is Microsoft Foundry?"
)

# Display the response
print(response.output_text)

입력 매개 변수는 프롬프트가 포함된 텍스트 문자열을 허용합니다. 모델은 이 입력을 기반으로 응답을 생성합니다.

응답 구조 이해

응답 개체에는 다음과 같은 몇 가지 유용한 속성이 포함되어 있습니다.

output_text: 생성된 텍스트 응답
id: 이 응답의 고유 식별자
status: Response status(예: "completed")
usage: 토큰 사용량 정보(입력, 출력 및 총 토큰)
모델: 응답을 생성하는 데 사용되는 모델

이러한 속성에 접근하여 응답을 효과적으로 처리할 수 있습니다.

response = openai_client.responses.create(
    model="gpt-4.1",
    input="Explain machine learning in simple terms."
)

print(f"Response: {response.output_text}")
print(f"Response ID: {response.id}")
print(f"Tokens used: {response.usage.total_tokens}")
print(f"Status: {response.status}")

지침 추가

사용자 입력 외에도 모델의 동작 을 안내하는 지침( 시스템 프롬프트라고도 함)을 제공할 수 있습니다.

response = client.responses.create(
    model="gpt-4.1",
    instructions="You are a helpful AI assistant that answers questions clearly and concisely.",
    input="Explain neural networks."
)

print(response.output_text)

응답 생성 제어

추가 매개 변수를 사용하여 응답 생성을 제어할 수 있습니다.

response = openai_client.responses.create(
    model="gpt-4.1",
    instructions="You are a helpful AI assistant that answers questions clearly and concisely.",
    input="Write a creative story about AI.",
    temperature=0.8,  # Higher temperature for more creativity
    max_output_tokens=200  # Limit response length
)

print(response.output_text)

온도: 임의성(0.0-2.0)을 제어합니다. 값이 높을수록 출력이 더 창의적이고 다양합니다.
max_output_tokens: 응답의 최대 토큰 수를 제한합니다.
top_p: 임의성을 제어하기 위한 온도 대체

파운드리의 직접 모델 작업

FoundrySDK 또는 AzureOpenAI 클라이언트를 사용하여 프로젝트 엔드포인트에 연결하는 경우 응답 API는 Azure OpenAI 모델과 Foundry 직접 모델(예: Microsoft Phi, DeepSeek 또는 Microsoft Foundry에서 직접 호스팅되는 다른 모델)에서 작동합니다.

# Using a Foundry direct model
response = openai_client.responses.create(
    model="microsoft-phi-4",  # Example Foundry direct model
    instructions="You are a helpful AI assistant that answers questions clearly and concisely.",
    input="What are the benefits of small language models?"
)

print(response.output_text)

대화형 환경 만들기

더 복잡한 대화형 시나리오의 경우 시스템 지침을 제공하고 다중 턴 대화를 빌드할 수 있습니다.

# First turn in the conversation
response1 = openai_client.responses.create(
    model="gpt-4.1",
    instructions="You are a helpful AI assistant that explains technology concepts clearly.",
    input="What is machine learning?"
)

print("Assistant:", response1.output_text)

# Continue the conversation
response2 = openai_client.responses.create(
    model="gpt-4.1",
    instructions="You are a helpful AI assistant that explains technology concepts clearly.",
    input="Can you give me an example?",
    previous_response_id=response1.id
)

print("Assistant:", response2.output_text)

실제로 구현은 사용자가 모델에서 받은 각 응답에 따라 대화형으로 메시지를 입력할 수 있는 루프로 생성될 가능성이 높습니다.

# Track responses
last_response_id = None

# Loop until the user wants to quit
print("Assistant: Enter a prompt (or type 'quit' to exit)")
while True:
    input_text = input('\nYou: ')
    if input_text.lower() == "quit":
        print("Assistant: Goodbye!")
        break

    # Get a response
    response = openai_client.responses.create(
                model=model_name,
                instructions="You are a helpful AI assistant that explains technology concepts clearly.",
                input=input_text,
                previous_response_id=last_response_id
    )
    assistant_text = response.output_text
    print("\nAssistant:", assistant_text)
    last_response_id = response.id

이 예제의 출력은 다음과 유사합니다.

Assistant: Enter a prompt (or type 'quit' to exit)

You: What is machine learning?

Assistant: Machine learning is a type of artificial intelligence (AI) that enables computers to learn from data and improve their performance over time without being explicitly programmed. It involves training algorithms on large datasets to recognize patterns, make predictions, or take actions based on those patterns. This allows machines to become more accurate and efficient in their tasks as they are exposed to more data.

You: Can you give me an example?

Assistant: Certainly! Let's look at a simple example of supervised learning—predicting house prices based on features like size, location, and number of rooms.
Imagine you want to build a machine learning model that can predict the price of a house based on various factors.
...
    { the example provided in the model response may be extensive}
...

You: quit

Assistant: Goodbye!

사용자가 각 턴에 새 입력을 입력할 때 모델에 전송되는 데이터에는 지침 시스템 메시지, 사용자의 입력 및 모델에서 받은 이전 응답이 포함됩니다. 이러한 방식으로 새 입력은 모델이 이전 입력에 대해 생성한 응답에서 제공하는 컨텍스트에 기반합니다.

대안: 수동 대화 연결

직접 메시지 기록을 작성하여 대화를 수동으로 관리할 수 있습니다. 이 방법을 사용하면 포함된 컨텍스트를 더 자세히 제어할 수 있습니다.

try:
    # Start with initial message
    conversation_history = [
        {
            "type": "message",
            "role": "user",
            "content": "What is machine learning?"
        }
    ]
    
    # First response
    response1 = openai_client.responses.create(
        model="gpt-4.1",
        input=conversation_history
    )
    
    print("Assistant:", response1.output_text)
    
    # Add assistant response to history
    conversation_history += response1.output
    
    # Add new user message
    conversation_history.append({
        "type": "message",
        "role": "user", 
        "content": "Can you give me an example?"
    })
    
    # Second response with full history
    response2 = openai_client.responses.create(
        model="gpt-4.1",
        input=conversation_history
    )
    
    print("Assistant:", response2.output_text)

except Exception as ex:
    print(f"Error: {ex}")

이 수동 방법은 다음을 수행해야 하는 경우에 유용합니다.

컨텍스트에 포함되는 메시지 사용자 지정
대화 정리를 구현하여 토큰 제한 관리
데이터베이스에서 대화 기록 저장 및 복원

특정 이전 응답 검색

응답 API는 응답 기록을 유지하므로 이전 응답을 검색할 수 있습니다.

try:   
   
    # Retrieve a previous response
    response_id = "resp_67cb61fa3a448190bcf2c42d96f0d1a8"  # Example ID
    previous_response = openai_client.responses.retrieve(response_id)
    
    print(f"Previous response: {previous_response.output_text}")

except Exception as ex:
    print(f"Error: {ex}")

컨텍스트 창 고려 사항

previous_response_id 매개 변수는 응답을 함께 연결하여 여러 API 호출에서 대화 컨텍스트를 유지 관리합니다.

대화 기록을 유지하면 토큰 사용량이 증가할 수 있다는 점에 유의해야 합니다. 단일 실행의 경우 활성 컨텍스트 창에는 다음이 포함될 수 있습니다.

시스템 지침(지침, 안전 규칙)
현재 프롬프트
대화 기록(이전 사용자 + 도우미 메시지)
도구 스키마(함수, OpenAPI 사양, MCP 도구 등)
도구 출력(검색 결과, 코드 인터프리터 출력, 파일)
검색된 메모리 또는 문서(메모리 저장소, RAG, 파일 검색)

이러한 모든 항목은 연결되고, 토큰화되고, 모든 요청에 따라 함께 모델로 전송됩니다. SDK는 상태를 관리하는 데 도움이 되지만 토큰 사용량을 자동으로 더 저렴하게 만들지는 않습니다.

반응형 채팅 앱 만들기

모델의 응답은 사용 중인 특정 모델, 컨텍스트 창 크기 및 프롬프트 크기와 같은 요인에 따라 생성하는 데 다소 시간이 걸릴 수 있습니다. 응답을 기다리는 동안 앱이 "동결"되는 것처럼 보이면 사용자가 좌절할 수 있으므로 구현에서 앱 응답성을 고려하는 것이 중요합니다.

스트리밍 응답

긴 응답의 경우 스트리밍을 사용하여 출력을 증분 방식으로 수신할 수 있으므로 출력을 사용할 수 있게 되면 사용자가 부분적으로 완전한 응답을 볼 수 있습니다.

stream = openai_client.responses.create(
    model="gpt-4.1",
    input="Write a short story about a robot learning to paint.",
    stream=True
)

for event in stream:
    print(event, end="", flush=True)

스트리밍할 때 대화 기록을 추적하는 경우 다음과 같이 스트림이 종료되면 응답 ID를 가져올 수 있습니다.

stream = openai_client.responses.create(
    model="gpt-4.1",
    input="Write a short story about a robot learning to paint.",
    stream=True
)
for event in stream:
                if event.type == "response.output_text.delta":
                    print(event.delta, end="")
                elif event.type == "response.completed":
                    response_id = event.response.id

비동기 사용량

고성능 애플리케이션의 경우 비차단 API 호출을 수행할 수 있는 비동기 클라이언트를 사용할 수 있습니다. 비동기 사용은 장기 실행 요청 또는 애플리케이션을 차단하지 않고 여러 요청을 동시에 처리하려는 경우에 적합합니다. 이를 사용하려면 AsyncOpenAI 대신 OpenAI을 가져오고, 각 API 호출 시 await를 사용하십시오.

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="https://<resource-name>.openai.azure.com/openai/v1/",
    api_key=token_provider,
)

async def main():
    response = await client.responses.create(
        model="gpt-4.1",
        input="Explain quantum computing briefly."
    )
    print(response.output_text)

asyncio.run(main())

비동기 스트리밍은 다음과 같은 방식으로 작동합니다.

async def stream_response():
    stream = await client.responses.create(
        model="gpt-4.1",
        input="Write a haiku about coding.",
        stream=True
    )
    
    async for event in stream:
        print(event, end="", flush=True)

asyncio.run(stream_response())

Microsoft Foundry SDK를 통해 응답 API를 사용하면 컨텍스트를 유지하고 여러 모델 유형을 지원하며 반응형 사용자 환경을 제공하는 정교한 대화형 AI 애플리케이션을 빌드할 수 있습니다.

피드백

이 페이지가 도움이 되었나요?