This article provides a comprehensive guide on deploying a large language model (LLM) application using the PAI-EAS (Platform for AI — Elastic Algorithm Service). The deployment process supports both WebUI interface and API calls. Once deployed, the LLM application can be integrated with the LangChain framework to incorporate an enterprise knowledge base, and details are here. PAI-EAS also offers inference acceleration engines like BladeLLM and vLLM to achieve high concurrency and low latency.

The growing popularity of big models such as GPT has made the deployment of LLM applications a sought-after service. There are numerous open-source big models available, each excelling in different specialties. PAI-EAS simplifies the deployment of these models, making it possible to launch an inference service in under five minutes.
To deploy an EAS service, follow these steps:

Click Deploy Service and configure key parameters such as the service name (e.g., llm_demo001), deployment method (Deploy Web App by Using Image), and Select Image (choose from the PAI platform image list).


2. Set the run command to launch the desired large model, such as Qwen/Qwen-7B-Chat, for the Universal-7B parameter model.
3. Select the appropriate resource group type and configuration, recommending GPU types like ml.gu7i.c16m60.1-gu30 for optimal performance.

4. Deploy the service and wait for the model deployment to complete.
Creating an inference service will take some time, so let’s go to the next interesting session and start using the service itself.
Once the service is deployed, use the WebUI interface to perform model inference verification. Enter a query like “Please provide a financial learning plan” and click send to interact with the model.


Switching Open-Source Models: PAI-EAS allows easy switching between different models, such as Tongyi Qianwen, Llama2, and chatglm, by updating the run commands and instance specifications accordingly.
Integrating Business Data with LangChain: LangChain is a framework that combines LLMs with external data. Select LangChain on the WebUI page to integrate your business data, and follow the prompts to upload and vectorize your knowledge base files.
Improving Inference Concurrency and Latency: Add the — backend=vllm parameter to the run command for higher concurrency and lower latency. This is compatible with certain models like qwen, llama2, and baichuan-13B.
Mounting Custom Models: To deploy a custom model, mount it using OSS by uploading the model and configuration files to your OSS bucket, updating the service with the appropriate OSS path, and running command parameters.
Obtaining Service Access Address and Token: Access the service details page from the PAI-EAS model online service page to retrieve the service Token and access address.
HTTP Calls: Use standard HTTP or SSE methods to call the service by sending string or structured requests with the service Token and access address. You can use Python’s requests package for this purpose.
WebSocket Calls: WebSocket can maintain connections and conduct multiple rounds of conversation. Replace the http in the service access address with ws and control the streaming output with the use_stream_chat parameter.
This guide outlines the entire process of deploying and using an LLM application with PAI-EAS, addressing common issues and providing solutions for seamless integration and improved performance. Whether you’re using the WebUI interface or calling APIs, this guide ensures a smooth deployment and utilization of LLM applications.
LangChain first processes the input user data in natural language and stores it locally as a knowledge base for large models. Each inference user input will first find the answers similar to the input questions in the local knowledge base and input the knowledge base answers with the user input into a large model to generate customized answers based on the local knowledge base.

For example, Upload a README.md file. In the lower-left corner, click knowledge base file vectorization. The following result indicates that the custom data is loaded.

In the input box at the bottom of the WebUI page, enter questions related to business data for a conversation.
For example, enter “What is EasyCV, and how can it be used? Use English to answer my question.” in the input box and click Send. The following figure shows the returned result.

With one click, the PAI-EAS inference acceleration engine that supports BladeLLM and vLLM can help you enjoy high concurrency and low latency.


To deploy a custom model, you can use OSS to mount the custom model. The procedure is as follows:
The sample model file to be prepared is as follows:

The configuration file must include a config.json file, and you will need to set up the Config file in accordance with the model formats of Huggingface or Modelscope. For details of the example file, please refer to config.json.

Parameter
Depaint
Model configuration
Click Enter model configuration to configure the model.
Run command
Add the following parameters to the run command:
For more information about how to configure running commands for different models, see Mastering Generative AI: Run Llama 2 Models on the Alibaba Cloud’s PAI.
Run command

The client uses the standard HTTP format. When you call the curl command, you can send the following two types of requests:
curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v
Where: $authorization needs to be replaced with the service Token, $host: needs to be replaced with the service access address, chatllm_data.txt: This file contains the information.
curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"
Use the chatllm_data.json file to set inference parameters. The content format of the chatllm_data.json file is as follows:
{
"max_new_tokens": 4096,
"use_stream_chat": false,
"prompt": "How to install it?",
"system_prompt": "Act like you are programmer with 5+ years of experience."
"history": [
[
"Can you tell me what's the bladellm?",
"BladeLLM is an framework for LLM serving, integrated with acceleration techniques like quantization, ai compilation, etc. , and supporting popular LLMs like OPT, Bloom, LLaMA, etc."
]
],
"temperature": 0.8,
"top_k": 10,
"top_p": 0.8,
"do_sample": True,
"use_cache": True,
} The parameters are described as follows. Add and delete them as appropriate.

You can also implement your own client based on the Python requests package. The sample code is as follows:
import argparse
import json
from typing import Iterable, List
import requests
def post_http_request(prompt: str,
system_prompt: str,
history: list,
host: str,
authorization: str,
max_new_tokens: int = 2048,
temperature: float = 0.95,
top_k: int = 1,
top_p: float = 0.8,
langchain: bool = False,
use_stream_chat: bool = False) -> requests.Response:
headers = {
"User-Agent": "Test Client",
"Authorization": f"{authorization}"
}
if not history:
history = [
(
"San Francisco is a",
"city located in the state of California in the United States. \
It is known for its iconic landmarks, such as the Golden Gate Bridge \
and Alcatraz Island, as well as its vibrant culture, diverse population, \
and tech industry. The city is also home to many famous companies and \
startups, including Google, Apple, and Twitter."
)
]
pload = {
"prompt": prompt,
"system_prompt": system_prompt,
"top_k": top_k,
"top_p": top_p,
"temperature": temperature,
"max_new_tokens": max_new_tokens,
"use_stream_chat": use_stream_chat,
"history": history
}
if langchain:
print(langchain)
pload["langchain"] = langchain
response = requests.post(host, headers=headers,
json=pload, stream=use_stream_chat)
return response
def get_response(response: requests.Response) -> List[str]:
data = json.loads(response.content)
output = data["response"]
history = data["history"]
return output, history
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--top-k", type=int, default=4)
parser.add_argument("--top-p", type=float, default=0.8)
parser.add_argument("--max-new-tokens", type=int, default=2048)
parser.add_argument("--temperature", type=float, default=0.95)
parser.add_argument("--prompt", type=str, default="How can I get there?")
parser.add_argument("--langchain", action="store_true")
args = parser.parse_args()
prompt = args.prompt
top_k = args.top_k
top_p = args.top_p
use_stream_chat = False
temperature = args.temperature
langchain = args.langchain
max_new_tokens = args.max_new_tokens
host = "EAS service public network address"
authorization = "EAS service public network Token"
print(f"Prompt: {prompt!r}\n", flush=True)
# You can set the system prompt in the language model input within the request.
system_prompt = "Act like you are programmer with \
5+ years of experience."
# In client requests, it is possible to set the historical information of the conversation. The client maintains the dialogue records of the current user to facilitate multi-turn dialogue. Typically, the history information returned from the previous round of dialogue can be used. The format for history is List[Tuple(str, str)]
history = []
response = post_http_request(
prompt, system_prompt, history,
host, authorization,
max_new_tokens, temperature, top_k, top_p,
langchain=langchain, use_stream_chat=use_stream_chat)
output, history = get_response(response)
print(f" --- output: {output} \n --- history: {history}", flush=True)
# The server returns the results in JSON format, which includes inference results and dialogue history.
def get_response(response: requests.Response) -> List[str]:
data = json.loads(response.content)
output = data["response"]
history = data["history"]
return output, histor
Where:
The HTTP SSE method is used for streaming calls. Other settings are the same as those for non-streaming calls. For more information, see the following code:
import argparse
import json
from typing import Iterable, List
import requests
def clear_line(n: int = 1) -> None:
LINE_UP = '\033[1A'
LINE_CLEAR = '\x1b[2K'
for _ in range(n):
print(LINE_UP, end=LINE_CLEAR, flush=True)
def post_http_request(prompt: str,
system_prompt: str,
history: list,
host: str,
authorization: str,
max_new_tokens: int = 2048,
temperature: float = 0.95,
top_k: int = 1,
top_p: float = 0.8,
langchain: bool = False,
use_stream_chat: bool = False) -> requests.Response:
headers = {
"User-Agent": "Test Client",
"Authorization": f"{authorization}"
}
if not history:
history = [
(
"San Francisco is a",
"city located in the state of California in the United States. \
It is known for its iconic landmarks, such as the Golden Gate Bridge \
and Alcatraz Island, as well as its vibrant culture, diverse population, \
and tech industry. The city is also home to many famous companies and \
startups, including Google, Apple, and Twitter."
)
]
pload = {
"prompt": prompt,
"system_prompt": system_prompt,
"top_k": top_k,
"top_p": top_p,
"temperature": temperature,
"max_new_tokens": max_new_tokens,
"use_stream_chat": use_stream_chat,
"history": history
}
if langchain:
pload["langchain"] = langchain
response = requests.post(host, headers=headers,
json=pload, stream=use_stream_chat)
return response
def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
for chunk in response.iter_lines(chunk_size=8192,
decode_unicode=False,
delimiter=b"\0"):
if chunk:
data = json.loads(chunk.decode("utf-8"))
output = data["response"]
history = data["history"]
yield output, history
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--top-k", type=int, default=4)
parser.add_argument("--top-p", type=float, default=0.8)
parser.add_argument("--max-new-tokens", type=int, default=2048)
parser.add_argument("--temperature", type=float, default=0.95)
parser.add_argument("--prompt", type=str, default="How can I get there?")
parser.add_argument("--langchain", action="store_true")
args = parser.parse_args()
prompt = args.prompt
top_k = args.top_k
top_p = args.top_p
use_stream_chat = True
temperature = args.temperature
langchain = args.langchain
max_new_tokens = args.max_new_tokens
host = ""
authorization = ""
print(f"Prompt: {prompt!r}\n", flush=True)
system_prompt = "Act like you are programmer with \
5+ years of experience."
history = []
response = post_http_request(
prompt, system_prompt, history,
host, authorization,
max_new_tokens, temperature, top_k, top_p,
langchain=langchain, use_stream_chat=use_stream_chat)
for h, history in get_streaming_response(response):
print(
f" --- stream line: {h} \n --- history: {history}", flush=True)
Where:
To better maintain user conversation information, you can also use WebSocket to maintain the connection with the service to complete one or more rounds of conversation. The code example is as follows:
import os
import time
import json
import struct
from multiprocessing import Process
import websocket
round = 5
questions = 0
def on_message_1(ws, message):
if message == "<EOS>":
print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(),
time.time(), message), flush=True)
ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
else:
print("{}".format(time.time()))
print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(),
time.time(), message), flush=True)
def on_message_2(ws, message):
global questions
print('pid-{} --- message received: {}'.format(os.getpid(), message))
# end the client-side streaming
if message == "<EOS>":
questions = questions + 1
if questions == 5:
ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
def on_message_3(ws, message):
print('pid-{} --- message received: {}'.format(os.getpid(), message))
# end the client-side streaming
ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
def on_error(ws, error):
print('error happened: ', str(error))
def on_close(ws, a, b):
print("### closed ###", a, b)
def on_pong(ws, pong):
print('pong:', pong)
# stream chat validation test
def on_open_1(ws):
print('Opening Websocket connection to the server ... ')
params_dict = {}
params_dict['prompt'] = """Show me a golang code example: """
params_dict['temperature'] = 0.9
params_dict['top_p'] = 0.1
params_dict['top_k'] = 30
params_dict['max_new_tokens'] = 2048
params_dict['do_sample'] = True
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
# raw_req = f"""To open a Websocket connection to the server: """
ws.send(raw_req)
# end the client-side streaming
# multi-round query validation test
def on_open_2(ws):
global round
print('Opening Websocket connection to the server ... ')
params_dict = {"max_new_tokens": 6144}
params_dict['temperature'] = 0.9
params_dict['top_p'] = 0.1
params_dict['top_k'] = 30
params_dict['use_stream_chat'] = True
params_dict['prompt'] = "您好!"
params_dict = {
"system_prompt":
"Act like you are programmer with 5+ years of experience."
}
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "请使用Python,编写一个排序算法"
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "请转写成java语言的实现"
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "请介绍一下你自己?"
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "请总结上述对话"
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
# Langchain validation test.
def on_open_3(ws):
global round
print('Opening Websocket connection to the server ... ')
params_dict = {}
# params_dict['prompt'] = """To open a Websocket connection to the server: """
params_dict['prompt'] = """Can you tell me what's the MNN?"""
params_dict['temperature'] = 0.9
params_dict['top_p'] = 0.1
params_dict['top_k'] = 30
params_dict['max_new_tokens'] = 2048
params_dict['use_stream_chat'] = False
params_dict['langchain'] = True
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
authorization = ""
host = "ws://" + ""
def single_call(on_open_func, on_message_func, on_clonse_func=on_close):
ws = websocket.WebSocketApp(
host,
on_open=on_open_func,
on_message=on_message_func,
on_error=on_error,
on_pong=on_pong,
on_close=on_clonse_func,
header=[
'Authorization: ' + authorization],
)
# setup ping interval to keep long connection.
ws.run_forever(ping_interval=2)
if __name__ == "__main__":
for i in range(5):
p1 = Process(target=single_call, args=(on_open_1, on_message_1))
p2 = Process(target=single_call, args=(on_open_2, on_message_2))
p3 = Process(target=single_call, args=(on_open_3, on_message_3))
p1.start()
p2.start()
p3.start()
p1.join()
p2.join()
p3.join()
Where:
The article serves as an in-depth manual for deploying and integrating large language models (LLMs) using Alibaba Cloud’s Platform for AI — Elastic Algorithm Service (PAI-EAS). It addresses the increasing demand for accessible and efficient deployment of sophisticated LLM applications, such as those based on the foundation models’ architecture, emphasizing the capability to establish an inference service rapidly.
Key points covered in the article include:
The content is rich with command examples, parameter descriptions, and model-specific recommendations that enable users to navigate the complexities of LLM deployment confidently. By covering common use cases and providing solutions to typical challenges, the article ensures that readers can not only deploy but also maximize the utility of LLM applications, aligning with their business objectives and technical requirements.
This comprehensive tutorial embodies a critical resource for data scientists, AI developers, and IT professionals aiming to leverage the power of large language models within the Alibaba Cloud ecosystem. Contact Alibaba Cloud to explore the world of generative AI and discover how it can transform your applications and business. It bridges the gap between advanced AI technologies and practical business applications, enabling organizations to innovate and extract value from cutting-edge language processing capabilities.