text-generation-inference documentation

Gaudi Backend for Text Generation Inference

text-generation-inference

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Gaudi Backend for Text Generation Inference

Overview

Text Generation Inference (TGI) has been optimized to run on Gaudi hardware via the Gaudi backend for TGI.

Supported Hardware

Gaudi1: Available on AWS EC2 DL1 instances
Gaudi2: Available on Intel Cloud
Gaudi3: Available on Intel Cloud

Tutorial: Getting Started with TGI on Gaudi

Basic Usage

The easiest way to run TGI on Gaudi is to use the official Docker image:

model=meta-llama/Meta-Llama-3.1-8B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
hf_token=YOUR_HF_ACCESS_TOKEN

docker run --runtime=habana --cap-add=sys_nice --ipc=host \
    -p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
    ghcr.io/huggingface/text-generation-inference:3.1.1-gaudi \
    --model-id $model

Once you see the connected log, the server is ready to accept requests:

2024-05-22T19:31:48.302239Z INFO text_generation_router: router/src/main.rs:378: Connected

You can find your YOUR_HF_ACCESS_TOKEN at https://huggingface.co./settings/tokens. This is necessary to access gated models like llama3.1.

Making Your First Request

You can send a request from a separate terminal:

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":32}}' \
    -H 'Content-Type: application/json'

How-to Guides

How to Run Specific Models

The following models have been validated on Gaudi2:

Model	Model ID	BF16		FP8
		Single Card	Multi-Card	Single Card	Multi-Card
Llama2-7B	meta-llama/Llama-2-7b-chat-hf	✔	✔	✔	✔
Llama2-70B	meta-llama/Llama-2-70b-chat-hf		✔		✔
Llama3-8B	meta-llama/Meta-Llama-3.1-8B-Instruct	✔	✔	✔	✔
Llama3-70B	meta-llama/Meta-Llama-3-70B-Instruct		✔		✔
Llama3.1-8B	meta-llama/Meta-Llama-3.1-8B-Instruct	✔	✔	✔	✔
Llama3.1-70B	meta-llama/Meta-Llama-3.1-70B-Instruct		✔		✔
CodeLlama-13B	codellama/CodeLlama-13b-hf	✔	✔	✔	✔
Mixtral-8x7B	mistralai/Mixtral-8x7B-Instruct-v0.1	✔	✔	✔	✔
Mistral-7B	mistralai/Mistral-7B-Instruct-v0.3	✔	✔	✔	✔
Falcon-180B	tiiuae/falcon-180B-chat		✔		✔
Qwen2-72B	Qwen/Qwen2-72B-Instruct		✔		✔
Starcoder2-3b	bigcode/starcoder2-3b	✔	✔	✔
Starcoder2-15b	bigcode/starcoder2-15b	✔	✔	✔
Starcoder	bigcode/starcoder	✔	✔	✔	✔
Gemma-7b	google/gemma-7b-it	✔	✔	✔	✔
Llava-v1.6-Mistral-7B	llava-hf/llava-v1.6-mistral-7b-hf	✔	✔	✔	✔

To run any of these models:

model=MODEL_ID_THAT_YOU_WANT_TO_RUN
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
hf_token=YOUR_ACCESS_TOKEN

docker run --runtime=habana --cap-add=sys_nice --ipc=host \
    -p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
    ghcr.io/huggingface/text-generation-inference:3.1.1-gaudi \
    --model-id $model
    <text-generation-inference-launcher-arguments>

For the full list of service parameters, refer to the launcher-arguments page.

The validated docker commands can be found in the examples/docker_commands folder.

Note: --runtime=habana --cap-add=sys_nice --ipc=host is required to enable docker to use the Gaudi hardware (more details here).

How to Enable Multi-Card Inference (Sharding)

TGI-Gaudi supports sharding for multi-card inference, allowing you to distribute the load across multiple Gaudi cards.

For example, on a machine with 8 Gaudi cards, you can run:

docker run --runtime=habana --ipc=host --cap-add=sys_nice \
    -p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
    tgi-gaudi \
    --model-id $model --sharded true --num-shard 8

We recommend always using sharding when running on a multi-card machine.

How to Use Different Precision Formats

BF16 Precision (Default)

By default, all models run with BF16 precision on Gaudi hardware.

FP8 Precision

TGI-Gaudi supports FP8 precision inference with Intel Neural Compressor (INC).

To run FP8 Inference:

Measure statistics using Optimum Habana measurement script
Run the model in TGI with QUANT_CONFIG setting - e.g. -e QUANT_CONFIG=./quantization_config/maxabs_quant.json.

The following commmand example for FP8 inference is based on the assumption that measurement is done via the first step above.

Example for Llama3.1-70B on 8 cards with FP8 precision:

model=meta-llama/Meta-Llama-3.1-70B-Instruct
hf_token=YOUR_ACCESS_TOKEN
volume=$PWD/data   # share a volume with the Docker container to avoid downloading weights every run

docker run -p 8080:80 \
   --runtime=habana \
   --cap-add=sys_nice \
   --ipc=host \
   -v $volume:/data \
   -v $PWD/quantization_config:/usr/src/quantization_config \
   -v $PWD/hqt_output:/usr/src/hqt_output \
   -e QUANT_CONFIG=./quantization_config/maxabs_quant.json \
   -e HF_TOKEN=$hf_token \
   -e MAX_TOTAL_TOKENS=2048 \
   -e BATCH_BUCKET_SIZE=256 \
   -e PREFILL_BATCH_BUCKET_SIZE=4 \
   -e PAD_SEQUENCE_TO_MULTIPLE_OF=64 \
   ghcr.io/huggingface/text-generation-inference:3.1.1-gaudi \
   --model-id $model \
   --sharded true --num-shard 8 \
   --max-input-tokens 1024 --max-total-tokens 2048 \
   --max-batch-prefill-tokens 4096 --max-batch-size 256 \
   --max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 512

How to Run Vision-Language Models (VLMs)

Gaudi supports VLM inference.

Example for Llava-v1.6-Mistral-7B on 1 card:

Start the TGI server via the following command:

model=llava-hf/llava-v1.6-mistral-7b-hf
volume=$PWD/data   # share a volume with the Docker container to avoid downloading weights every run

docker run -p 8080:80 \
   --runtime=habana \
   --cap-add=sys_nice \
   --ipc=host \
   -v $volume:/data \
    -e PREFILL_BATCH_BUCKET_SIZE=1 \
    -e BATCH_BUCKET_SIZE=1 \
   ghcr.io/huggingface/text-generation-inference:3.1.1-gaudi \
   --model-id $model \
   --max-input-tokens 4096 --max-batch-prefill-tokens 16384 \
   --max-total-tokens 8192 --max-batch-size 4

You can then send a request to the server via the following command:

curl -N 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"![](https://huggingface.co./datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":32}}' \
    -H 'Content-Type: application/json'

Note: In Llava-v1.6-Mistral-7B, an image usually accounts for 2000 input tokens. For example, an image of size 512x512 is represented by 2800 tokens. Thus, max-input-tokens must be larger than the number of tokens associated with the image. Otherwise the image may be truncated. We set BASE_IMAGE_TOKENS=2048 as the default image token value. This is the minimum value of max-input-tokens. You can override the environment variable BASE_IMAGE_TOKENS to change this value. The warmup will generate graphs with input length from BASE_IMAGE_TOKENS to max-input-tokens. For Llava-v1.6-Mistral-7B, the value of max-batch-prefill-tokens is 16384, which is calcualted as follows: prefill_batch_size = max-batch-prefill-tokens / max-input-tokens.

How to Benchmark Performance

We recommend using the inference-benchmarker tool to benchmark performance on Gaudi hardware.

This benchmark tool simulates user requests and measures the performance of the model on realistic scenarios.

To run it on the same machine, you can do the following:

MODEL=meta-llama/Llama-3.1-8B-Instruct
HF_TOKEN=<your HF READ token>
# run a benchmark to evaluate the performance of the model for chat use case
# we mount results to the current directory
docker run \
    --rm \
    -it \
    --net host \
    -v $(pwd):/opt/inference-benchmarker/results \
    -e "HF_TOKEN=$HF_TOKEN" \
    ghcr.io/huggingface/inference-benchmarker:latest \
    inference-benchmarker \
    --tokenizer-name "$MODEL" \
    --url http://localhost:8080 \
    --profile chat

Please refer to the inference-benchmarker README for more details.

How to Profile Performance

To collect performance profiling, you need to set the following environment variables:

Name	Value(s)	Default	Description
PROF_WAITSTEP	integer	0	Control profile wait steps
PROF_WARMUPSTEP	integer	0	Control profile warmup steps
PROF_STEP	integer	0	Enable/disable profile, control profile active steps
PROF_PATH	string	/tmp/hpu_profile	Define profile folder
PROF_RANKS	string	0	Comma-separated list of ranks to profile
PROF_RECORD_SHAPES	True/False	False	Control record_shapes option in the profiler

To use these environment variables, add them to your docker run command with the -e flag. For example:

docker run --runtime=habana --ipc=host --cap-add=sys_nice \
    -p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
    -e PROF_WAITSTEP=10 \
    -e PROF_WARMUPSTEP=10 \
    -e PROF_STEP=1 \
    -e PROF_PATH=/tmp/hpu_profile \
    -e PROF_RANKS=0 \
    -e PROF_RECORD_SHAPES=True \
    ghcr.io/huggingface/text-generation-inference:3.1.1-gaudi \
    --model-id $model

Explanation: Understanding TGI on Gaudi

The Warmup Process

To ensure optimal performance, warmup is performed at the beginning of each server run. This process creates queries with various input shapes based on provided parameters and runs basic TGI operations (prefill, decode, concatenate).

Note: Model warmup can take several minutes, especially for FP8 inference. For faster subsequent runs, refer to Disk Caching Eviction Policy.

Understanding Parameter Tuning

Sequence Length Parameters

--max-input-tokens is the maximum possible input prompt length. Default value is 4095.
--max-total-tokens is the maximum possible total length of the sequence (input and output). Default value is 4096.

Batch Size Parameters

For prefill operation, please set --max-batch-prefill-tokens as bs * max-input-tokens, where bs is your expected maximum prefill batch size.
For decode operation, please set --max-batch-size as bs, where bs is your expected maximum decode batch size.
Please note that batch size will be always padded to the nearest multiplication of BATCH_BUCKET_SIZE and PREFILL_BATCH_BUCKET_SIZE.

Performance and Memory Parameters

PAD_SEQUENCE_TO_MULTIPLE_OF determines sizes of input length buckets. Since warmup creates several graphs for each bucket, it’s important to adjust that value proportionally to input sequence length. Otherwise, some out of memory issues can be observed.
ENABLE_HPU_GRAPH enables HPU graphs usage, which is crucial for performance results. Recommended value to keep is true.

Sequence Length Parameters

--max-input-tokens: Maximum possible input prompt length (default: 4095)
--max-total-tokens: Maximum possible total sequence length (input + output) (default: 4096)

Batch Size Parameters

--max-batch-prefill-tokens: Set as bs * max-input-tokens where bs is your expected maximum prefill batch size
--max-batch-size: Set as bs where bs is your expected maximum decode batch size
Note: Batch sizes are padded to the nearest multiple of BATCH_BUCKET_SIZE and PREFILL_BATCH_BUCKET_SIZE

Reference

This section contains reference information about the Gaudi backend.

Environment Variables

The following table contains the environment variables that can be used to configure the Gaudi backend:

Name	Value(s)	Default	Description	Usage
ENABLE_HPU_GRAPH	True/False	True	Enable hpu graph or not	add -e in docker run command
LIMIT_HPU_GRAPH	True/False	True	Skip HPU graph usage for prefill to save memory, set to `True` for large sequence/decoding lengths(e.g. 300/212)	add -e in docker run command
BATCH_BUCKET_SIZE	integer	8	Batch size for decode operation will be rounded to the nearest multiple of this number. This limits the number of cached graphs	add -e in docker run command
PREFILL_BATCH_BUCKET_SIZE	integer	4	Batch size for prefill operation will be rounded to the nearest multiple of this number. This limits the number of cached graphs	add -e in docker run command
PAD_SEQUENCE_TO_MULTIPLE_OF	integer	128	For prefill operation, sequences will be padded to a multiple of provided value.	add -e in docker run command
SKIP_TOKENIZER_IN_TGI	True/False	False	Skip tokenizer for input/output processing	add -e in docker run command
WARMUP_ENABLED	True/False	True	Enable warmup during server initialization to recompile all graphs. This can increase TGI setup time.	add -e in docker run command
QUEUE_THRESHOLD_MS	integer	120	Controls the threshold beyond which the request are considered overdue and handled with priority. Shorter requests are prioritized otherwise.	add -e in docker run command
USE_FLASH_ATTENTION	True/False	True	Whether to enable Habana Flash Attention, provided that the model supports it. Please refer to https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_PyTorch_Models.html?highlight=fusedsdpa#using-fused-scaled-dot-product-attention-fusedsdpa	add -e in docker run command
FLASH_ATTENTION_RECOMPUTE	True/False	True	Whether to enable Habana Flash Attention in recompute mode on first token generation.	add -e in docker run command

Contributing

Contributions to the TGI-Gaudi project are welcome. Please refer to the contributing guide.

Building the Docker Image from Source

To build the Docker image from source:

make -C backends/gaudi image

This builds the image and saves it as tgi-gaudi. You can then run TGI-Gaudi with this image:

model=meta-llama/Meta-Llama-3.1-8B-Instruct
volume=$PWD/data
hf_token=YOUR_ACCESS_TOKEN

docker run --runtime=habana --ipc=host --cap-add=sys_nice \
    -p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
    tgi-gaudi \
    --model-id $model

For more details, see the README of the Gaudi backend and the Makefile of the Gaudi backend.

< > Update on GitHub

←Neuron TensorRT-LLM→