1. Configure your inference deployment
Select or enter a container inference engine image, tag and container port.
- healthcheck: HTTP health check endpoint to check if the inference engine is ready to accept requests. Default is None.
- command and arguments: Entrypoint command and arguments to run when the container starts. Default is entrypoints specified in the container image.
- autoscaling: Set the min and max scale for your deployment. We scale up and down your deployment to match the traffic based on the max concurrency set. Max concurrency refers to the maximum number of in-flight requests per replica. Default is infinity.
- environment variables: Pass in any additional environment variables to the container. e.g, HF_TOKEN
2. Select the cluster and hardware to deploy
By default, CentML Platform provides several managed clusters and GPU instances for you to deploy your inference containers.
3. Monitor your deployment
Once deployed, you can see all your deployments under the listing view along with their current status.

https://<endpoint_url>.
What’s Next
LLM Serving
Explore dedicated public and private endpoints for production model deployments.
Clients
Learn how to interact with the CentML platform programmatically
Resources and Pricing
Learn more about the CentML platform’s pricing.
Private Inference Endpoints
Learn how to create private inference endpoints
Submit a Support Request
Submit a Support Request.
Agents on CentML
Learn how agents can interact with CentML services.