LLMs in PHP: integrating language models into production systems without rewriting everything
Every team I have talked to in the last two years has had the same conversation: the engineers want to add LLM features, the CTO says Python, and the platform team — who owns a PHP monolith with ten years of business logic — goes quiet. The argument is that ML tooling is Python-first, the LLM SDKs are better in Python, and the talent pool is there.
The argument is also mostly wrong, and the teams that act on it spend six months building a Python microservice that calls their PHP monolith for business logic over HTTP, introducing a network boundary, two deployment pipelines, and a latency budget they did not plan for.
This is what integrating LLMs into a PHP production system actually looks like — not a demo, but a system that has been running under real traffic.
The PHP LLM landscape
The PHP ecosystem has three credible options for LLM integration:
Direct HTTP to the API. Every LLM provider — OpenAI, Anthropic, Mistral — exposes a REST API. A PHP HTTP client and a JSON decoder is all you technically need. I have used this for simple completions in systems where adding a dependency was harder than writing 40 lines of wrapper code.
LLPhant. The most complete PHP library for production LLM work. It wraps OpenAI and Anthropic, handles streaming, implements RAG (retrieval-augmented generation) patterns, and supports function calling. It is the option I reach for now in new PHP projects.
Symfony AI integration. Symfony 7.2 shipped a first-party AI component. If you are on Symfony, this is increasingly the right answer — it has proper dependency injection, event system integration, and respects the framework's conventions.
The benchmark for "production-ready" in LLM integration is: does it handle streaming correctly, does it support function calling, does it let you inject observability, and does it fail gracefully when the API returns a 500. LLPhant clears all four.
What I got wrong on the first deployment
Our first LLM integration was a customer support triage system. The model read incoming tickets and classified them by urgency and department. The PHP code was clean. The deployment was a disaster.
We did not account for API latency in our queue worker timeout. The LLM call averaged 3.2 seconds. The queue worker's default timeout was 30 seconds. Under burst load, workers processing multiple tickets simultaneously hit the timeout, the job was retried, and we billed the API twice for the same ticket — with different classifications, which broke downstream routing logic.
// What we had:
class TicketTriageJob implements ShouldQueue
{
public $timeout = 30; // default — did not think about LLM latency
public function handle(LLMClient $client): void
{
$classification = $client->classify($this->ticket->body);
$this->ticket->update(['department' => $classification->department]);
}
}
// What we needed:
class TicketTriageJob implements ShouldQueue
{
public $timeout = 120; // LLM call + processing overhead
public $tries = 1; // never retry — LLM calls are not idempotent
public $uniqueFor = 3600; // prevent duplicate processing
public function handle(LLMClient $client): void
{
if ($this->ticket->fresh()->triaged_at !== null) {
return; // already processed by a previous attempt
}
$classification = $client->classify($this->ticket->body);
DB::transaction(function () use ($classification) {
$this->ticket->update([
'department' => $classification->department,
'priority' => $classification->priority,
'triaged_at' => now(),
]);
});
}
}
The non-idempotency of LLM calls is the thing teams consistently underestimate. The model does not return the same output for the same input, and re-running a classification after a partial failure is not safe if downstream systems have already acted on the first result.
RAG in production: the index is the product
Retrieval-augmented generation is where PHP LLM integrations get interesting and where the gap with Python narrows to nearly nothing. The heavy work — embedding generation, vector storage, similarity search — happens at indexing time, not at query time. By query time, you are doing an HTTP call and a database query.
use LLPhant\Embeddings\EmbeddingGenerator\OpenAI\OpenAI3LargeEmbeddingGenerator;
use LLPhant\Embeddings\VectorStores\Doctrine\DoctrineVectorStore;
// Indexing (run once, or on content update)
$generator = new OpenAI3LargeEmbeddingGenerator();
$vectorStore = new DoctrineVectorStore($entityManager, DocumentChunk::class);
foreach ($documents as $doc) {
$chunks = $splitter->splitDocument($doc, chunkSize: 512, overlap: 64);
foreach ($chunks as $chunk) {
$chunk->embedding = $generator->embedText($chunk->content);
}
$vectorStore->addDocuments($chunks);
}
// Query time (per user request)
$query = $request->input('question');
$embedding = $generator->embedText($query);
// pgvector cosine similarity — single query, < 20ms on indexed data
$relevant = $vectorStore->similaritySearch($embedding, maxResults: 5, minScore: 0.78);
$context = implode("\n\n", array_map(fn($c) => $c->content, $relevant));
$answer = $llm->chat([
['role' => 'system', 'content' => "Answer using only the provided context.\n\n{$context}"],
['role' => 'user', 'content' => $query],
]);
The 0.78 similarity threshold is not a default — it is tuned. Too low and you retrieve irrelevant context that confuses the model. Too high and you retrieve nothing. We ran 200 sample queries against held-out answers and measured recall at different thresholds before shipping. 0.78 was the point where recall was stable and hallucinations dropped to an acceptable rate.
Function calling: where PHP fits better than expected
Function calling — the model deciding to invoke a tool and returning structured arguments — is the core mechanism that makes LLM agents practical. PHP is well-suited for this because the "tools" are typically existing domain logic: fetch a customer, check an order status, run a calculation. You already have that code.
$tools = [
Tool::create('get_order_status')
->description('Returns the current status and ETA for a given order ID')
->parameter('order_id', 'string', 'The order UUID', required: true),
Tool::create('calculate_refund')
->description('Calculates eligible refund amount based on order ID and reason')
->parameter('order_id', 'string', required: true)
->parameter('reason', 'string', 'cancellation | defect | not_received', required: true),
];
$response = $llm->chat($messages, tools: $tools);
// The model may return a tool call rather than text
while ($response->hasToolCalls()) {
foreach ($response->toolCalls() as $call) {
$result = match ($call->name) {
'get_order_status' => $orderService->getStatus($call->arguments['order_id']),
'calculate_refund' => $refundCalculator->calculate(
$call->arguments['order_id'],
$call->arguments['reason']
),
default => throw new UnknownToolException($call->name),
};
// Feed the tool result back into the conversation
$messages[] = ['role' => 'tool', 'tool_call_id' => $call->id, 'content' => json_encode($result)];
}
$response = $llm->chat($messages, tools: $tools);
}
The while loop handles multi-step tool use: the model may call get_order_status, decide it needs to call calculate_refund, and only then produce a final answer. In practice, most production agents run 1–3 tool calls per conversation turn. More than that and latency becomes the dominant user experience problem.
Observability you actually need
The three metrics I track for every LLM integration:
Token usage by endpoint. LLM costs scale with tokens, not requests. A single endpoint that passes a 10,000-token system prompt on every call will dominate your API bill within days.
// After every LLM call
$this->metrics->increment('llm.tokens.prompt', $response->usage()->promptTokens);
$this->metrics->increment('llm.tokens.completion', $response->usage()->completionTokens);
$this->metrics->timing('llm.latency_ms', $response->latencyMs());
Classification confidence / tool call success rate. For structured outputs — classifications, function calls, JSON extraction — the model will occasionally return malformed output. Track the parse failure rate. If it climbs above 2%, your prompt is degrading, the model was silently updated, or the input distribution shifted.
Queue depth before and after. If you are doing LLM work in background jobs, queue depth is the leading indicator of whether your worker count is keeping pace with request volume.
The rewrite question, answered honestly
Is Python better for LLM work? For pure ML research, training, and fine-tuning — yes, unambiguously. For building LLM-augmented features on top of an existing PHP system: the gap is smaller than the migration cost in almost every case I have evaluated.
The question to ask is not "which language is better for LLMs" but "where does the business logic live that the LLM needs to act on?" If it is in a PHP system with ten years of domain modelling, you are not going to replicate that in six months in a new Python service. You will end up with a thin Python wrapper calling your PHP API, and you will have paid full price for the rewrite without gaining anything that LLPhant could not have done from within PHP.