Dynamic Context Tuning: Smarter Chatbot Context Resolution Without the LLM Overhead

Matthias Theuermann | 09.02.2026 artificial intelligence, devops

The Problem With "That Product"

Multi-turn conversations are natural for humans but surprisingly tricky for chatbots. When a user asks "What's the warranty on that product?" after discussing a specific item, they expect the chatbot to know what "that product" refers to. This linguistic phenomenon is called anaphora, and resolving it correctly is crucial for natural conversation flow.

The traditional solution is to use an LLM to rewrite the query: send the conversation history and ask GPT-4 or similar to expand "that product" into the actual product name. This works but comes with downsides: 50-200ms latency per query, non-deterministic outputs, and additional API costs.

Dynamic Context Tuning (DCT) solves this problem using embeddings instead of LLM calls.

What is Dynamic Context Tuning?

DCT is an embedding-based approach that maintains a session-scoped entity cache with semantic embeddings. Instead of asking an LLM "what does the user mean by 'that product'?", DCT:

Tracks entities (product names, services, locations, etc.) mentioned in the conversation
Detects anaphoric references using pattern matching
Uses semantic similarity to find the most likely referent
Considers recency, so recent mentions score higher

The result: sub-millisecond context resolution that's deterministic and free of API costs.

The DCT Scoring Formula

At the heart of DCT is a weighted scoring formula that balances semantic relevance with conversational recency:

Score(entity, query) = α × cosine_similarity(e, q) + (1-α) × recency(e)

Where:

α = 0.7 (70% weight on semantic similarity)
cosine_similarity = computed from sentence embeddings
recency = 1 / (1 + age_in_minutes) (exponential decay)

A practical example:

 1Cache contains:
 21. "Premium Support Package" (mentioned 2 minutes ago)
 32. "Basic Starter Plan" (mentioned 5 minutes ago)
 4
 5User asks: "What's included in that package?"
 6
 7For "Premium Support Package":
 8  - similarity = 0.72
 9  - recency = 1/(1+2) = 0.33
10  - score = 0.7 × 0.72 + 0.3 × 0.33 = 0.603
11
12For "Basic Starter Plan":
13  - similarity = 0.58
14  - recency = 1/(1+5) = 0.17
15  - score = 0.7 × 0.58 + 0.3 × 0.17 = 0.457
16
17Winner: "Premium Support Package" (0.603 > threshold 0.5)

Implementation Deep Dive

Here's a core entity cache implementation using Python and SentenceTransformers:

 1from dataclasses import dataclass, field
 2import time
 3import numpy as np
 4from sentence_transformers import SentenceTransformer
 5
 6@dataclass
 7class CachedEntity:
 8    """Entity that was mentioned in the conversation."""
 9    name: str
10    embedding: np.ndarray
11    timestamp: float = field(default_factory=time.time)
12    entity_type: str = "product"
13
14    def recency_score(self) -> float:
15        """Recency score using exponential decay."""
16        age_minutes = (time.time() - self.timestamp) / 60.0
17        return 1.0 / (1.0 + age_minutes)
18
19
20class EntityCache:
21    """DCT-based entity cache for resolving anaphoric references."""
22
23    def __init__(
24        self,
25        max_size: int = 15,
26        model_name: str = "paraphrase-multilingual-MiniLM-L12-v2",
27        similarity_weight: float = 0.7,
28    ):
29        self.max_size = max_size
30        self.similarity_weight = similarity_weight
31        self.recency_weight = 1.0 - similarity_weight
32        self.encoder = SentenceTransformer(model_name)
33        self._entities: list[CachedEntity] = []

The reference resolution method scores all cached entities against the user query:

 1def resolve_reference(self, query: str, threshold: float = 0.5) -> str | None:
 2    """Resolve anaphoric reference in query to actual entity name."""
 3    if not self._contains_reference(query) or not self._entities:
 4        return None
 5
 6    query_embedding = self.encoder.encode(query, convert_to_numpy=True)
 7    best_entity, best_score = None, -1.0
 8
 9    for entity in self._entities:
10        similarity = self._cosine_similarity(query_embedding, entity.embedding)
11        recency = entity.recency_score()
12        score = self.similarity_weight * similarity + self.recency_weight * recency
13
14        if score > best_score:
15            best_score = score
16            best_entity = entity
17
18    if best_entity and best_score >= threshold:
19        return best_entity.name
20    return None

Pattern matching detects references - these patterns can be customized for your domain and languages:

1self._reference_patterns = [
2    r"\b(this|that|the)\s+(product|item|service|package)\b",
3    r"\b(the)\s+(first|second|third|last|previous)\s+(one)?\b",
4    r"\bit\b(?!\s+is\s+not)",  # "it" but not "it is not"
5]

Automatic Entity Extraction

DCT can automatically extract entities from chatbot responses to populate the cache:

 1def extract_and_cache_entities(self, text: str) -> None:
 2    """Extract entity names from text and add to cache."""
 3    # Domain-specific patterns - customize for your use case
 4    patterns = [
 5        r"Product:\s*([^\n,]+)",      # "Product: Name"
 6        r"Service:\s*([^\n,]+)",      # "Service: Name"
 7        r"\d+\.\s*\*\*([^*]+)\*\*",   # Markdown bold in lists
 8        r"\"([A-Z][^\"]{4,})\"",      # Quoted proper nouns
 9    ]
10
11    for pattern in patterns:
12        matches = re.findall(pattern, text, re.IGNORECASE)
13        for match in matches:
14            entity_name = match.strip()
15            if len(entity_name) >= 5:
16                self.add_entity(name=entity_name, entity_type="product")

Performance Comparison

The performance gains are significant:

Metric	DCT	Local LLM	Cloud LLM API
Latency	~15-60ms	50-200ms	50-100ms
Accuracy	~73-85%	~90%	~95%
API Cost	$0	$0	~$0.09/1K queries
Predictable	Yes*	No	No

*DCT is predictable but not fully deterministic: the recency component depends on time, so scores change as entities age. However, unlike LLMs, there's no stochastic sampling - given the same cache state at the same moment, results are reproducible.

The trade-off is clear: DCT sacrifices some accuracy for massive latency improvements and predictable behavior. For most chatbot use cases where the entity cache is well-populated, 73-85% accuracy is sufficient, and the 150-200ms latency savings compound across multi-turn conversations.

Integration Architecture

Here's how DCT fits into a typical RAG pipeline:

 1User Query: "What's the warranty on that product?"
 2                          ↓
 3[1] DCT Entity Cache Lookup (~15-60ms)
 4    └─ Detect: "that product" pattern found
 5    └─ Score entities, find best match
 6                          ↓
 7Enriched: "What's the warranty on Premium Support Package?"
 8                          ↓
 9[2] RAG Pipeline (Vector Search + LLM)
10    └─ Embed query with embedding model
11    └─ Search vector database
12    └─ Generate response with LLM
13                          ↓
14Response to User
15                          ↓
16[3] Entity Extraction & Caching
17    └─ Extract entity names from response
18    └─ Add to entity cache for future resolution

Key Configuration Parameters

When implementing DCT, several parameters significantly influence behavior:

Embedding Model: The model choice determines speed, accuracy, and language support. Multilingual models like paraphrase-multilingual-MiniLM-L12-v2 are more versatile, while specialized English models often provide better accuracy.

Cache Size: How many entities are stored per session. Too small leads to missed references, too large increases scoring time and may match irrelevant old entities.

Similarity Threshold: The minimum score required to accept a match (typically 0.4-0.6). Too low produces false positives, too high misses legitimate references.

Similarity Weight (α): The balance between semantic similarity and recency. A higher α (e.g., 0.8) favors semantically matching entities, while a lower value (e.g., 0.5) weights recently mentioned entities more heavily.

When to Use DCT

DCT works best when:

You have a bounded set of entity types (products, services, locations)
Multi-turn conversations are common
Latency is critical for user experience
You want predictable, debuggable behavior

Consider LLM-based enrichment when:

Queries involve complex reasoning beyond simple reference resolution
You need very high accuracy (>90%)
Entity types are unbounded or unpredictable

Conclusion

Dynamic Context Tuning provides a lightweight alternative to LLM-based query enrichment. By leveraging embeddings and a simple scoring formula, you can achieve significantly faster context resolution than LLM-based approaches while maintaining reasonable accuracy. The bulk of the time is spent encoding the query (~15-50ms) - the actual cache scoring over all entities takes less than a millisecond. The approach is particularly valuable for production chatbots where latency and predictability matter.

The underlying research paper provides additional depth on the theoretical foundations and evaluation benchmarks.

Go Back explore our courses

AI Coding Essentials

Discover how to leverage AI tools to enhance coding efficiency, automate repetitive tasks, and unlock innovative development workflows in this hands-on session.

Mondoo Essentials

Gain hands-on experience with Mondoo, mastering its features, custom policies, and advanced security management.

AI Skills & MCP Server Development

Build custom skills and MCP servers and connect Claude Code, Open Code, and GitHub Copilot to your internal systems and APIs.

Matthias Theuermann | 09.02.2026 artificial intelligence, devops

Dynamic Context Tuning: Smarter Chatbot Context Resolution Without the LLM Overhead

The Problem With "That Product" Multi-turn conversations are natural for humans but surprisingly tricky for chatbots. When a user asks "What's

Jürgen Brüder | 06.02.2026 cloud native, infrastructure as code

Cloud Sovereignty: Why EU Companies Need to Rethink Their Cloud Strategy Now

If you work in IT infrastructure in Europe, you have probably noticed a shift in the conversation over the past two years. Cloud sovereignty is no longer a

Jürgen Brüder | 29.01.2026 artificial intelligence, devops

Claude Code vs OpenCode: Which Agentic CLI Fits Your Workflow?

If you’ve been using AI in software engineering for a while, you know the real productivity jump doesn’t come from "chatting about code". It comes

Paul Strebenitzer | 26.01.2026 artificial intelligence, devops, security

Tech Trends in 2026: What IT Teams Should Prepare For (Beyond the Hype)

In 2026, “tech trends” are less about shiny launches and more about the forces that shape everyday IT work—how software gets built, how systems stay secure, how

Martin Buchleitner | 19.12.2025 hashicorp, infrastructure as code

Sources of Secrets by HashiCorp Vault - Revisited

Same Goal, Different Pattern In the previous article we synchronized 1Password items into Vault by pulling data with Terraform and writing it into a KV engine.

We are here for you

You are interested in our courses or you simply have a question that needs answering? You can contact us at anytime! We will do our best to answer all your questions.