Dynamic Context Tuning: Smarter Chatbot Context Resolution Without the LLM Overhead
The Problem With "That Product" Multi-turn conversations are natural for humans but surprisingly tricky for chatbots. When a user asks "What's

Multi-turn conversations are natural for humans but surprisingly tricky for chatbots. When a user asks "What's the warranty on that product?" after discussing a specific item, they expect the chatbot to know what "that product" refers to. This linguistic phenomenon is called anaphora, and resolving it correctly is crucial for natural conversation flow.
The traditional solution is to use an LLM to rewrite the query: send the conversation history and ask GPT-4 or similar to expand "that product" into the actual product name. This works but comes with downsides: 50-200ms latency per query, non-deterministic outputs, and additional API costs.
Dynamic Context Tuning (DCT) solves this problem using embeddings instead of LLM calls.
DCT is an embedding-based approach that maintains a session-scoped entity cache with semantic embeddings. Instead of asking an LLM "what does the user mean by 'that product'?", DCT:
The result: sub-millisecond context resolution that's deterministic and free of API costs.
At the heart of DCT is a weighted scoring formula that balances semantic relevance with conversational recency:
Score(entity, query) = α × cosine_similarity(e, q) + (1-α) × recency(e)
Where:
1 / (1 + age_in_minutes) (exponential decay)A practical example:
1Cache contains:
21. "Premium Support Package" (mentioned 2 minutes ago)
32. "Basic Starter Plan" (mentioned 5 minutes ago)
4
5User asks: "What's included in that package?"
6
7For "Premium Support Package":
8 - similarity = 0.72
9 - recency = 1/(1+2) = 0.33
10 - score = 0.7 × 0.72 + 0.3 × 0.33 = 0.603
11
12For "Basic Starter Plan":
13 - similarity = 0.58
14 - recency = 1/(1+5) = 0.17
15 - score = 0.7 × 0.58 + 0.3 × 0.17 = 0.457
16
17Winner: "Premium Support Package" (0.603 > threshold 0.5)
Here's a core entity cache implementation using Python and SentenceTransformers:
1from dataclasses import dataclass, field
2import time
3import numpy as np
4from sentence_transformers import SentenceTransformer
5
6@dataclass
7class CachedEntity:
8 """Entity that was mentioned in the conversation."""
9 name: str
10 embedding: np.ndarray
11 timestamp: float = field(default_factory=time.time)
12 entity_type: str = "product"
13
14 def recency_score(self) -> float:
15 """Recency score using exponential decay."""
16 age_minutes = (time.time() - self.timestamp) / 60.0
17 return 1.0 / (1.0 + age_minutes)
18
19
20class EntityCache:
21 """DCT-based entity cache for resolving anaphoric references."""
22
23 def __init__(
24 self,
25 max_size: int = 15,
26 model_name: str = "paraphrase-multilingual-MiniLM-L12-v2",
27 similarity_weight: float = 0.7,
28 ):
29 self.max_size = max_size
30 self.similarity_weight = similarity_weight
31 self.recency_weight = 1.0 - similarity_weight
32 self.encoder = SentenceTransformer(model_name)
33 self._entities: list[CachedEntity] = []
The reference resolution method scores all cached entities against the user query:
1def resolve_reference(self, query: str, threshold: float = 0.5) -> str | None:
2 """Resolve anaphoric reference in query to actual entity name."""
3 if not self._contains_reference(query) or not self._entities:
4 return None
5
6 query_embedding = self.encoder.encode(query, convert_to_numpy=True)
7 best_entity, best_score = None, -1.0
8
9 for entity in self._entities:
10 similarity = self._cosine_similarity(query_embedding, entity.embedding)
11 recency = entity.recency_score()
12 score = self.similarity_weight * similarity + self.recency_weight * recency
13
14 if score > best_score:
15 best_score = score
16 best_entity = entity
17
18 if best_entity and best_score >= threshold:
19 return best_entity.name
20 return None
Pattern matching detects references - these patterns can be customized for your domain and languages:
1self._reference_patterns = [
2 r"\b(this|that|the)\s+(product|item|service|package)\b",
3 r"\b(the)\s+(first|second|third|last|previous)\s+(one)?\b",
4 r"\bit\b(?!\s+is\s+not)", # "it" but not "it is not"
5]
DCT can automatically extract entities from chatbot responses to populate the cache:
1def extract_and_cache_entities(self, text: str) -> None:
2 """Extract entity names from text and add to cache."""
3 # Domain-specific patterns - customize for your use case
4 patterns = [
5 r"Product:\s*([^\n,]+)", # "Product: Name"
6 r"Service:\s*([^\n,]+)", # "Service: Name"
7 r"\d+\.\s*\*\*([^*]+)\*\*", # Markdown bold in lists
8 r"\"([A-Z][^\"]{4,})\"", # Quoted proper nouns
9 ]
10
11 for pattern in patterns:
12 matches = re.findall(pattern, text, re.IGNORECASE)
13 for match in matches:
14 entity_name = match.strip()
15 if len(entity_name) >= 5:
16 self.add_entity(name=entity_name, entity_type="product")
The performance gains are significant:
| Metric | DCT | Local LLM | Cloud LLM API |
|---|---|---|---|
| Latency | ~15-60ms | 50-200ms | 50-100ms |
| Accuracy | ~73-85% | ~90% | ~95% |
| API Cost | $0 | $0 | ~$0.09/1K queries |
| Predictable | Yes* | No | No |
*DCT is predictable but not fully deterministic: the recency component depends on time, so scores change as entities age. However, unlike LLMs, there's no stochastic sampling - given the same cache state at the same moment, results are reproducible.
The trade-off is clear: DCT sacrifices some accuracy for massive latency improvements and predictable behavior. For most chatbot use cases where the entity cache is well-populated, 73-85% accuracy is sufficient, and the 150-200ms latency savings compound across multi-turn conversations.
Here's how DCT fits into a typical RAG pipeline:
1User Query: "What's the warranty on that product?"
2 ↓
3[1] DCT Entity Cache Lookup (~15-60ms)
4 └─ Detect: "that product" pattern found
5 └─ Score entities, find best match
6 ↓
7Enriched: "What's the warranty on Premium Support Package?"
8 ↓
9[2] RAG Pipeline (Vector Search + LLM)
10 └─ Embed query with embedding model
11 └─ Search vector database
12 └─ Generate response with LLM
13 ↓
14Response to User
15 ↓
16[3] Entity Extraction & Caching
17 └─ Extract entity names from response
18 └─ Add to entity cache for future resolution
When implementing DCT, several parameters significantly influence behavior:
Embedding Model: The model choice determines speed, accuracy, and language support. Multilingual models like paraphrase-multilingual-MiniLM-L12-v2 are more versatile, while specialized English models often provide better accuracy.
Cache Size: How many entities are stored per session. Too small leads to missed references, too large increases scoring time and may match irrelevant old entities.
Similarity Threshold: The minimum score required to accept a match (typically 0.4-0.6). Too low produces false positives, too high misses legitimate references.
Similarity Weight (α): The balance between semantic similarity and recency. A higher α (e.g., 0.8) favors semantically matching entities, while a lower value (e.g., 0.5) weights recently mentioned entities more heavily.
DCT works best when:
Consider LLM-based enrichment when:
Dynamic Context Tuning provides a lightweight alternative to LLM-based query enrichment. By leveraging embeddings and a simple scoring formula, you can achieve significantly faster context resolution than LLM-based approaches while maintaining reasonable accuracy. The bulk of the time is spent encoding the query (~15-50ms) - the actual cache scoring over all entities takes less than a millisecond. The approach is particularly valuable for production chatbots where latency and predictability matter.
The underlying research paper provides additional depth on the theoretical foundations and evaluation benchmarks.
You are interested in our courses or you simply have a question that needs answering? You can contact us at anytime! We will do our best to answer all your questions.
Contact us