Dynamic Context Tuning: Smarter Chatbot Context Resolution Without the LLM Overhead


Bicycle

The Problem With "That Product"

Multi-turn conversations are natural for humans but surprisingly tricky for chatbots. When a user asks "What's the warranty on that product?" after discussing a specific item, they expect the chatbot to know what "that product" refers to. This linguistic phenomenon is called anaphora, and resolving it correctly is crucial for natural conversation flow.

The traditional solution is to use an LLM to rewrite the query: send the conversation history and ask GPT-4 or similar to expand "that product" into the actual product name. This works but comes with downsides: 50-200ms latency per query, non-deterministic outputs, and additional API costs.

Dynamic Context Tuning (DCT) solves this problem using embeddings instead of LLM calls.

What is Dynamic Context Tuning?

DCT is an embedding-based approach that maintains a session-scoped entity cache with semantic embeddings. Instead of asking an LLM "what does the user mean by 'that product'?", DCT:

  1. Tracks entities (product names, services, locations, etc.) mentioned in the conversation
  2. Detects anaphoric references using pattern matching
  3. Uses semantic similarity to find the most likely referent
  4. Considers recency, so recent mentions score higher

The result: sub-millisecond context resolution that's deterministic and free of API costs.

The DCT Scoring Formula

At the heart of DCT is a weighted scoring formula that balances semantic relevance with conversational recency:

Score(entity, query) = α × cosine_similarity(e, q) + (1-α) × recency(e)

Where:

  • α = 0.7 (70% weight on semantic similarity)
  • cosine_similarity = computed from sentence embeddings
  • recency = 1 / (1 + age_in_minutes) (exponential decay)

A practical example:

 1Cache contains:
 21. "Premium Support Package" (mentioned 2 minutes ago)
 32. "Basic Starter Plan" (mentioned 5 minutes ago)
 4
 5User asks: "What's included in that package?"
 6
 7For "Premium Support Package":
 8  - similarity = 0.72
 9  - recency = 1/(1+2) = 0.33
10  - score = 0.7 × 0.72 + 0.3 × 0.33 = 0.603
11
12For "Basic Starter Plan":
13  - similarity = 0.58
14  - recency = 1/(1+5) = 0.17
15  - score = 0.7 × 0.58 + 0.3 × 0.17 = 0.457
16
17Winner: "Premium Support Package" (0.603 > threshold 0.5)

Implementation Deep Dive

Here's a core entity cache implementation using Python and SentenceTransformers:

 1from dataclasses import dataclass, field
 2import time
 3import numpy as np
 4from sentence_transformers import SentenceTransformer
 5
 6@dataclass
 7class CachedEntity:
 8    """Entity that was mentioned in the conversation."""
 9    name: str
10    embedding: np.ndarray
11    timestamp: float = field(default_factory=time.time)
12    entity_type: str = "product"
13
14    def recency_score(self) -> float:
15        """Recency score using exponential decay."""
16        age_minutes = (time.time() - self.timestamp) / 60.0
17        return 1.0 / (1.0 + age_minutes)
18
19
20class EntityCache:
21    """DCT-based entity cache for resolving anaphoric references."""
22
23    def __init__(
24        self,
25        max_size: int = 15,
26        model_name: str = "paraphrase-multilingual-MiniLM-L12-v2",
27        similarity_weight: float = 0.7,
28    ):
29        self.max_size = max_size
30        self.similarity_weight = similarity_weight
31        self.recency_weight = 1.0 - similarity_weight
32        self.encoder = SentenceTransformer(model_name)
33        self._entities: list[CachedEntity] = []

The reference resolution method scores all cached entities against the user query:

 1def resolve_reference(self, query: str, threshold: float = 0.5) -> str | None:
 2    """Resolve anaphoric reference in query to actual entity name."""
 3    if not self._contains_reference(query) or not self._entities:
 4        return None
 5
 6    query_embedding = self.encoder.encode(query, convert_to_numpy=True)
 7    best_entity, best_score = None, -1.0
 8
 9    for entity in self._entities:
10        similarity = self._cosine_similarity(query_embedding, entity.embedding)
11        recency = entity.recency_score()
12        score = self.similarity_weight * similarity + self.recency_weight * recency
13
14        if score > best_score:
15            best_score = score
16            best_entity = entity
17
18    if best_entity and best_score >= threshold:
19        return best_entity.name
20    return None

Pattern matching detects references - these patterns can be customized for your domain and languages:

1self._reference_patterns = [
2    r"\b(this|that|the)\s+(product|item|service|package)\b",
3    r"\b(the)\s+(first|second|third|last|previous)\s+(one)?\b",
4    r"\bit\b(?!\s+is\s+not)",  # "it" but not "it is not"
5]

Automatic Entity Extraction

DCT can automatically extract entities from chatbot responses to populate the cache:

 1def extract_and_cache_entities(self, text: str) -> None:
 2    """Extract entity names from text and add to cache."""
 3    # Domain-specific patterns - customize for your use case
 4    patterns = [
 5        r"Product:\s*([^\n,]+)",      # "Product: Name"
 6        r"Service:\s*([^\n,]+)",      # "Service: Name"
 7        r"\d+\.\s*\*\*([^*]+)\*\*",   # Markdown bold in lists
 8        r"\"([A-Z][^\"]{4,})\"",      # Quoted proper nouns
 9    ]
10
11    for pattern in patterns:
12        matches = re.findall(pattern, text, re.IGNORECASE)
13        for match in matches:
14            entity_name = match.strip()
15            if len(entity_name) >= 5:
16                self.add_entity(name=entity_name, entity_type="product")

Performance Comparison

The performance gains are significant:

MetricDCTLocal LLMCloud LLM API
Latency~15-60ms50-200ms50-100ms
Accuracy~73-85%~90%~95%
API Cost$0$0~$0.09/1K queries
PredictableYes*NoNo

*DCT is predictable but not fully deterministic: the recency component depends on time, so scores change as entities age. However, unlike LLMs, there's no stochastic sampling - given the same cache state at the same moment, results are reproducible.

The trade-off is clear: DCT sacrifices some accuracy for massive latency improvements and predictable behavior. For most chatbot use cases where the entity cache is well-populated, 73-85% accuracy is sufficient, and the 150-200ms latency savings compound across multi-turn conversations.

Integration Architecture

Here's how DCT fits into a typical RAG pipeline:

 1User Query: "What's the warranty on that product?"
 2 3[1] DCT Entity Cache Lookup (~15-60ms)
 4    └─ Detect: "that product" pattern found
 5    └─ Score entities, find best match
 6 7Enriched: "What's the warranty on Premium Support Package?"
 8 9[2] RAG Pipeline (Vector Search + LLM)
10    └─ Embed query with embedding model
11    └─ Search vector database
12    └─ Generate response with LLM
1314Response to User
1516[3] Entity Extraction & Caching
17    └─ Extract entity names from response
18    └─ Add to entity cache for future resolution

Key Configuration Parameters

When implementing DCT, several parameters significantly influence behavior:

Embedding Model: The model choice determines speed, accuracy, and language support. Multilingual models like paraphrase-multilingual-MiniLM-L12-v2 are more versatile, while specialized English models often provide better accuracy.

Cache Size: How many entities are stored per session. Too small leads to missed references, too large increases scoring time and may match irrelevant old entities.

Similarity Threshold: The minimum score required to accept a match (typically 0.4-0.6). Too low produces false positives, too high misses legitimate references.

Similarity Weight (α): The balance between semantic similarity and recency. A higher α (e.g., 0.8) favors semantically matching entities, while a lower value (e.g., 0.5) weights recently mentioned entities more heavily.

When to Use DCT

DCT works best when:

  • You have a bounded set of entity types (products, services, locations)
  • Multi-turn conversations are common
  • Latency is critical for user experience
  • You want predictable, debuggable behavior

Consider LLM-based enrichment when:

  • Queries involve complex reasoning beyond simple reference resolution
  • You need very high accuracy (>90%)
  • Entity types are unbounded or unpredictable

Conclusion

Dynamic Context Tuning provides a lightweight alternative to LLM-based query enrichment. By leveraging embeddings and a simple scoring formula, you can achieve significantly faster context resolution than LLM-based approaches while maintaining reasonable accuracy. The bulk of the time is spent encoding the query (~15-50ms) - the actual cache scoring over all entities takes less than a millisecond. The approach is particularly valuable for production chatbots where latency and predictability matter.

The underlying research paper provides additional depth on the theoretical foundations and evaluation benchmarks.

Go Back explore our courses

We are here for you

You are interested in our courses or you simply have a question that needs answering? You can contact us at anytime! We will do our best to answer all your questions.

Contact us