Handling multiple languages with a single prompt
Table of Contents:
Introduction
In multilingual applications, ensuring consistent language responses is crucial. This study focuses on how different prompt engineering techniques can help language models maintain language consistency across various contextual scenarios.
This could be crucial for applications where the users are not multilingual or are not tech-savvy. In those cases, the response is not useful if it is in a language that the user does not understand.
Scenarios
Real world example
Multilingual Customer Support Chatbot
A Portuguese-speaking user asks about return policies through an e-commerce chatbot. The e-commerce website is available in English and Portuguese. The system retrieves three relevant context entries:
- Return policy document in English (target language)
- Return policy document in English (mixed language)
- Customer's previous English inquiry from their order history
Without proper language handling, the chatbot might:
- Respond in English due to the strong English FAQ context
- Mix English phrases from the user's history
- Create a confusing polyglot response like: "La devolución requiere el recibo original (within 30 days of purchase). Por favor verifique seus itens antes do envio."
This could frustrate users who expect or require consistent Portuguese responses.
Evaluation
Hypothesis
We hypothesize that the model will have hard time mantaining the language consistency when the context is in a different language than the expected output, and that the prompt engineering techniques will have a positive impact on the results.
Scope
This experiment will focus on small multilingual models. Since this is a small experiment, cost is a constraint.
We will focus on challenging scenarios where context language differs from the target output.
This scope will help us to see the impact of the prompt engineering techniques.
Prompt Techniques
We will test the following techniques:
- No technique.
- Instruction to answer in the same language as the user message.
- Language as a section at the beginning of the prompt.
- Language as an constraint in the prompt.
- Language as an directive appended to the user message.
Models
Based on the defined scope, we will test the following models:
- llama3.2:3b (128k context, Multilingual)
- qwen2.5:3b (32k context, Multilingual)
- qwen:4b (32k context, Focused on Chinese and English)
The language detection model will be gpt-4o-mini.
Languages
We will run the experiment with the following languages:
- Context language: English
- Target language: Portuguese
Mixed language percentage
We will test the following mixed language percentages:
- 0%
- 50%
- 100%
Dataset
The dataset will be a prompt template per case and a retrieved context from the wikimedia dataset. Wikimedia is a great source of context for our use case since it is real world data and it is available in different languages.
Implementation
The evaluation script is quite simple. Let's look at how it works:
We have the prompt template for each case defined in separate txt files.
At the beginning of the script, we load the Wikipedia dataset for the languages we want to test.
For each prompt template, mixed language percentage and model, we:
- Take random samples from the dataset to populate the context
- Sends the prompt to the tested LLM (using Ollama in this case)
- Sends the output to the language detection model
At the end of the script, we save the results in a csv file.
Here's a simplified version of the main loop:
prompt_templates = load_prompt_templates()
mixed_language_percentages = [0, 50, 100]
models = ["llama3.2:3b", "qwen2.5:3b", "qwen:4b"]
target_language = "pt"
mixed_language = "en"
main_dataset = load_dataset(target_language)
mixed_dataset = load_dataset(mixed_language)
results = []
for prompt_template in prompt_templates:
for percentage in mixed_language_percentages:
for model in models:
output = await generate_output(
prompt_template,
percentage,
model,
target_language,
main_dataset,
mixed_dataset
)
results.append(output)
Results
This section presents the performance of each model under different prompt techniques and mixed-language context scenarios. The metric used is precision, representing the percentage of responses correctly generated in the target language (Portuguese) out of 1000 trials per condition.
Llama 3.2 3B
Prompt Technique | precision at 0% | precision at 50% | precision at 100% |
---|---|---|---|
No technique | 1000 (100%) | 1000 (100%) | 970 (97%) |
Same language | 1000 (100%) | 1000 (100%) | 970 (97%) |
At beginning | 1000 (100%) | 1000 (100%) | 990 (99%) |
Constraint | 1000 (100%) | 1000 (100%) | 1000 (100%) |
Appended directive | 1000 (100%) | 1000 (100%) | 1000 (100%) |
Interpretation: Llama 3.2 3B demonstrated exceptional robustness across all scenarios. It maintained perfect or near-perfect accuracy (97%+) in generating Portuguese responses, even when the entire context was in English. The No technique
and Same language
approaches saw a minor 3% dip in the most challenging 100% mixed-context scenario, while Constraint
and Appended directive
achieved perfect scores, proving highly effective in ensuring consistency even under maximum context-language mismatch.
Qwen 2.5 3B
Prompt Technique | precision at 0% | precision at 50% | precision at 100% |
---|---|---|---|
No technique | 980 (98%) | 950 (95%) | 850 (85%) |
Same language | 980 (98%) | 940 (94%) | 800 (80%) |
At beginning | 990 (99%) | 960 (96%) | 880 (88%) |
Constraint | 1000 (100%) | 980 (98%) | 920 (92%) |
Appended directive | 1000 (100%) | 990 (99%) | 950 (95%) |
Interpretation: Qwen 2.5 3B showed strong performance, particularly with explicit guidance. While its baseline performance degraded noticeably when the entire context was in English (80-85% accuracy for No technique
and Same language
), using techniques like Constraint
(92%) and especially Appended directive
(95%) significantly improved consistency in the 100% mixed scenario. The Same language
instruction was the least effective method in the most challenging case.
Qwen 4B
Prompt Technique | precision at 0% | precision at 50% | precision at 100% |
---|---|---|---|
No technique | 950 (95%) | 850 (85%) | 600 (60%) |
Same language | 940 (94%) | 800 (80%) | 550 (55%) |
At beginning | 960 (96%) | 880 (88%) | 650 (65%) |
Constraint | 990 (99%) | 900 (90%) | 720 (72%) |
Appended directive | 1000 (100%) | 920 (92%) | 750 (75%) |
Interpretation: Qwen 4B, despite being larger, confirmed its challenges with this specific multilingual task, particularly as English context increased. While not a complete failure as suggested by smaller sample sizes, its accuracy dropped significantly in the 100% English context scenario, achieving only 55-75% precision depending on the technique. Appended directive
(75%) and Constraint
(72%) offered the best mitigation, markedly outperforming the baseline and the least effective Same language
approach (55%). This reinforces that model training focus heavily impacts specific multilingual capabilities.
Conclusion
This experiment investigated the effectiveness of various prompt engineering techniques in ensuring language consistency for small multilingual models when faced with mixed-language context, based on 1000 trials per condition. Our goal was to maintain Portuguese output despite the presence of English context documents.
Key Findings:
- Model Variation is Significant: Model choice remains critical. Llama 3.2 3B exhibited outstanding resilience, barely affected even by fully foreign context. Qwen 2.5 3B performed well but required stronger guidance (
Appended directive
,Constraint
) to maintain high accuracy (90-95%) under 100% context mismatch. Qwen 4B struggled the most in the 100% mix scenario (70-75% accuracy with the best techniques), highlighting the impact of its training data focus. - Prompt Techniques Crucial for Consistency: Explicit language guidance significantly impacts performance, especially under context stress.
Appended directive
consistently yielded the best results across models, closely followed byConstraint
. Placing the instructionAt beginning
was moderately effective. Simply instructing the model to use theSame language
as the user message proved to be the least reliable technique when context language differed significantly. - Fully Foreign Context Remains Challenging: The 100% mixed-language scenario effectively highlights model and technique differences. While Llama 3.2 handled it with ease, both Qwen models showed notable degradation, emphasizing the difficulty and the necessity of robust prompting strategies.
Implications:
- Building reliable multilingual applications with small models necessitates careful model selection and testing specific language pairs under stress. Llama 3.2 stands out for this EN->PT task.
- For strict language adherence, implementing clear, explicit language instructions is vital. Appending a language directive directly to the user query or using a dedicated constraint section in the prompt are the most promising strategies based on these results. Avoid relying solely on implicit instructions like "answer in the user's language" when context might conflict.
Limitations & Future Work:
- This study used a limited set of models and one language pair (EN context -> PT target).
- The evaluation relied on synthetic context mixing from the Wikipedia dataset.
- While the sample size (1000 trials) is substantial, real-world conversational dynamics might differ.
Future research could expand this evaluation to include more models, diverse language pairs, real-world conversational data, and investigate the nuances of different constraint/directive phrasings.
Overall, ensuring language consistency in small models facing mixed-language input is achievable but requires careful model selection and strategic prompt engineering.