Skip to main content

Evaluating Pro AI translations with Custom AI profiles vs Standard AI/MT

Learn how to evaluate AI translations with Custom AI profiles (RAG) vs Standard AI/MT.

Ilya Krukowski avatar
Written by Ilya Krukowski
Updated this week

This guide explains how you can manually compare different types of translations in Lokalise:

  • Pro AI translations using Custom AI profiles (with RAG context)

  • vs Pro AI translations without RAG

  • vs standard AI/MT engines (Google, DeepL)

This evaluation setup helps you understand which option delivers the best quality for your content.

Create an evaluation project: Start with a clean workspace

We recommend creating a brand-new project that you will use only for evaluation.

You need to:

  1. Create a new Web and mobile or Ad hoc documents project (e.g., AI Quality Evaluation).

  2. Add or import the source text you want to test (for example, in English). You can:

    1. Copy/move keys from another project using bulk actions

    2. Add keys manually


Add comparison languages: Use multiple variants for testing

To compare different translation outputs of the same language, you need several "variants" of that language. Because a project cannot contain the same locale twice, we recommend adding custom locale codes to the evaluation project.

Note that we recommend using Pro AI with RAG for a "real" language variant — the language that already has previous translations and translation memory entries.

Pro AI and Standard AI/MT can be used for "fake" (custom) language variants.

Example:

Variant

Locale code

Purpose

Latvian A

lv (or pt_BR, fr_FR, etc.)

Pro AI with Custom AI Profile (RAG)

Latvian B

lv-test1 (or pt_BR-test1, fr_FR-test1, etc.)

Pro AI (no RAG)

Latvian C

lv-test2 (or pt_BR-test2, pt_BR-test2)

Standard AI/MT engine (e.g., Google)

You need to:

  • Add as many language variants as you want to compare.

  • Use the correct ISO code as the prefix (e.g., lv), and add any suffix you like (-test1, -v02, etc.).

  • Rename the languages to neutral labels such as Latvian A, Latvian B, Latvian C so reviewers cannot guess which engine produced which translation.

This helps prevent bias during evaluation.


Optional: Create custom statuses for linguistic scoring

If you want a structured way to mark quality, you can create custom statuses such as:

  • Acceptable

  • Not acceptable

Reviewers will later use these statuses to evaluate each translation.

Reviewers don’t need to fix translations they think are wrong — they can simply mark them by using a custom status. For example, if something looks incorrect, they can just apply the “Not acceptable” status.


Set up your Custom AI profile: Apply RAG to one language only

If you want to evaluate the impact of Custom AI profiles, you need to apply the profile to only one language variant.

You need to:

  1. Create or open a Custom AI profile.

  2. Add your high-quality past translations as examples (for RAG context).

  3. Assign this profile to one target language within the evaluation project (e.g., Latvian A).

  4. Do not assign it to the other variants.

This ensures that only one variant uses customized RAG-powered AI.


Generate translations: Use automations for consistent output

Once your languages are ready, you can fill all translations in bulk.

You need to:

  1. Go to Automations within the evaluation project.

  2. Create an automation that fills each target language:

    1. Use Pro AI for Latvian A and for Latvian B. As long as you've added a custom AI profile only for the Latvian A, the B variant won't use RAG.

    2. Use Standard AI/MT for Latvian C (and additional variants if needed).

Automations allow you to generate translations consistently across all keys and languages. Automations also won't trigger translation scoring which is important to avoid bias.


Invite reviewers: Evaluate quality blindly

Once translations are generated, you can invite your linguists or reviewers.

We recommend:

  • Asking reviewers to evaluate translations without knowing which engine produced them.

    • Note that reviewers are not required to fix incorrect translations.

  • Using custom statuses (e.g., Acceptable / Not acceptable).

  • Adding comments for suggested improvements.

This ensures an unbiased assessment.


Analyze results: Compare the quality of each translation variant

After the evaluation is complete, you can compare the results across all languages.

You need to:

  1. Filter translations by:

    1. Language

    2. Status (Acceptable / Not acceptable)

  2. Count how many translations were marked as acceptable for each variant.

  3. Review comments to understand typical errors or strengths.

If needed, you can rename the languages to reveal which engine produced each result (e.g., Latvian A → Custom AI, Latvian B → Standard AI, Latvian C → MT).


What you learn: Understanding the impact of customization

By following this workflow, you will see:

  • How much Custom AI profiles (with RAG) improve quality

  • How Pro AI performs without past translation context

  • How Standard AI/MT engines compare to both AI options

  • How close your customized AI results get to previous human translations

This evaluation helps you choose the best translation approach for your product and content domain.

Did this answer your question?