HTML parsing

Learn how HTML is parsed in "Marketing and support" projects and how to setup custom HTML parsing rules.

Alex Terehov avatar
Written by Alex Terehov
Updated over a week ago

This feature is available only for the Marketing and support project type.

How it works

When Lokalise imports content from various sources like Marketo, Salesforce, Iterable, and others, it first converts the formatted content into HTML, regardless of the content's original format. After this conversion, it separates the translatable content into individual keys. We have refined our HTML parsing rules to simplify the translation texts by eliminating unnecessary tags, aiming to streamline translations and enhance the precision of translation memory entries.

In particular:

  • Extra tags at the end are removed to clean up the text.

  • Keys that would consist only of iframes and images are not created, shifting the focus to translating textual content.

  • If a paragraph contains a <br> tag for a new line, this is now treated as a cue to generate two separate translation keys: one for the text before the <br> tag and another for the text after, allowing for more accurate translations.

  • Texts within HTML attributes like alt and title are extracted as separate keys. When exporting, Lokalise ensures the original structure of the document is accurately reconstructed. These keys are also given additional context to improve the accuracy and relevance of translations.

Customize parsing

Only project admins have permissions to adjust the parsing rules.

It's possible to adjust custom HTML parsing rules on a per-project basis. To learn more about the rules, please refer to OKAPI documentation.

To get started, proceed to the project settings and then navigate to the HTML parsing rules tab.

Once you are ready, click Save changes. Click Restore default rules to revert all the changes and load the defauls.

After the rules have been adjusted, they'll be applied to all new content imports. The previously imported content will stay intact but if you re-import an existing entry, it will be recreated according to the new rules.

Technical limitations

  • Content within HTML attributes will be extracted without further parsing.

    • For example, if you have a <a title="<b>test</b>"> tag, the title will be extracted as <b>test</b> without any HTML processing.

  • Invalid HTML, such as duplicate tag IDs, may lead to import failures. It's crucial to ensure the HTML is valid to avoid disruptions in the import process.

  • A seemingly broken layout might be created in Lokalise (it happens because the trailing tags are being stripped out):

Input

Visible content in Editor

Exported content is correct

<span>Hello,<br></span>

<span>Hello,

<span>Hello,<br></span>

  • If the initial markup contains improperly closed tags, the content might be broken:

Input

Visible content in Editor

Export will be broken

<strong>Hello, <u>Lokalise</strong></u>

Hello, <u>Lokalise</strong></u>

Hello, <u>Lokalise</strong></u>

Did this answer your question?