How do people code in other languages

Specification of the language in HTML

Intended audience: HTML developers (who use editors or scripts), script developers (PHP, JSP, etc.), web project managers and anyone who wants to better understand how to specify the language of a website

Question

How should I specify the language of the content on my HTML pages?

Short answer

Always use a language attribute in the tag to indicate the language of the text on your page. If the page contains content in an additional language, add a language attribute to the element that wraps that content.

Use the attribute for pages that are delivered as HTML and the attribute for pages that are delivered as XML. Use both attributes together for XHTML 1.x and polyglot HTML5 documents.

Use the language abbreviations from the IANA register for language abbreviations. This unofficial tool for looking up language abbreviations offers a user-friendly interface for the IANA registry.

Use nested elements when the element content and attribute value of an element are in different languages.

Details

Basics

Always use a language attribute for the element. This information is passed on to all other elements and thus also sets the base language for the text in the element.

You should enter the specification for the element, not for the element, because the element does not include the text in the element.

If you have content on the page in a different language than the language specified for the element, set a language attribute for the element that contains this content. This allows you to style or process this content differently.

At some points in your code you run into a problem: If you have multilingual text in the element, you cannot specify the different languages, because only text is allowed in the element, but no mark-up. The same goes for text in attributes. There is currently no satisfactory solution to this problem.

Choosing the right attribute

If your document is served as HTML (i.e., as), use the attribute to specify the language of the document or section of text. For example, to specify French as the base language:

If you deliver XHTML 1.x or polyglot HTML5 documents as, use either the attribute as well as the attribute for each language specification. The attribute is used to specify the language information in XML. Make sure that both attributes have the same value.

The attribute has no effect when the file is processed as HTML, but it takes over the function of the attribute when the document is processed as XML. The attribute is allowed in XHTML and can also be observed by browsers. If you use other XML parsers (such as the function in XSLT), you cannot assume that the attribute will be observed.

If your page is served as XML (i.e. with a media type such as application / xhtml + xml), you do not need an attribute. The attribute alone is sufficient.

What to do if element content and attribute values ​​are in different languages?

Sometimes the text in an attribute value is in a different language than the content of that element. In the upper right corner of this page there are, for example, links to the English original and translated versions. The link text specifies the language of the target page in the respective target language, an associated attribute contains the note in the language of the current page:

If your code looks like this, the language attribute would indicate that not only the element content, but also the text in the attribute is in Spanish. That is obviously not the case.

Instead, append the attribute with the text in the other language to another element, as shown in the following example, whereby the element inherits the language specification from the element.

What if there is no element to which the attribute can be attached?

If you want to specify the language for content that is not enclosed in markup, enclose it in an element. Here's an example:

Specify values ​​for language attributes

To make sure that all user programs understand which language you mean, you should follow the standard for language information. You may also need to consider how different dialects are referenced according to the standard, e.g. American and British English, which differ significantly in spelling and pronunciation.

The rules for valid values ​​of language attributes are described in the IETF specification BCP 47. In addition to defining how to use simple language tags (such as for English or French), BCP 47 also describes how to compose language tags that allow the use of regional dialects, scripts and other variants associated with the respective language.

For a short but comprehensive introduction to the syntax of language codes according to BCP 47, please read Language tags in HTML and XML. To help you choose the right language code from the abundance of possible abbreviations and their combinations, please read Choose a language label.

additional Information

Provide metadata about the language of the target audience

If you want to generate metadata that indicates the language of the target audience, rather than the language of the text, let the server send this information in the HTTP header. If your target audience speaks multiple languages, the HTTP header allows a comma-separated list of languages.

Here is an example of an HTTP header that identifies the resource as a mix of English, Hindi and Punjabi:

This approach is ineffective if the page is accessed from hard disk or similar source rather than from a web server. There is currently no generally working way to accommodate this type of metadata within the page.

In the past, many have used an element with the attribute set to. Due to constant irritation and different implementations of this element, the HTML5 specification declares this to be non-standard in HTML. You should therefore no longer use this.

For backward compatibility, HTML5 describes an algorithm that - under certain conditions - guesses the base language of the content using the information from the HTTP header or an element. However, this is only a fallback mechanism for cases when no language attribute is used in the tag. If you use the language attribute in the tag - which you should always do - these fallbacks are irrelevant.

For more information about in HTTP respectively read HTTP headers, meta elements and language information.

Things unrelated to language information

For the sake of completeness, a few things should be mentioned that Not related to language information.

1. It is not possible to specify the language of the content using CSS.

2. The declaration that every XHTML file should begin with might contain what to some might look like a language statement. The example below contains EN, which stands for “English”. However, this information refers to the language of the associated Schemes - it has nothing to do with the language of the document itself.

3. Some believe that natural language information can be obtained from character encoding. The character coding does not allow the natural language to be clearly identified. This would include a one-to-one relationship between coding and language imperative; but there is no such thing. A character encoding can be used for many languages, for example with Latin-1 (ISO 8859-1) you can encode both French and English and many other languages. In addition, the character encoding for a language can vary; for example, encodings such as Windows-1256, ISO 8859-6 or UTF-8 can be used for Arabic.

All these character encodings are hardly relevant nowadays, however, since all content should be encoded in UTF-8, which covers most languages ​​with a single character encoding.

In some scripts, such as Arabic and Hebrew script, the text is primarily read from right to left, but numbers and texts from other scripts are read from left to right. Markup like the attribute is required to set the general right-to-left context. In some cases, additional markup is needed to correctly display bidirectional text. This cannot be achieved with speech recognition.

The same applies to the direction of writing. As with coding and languages, there is not always a one-to-one relationship between language and writing, and thus also not between language and the direction of writing. Azerbaijani, for example, can be written in right-to-left (Arabic) script and left-to-right (Latin or Cyrillic) script, and the language code is used for all of these script variants. Markup for specifying the direction of writing can assign a series of values ​​to a text, but specifying the language is a simple switch that is not suitable for this.

Further reading