Standards in the Evolution of Translation Technologies

By: Sergio Alasia - Read Time: 11 minutes

Introduction
Character encoding standards
File formats
Process standards
Conclusions

This article illustrates the most commonly used standards in the language and translation industry, and how introducing them can positively affect the evolution of language and translation technologies.

The first part will briefly go through the three main types of standards and how they fit in the processes of translation and software localization.

The second part will analyse the extent to which these standards are implemented by software developers and Language Service Providers in their workflows.

Finally, some thoughts on how standardization and interoperability can improve the quality of interactions in the language service market.

Introduction

As stated by the European Telecommunications Standards Institute (ETSI), the main purpose of standardization is to allow interoperability and freedom of choice for buyers in a multi-vendor, multi-network, multi-service context. In particular, Language Service Providers (LSPs) cannot afford to allow their increasingly demanded services not to meet at least some unified industry standards.

LSPs mainly deal with text files, so industry standards should focus on text and its digital representation. In fact, interoperability in the language industry should allow us to basically create and process documents in different environments and transfer them by different means without losing any properties.

So far, this has not been always possible, since market leaders’ policies have often stood in the way of interoperability. However, in the last decade there has been an increasing effort in that direction, and standards are starting to play a constructive role in the evolution of language technologies.

For clarity’s sake, this article will consider three main blocks of standards that directly concern the translation industry:

Character encoding standards
File formats
Process standards

Character encoding standards

Character encoding standards determine how text is represented as an underlying digital code, in order to be transmitted from one computer system to another. There are plenty of different character sets, but few of them are universally supported. Therefore the use of non-standard character sets may lead to unreadable texts when documents are transferred from one system to another. The typical result is a poorly displayed, garbled text, full of question marks, squares and other strange characters.

This is especially problematic in the case of multilingual texts, where different alphabets or even letters and ideograms must coexist. A Russian-English glossary, for instance, should be encoded in a way that both the Cyrillic and the Latin alphabets are readable by any computer, regardless of the system’s locale.

One solution to this question is the use of the Unicode standard, which provides a unique code point for every single character or ideogram of almost all written languages. UTF-8 and UTF-16 are two of the most complete and widespread Unicode-based character sets. Unicode’s ability to represent and handle text expressed in most of the world’s writing systems has determined its extensive use in web pages and therefore the language service industry as well.

File formats

File formats define the inner structure of files, so that the appropriate software application can properly load, open, process and save them. Companies should use standard formats for written files containing vital information, in order to keep the ownership of such information at all times.

On the contrary, if there is only one or few commercial applications available for a company’s documentation authoring and translation, a risky relationship of dependency takes place. On top of not adhering to the standards, the software house selling that application may force buyers to update without allowing backwards compatibility. They may even abandon that application, or worse, close down completely their business, leaving users without updates or support at all.

Since the days punching cards, end users have witnessed many such lock-in practices by major software houses to gain or keep their market share, regardless of the quality of their products. In fact, it was the lack of common standards for word processors in the mid 1990s that caused a somewhat painful mass migration of WordPerfect users to Microsoft Word, while Windows was supplanting DOS as the foremost operating system.

The language niche is absolutely no exception, with SDL Trados nowadays as the market leader and de facto standard, but far from being the best translation software, at least in terms of compatibility. First of all it is based on Microsoft’s .NET technology, therefore leaving out all Mac and Linux users, it does not allow older versions to open files created by newer versions, and only recently did it improve its cumbersome and cluttered graphic interface, reminiscent of the early 1990s. Most importantly, lately they have proven to treat their paying users as beta testers by releasing a buggy version of their Studio 2011 suite in September 2011, followed by the first Service Pack only three months later at a full 347 Mb.

In the translation industry, the standardization of file formats is especially needed for the different tagged text files, which are usually intermediate and auxiliary subproducts of the translation process, when it is done using a Computer Assisted Translation (CAT) tool.

CAT tools are usually capable of processing at least three types of files: translation memories, bilingual texts and term bases. If the inner structure of such files follows shared specifications that are expressly intended to allow them to be shared among different environments, a company’s multilingual documentation will not depend on just one or a few vendors.

Some of the standards, which were specifically designed for the localization process, had been regulated by the Localization Industry Standards Association (LISA) until its demise on Feb 28th 2011. Later the same year, the European Telecommunications Standards Institute (ETSI) started a Special Interest Group for localization (LIS), aiming at making progress in the development of TBX, TMX, SRX, GMX-V and xml:tm standards, but without achieving relevant outcomes in this direction so far.

Translation memories

Translation memories are usually large database files which contain previously translated texts, their formatting and other properties. Some properties are set by default (e.g. source and target language, date, time, ID of the person or software that performed the translation, etc.), while others can be added as custom attributes.

Each CAT tool has a way to store translation memories, but it is extremely important for all language providers to share TMs in order to perform their tasks. Translation Memory eXchange (TMX) is an XML-compliant format designed for the interchange of translation memories among different CAT tools, which allows database structure to be represented. Again, development stopped several years ago when version 2.0 of the specification was proposed, but was never implemented.

Bilingual files

Whatever the format of the source file, in most cases, translation and related processes are carried out on text-only extractions which include tags and placeholders to maintain the original printing or displaying layout. They also support sentences and their translations to coexist side by side. The XML Localisation Interchange File Format (XLIFF) provides a unified structure used for bilingual documents.

XLIFF is used as a “bridge” format which gives the extracted tex tthe appropriate structure. Specific elements and attributes provide the means to define the properties of each pair of segments (source and target), such as source and target language, the extraction tool, etc. Unlike the above formats, the XLIFF standard is being developed by the OASIS consortium. The latest version is 2.0, released in 2014.

Gettext Portable Object (PO) is also a multilingual format and is designed specifically for the software localization industry. PO files have a very simple structure, lacking in special attributes, which is usually displayed in two columns: left for the source language and right for the target language.

Term bases

Term-Base eXchange (TBX), Universal Terminology eXchange (UTX) and Open Lexicon Interchange Format (OLIF) are three XML-compliant formats specifically designed for terminological and lexical data. The three of them support glossaries for both human and machine translation. They store pairs of source-target terms, as well as other terminology data, including type of word, gender, number and more detailed lexical information.

Other file types

Other file types taking part in the computer assisted translation process have their standard formats as well:

The Segmentation Rules eXchange (SRX) format is used to define segmentation rules. Segmentation is the operation that allows to divide the text into chunks, called segments, that can be translated one by one. When a program needs to segment a document, rules are needed to determine where a segment ends and the next one starts. In most cases it will consider a full stop as a segment end mark, but not when it is a dot within an Internet address, or for acronyms, for instance.

Global information management Metrics eXchange (GMX) is a collection of standards which intend to provide common means to measure quantitative aspects of a document, like word counts, complexity, etc. When a translation agency receives a job request, it needs to calculate the whole translation work in order to quote the project. Translation quotes for exactly the same text can vary a lot, because every translation company measures the complexity and length of a text in a different way. With the development and integration of the GMX standards, the translation industry will benefit from verifiable and defined metrics applied to text documents.

Process standards

The growth of the translation market over the past decade has determined a great need to develop translation service quality standards. On one hand, with the increasing volume of written information and the sheer number of clients not familiar with concepts like localization, internationalization and globalization, demand is growing. On the other hand, the Internet is making it easier than ever to start and run a translation business with very little investment, so much so that the marketplace is getting crowded with all the new and inexperienced providers.

As a result, governments and other institutions like translation associations have promoted the introduction of quality standards to formally describe all the steps involved in delivering a satisfactory translation service. Whatever the final product, i.e. a commercial contract, documentary subtitles, a product catalogue, a multilingual web page, etc., standards benefit end customers by providing a framework for weighing their experience with their language service provider against a recognized and unbiased criterion.

Since it is very hard to agree on a unique definition and measurement of written translation quality, process standards mainly focus on the overall quality of a traditional full translation workflow, from the service request to the output delivery. In fact, they do not provide specific criteria for translation or project quality, as these are highly subjective. Instead, they set out parameters that LSPs should consider before starting a translation project (human resources, project analysis and quoting, customer specifications and communication), during its execution (terminology management, translation, editing, formatting, proofreading, and quality control) and after delivery (translation memory maintenance, feedback tracking).

Unlike many manufacturers and service suppliers who have the ISO 9001 as the one main international certification available, the translation industry’s best practices are defined by different quality standards depending on the geographical location. In Europe, first the UNI 10574:1996 standard and later the UNI EN 15038:2006 aimed to unify the terminology of translation activities, as well as the definition of good practices for the buyer-seller relationship. In North America, Canada’s Language Industry Association (AILIA) has contributed to the development of the National Standard for Translation Services CAN/CGSB 131.10-2008, adapted from Europe’s EN 15038. In the US, the American Translators Association (ATA) endorsed the ASTM F2575 Standard Guide for Quality Assurance in Translation. In 2015 the ISO 17100 standard was introduced, which is globally recognised and finally set an international standard for quality translation services.

Conclusions

Standardization has many enemies in the translation industry. The fact that a telecom-oriented body like ETSI had to come forward and to continue the development of the former LISA’s standard formats through their numerous organizations, like EUATC, FIT, GALA, ELIA, etc., rather than this being done by the industry’s stakeholders themselves, because of their apparent lack of will to agree upon a unified set of standards, should be enough to attest to this. In fact, so far, rather than grow with it and guide its development, the language industry has passively stood by and just watched the technology develop. Moreover, the fact that the initiative is left to a small group of organizations has caused a real lack of consistency, while vendor lock-in practices are threatening the freedom of the market.

However, standards are already coming into play and will play a crucial role in the documentation and translation industry, and their progressive introduction is key to the evolution of translation technology. Although they do not cover all the aspects of translation services yet, standards offer an accepted and acceptable framework to implement better quality processes at all levels. Because they help improve overall translation management, all the stakeholders, including software developers, LSPs and final customers, are responsible for endorsing standardization and enhancing interoperability.

In fact, many software developers are adopting industry standards to count on a template to design compliant programs. As a result, LSPs of all sizes can free themselves from the restrictions of commercial software. On the other hand, by applying process standards, large LSPs can select their translation vendors based on their real qualifications, since having purchased a certain software licence should not be a condition for hiring a language provider.

More importantly: buyers can benefit from improved transparency in the translation market, and from better communication with LSPs, regardless of the role that each party plays, and create flawless specifications. In other words, standards help customers get the best translation service because they promote competition and they make it easier for them to choose the most suitable LSP to fit their needs. They should therefore commit to hiring suppliers who follow the standards.