[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: More on Segmentation in XLIFF
Hi all, TRADOS has now joined OASIS, and I have just applied for
membership in this group. I noticed Tony’s posting of my
email conversation with him and Yves on the topic of segmentation. Unfortunately
the posted email thread did not contain the most interesting parts of our
exchange, so I take the liberty of enclosing that part in this email (see
below). Best regards, Magnus Martikainen TRADOS Inc. -------------------------------------- From: Magnus
Martikainen [mailto:magnus@trados.com] Sent: Tue
1/27/2004 8:11 PM To: Yves Savourel Cc: Jochen
Hummel; Tony Jewtushenko; John Reid Hi Yves, Thank you very
much for your extensive response. It is now clear to me that I somewhat
misinterpreted the purpose of XLIFF 1.1 in its role as an interchange file
format for localisation.
I did not realize that it does not yet attempt to address the concept of
segments during localisation phases. I am glad to hear
that there is an interest in addressing this, and I believe that this would be
a very important step towards providing a common standard file format that is
also more useful for processing and interchange of content between different
tools during the most labour intense localisation phases, the translation and edit phases. I also definitely
agree with you that supporting translation recycling on different granularity
levels (such as paragraph, sentence, or even phrase) is of vital importance in
particular for future CMS integration capabilities. If you are
interested, as a discussion starter on how segment structure could be
introduced in a future version of XLIFF, here are a couple of very early thoughts
from my side. Given your sample
trans-unit: <trans-unit
id='100'> <source
xml:lang='en'>First sentence.
Second sentence.</source> <trans-unit> We could allow a
segment structure to be introduced for the content of the trans-unit, like
this: <trans-unit
id='100'> <segment-group>
<segment>
<source xml:lang='en'>First sentence.</source>
</segment>
<segment>
<source xml:lang='en'> </source>
</segment>
<segment>
<source xml:lang='en'>Second sentence.</source>
</segment> </segment-group> </trans-unit> We can allow
<atl-trans> both on <trans-unit> and <segment> level, like
this: <trans-unit
id='100'> <segment-group>
<segment>
<source xml:lang='en'>First sentence.</source> <alt-trans> ...
</alt-trans>
</segment> <segment>
<source xml:lang='en'> </source>
</segment>
<segment>
<source xml:lang='en'>Second sentence.</source> <alt-trans> ...
</alt-trans>
</segment> </segment-group> <alt-trans> ... </alt-trans> </trans-unit> It would then be
up to the user interface tools to choose how to present the alt-trans content
to the user depending on which level of content it is associated with, and the
user should have the option to translate the entire trans-unit as one piece, to
translate individual segments as they appear, or even to change the segment
structure while translating, e.g. to merge and split sentences as is sometimes
needed for producing conceptually valid translations. Some comments on
this model: * I believe it is
important to explicitly recognise that during the phases of localisation (e.g.
between preparation, translation, editing, review) there may be a need for a
(segment) structure inside a trans-unit, as different parts of the trans-unit
content can be in different state during these phases. The segment extension of
the XLIFF standard would be targeted directly at addressing the need for tool
interoperability during these phases. (Attributes available for the
<segment> element, which I have not addressed yet, should also reflect
this.) * I believe the
segment structure should be on the trans-unit level, and not inside the current
source and target elements. Conceptually it makes more sense to me to define a
segment as having a source and target, as they have a pretty strong coupling,
in fact this is very similar to the way a trans-unit today has a source and a
target. * Likewise it is
important to make the distinction between the "hard" boundaries that
the <trans-unit> represents and the "soft" boundaries that the
<segment> represents. The "hard" structure cannot ever be
changed - it is set "in stone" by the filter that produces the XLIFF,
and must remain intact for the backward conversion to work. The
"soft" structure on the other hand can be changed as desired by any
tool without affecting validity or functionality. * For backward
compatibility reasons the segment structure inside the trans-unit is optional.
It can also easily be removed when interacting with XLIFF tools not yet
supporting this feature. Hopefully this would be a straight forward operation
that could be accomplished e.g. with an XSLT transformation. * An interesting
question is whether <bpt> and <ept> elements should in this model
be matched only within a <segment> or if they may remain matched within
the scope of the <trans-unit> even if they span multiple segments.
Conceptually, at least from a translation memory point-of-view, it would be
valuable to have them matched within the segments. On the other hand that would
potentially require changing some of them to <it> when introducing
segmentation. (This would by the way be the same if a trans-unit were divided
into multiple parts in XLIFF 1.1, as one of your workarounds suggested.) As a
one-way operation this is ok, but things get complicated if segmentation is
later changed, or the segments are removed. I'd be very
interested in hearing more of your thoughts on this topic. Thank you again! Magnus -----Original
Message----- From: Yves
Savourel [mailto:yves@opentag.com] Sent: Tuesday,
January 27, 2004 1:40 PM To: Magnus
Martikainen Cc: Jochen
Hummel; Tony Jewtushenko; John Reid Subject: RE:
Questions on XLIFF Dear Magnus, I don't think
I've met you but I certainly know your name and I've heard (good things)
about you. So I'm thankful that you took the time to expose your thoughts
about XLIFF. I've CCed two of
the XLIFF TC members in this answer, thinking they may be able to bring
useful insights in this topic: Tony from Oracle (the TC Chair) and John from
Novell (developer using XLIFF). I think there are
different ways to approach the use of XLIFF documents in translation
tools, depending on how much of XLIFF's features are used. So maybe a
step-by-step approach could be considered. --1- Simple Use The first one is
very simple and pretty much works already. For example, RWS has been
using TagEditor to translate XLIFF files for years now, and we have--almost--no
problems. The reason is because we use currently a simple content: some
text inside <trans-unit>, no <alt-trans>. So the issue of segmentation
doesn't exist: TagEditor segments as it wants and we get back clean files our
filters can merge. We do have to do
some workaround to ensure that: a) The
translation goes into the <target> elements (so we make sure the XLIFF document
has a <target> element with the source text). b) The
<trans-unit> element with an attribute translate='no' are protected (so we add an
additional <NTBT> tag inside any <target> with translate='no' and use a DTD
settings file where <NTBT> is protected). It's not very
pretty, but overall the process works. --2- Segmenting The real problems
start when XLIFF contains <alt-trans> and/or when there is assumption from
the producer of the XLIFF document that the translation tool will look at the
<trans-unit> elements as leverageable segments. You wrote:
"One of the main obstacles I see for XLIFF as a generic interchange
format for translatable content is how the format requires segmentation to
be applied at a filtering stage, without allowing it to be changed later in
the process." I think you may
be misled by the name of the element. An XLIFF <trans-unit> does not
pre-suppose any type of segmentation. So it is not quite correct to say that the
segmentation cannot be changed later: nothing prevent an XLIFF document to be
manipulated in any way, as long as it is returned to a form that will be
usable by its original filter. In other words, one could take an extracted
XLIFF document where the <trans-unit> element contain 'paragraphs', run
it through a utility that will use the segmentation engine of a translation
tool such as Trados' and "re-break" the <trans-unit> element. This
would give you the equivalent of a pre-segmented Trados RTF file. For example, an
original extraction like this: <trans-unit
id='100'> <source
xml:lang='en'>First sentence. Second sentence.</source> <trans-unit> Could be
transformed as something like this: <group
id='100' restype='x-myOriginalTUgroup'> <trans-unit
id='100-1'> <source
xml:lang='en'>First sentence. </source> <trans-unit> <trans-unit
id='100-2'> <source
xml:lang='en'>Second sentence.</source> <trans-unit> </group> Or you could also
use the <mrk> inline element to do the same thing, like this: <trans-unit
id='100'> <source
xml:lang='en'><mrk mtype='phrase'>First sentence. </mrk><mrk mtype='phrase'>Second
sentence.</mrk></source> <trans-unit> The problem
(currently) is that XLIFF does not address explicitly the topic of segmentation
at all. There is no guidelines or rules to tell the extractor how to
represent segments, or even what a segment is. Will this be
addressed in a future version of XLIFF? I sure hope so. There are several
possibilities: - New elements or
a different namespace inside the <source> and <target> element
(currently non-XLIFF namespaces are not allow there, you have to use <mrk> for
assigning specific information to runs of text). - SRX could
possibly used at some level, although I'm not sure yet how this would play into
the picture. - Some guidelines
could also be set to use <group> and <trans-unit> in a certain way to
decompose a 'paragraph' into 'sentences', so any filter could rebuilt the
original 'paragraph' and merge it back. --3- The Core
Issue I think we could
see the whole problem from a different angle. Some of XLIFF requirements do
not fit into a classic translation tool system because XLIFF is not a source document,
but more like the output of a content management system. In my view TM
tools serve essentially two purposes: First they allow to not retranslate
something that was translated in a previous project. And secondly they
allow to re-use an existing translation when working on a new text. I think it's
important to make a distinction between these two functions. The first one is
a fix for a problem that exists upstream in the process: the fact that we
are not able to know what has changed from one version of the source
document to the next, or at least not able to package it in a way it's useable. But
nowadays, this limitation is slowly disappearing with the use of CMS. More
and more the customer of translation knows what has changed and does not need
the TM tool do provide such function. This said, TM tools are still very
useful because they still provide their second function: reuse of existing
text when translating new one. So, maybe one way
to approach XLIFF, is to see it just like a CMS: <trans-unit>
element being holder of a text object (whatever its granularity),
<alt-trans> elements being existing translations of these text objects. Maybe the support
of XLIFF could be done step by step: First addressing
the more general issues, which are not linked to CMS-type problems and
exist in other XML formats than XLIFF: a) support for taking the source text
from one place (<source>) and putting it in another (<target>),
and b) support for conditional translation (translate='no'). For example, how
would I define a DTD settings for this file? (a real-life example): <dialogue
xml:lang="en-gb"> <rsrc
id="123">
<component id="456" type="image">
<data type="text">images/cancel.gif</data>
<data type="coordinates">12,20,50,14</data>
</component>
<component id="789" type="caption">
<data type="text">Cancel</data>
<data type="coordinates">12,34,50,14</data>
</component> </rsrc> </dialogue> Here only
"Cancel" is to localize. So the only efficient way to express what is translatable
would be by an XPath expression: "//component[@type='caption']/data[@type='text']". We run into such
issues with many XML documents and have to create XSL templates to
workaround TagEditor's limitations. So allowing the DTD settings to be
more flexible in that aspect would help not only in supporting XLIFF,
but more importantly in supporting many other XML formats. Then, we could
look at how <alt-trans> could be supported, and it's effects on how <trans-unit>
elements should be arranged for such purpose. So to go back to
your original questions: "Ideally a
file filter should only need to distinguish translatable parts of a file from
non-translatable parts, and leave it at that. Segmentation should be applied
and managed (and possibly also changed) by other tools later in the
process, without affecting the ability of the file filter to convert the
segmented file back to native format." 1) Is there a
convenient and compatible way to support this in XLIFF 1.1?" Answer: Currently
XLIFF does not assume anything specific about segmentation. So
XLIFF filters can do this, and as far as the ones I know, they actually do
exactly that: just separate the text from the code. "2) Are
there already plans on extending the XLIFF standard in the future to better support
this?" I certainly hope
we will be able to come up with a mechanism to integrate segmentation, a
way of re-assembling segmented <trans-unit> elements, or something of that
order. And your collaboration would certainly be very welcomed. To conclude, I'd
like to underline that there are more and more cases now where the
granularity of the text to translate cannot be only driven by the translator's
workbench. With CMS we have to take in account the fact that part of the
traditional function of the TM tool can now be done at the document
authoring/management level, in some cases working with 'paragraph' rather than
sentence. I think that ultimately if we find a way to reconciliate
those two concepts, the remaining problems of XLIFF integration in translation
tools will be solved. I think Trados
has more and more experience in working with CMS, so maybe some of that
knowledge could be used for XLIFF as well? That's all I can
think of for now. Cheers, -yves ________________________________ From: Magnus
Martikainen [mailto:magnus@trados.com] Sent: Monday,
January 26, 2004 8:47 PM To: Yves Savourel Cc: Jochen Hummel Subject:
Questions on XLIFF Hi Yves, My apologies if
you got another copy of this email - I accidentally hit the wrong key while
typing and it was sent before I had finished it. I tried to recall it, but I
may have been too late. I don't think we
have ever met in person, but I am well aware of your extensive
presence in the localisation industry. You may remember my name, e.g. from the
LISA ITS group. I am the Chief Software Architect in TRADOS. Jochen Hummel
suggested that I contact you directly with some questions about XLIFF - I
hope you don't mind? I have been
looking closely at the XLIFF 1.1 specification lately, amongst other things in
order to see how we can better support it as an interchange format or even as
a natively supported file format for TRADOS in the future. One of the main
obstacles I see for XLIFF as a generic interchange format for translatable
content is how the format requires segmentation to be applied at a
filtering stage, without allowing it to be changed later in the process. Let me explain: Since all content
in XLIFF must be stored inside translation units, a file conversion tool
that produces XLIFF output must decide where to introduce the translation
unit boundaries. While for some file formats there may be natural breaks
(e.g. in software files, which XLIFF seems to be concentrated on) when dealing
with larger volumes of running text (e.g. in documentation and help files)
the file conversion tool would have many options on how to break the content
into segments (e.g. based on tags, paragraphs, or sentences). Once the
translation units have been introduced in the XLIFF file there is no way to change
the segmentation, as the process for assembling the XLIFF body and skeleton
is not at all governed by the standard, but is left completely up to
the file conversion tool. If an XLIFF translation unit were changed into two
translation units this would very likely break the conversion of the
translated XLIFF file back to the native file format. Segmentation of
text into sentences (this being the most common type of segmentation used
with translation memories) is a complex task that requires sophisticated
linguistically aware algorithms to produce good results. Translation
memory tools have over the years developed and fine-tuned these algorithms for
different source languages, and this is what has been used to produce the
translation memory content many companies have built up over time. The only
way to achieve maximum recycling against such translation memories is to
use the very same algorithms to identify sentences in the content to be
translated. If the content to
be translated resides in an XLIFF file, translation unit boundaries have
already been set by the file conversion tool, and cannot easily be adapted
to suite the translation memory. As it is unlikely that the file
conversion tool uses the exact same segmentation algorithm as the translation
memory this will lead to reduced translation memory recycling. Even small
differences in segmentation between the translation memory content and the
XLIFF files can lead to big costs. Further, as
segmentation is generated by the file conversion tool, also recycling between
file formats, or even recycling within the same file format when
different file conversion tools have been used, can be seriously affected. As I see it the
problem lies in that the notion of a translation unit is enforced upon the
content at a stage in the process long before it is known what would be the
most suitable segmentation for that content. Ideally a file filter
should only need to distinguish translatable parts of a file from
non-translatable parts, and leave it at that. Segmentation should be applied
and managed (and possibly also changed) by other tools later in the
process, without affecting the ability of the file filter to convert the
segmented file back to native format. My questions: 1) Is there a
convenient and compatible way to support this in XLIFF 1.1? 2) Are there
already plans on extending the XLIFF standard in the future to better support
this? Best regards, Magnus
Martikainen Chief Software
Architect TRADOS
Incorporated Ph: +1-408-743
3564 |
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]