delete ligature support for Arabic "la" from the less(1) command line

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

delete ligature support for Arabic "la" from the less(1) command line

Ingo Schwarze
Hi,

i have to admit that i am neither able to speak nor to write nor
to understand the Arabic language nor the Arabic script, but here
is my current, probably incomplete understanding of what our less(1)
program is trying to do with Arabic ligatures.

If somebody is reading this who is able to read and write Arabic
or an Indian language heavily using ligatures, feedback is highly
welcome.

Arabic is a cursive script, which means that when writing Arabic,
characters do not map 1:1 to glyphs.  Instead, there are rules about
how adjacent characters attach to each other, forming ligatures.

As an extremely simple example, consider the Arabic adverb "la",
which means the same as the English adverb "no".  It consists of
the two letters U+0644 LAM and U+0627 ALEF, the LAM appearing before
(i.e. to the right of) the ALEF.  However, you do not write both
letters separately.  Instead, the ALEF leans forward (to the left)
and attaches to the LAM, forming the glyph U+FEFB, ARABIC LIGATURE
LAM WITH ALEF ISOLATED FORM.  When displayed in a fixed width font,
that ligature only occupies a single display column just like any
other Arabic or Latin glyph.  The LAM WITH ALEF glyph is not a
double-width glyph like Japanese or Chinese characters typically
are.

So, when this happens, you have four bytes of UTF-8 forming two
Unicode characters, and *together*, these two characters occupy
only one single display column.

Note that in the default configuration, our xterm(1) is not able
to display Arabic characters at all.  But even when you run
  xterm -fa arabic
or
  xterm -fa fixed
which uses FreeType support instead of the default X toolkit font
support, such that xterm(1) does become able to display single
Arabic characters, it still displays the word "la" incorrectly,
failing to generate the required ligature and instead displaying
the two characters LAM and ALEF separately.

So i installed konsole-18.12.0p1 for testing (which pulls in
ridiculous amounts of dependencies, dozens of them, but oh well,
i guess support for advanced Unicode features isn't trivial).
The konsole(1) program does display the word "la" correctly, as a
ligature.

Now, running less(1) inside konsole(1), i found that columnation
is already subtly broken.  As long as the "la" ligature is visible
on screen, all is fine.  Now scroll to the right until the "la"
appears in the first screen column.  Then scroll one more column
to the right by pressing "1 RIGHTARROW".  Now you see *half* the
ligature, i.e. an isolated ALEF, in the first column of the screen,
even though the Arabic word does not contain an isolated ALEF.
Besides, we just attempted to scroll the "la" off screen, so the
ALEF now appears in the column one to the right of where the "la"
should actually be, and all the rest of the line is shifted one
column to the right, too, so columnation is now off by one.
Scrolling back left, columnation recovers to correct display.

I strongly suspect i broke that during my previous UTF-8 cleanup
work on less(1).

However, LAM WITH ALEF is literally the only ligature that less(1)
supports, together with three variations (with MADDA above, with
HAMZA above, and with HAMZA below).  But there are hundreds of
ligatures in Arabic, see

  https://www.unicode.org/charts/PDF/UFB50.pdf
  https://www.unicode.org/charts/PDF/UFE70.pdf

I have no idea how many of those work in konsole(1) - but i'm sure
none of those, except the four LAM WITH ALEF discussed here, work
with less(1), so i think support for LAM WITH ALEF provided no value
in the first place.  The way it is implemented, with an ad-hoc table
inside less(1) of character combinations that form ligatures, is
just wrong and not sustainable by any stretch of the imagination,
i think.

On top of that, how characters combine in Arabic is strongly context
dependent; even the syllable "la" forms a different ligature depending
on whether it is isolated or at the end of a longer word, and none
of the context dependencies are implemented in less(1) anyway.

And finally, people say the situation in many Indian languages is
even more dire than in Arabic, so what our less(1) tries to do is
almost certainly completely useless for those languages, even if
we would expand the ad-hoc table.

So, i propose to delete support for combining characters into
ligatures from our less(1): at this point, it is only used for
typing at the less prompt anyway (and not for the file displayed),
only for Arabic, and only for the single ligature "la".  If we ever
want better ligature support in the future, i think we would have
to make a fresh start anyway - and i think there are many other
things to do before that.

Note that this only removes support for combining characters into
ligatures that can also stand on their own; support for purely
combining accents like U+300 COMBINING GRAVE ACCENT and U+3099
COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK remains intact.

OK?
  Ingo


Index: charset.c
===================================================================
RCS file: /cvs/src/usr.bin/less/charset.c,v
retrieving revision 1.25
diff -u -p -r1.25 charset.c
--- charset.c 31 Aug 2019 13:44:29 -0000 1.25
+++ charset.c 31 Aug 2019 21:30:25 -0000
@@ -474,13 +474,6 @@ static struct wchar_range comp_table[] =
 };
 
 /*
- * Special pairs, not ranges.
- */
-static struct wchar_range comb_table[] = {
- {0x0644, 0x0622}, {0x0644, 0x0623}, {0x0644, 0x0625}, {0x0644, 0x0627},
-};
-
-/*
  * Characters with general category values
  * Cc: Other, Control
  * Cf: Other, Format
@@ -825,22 +818,4 @@ is_wide_char(LWCHAR ch)
 {
  return (is_in_table(ch, wide_table,
     (sizeof (wide_table) / sizeof (*wide_table))));
-}
-
-/*
- * Is a character a UTF-8 combining character?
- * A combining char acts like an ordinary char, but if it follows
- * a specific char (not any char), the two combine into one glyph.
- */
-int
-is_combining_char(LWCHAR ch1, LWCHAR ch2)
-{
- /* The table is small; use linear search. */
- int i;
- for (i = 0; i < sizeof (comb_table) / sizeof (*comb_table); i++) {
- if (ch1 == comb_table[i].first &&
-    ch2 == comb_table[i].last)
- return (1);
- }
- return (0);
 }
Index: cmdbuf.c
===================================================================
RCS file: /cvs/src/usr.bin/less/cmdbuf.c,v
retrieving revision 1.19
diff -u -p -r1.19 cmdbuf.c
--- cmdbuf.c 28 Jun 2019 13:35:01 -0000 1.19
+++ cmdbuf.c 31 Aug 2019 21:30:26 -0000
@@ -179,19 +179,10 @@ cmd_step_common(char *p, LWCHAR ch, int
  if (bswidth != NULL)
  *bswidth = prlen;
  } else {
- LWCHAR prev_ch = step_char(&p, -1, cmdbuf);
- if (is_combining_char(prev_ch, ch)) {
- if (pwidth != NULL)
- *pwidth = 0;
- if (bswidth != NULL)
- *bswidth = 0;
- } else {
- if (pwidth != NULL)
- *pwidth = is_wide_char(ch)
-    ? 2 : 1;
- if (bswidth != NULL)
- *bswidth = 1;
- }
+ if (pwidth != NULL)
+ *pwidth = is_wide_char(ch) ? 2 : 1;
+ if (bswidth != NULL)
+ *bswidth = 1;
  }
  }
  }
Index: funcs.h
===================================================================
RCS file: /cvs/src/usr.bin/less/funcs.h,v
retrieving revision 1.24
diff -u -p -r1.24 funcs.h
--- funcs.h 31 Aug 2019 13:44:29 -0000 1.24
+++ funcs.h 31 Aug 2019 21:30:26 -0000
@@ -65,7 +65,6 @@ LWCHAR step_char(char **, int, char *);
 int is_composing_char(LWCHAR);
 int is_ubin_char(LWCHAR);
 int is_wide_char(LWCHAR);
-int is_combining_char(LWCHAR, LWCHAR);
 void cmd_reset(void);
 void clear_cmd(void);
 void cmd_putstr(char *);
Index: less.1
===================================================================
RCS file: /cvs/src/usr.bin/less/less.1,v
retrieving revision 1.56
diff -u -p -r1.56 less.1
--- less.1 20 Aug 2019 11:34:18 -0000 1.56
+++ less.1 31 Aug 2019 21:30:26 -0000
@@ -1804,7 +1804,7 @@ Language for determining the character s
 The character encoding
 .Xr locale 1 .
 It decides which byte sequences form characters, what their display
-width is, and which characters are composing or combining characters.
+width is, and which characters are composing characters.
 .It Ev LESS
 Options which are passed to
 .Nm

Reply | Threaded
Open this post in threaded view
|

Re: delete ligature support for Arabic "la" from the less(1) command line

Evan Silberman
Ingo Schwarze <[hidden email]> wrote:

> I have no idea how many of those work in konsole(1) - but i'm sure
> none of those, except the four LAM WITH ALEF discussed here, work
> with less(1), so i think support for LAM WITH ALEF provided no value
> in the first place.  The way it is implemented, with an ad-hoc table
> inside less(1) of character combinations that form ligatures, is
> just wrong and not sustainable by any stretch of the imagination,
> i think.
>
> On top of that, how characters combine in Arabic is strongly context
> dependent; even the syllable "la" forms a different ligature depending
> on whether it is isolated or at the end of a longer word, and none
> of the context dependencies are implemented in less(1) anyway.
>
> And finally, people say the situation in many Indian languages is
> even more dire than in Arabic, so what our less(1) tries to do is
> almost certainly completely useless for those languages, even if
> we would expand the ad-hoc table.
>
> So, i propose to delete support for combining characters into
> ligatures from our less(1): at this point, it is only used for
> typing at the less prompt anyway (and not for the file displayed),
> only for Arabic, and only for the single ligature "la".  If we ever
> want better ligature support in the future, i think we would have
> to make a fresh start anyway - and i think there are many other
> things to do before that.

I did less practical research than you did when I looked at this bit of
code but your conclusions match mine: this is an attempt at an
implementation of a tiny subset of the vastly complex problem of digital
typesetting of the Arabic alef-bet. Keeping the code is probably worse
than no solution at all, because (as you noted) it's the wrong
implementation in the wrong place and "improving" it by adding more
combination rules here would be a mistake.

--Evan Silberman

Reply | Threaded
Open this post in threaded view
|

Re: delete ligature support for Arabic "la" from the less(1) command line

Mohammadreza Abdollahzadeh
In reply to this post by Ingo Schwarze
Hi Ingo,
Persian is my native language and I think that the major problem that
all RTL (Right-To-Left) languages like Persian and Arabic currentlly suffer
from is the lack of BiDi (Bidirectionality) support in console and terminal
environment like xterm(1). KDE konsole(1) support bidi and that's why it
show ligatures correctly.
I think any attempt to fix such problems must first start with adding bidi
support to xterm and other terminal environment.

best regards.

Reply | Threaded
Open this post in threaded view
|

Re: delete ligature support for Arabic "la" from the less(1) command line

Ingo Schwarze
Hello Mohammadreza,

Mohammadreza Abdollahzadeh wrote on Sun, Sep 01, 2019 at 09:40:16AM +0430:

> Persian is my native language and I think that the major problem that
> all RTL (Right-To-Left) languages like Persian and Arabic currentlly suffer
> from is the lack of BiDi (Bidirectionality) support in console and terminal
> environment like xterm(1). KDE konsole(1) support bidi and that's why it
> show ligatures correctly.
> I think any attempt to fix such problems must first start with adding bidi
> support to xterm and other terminal environment.

Thank you for your feedback!

If i understand correctly, xterm(1) does indeed have that problem.
I prepared a test file that contains, in this order,

 - some Latin characters
 - the Arabic word "la" ("no"), i.e. first LAM, then ALEF
 - some more Latin characters
 - the Arabic word "al" ("the"), i.e. first ALEF, then LAM
 - some final Latin characters

And indeed, xterm(1) does not respect the writing direction of the
individual words.  When cat(1)'ing the file to stdout, both xterm(1)
and konsole(1) show all the words from left to right, but *inside*
each word, konsole(1) uses the correct writing direction: right to
left for Arabic and left to right for Latin.  For example, in the
Arabic word "al", konsole(1) correctly shows the ALEF right of the
LAM, whereas xterm(1) wrongly shows the ALEF left of the LAM.

I'm not entirely sure this has much to do with ligatures, though.
What matters for building ligatures is only the logical ordering,
the ordering in *time* so to speak, i.e. what comes before and what
comes after.  LAM before ALEF has to become the ligature glyph "al",
whereas ALEF before LAM remains two glyphs.  Technically, the
question of ordering in space, whether glyphs are painted onto the
screen right to left or left to right, only comes into play after
characters have already been combined into glyphs.

Actually, now that you bring up the topic, i see another situation
where less(1) causes an issue.  Let's use konsole(1) and not xterm(1)
such that we get the correct writing direction, and let's put the
word "al" onto the screen.  No ligature here, so that part of the
topic is suspended for a moment.  Now let's slowly scroll right in
one-column steps.  All is fine as long as the word "al" is completely
visible on screen.  But when the final letter LAM of "al" is in the
last (leftmost) column of the screen and you scroll right one more
column, something weird happens, even in konsole(1).  You would
expect the final letter LAM to scroll off screen first and the initial
letter ALEF to remain on the screen for a little longer.  Instead,
less(1) incorrectly thinks the *initial* letter of the word scrolls
off screen first, and it tells xterm(1) to display the ALEF in the
leftmost column of the screen while the LAM just went off-screen.
That looks weird because there is no word in that text beginning
with ALEF.

This means that being able to properly view Arabic or Farsi text
with the default OpenBSD terminal emulator and parser would require

 1. bidi support in xterm(1)
    to render Farsi words with the correct writing direction
 2. ligature support in xterm(1)
    to correctly connect letters
 3. bidi support in less(1)
    to correctly scroll parts of words on and off screen, horizontally
 4. ligature support in less(1)
    for correct columnation

As far as i understand, you are saying that the extremely fragmentary
support for item 4 which we happen to have right now is not really
useful without items 1-3, and even when using konsole(1), which does
have items 1 and 2, implementing item 3 before item 4 would make
sense because item 3 is more importrant.

So my understanding is that you are not objecting to the patch because
the fragmentary support for item 4 is practically useless in isolation.


The following is not related to this patch, but i think it makes
sense to mention it here: regarding the future, i think items 1 and
3 are much easier to support than items 2 and 4 because bidi support,
if i understand correctly, only needs one bit of information per
character because it only needs to know whether the character is
part of a right to left or left to right script, so the complexity
on the libc level, where we want complexity least of all places,
is comparable to other boolean character properties like those
listed in the iswalnum(3) manual page.  Realistically, though,
bidi support would still be a large project, and i don't think it
makes sense to tackle it any time soon.

Ligature support feels much worse than bidi support because the
mapping required is not merely character -> boolean but (character +
character) -> character, which is more complicated than even the
(character + character) -> -1/0/+1 mapping required for collation
support - and we decided that we don't want collation support in
libc because it would cause excessive complexity.  Admittedly,
collations are strongly locale-dependent, while i'm not sure ligatures
are locale-dependent, so with some luck, they might be simpler in
that respect.  But a pair-to-character mapping, even without locale
dependency, still sounds so scary that i doubt we want it in libc
even in the long term.

Thanks, you helped make the big picture a bit clearer for me.

Yours,
  Ingo

Reply | Threaded
Open this post in threaded view
|

Re: delete ligature support for Arabic "la" from the less(1) command line

Ali Farzanrad
Hi Ingo,

Thanks for your effort in unicode support.  I hope my feedback as a
native Persian would be helpful.

Ingo Schwarze <[hidden email]> wrote:

> If i understand correctly, xterm(1) does indeed have that problem.
> I prepared a test file that contains, in this order,
>
>  - some Latin characters
>  - the Arabic word "la" ("no"), i.e. first LAM, then ALEF
>  - some more Latin characters
>  - the Arabic word "al" ("the"), i.e. first ALEF, then LAM
>  - some final Latin characters
>
> And indeed, xterm(1) does not respect the writing direction of the
> individual words.  When cat(1)'ing the file to stdout, both xterm(1)
> and konsole(1) show all the words from left to right, but *inside*
> each word, konsole(1) uses the correct writing direction: right to
> left for Arabic and left to right for Latin.  For example, in the
> Arabic word "al", konsole(1) correctly shows the ALEF right of the
> LAM, whereas xterm(1) wrongly shows the ALEF left of the LAM.
>

There are many rules.  Each letter / character has a direction by
itself.  For example English letters are LTR (left-to-right), Arabic /
Persian letters are RTL, but some characters, say symbols, have no
direction.  For example, when you write:

    'A' '+' 'B'

It should be displayed as is ('+' is LTR), but when you write:

    'A' ALEF '+' LAM 'B'

The '+' should be displayed in the left side of ALEF ('+' is RTL):

    'A' LAM '+' ALEF 'B'

I think you need to detect all maximal non-LTR substrings (which don't
start or end with a symbol) inside LTR strings to render them correctly.
There are also RTL / LTR control characters in Unicode which manipulate
this behaviour.

> I'm not entirely sure this has much to do with ligatures, though.
> What matters for building ligatures is only the logical ordering,
> the ordering in *time* so to speak, i.e. what comes before and what
> comes after.  LAM before ALEF has to become the ligature glyph "al",
> whereas ALEF before LAM remains two glyphs.  Technically, the
> question of ordering in space, whether glyphs are painted onto the
> screen right to left or left to right, only comes into play after
> characters have already been combined into glyphs.
>
> Actually, now that you bring up the topic, i see another situation
> where less(1) causes an issue.  Let's use konsole(1) and not xterm(1)
> such that we get the correct writing direction, and let's put the
> word "al" onto the screen.  No ligature here, so that part of the
> topic is suspended for a moment.  Now let's slowly scroll right in
> one-column steps.  All is fine as long as the word "al" is completely
> visible on screen.  But when the final letter LAM of "al" is in the
> last (leftmost) column of the screen and you scroll right one more
> column, something weird happens, even in konsole(1).  You would
> expect the final letter LAM to scroll off screen first and the initial
> letter ALEF to remain on the screen for a little longer.  Instead,
> less(1) incorrectly thinks the *initial* letter of the word scrolls
> off screen first, and it tells xterm(1) to display the ALEF in the
> leftmost column of the screen while the LAM just went off-screen.
> That looks weird because there is no word in that text beginning
> with ALEF.
>

It's a difficult problem.  You need to consider all maximal non-LTR
substrings, and all LTR / RTL modifiers.  Also consider a file with long
RTL lines; user prefer to see the beginig of lines (in all languages,
readers read from start), so less(1) should display right-most part of
each line, and when user scrolls the text to right, less(1) should
display left-side of each line.

I think that if xterm had a complete RTL mode with swapped right and
left keys, it might solve many problems.  In your example in RTL xterm,
there will be no right scroll (because of swapped keys) and when you
scroll less(1) to the left, less(1) will correctly scrolls off the
initial letter.  Of course it will not work on complex mixed RTL / LTR
texts, but it solves the problem in most common situations.

> This means that being able to properly view Arabic or Farsi text
> with the default OpenBSD terminal emulator and parser would require
>
>  1. bidi support in xterm(1)
>     to render Farsi words with the correct writing direction
>  2. ligature support in xterm(1)
>     to correctly connect letters
>  3. bidi support in less(1)
>     to correctly scroll parts of words on and off screen, horizontally

According to previous example (a file with long RTL lines), I don't
agree with bidi support in less(1).

>  4. ligature support in less(1)
>     for correct columnation
>
> As far as i understand, you are saying that the extremely fragmentary
> support for item 4 which we happen to have right now is not really
> useful without items 1-3, and even when using konsole(1), which does
> have items 1 and 2, implementing item 3 before item 4 would make
> sense because item 3 is more importrant.
>
> So my understanding is that you are not objecting to the patch because
> the fragmentary support for item 4 is practically useless in isolation.
>
>
> The following is not related to this patch, but i think it makes
> sense to mention it here: regarding the future, i think items 1 and
> 3 are much easier to support than items 2 and 4 because bidi support,
> if i understand correctly, only needs one bit of information per
> character because it only needs to know whether the character is
> part of a right to left or left to right script, so the complexity
> on the libc level, where we want complexity least of all places,
> is comparable to other boolean character properties like those
> listed in the iswalnum(3) manual page.  Realistically, though,
> bidi support would still be a large project, and i don't think it
> makes sense to tackle it any time soon.
>

As mentioned before, each character might have 3 different direction:
LTR, RTL, and none.  So it needs at least 2 bits of information.  Also
you need to handle LEFT TO RIGHT MARK, RIGHT TO LEFT MARK, and other
direction control characters.

> Ligature support feels much worse than bidi support because the
> mapping required is not merely character -> boolean but (character +
> character) -> character, which is more complicated than even the
> (character + character) -> -1/0/+1 mapping required for collation
> support - and we decided that we don't want collation support in
> libc because it would cause excessive complexity.  Admittedly,
> collations are strongly locale-dependent, while i'm not sure ligatures
> are locale-dependent, so with some luck, they might be simpler in
> that respect.  But a pair-to-character mapping, even without locale
> dependency, still sounds so scary that i doubt we want it in libc
> even in the long term.
>
> Thanks, you helped make the big picture a bit clearer for me.
>
> Yours,
>   Ingo
>
>


Best Regards