[Haskell-cafe] regex-pcre is not working with UTF-8

Discussion:

José Romildo Malaquias

2012-08-18 15:16:43 UTC

Hello.

It seems that the regex-pcre has a bug dealing with utf-8:

Prelude> :m + Text.Regex.PCRE

Prelude Text.Regex.PCRE> "país:Brasil" =~ "país:(.*)" :: (String,String,String,[String])
("","pa\237s:Brasil","",["rasil"])

Notice the missing 'B' in the result of the regex matching.

With regex-posix this does not happen:

Prelude> :m + Text.Regex.Posix

Prelude Text.Regex.Posix> "país:Brasil" =~ "país:(.*)" ::(String,String,String,[String])
("","pa\237s:Brasil","",["Brasil"])

I hope this bug can be fixed soon.

Is there a bug tracker to report the bug? If so, what is it?

Romildo

Konstantin Litvinenko

2012-08-21 19:25:53 UTC

Permalink

Post by JosÃ© Romildo Malaquias
Hello.
I hope this bug can be fixed soon.
Is there a bug tracker to report the bug? If so, what is it?

You need something like that

let pat = makeRegexOpts (compUTF8 .|. defaultCompOpt) defaultExecOpt
("@'(.+?)'@" :: B.ByteString)

and than pat will match correctly.

José Romildo Malaquias

2012-08-21 21:00:06 UTC

Permalink

Post by Konstantin Litvinenko

Post by JosÃ© Romildo Malaquias
Hello.
I hope this bug can be fixed soon.
Is there a bug tracker to report the bug? If so, what is it?

You need something like that
let pat = makeRegexOpts (compUTF8 .|. defaultCompOpt) defaultExecOpt
and than pat will match correctly.

The bug is related to String (not ByteString) in a UTF-8 locale.

Until it is fixed, I am using the workaround of converting the regular
expression and the text to ByteString, doing the matching, and then
converting the results back to String.

Romildo

José Romildo Malaquias

2012-08-22 14:56:58 UTC

Permalink

I do not have time to test this myself right now. But I will unravel my code a
bit for you.

By November 2011 it worked without problems in my application. Now that
I have resumed developping the application, I have been faced with this
behaviour. As it used to work before, I believe it is a bug in
regex-pcre or libpcre.

I believe it may be problem in String <-> ByteString conversion. The "base"
library may have changed and your LOCALE information may be different or may be
being used differently by "base".

The (temporary) workaround I found is to convert the strings to
byte-strings before matching, and then convert the results back to
strings. With byte-strings it works well.

That is an excellent sign that it is your LOCALE settings being picked up by
GHC's "base" package, see explanation below.

[...]

I have written an application to test those things. There are 2 source
files: test.hs and seestr.c, which are attached.
1. shows the getForeignEncoding
2. uses a C function to show the characters from a String (using
withCString) and from a ByteString (using useAsCString)
3. matches a PCRE regular expression using String and ByteString
The test is run twice, with different LANG settings, and its output
follows.

[...]

As can be seen, regular expression matching does not work with
en_US.UTF-8. But it works with en_US.ISO-8859-1.
The test shows that withCString is working as expected too. This
may suggest the problem is really with regex-pcre.

The previous tests were run on an gentoo linux with ghc-7.4.1.

I have also run the tests on Fedora 17 with ghc-7.0.4, which does not
have the bug. The sources are attached. The tests output follows:

$ LANG=en_US.ISO-8859-1 && ./test
testing with String
code: 70, char: p
code: 61, char: a
code: ffffffed, char:
code: 73, char: s
result: 4

testing with ByteString
code: 70, char: p
code: 61, char: a
code: ffffffed, char:
code: 73, char: s
result: 4

regex : paï¿œs:(.*)
text : paï¿œs:Brasil
String match : [["pa\237s:Brasil","Brasil"]]
ByteString match : [["pa\237s:Brasil","Brasil"]]

$ LANG=en_US.UTF-8 && ./test
testing with String
code: 70, char: p
code: 61, char: a
code: ffffffed, char:
code: 73, char: s
result: 4

testing with ByteString
code: 70, char: p
code: 61, char: a
code: ffffffed, char:
code: 73, char: s
result: 4

regex : paÃs:(.*)
text : paÃs:Brasil
String match : [["pa\237s:Brasil","Brasil"]]
ByteString match : [["pa\237s:Brasil","Brasil"]]

Clearly witchCString has changed from ghc-7.0.4 to ghc-7.4.1. It seems
that With ghc-7.0.4 withCString does not obey the UTF-8 locale and
generates a latin1 C string.

Regards,

Romildo

José Romildo Malaquias

2012-08-23 11:59:52 UTC

Permalink

Hello.

I think I have an explanation for the problem with regex-pcre, ghc-7.4.2
and UTF Strings.

The Text.Regex.PCRE.String module uses the withCString and
withCStringLen from the module Foreign.C.String to pass a Haskell string
to the C library pcre functions that compile regular expressions, and
execute regular expressions to match some text.

Recent versions of ghc have withCString and withCStringLen definitions
that uses the current system locale to define the marshalling of a
Haskell string into a NUL terminated C string using temporary storage.

With a UTF-8 locale the length of the C string will be greater than the
length of the corresponding Haskell string in the presence with
characters outside of the ASCII range. Therefore positions of
corresponding characters in both strings do not match.

In order to compute matching positions, regex-pcre functions use C
strings. But to compute matching strings they use those positions with
Haskell strings.

That gives the mismatch shown earlier and repeated here with the
attached program run on a system with a UTF-8 locale:

$ LANG=en_US.UTF-8 && ./test1
getForeignEncoding: UTF-8

regex : paÃs:(.*):(.*)
text : paÃs:BrasÃlia:Brasil
String matchOnce : Just (array (0,2) [(0,(0,22)),(1,(6,9)),(2,(16,6))])
String match : [["pa\237s:Bras\237lia:Brasil","ras\237lia:B","asil"]]

$ LANG=en_US.ISO-8859-1 && ./test1
getForeignEncoding: ISO-8859-1

regex : paï¿œs:(.*):(.*)
text : paï¿œs:Brasï¿œlia:Brasil
String matchOnce : Just (array (0,2) [(0,(0,20)),(1,(5,8)),(2,(14,6))])
String match : [["pa\237s:Bras\237lia:Brasil","Bras\237lia","Brasil"]]

I see two ways of fixing this bug:

1. make the matching functions compute the text using the C string and
the positions calculated by the C function, and convert the text back
to a Haskell string.

2. map the positions in the C string (if possible) to the corresponding
positions in the Haskell string; this way the current definitions of
the matching functions returning text will just work.

I hope this would help fixing the issue.

Regards,

Romildo

José Romildo Malaquias

2012-08-23 20:39:47 UTC

Permalink

Post by JosÃ© Romildo Malaquias
Hello.
I think I have an explanation for the problem with regex-pcre, ghc-7.4.2
and UTF Strings.
The Text.Regex.PCRE.String module uses the withCString and
withCStringLen from the module Foreign.C.String to pass a Haskell string
to the C library pcre functions that compile regular expressions, and
execute regular expressions to match some text.
Recent versions of ghc have withCString and withCStringLen definitions
that uses the current system locale to define the marshalling of a
Haskell string into a NUL terminated C string using temporary storage.
With a UTF-8 locale the length of the C string will be greater than the
length of the corresponding Haskell string in the presence with
characters outside of the ASCII range. Therefore positions of
corresponding characters in both strings do not match.
In order to compute matching positions, regex-pcre functions use C
strings. But to compute matching strings they use those positions with
Haskell strings.
That gives the mismatch shown earlier and repeated here with the
$ LANG=en_US.UTF-8 && ./test1
getForeignEncoding: UTF-8
regex : paÃs:(.*):(.*)
text : paÃs:BrasÃlia:Brasil
String matchOnce : Just (array (0,2) [(0,(0,22)),(1,(6,9)),(2,(16,6))])
String match : [["pa\237s:Bras\237lia:Brasil","ras\237lia:B","asil"]]
$ LANG=en_US.ISO-8859-1 && ./test1
getForeignEncoding: ISO-8859-1
regex : paï¿œs:(.*):(.*)
text : paï¿œs:Brasï¿œlia:Brasil
String matchOnce : Just (array (0,2) [(0,(0,20)),(1,(5,8)),(2,(14,6))])
String match : [["pa\237s:Bras\237lia:Brasil","Bras\237lia","Brasil"]]
1. make the matching functions compute the text using the C string and
the positions calculated by the C function, and convert the text back
to a Haskell string.
2. map the positions in the C string (if possible) to the corresponding
positions in the Haskell string; this way the current definitions of
the matching functions returning text will just work.
I hope this would help fixing the issue.

I have a fix for this bug and it would be nice if others take a look at
it and see if it is ok. It is based on the second way presented above.

Romildo