First time here? Check out the FAQ!
6

Non-latin chars in tags issues

I posted earlier on the bug tracking system re a bug in the tags (ref issue 185) when Unicode chars are used for the tags. I post here again with the a solution and some consequent points.

In short, if Unicode chars are used for tags, we get invalid url's for tags: for example:

http://localhost:8000/questions/<<<العمل>>>
http://localhost:8000/questions/scope:all/sort:activity-desc/tags:العمل/page:1/<<<العمل>>>

The failure point is in models/questions.py:630

    # use `<<<` and `>>>` because they cannot be confused with user input
    # - if user accidentialy types <<<tag-name>>> into question title or body,
    # then in html it'll become escaped like this: &lt;&lt;&lt;tag-name&gt;&gt;&gt;
    regex = re.compile(r'<<<(%s)>>>' % const.TAG_REGEX_BARE)

    while True:
        match = regex.search(html)
        if not match:
            break
        seq = match.group(0)  # e.g "<<<my-tag>>>"
        tag = match.group(1)  # e.g "my-tag"
        full_url = search_state.add_tag(tag).full_url()
        html = html.replace(seq, full_url)

    return html

We can solve the immediate issue by:

    regex = re.compile(r'<<<(%s)>>>' % const.TAG_REGEX_BARE, re.UNICODE)

But we would like to have more control on allowable tag chars, I can see in const/__init__.py:87:

TAG_CHARS = r'\w+.#-'
TAG_REGEX_BARE = r'[%s]+' % TAG_CHARS

Which uses the locale dependent '\w'. Using this with a system locale - re.compile(...,re.LOCALE) - would not be adequate. Testing locally, the definition of alphanumeric in system locales is something of a an oddity:

>>> import locale
>>> import string
>>> locale.setlocale(locale.LC_ALL, 'ar_SA')
'ar_SA'
>>> string.letters
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea'
>>> locale.setlocale(locale.LC_ALL, 'ar_SA.utf8')
'ar_SA.utf8'
>>> string.letters
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'

So for alphanumeric, I'm getting either Latin only(!), or Latin plus local language. Both are undesired - we would like to have local language only. So the way out is to use 're.UNICODE' and have the acceptable tag pattern defined in admin/config.

This has been correctly noted in const/__init__.py:80:

#todo:
#this probably needs to be language-specific
#and selectable/changeable from the admin interface
#however it will be hard to expect that people will type
#correct regexes - plus this must be an anchored regex
#to do full string match

Other issues:

The way the code stands now, it allows broken url's to be generated, which is buggy behavior that slips through the net, so on top of the above, would be good to have:

  • When a tag is used in code, it should be tested for legality, and an exception thrown if it doesn't match the legal pattern. (also good in case the setting for the legal pattern is changed, and the system becomes inconsistent having older illegal tags).

  • When tags are entered by user, they should be checked whether they match the legal pattern, and rejection message generated - this is partly implemented and ...

(more)
Basel Shishani's avatar
197
Basel Shishani
asked 2012-03-14 06:37:40 -0500
edit flag offensive 0 remove flag close merge delete

Comments

add a comment see more comments

1 Answer

1

Thanks for pointing out the re.UNICODE solution, sorry I have overlooked your post for > 2 weeks.

Why do you want to so strictly control the allowed characters in the tags? In which situation it would actually be useful?

Evgeny's avatar
13.2k
Evgeny
answered 2012-04-10 16:18:42 -0500, updated 2012-04-10 16:22:29 -0500
edit flag offensive 0 remove flag delete link

Comments

One case is tech jargon, For example, if 'HTML' is to be a tag, do you use the English word or a transliterated eqv.

Another case: the Arabic code page covers other languages including Farsi and others. Some words have been borrowed from English (eg video) and contain sounds not present in Arabic (v), and are commonly written with a close alternative (f for v, b for p), but can also less commonly written with true v, p letters from the Arabic code page but not from the strict Arabic range. So basically there are alternatives that would be dictated by the allowable range.

Basel Shishani's avatar Basel Shishani (2012-04-18 05:52:10 -0500) edit

So basically you would like to enforce a style or a rule for the users when they're thinking up these tags. I understand these could be enforced at a different level, or by convention, but Ideally the regex and the error message for unacceptable tags would be configurable.

Basel Shishani's avatar Basel Shishani (2012-04-18 05:52:52 -0500) edit
add a comment see more comments