Revision history [back]

Revision 1

asked 2012-03-14 06:37:40 -0500

Non-latin chars in tags issues

I posted earlier on the bug tracking system re a bug in the tags (ref issue 185) when Unicode chars are used for the tags. I post here again with the a solution and some consequent points.

In short, if Unicode chars are used for tags, we get invalid url's for tags: for example:

http://localhost:8000/questions/<<<العمل>>>
http://localhost:8000/questions/scope:all/sort:activity-desc/tags:العمل/page:1/<<<العمل>>>

The failure point is in models/questions.py:630

    # use `<<<` and `>>>` because they cannot be confused with user input
    # - if user accidentialy types <<<tag-name>>> into question title or body,
    # then in html it'll become escaped like this: &lt;&lt;&lt;tag-name&gt;&gt;&gt;
    regex = re.compile(r'<<<(%s)>>>' % const.TAG_REGEX_BARE)

    while True:
        match = regex.search(html)
        if not match:
            break
        seq = match.group(0)  # e.g "<<<my-tag>>>"
        tag = match.group(1)  # e.g "my-tag"
        full_url = search_state.add_tag(tag).full_url()
        html = html.replace(seq, full_url)

    return html

We can solve the immediate issue by:

    regex = re.compile(r'<<<(%s)>>>' % const.TAG_REGEX_BARE, re.UNICODE)

But we would like to have more control on allowable tag chars, I can see in const/__init__.py:87:

TAG_CHARS = r'\w+.#-'
TAG_REGEX_BARE = r'[%s]+' % TAG_CHARS

Which uses the locale dependent '\w'. Using this with a system locale - re.compile(...,re.LOCALE) - would not be adequate. Testing locally, the definition of alphanumeric in system locales is something of a an oddity:

>>> import locale
>>> import string
>>> locale.setlocale(locale.LC_ALL, 'ar_SA')
'ar_SA'
>>> string.letters
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea'
>>> locale.setlocale(locale.LC_ALL, 'ar_SA.utf8')
'ar_SA.utf8'
>>> string.letters
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'

So for alphanumeric, I'm getting either Latin only(!), or Latin plus local language. Both are undesired - we would like to have local language only. So the way out is to use 're.UNICODE' and have the acceptable tag pattern defined in admin/config.

This has been correctly noted in const/__init__.py:80:

#todo:
#this probably needs to be language-specific
#and selectable/changeable from the admin interface
#however it will be hard to expect that people will type
#correct regexes - plus this must be an anchored regex
#to do full string match

Other issues:

The way the code stands now, it allows broken url's to be generated, which is buggy behavior that slips through the net, so on top of the above, would be good to have:

When a tag is used in code, it should be tested for legality, and an exception thrown if it doesn't match the legal pattern. (also good in case the setting for the legal pattern is changed, and the system becomes inconsistent having older illegal tags).
When tags are entered by user, they should be checked whether they match the legal pattern, and rejection message generated - this is partly implemented and noted as a todo point (forms.py).

Regards.