asked 2012-03-14 06:37:40 -0500Basel Shishani
I posted earlier on the bug tracking system re a bug in the tags (ref issue 185) when Unicode chars are used for the tags. I post here again with the a solution and some consequent points.
In short, if Unicode chars are used for tags, we get invalid url's for tags: for example:
The failure point is in models/questions.py:630
# use `<<<` and `>>>` because they cannot be confused with user input # - if user accidentialy types <<<tag-name>>> into question title or body, # then in html it'll become escaped like this: <<<tag-name>>> regex = re.compile(r'<<<(%s)>>>' % const.TAG_REGEX_BARE) while True: match = regex.search(html) if not match: break seq = match.group(0) # e.g "<<<my-tag>>>" tag = match.group(1) # e.g "my-tag" full_url = search_state.add_tag(tag).full_url() html = html.replace(seq, full_url) return html
We can solve the immediate issue by:
regex = re.compile(r'<<<(%s)>>>' % const.TAG_REGEX_BARE, re.UNICODE)
But we would like to have more control on allowable tag chars, I can see in const/__init__.py:87:
TAG_CHARS = r'\w+.#-' TAG_REGEX_BARE = r'[%s]+' % TAG_CHARS
Which uses the locale dependent '\w'. Using this with a system locale - re.compile(...,re.LOCALE) - would not be adequate. Testing locally, the definition of alphanumeric in system locales is something of a an oddity:
>>> import locale >>> import string >>> locale.setlocale(locale.LC_ALL, 'ar_SA') 'ar_SA' >>> string.letters 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea' >>> locale.setlocale(locale.LC_ALL, 'ar_SA.utf8') 'ar_SA.utf8' >>> string.letters 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
So for alphanumeric, I'm getting either Latin only(!), or Latin plus local language. Both are undesired - we would like to have local language only. So the way out is to use 're.UNICODE' and have the acceptable tag pattern defined in admin/config.
This has been correctly noted in const/__init__.py:80:
#todo: #this probably needs to be language-specific #and selectable/changeable from the admin interface #however it will be hard to expect that people will type #correct regexes - plus this must be an anchored regex #to do full string match
The way the code stands now, it allows broken url's to be generated, which is buggy behavior that slips through the net, so on top of the above, would be good to have:
When a tag is used in code, it should be tested for legality, and an exception thrown if it doesn't match the legal pattern. (also good in case the setting for the legal pattern is changed, and the system becomes inconsistent having older illegal tags).
When tags are entered by user, they should be checked whether they match the legal pattern, and rejection message generated - this is partly implemented and noted as a todo point (forms.py).
Thanks for pointing out the
re.UNICODE solution, sorry I have overlooked your post for > 2 weeks.
Why do you want to so strictly control the allowed characters in the tags? In which situation it would actually be useful?
Create your Q&A site at askbot.com. Managed Askbot hosting at just $15/mo. Dedicated hosting, support contracts, consulting services.create your Q&A site
Asked: 2012-03-14 06:37:40 -0500
Seen: 103 times
Last updated: Apr 10 '12