First time here? Check out the FAQ!

Revision history  [back]

I will expand a bit on this topic. (Please note in the following that I'm new to the Python/Django scene.)

With regards to the handling of url's and slugs:

  • the majority of top tier sites do not use url encoding (percent encoding) for the slugs. Some just use English titles, most use articles id's.

  • few of the top tier sites do use URL encoding for the slugs - some examples (providing them in the ugly encoded form):

http://sayedmokhtar.maktoobblog.com/1633941/%D8%AF%D9%83%D8%AA%D9%88%D8%B1-%D8%B3%D9%8A%D8%AF-%D9%85%D8%AE%D8%AA%D8%A7%D8%B1-%D9%8A%D9%83%D8%AA%D8%A8-%D8%AB%D9%88%D8%B1%D8%A9-%D8%B4%D8%B9%D8%A8%D9%8A%D8%A9-%D9%82%D8%A7%D8%AF%D9%85%D8%A9/

http://fr-fr.facebook.com/pages/%D8%A7%D9%84%D8%AB%D9%88%D8%B1%D8%A9-%D8%A7%D9%84%D8%B4%D8%B9%D8%A8%D9%8A%D8%A9-%D8%A7%D9%84%D9%85%D8%BA%D8%B1%D8%A8%D9%8A%D8%A9-%D8%A7%D9%84%D9%82%D8%A7%D8%AF%D9%85%D8%A9/188865831132687

  • few use Romanized slugification - couldn't find examples in my quick search - but it's out there.

Using percent encoding with Arabic comes with its pros and cons:

Cons: because of implementation issues and inconsistencies among different browsers, issues arise like:

  • url's look ugly when copied around.
  • bidi issues (mis-arrangement).

Pros: would help with SEO (possible)

Now to Askbot; I can see it's using both Romanization for article slugs and percent encoding for the section names in the path segment. I'm not sure where the slugification is happening (Askbot level or Django level) - this is just a sample url from a test site where the encoded segment is the 'questions' section:

http://127.0.1.1:8000/%D8%B3%D8%A4%D8%A7%D9%84/5/brkt-mwzfy-wzrth-lmlyn-fy-mktb-lkhdm-ljtmy

The slugify function is rather crude and removes some characters that don't have latin counterparts, thus it would be better to override it. Informal transliterated writing is in common use for Arabic (and other languages), for example in chat, sms and in cases where there are technical issues - see http://en.wikipedia.org/wiki/Arabic_chat_alphabet - so the slugification option makes good sense, but there are conventions in mapping the letters between the two languages, so it also makes sense that different languages can have different mappings or even different translation functions.

I would also like to have the choice to get rid of percent encoding in the path segment of the URL (because of the url encoding issues mentioned above), and replace it with either a Romanized form or the original English strings (ie 'questions' segment remains the same even when the locale is set to something not English). (not sure how to get the original English strings with gettext, switching locales maybe?!)

So to sum up - this is the list of possible options that different people might consider in multi-lingual, or uni-lingual non-English sites:

For the path section:

  • use the original English strings: domain.name/questions/...
  • use percent encoding: domain.name/%xx.../...
  • use Romanized form: domain.name/2as2ila/...

For the slug section: (independent from the above)

  • define a corresponding English slug: domain.name/.../samsung-releases-new-pad
  • use percent encoding of title: domain.name/.../%xx...
  • use Romanized slugification of title: domain.name/.../samsung-ta6ra7-law7-jadeed

(of course using id's rather than slugs is also an option.)

Ultimately when IDN's become common and punycode becomes ubiquitous in browsers and implementation issues get straightened out - we probably won't need all this or would need it to a lessor degree as a choice.

I appreciate any comments on the above. If needs be I'll post some of the raised points as separate questions.

I will expand a bit on this topic. (Please note in the following that I'm new to the Python/Django scene.)

With regards to the handling of url's and slugs:

  • the majority of top tier sites do not use url encoding (percent encoding) for the slugs. Some just use English titles, most use articles id's.

  • few of the top tier sites do use URL encoding for the slugs - some examples (providing them in the ugly encoded form):

http://sayedmokhtar.maktoobblog.com/1633941/%D8%AF%D9%83%D8%AA%D9%88%D8%B1-%D8%B3%D9%8A%D8%AF-%D9%85%D8%AE%D8%AA%D8%A7%D8%B1-%D9%8A%D9%83%D8%AA%D8%A8-%D8%AB%D9%88%D8%B1%D8%A9-%D8%B4%D8%B9%D8%A8%D9%8A%D8%A9-%D9%82%D8%A7%D8%AF%D9%85%D8%A9/

http://fr-fr.facebook.com/pages/%D8%A7%D9%84%D8%AB%D9%88%D8%B1%D8%A9-%D8%A7%D9%84%D8%B4%D8%B9%D8%A8%D9%8A%D8%A9-%D8%A7%D9%84%D9%85%D8%BA%D8%B1%D8%A8%D9%8A%D8%A9-%D8%A7%D9%84%D9%82%D8%A7%D8%AF%D9%85%D8%A9/188865831132687

  • few use Romanized slugification - couldn't find examples in my quick search - but it's out there.

Using percent encoding with Arabic comes with its pros and cons:

Cons: because of implementation issues and inconsistencies among different browsers, issues arise like:

  • url's look ugly when copied around.
  • bidi issues (mis-arrangement).

Pros: would help with SEO (possible)

Now to Askbot; I can see it's using both Romanization for article slugs and percent encoding for the section names in the path segment. I'm not sure where the slugification is happening (Askbot level or Django level) - this is just a sample url from a test site where the encoded segment is the 'questions' section:

http://127.0.1.1:8000/%D8%B3%D8%A4%D8%A7%D9%84/5/brkt-mwzfy-wzrth-lmlyn-fy-mktb-lkhdm-ljtmy

The slugify function is rather crude and removes some characters that don't have latin counterparts, thus it would be better to override it. Informal transliterated writing is in common use for Arabic (and other languages), for example in chat, sms and in cases where there are technical issues - see http://en.wikipedia.org/wiki/Arabic_chat_alphabet - so the slugification option makes good sense, but there are conventions in mapping the letters between the two languages, so it also makes sense that different languages can have different mappings or even different translation functions.

I would also like to have the choice to get rid of percent encoding in the path segment of the URL (because of the url encoding issues mentioned above), and replace it with either a Romanized form or the original English strings (ie 'questions' segment remains the same even when the locale is set to something not English). strings. (not sure how to get the original English strings with gettext, switching locales maybe?!)

So to sum up - this is the list of possible options that different people might consider in multi-lingual, or uni-lingual non-English sites:

For the path section:

  • use the original English strings: domain.name/questions/...
  • use percent encoding: domain.name/%xx.../...
  • use Romanized form: domain.name/2as2ila/...

For the slug section: (independent from the above)

  • define a corresponding English slug: domain.name/.../samsung-releases-new-pad
  • use percent encoding of title: domain.name/.../%xx...
  • use Romanized slugification of title: domain.name/.../samsung-ta6ra7-law7-jadeed

(of course using id's rather than slugs is also an option.)

Ultimately when IDN's become common and punycode becomes ubiquitous in browsers and implementation issues get straightened out - we probably won't need all this or would need it to a lessor degree as a choice.

I appreciate any comments on the above. If needs be I'll post some of the raised points as separate questions.