International emails

published Sep 08, 2009, last modified Sep 08, 2009

Sending emails with strange characters to strange people, or at least people with non-ascii names.

Every few months at my employer Zest Software we have an evening of "eten en weten". Literally that is Dutch for "eating and knowing". Let's call it "Food for thought". We eat together and several of us hold presentations on subjects that are in some way related to our work. For example: Django, common Dutch language mistakes, how we use subversion, or local site hooks and the many interesting ways in which they can break when migrating from Plone 2.5 to 3. I managed to squeeze that last one into a lightning talk of a few minutes; you really don't want to know. ;-) (In case you do want to know, take a look at Products.Plone3Cleaners).

It is probably about time for a new "eten en weten" so it is probably also about time I uploaded my talk from last time about international emails. I talked about some base terminology, what can go wrong, pointed to the python email module and showed how to send a complete message, including some details that you can forget as long as you use the proper methods. After all, foreign languages are difficult enough already:

http://engrishfunny.files.wordpress.com/2009/03/engrish-funny-no-flush.jpg?w=500&h=375

i18n/l10n

Two terms widely used are:

internationalization
i       18         n

localization
l    10    n

Roughly said, in a Plone context, internationalization is making sure the content or the UI is translated into several languages. Localization is making sure that 3 May 2009 is 05-03-2009 in the USA and 03.05.2009 in Germany.

These two terms are not really the focus here though. The point is: how do you make sure that an email sent from Plone (or any python application really, if you ignore some details) with a Chinese name as From address, a Japanese name as To address, a Russian Subject and a Korean body text is delivered without errors.

Now do not think: "I live and work in America, I only need ascii." Don't you have Spanish colleagues? Some friends from your year abroad at that French university? A few Chinese clients? You could use only ascii, but you might regret that:

http://punditkitchen.files.wordpress.com/2009/01/political-pictures-hello-usa-copy.jpg

utf-8 is not unicode

Repeat after me: "utf-8 is not unicode", "utf-8 is not unicode", "utf-8 is not unicode":

>>> type('ascii')

>>> type('utf-8')

>>> type(u'unicode')

Basics

Sending an email in Plone goes something like this:

charset = portal.getProperty(
  'email_charset', 'ISO-8859-1')
mailhost = getToolByName(portal, 'MailHost')

mailHost.send(message = msg,
              mto = address,
              mfrom = mfrom,
              subject = subject,
              charset = charset)

What can go wrong with that?

  • Hard to read headers:

    From: RenXX Artois
    
  • Hard to read body text:

    lettere accentate: ò ùâ
    
  • Unrecognized addresses:

    To: undisclosed recipients
    
  • No email body: C

  • UnicodeDecodeErrors/UnicodeEncodeErrors

Some examples can be found in Poi issues 146 and 161.

Parsing/formatting addresses

The To and From fields should have something like this:

Maurits van Rees 

The standard python email package has nice utilities for this:

>>> from email.Utils import parseaddr
>>> from email.Utils import formataddr
>>> formataddr(('Maurits van Rees',
                'maurits@example.org'))
'Maurits van Rees '
>>> parseaddr(
      'Maurits van Rees ')
('Maurits van Rees', 'maurits@example.org')

These functions can get confused by strange characters. You can guard against that by parsing the address that you have just formatted and seeing if the parsed information still makes sense:

from_address = portal.getProperty(
  'email_from_address', '')
from_name = portal.getProperty(
  'email_from_name', '')
mfrom = formataddr((from_name, from_address))
if parseaddr(mfrom)[1] != from_address:
    # formataddr probably got confused
    # by special characters.
    mfrom = from_address

Character sets

The python email.Charset module has interesting information about how email headers and body text should be encoded depending on the input character set. Some examples (QP is quoted printable):

input         header enc  body enc  output conv
iso-8859-1:   QP          QP        None 
iso-8859-15:  QP          QP        None 
windows-1252: QP          QP        None 
us-ascii:     None        None      None 
big5:         BASE64      BASE64    None 
euc-jp:       BASE64      None      iso-2022-jp 
iso-2022-jp:  BASE64      None      None 
utf-8:        SHORTEST    BASE64    utf-8 
...

If that does not make sense, perhaps this helps:

http://icanhascheezburger.files.wordpress.com/2008/12/funny-pictures-this-kitten-is-confused.jpg

This information is used when creating email headers:

>>> from email.Charset import Charset
>>> latin = Charset('iso-8859-1')
>>> utf = Charset('utf-8')
>>> latin.header_encode('René Artois')
u'=?iso-8859-1?q?Ren=C3=A9_Artois?='
>>> utf.header_encode('René Artois')
'=?utf-8?q?Ren=C3=A9_Artois?='

and encoding body text:

>>> latin.get_body_encoding()
'quoted-printable'
>>> latin.body_encode('René Artois')
'Ren=C3=A9 Artois'
>>> utf.get_body_encoding()
'base64'
>>> utf.body_encode('René Artois')
'UmVuw6kgQXJ0b2lz\n'

This may look confusing. Surely if you get an email with a text or subject like this it is unreadable? No, your email program should be smart enough to display this to you in a readable fashion. No need for the funny face:

http://www.alloallo.yoyo.pl/img/01_rene_artois.jpg

Formatting headers

Instead of using email.Charset for formatting headers you normally use the email.Header module:

>>> from email.Header import Header
>>> subject = 'Re: René'.decode('latin-1')
>>> subject
u'Re: Ren\xc3\xa9'
>>> subject = Header(subject, 'latin-1')
>>> subject

>>> print subject
=?iso-8859-1?q?Re=3A_Ren=C3=A9?=

Formatting the body

You will need to know which character set the body text has, or at least in which character set it can be encoded without errors. This snipped tries three character sets:

charset = portal.getProperty(
  'email_charset', 'ISO-8859-1')
for body_charset in 'US-ASCII', charset, 'UTF-8':
    try:
        message = message.encode(body_charset)
    except UnicodeError:
        pass
    else:
        break

If the message only contains ascii characters, then at the end of this snippet the message is encoded in ascii and the body_charset variable is 'US-ASCII'.

Send it

We have done all the hard work with the Headers so now we can use the 'send' method:

# Create the message.
# 'plain' stands for Content-Type: text/plain
from email.MIMEText import MIMEText
msg = MIMEText(message, 'plain', body_charset)
msg['From'] = email_from
msg['To'] = email_to
msg['Subject'] = subject
msg = msg.as_string()
mailhost = getToolByName(portal, 'MailHost')
mailhost.send(message=msg)

Using secureSend

Easier is to use the secureSend method; using with the Header class is not needed then, as secureSend takes care of that:

email_msg = MIMEText(message, 'plain', body_charset)
mailhost.secureSend(
  message = email_msg,
  mto = email_to,
  mfrom = email_from,
  subject = subject,
  charset = header_charset)

Now international email sending should work:

http://icanhascheezburger.files.wordpress.com/2009/07/funny-pictures-cat-is-indisposed.jpg

Images courtesy of: