Sunday, November 10, 2013

The story of None, True and False (and an explanation of literals, keywords and builtins thrown in)

I received an interesting question in the mail recently:
What is the difference between keywords and literals? Why are True and False keywords rather than literals in python3?
I was horrified recently to find that assigning to True/False works in python2. So I went digging, and found that True and False were created to be 'constants' like None in PEP 285. Assignment to None was disallowed in 2.4, but not to True/False until python3. Was there a reason None was originally built as a variable rather than a literal?
Let's start with the first question: keywords and literals.

A keyword, in the context of defining the syntax of a language, also known as a reserved word, is something that looks like an identifier in the language, but from the parser's point of view act like a token of the language. An identifier is defined as a sequence of one or more letters, digits and underscores, not starting with a digit. (This is Python's definition, but many languages, like C or Java, use the same or a very similar definition.)

The important thing to remember about keywords is that a keyword cannot be used to name a variable (or function, class, etc.). Some well-known keywords in Python include 'if', 'while', 'for', 'and', 'or'.

A literal, on the other hand, is an element of an expression that describes a constant value. Examples of literals are numbers (e.g. 42, 3.14, or 1.6e-10) and strings (e.g. "Hello, world"). Literals are recognized by the parser, and the exact rules for how literals are parsed are often quite subtle. For example, these are all numeric literals in Python 3:
123
1.0
1.
.01e10
.1e+42
123.456e-100
0xfffe
0o755
but these are not:
. (dot)
e10 (identifier)
0y12 (the literal 0 followed by the identifier y12)
0xffe+10 (the literal 0xffe followed by a plus sign and and the number 10)
Note the distinction between a constant and a literal. We often write code defining "constants", e.g.
MAX_LEVELS = 15
Here, 15 is a literal, but MAX_LEVELS is not -- it is an identifier, and the all-caps form of the name suggests to the reader that it is probably not changed anywhere in the code, which means that we can consider it a constant -- but this is just a convention, and the Python parser doesn't know about that convention, nor does it enforce it.

On the other hand, the parser won't let you write
15 = MAX_LEVELS
This is because the left-hand side of the assignment operator (=) must be a variable, and a literal is not a variable. (The exact definition of variable is complex, since some things that look like expressions are also considered to be variables, such as d[k], (a, b), and foo.bar -- but not f() or () or 42. This definition of variable is also used by the "del" statement.)

Now on to None, True and False.

Let's begin with None, because it has always been in the language. (True and False were relatively recent additions -- they first made their appearance in Python 2.2.1, to be precise.) None is a singleton object (meaning there is only one None), used in many places in the language and library to represent the absence of some other value. For example, if d is a dictionary, d.get(k) will return d[k] if it exists, but None if d has no key k. In earlier versions of Python, None was just a "built-in name". The parser had no special knowledge of None -- just like it doesn't have special knowledge of built-in types like int, float or str, or built-in exceptions like KeyError or ZeroDivisionError. All of these are treated by the parser as identifiers, and when your code is being interpreted they are looked up just like any other names (e.g. the functions and variables you define yourself). So from the parser's perspective, the following are treated the same, and the parse tree it produces (<name> = <name>) is the same in each case:
x = None
x = int
x = foobar
On the other hand, the following produce different parse trees (<name> = <literal>):
x = 42
x = 'hello'
because the parser treats numeric and string literals as different from identifiers. Combining this with the earlier MAX_LEVEL examples, we can see that if we swap the left and right hand sides, the first three will still be accepted by the parser (<name> = <name>), while the swapped version of the second set will be rejected (<literal> = <name> is invalid).

The practical consequence is that, if you really want to mess with your readers, you can write code that reassigns built-ins; for example, you could write:
int = float
def parse_string(s):
    return int(s)
print(parse_string('42'))    # Will print '42.0'
Some of you may respond to this with "So what? Reasonable programmers don't write such code." Others may react in the opposite way, saying "Why on earth does the language allow assignment to a built-in name like 'int' at all?!"

The answer is subtle, and has to do with consistency and evolution of the language. I bet that without looking it up you won't be able to give a complete list all built-in names defined by Python. (I know I can't.) Moreover, I bet that many of you won't recognize every single name on that list. (To see the list, try typing dir(__builtins__) at the Python command prompt.)

Take for example the weird built-ins named copyright, credits or license. They exist so that we can mention them in the greeting shown when you start Python interactively:
Python 3.4.0a4+ (default:0917f6c62c62, Oct 22 2013, 10:55:35)
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> credits
Thanks to CWI, CNRI, BeOpen.com, Zope Corporation and a cast of thousands
for supporting Python development.  See www.python.org for more information.
>>> 
 In order for this to work, we made them built-ins. But does this mean you shouldn't be allowed to use 'credits' as a variable or parameter name? I think not. Certainly many people don't realize that these esoteric built-ins even exist, and they would be surprised if they were prevented from using them as variable names. From here, it's just a gradual path. Many people write functions or methods with arguments named str or len, or with names like compile or format. Moreover, suppose you wrote some Python 2.5 code where you used bytes as a variable name. In Python 2.6, we added a built-in function named 'bytes' (it's an alias for str, actually). Should your code now be considered invalid? There's no reason for that, and in fact your code will be fine. (Even in Python 3, where bytes is one of the fundamental types.)

On the other hand, you cannot have a variable named 'if' or 'try', because these are reserved words (keywords) that are treated special by the parser. Because you cannot use these as variable or function names anywhere, ever, in any Python program, everyone using Python has to know about all the reserved words in the language, even if they don't have any need for them. For this reason, we try to keep the list of reserved words small, and the core developers hem and haw a lot before adding a new reserved word to the language.

In fact, many proposed new features have been killed because they would require a new keyword; others have been modified to avoid that need. Also, when we do decide to add a new keyword, we start a deprecation campaign at least one release before the new keyword is introduced, warning developers to choose a different name for their variables. (There's also a trick to allow developers to choose to use the new keyword right away; this is why we have e.g. "from __future__ import with_statement".)

There's no such concern for built-ins. Code that happens to use the name of a new built-in as a variable or function name will continue to function (as long as you don't also try to use the new built-in in the same function). While we still try to be conservative with the introduction of new built-ins, at least we don't have to worry about breaking working code by merely adding something to the language. The (small) price we pay for this is the possibility that some joker intentionally redefines a built-in just to confuse others. But there are tons of other ways to write unreadable code, and I don't see this as a particularly bad problem.

So, after this long detour about built-ins vs. keywords, back to None. Why did we eventually make None a reserved word? Frankly, the reasons were perhaps mostly social. Unlike some built-ins and many exceptions, None is so central to using Python that you really can't be using Python without knowing about None. So people were (like our question-asker) "horrified" when they found that assignment to None was actually allowed at all. Worse, there was the concern (whether founded or not) that the way name lookup in Python works, "evaluating" the expression None is slow, because it requires at least two dictionary lookups (all names are looked up in the globals dict before being looked up in the built-ins dict).

In the end we decided that there was no downside to making None a keyword (there is no code that actually assigns to it) and it might make some code a tiny bit faster, or catch rare typos. There was still a one-time cost to the developer community (changes to the parser and documentation) but this was small enough that we din't hesitate very long.

The situation for True/False is a little different. They weren't always part of the language, and many people had invented their own convention. People would define constants named true and false, True and False, or TRUE and FALSE, and use those consistently throughout their code. I don't recall which spelling was most popular, but when we introduced True and False into the language, we definitely did not want to break any packages that were defining their own True and False constants. (One concern was that those packages would have to have a way to continue to run on previous Python versions.)

So, essentially our hand was forced in this case, and we had to introduce True and False as built-in constants, not as keywords. But over time, code defining its own versions of True and False (by whichever name) became more and more frowned upon, and by the time Python 3 came around, when we looked at opportunities for cleaning up the language, we found that it was logical to make True and False keywords, by analogy to None.

And there you have it. It's all completely logical, once you understand the context. :-) Sorry for the long response; I hope it's been educational.

UPDATE: I still forgot to answer whether None/True/False are literals or keywords. My answer is that they are both. They are keywords because that's how the parser recognizes them. They are literals because that's their role in expressions and because they stand for constant values. One could argue about whether things like {'foo': 42} are literals; personally I'd prefer to give these some other name, because otherwise what would you call {'foo': x+1}? The language reference calls both of these "displays".

15 comments:

  1. It is false that this post has not been educational.
    (Forgive me, couldn't resist.)

    ReplyDelete
  2. Moral of the story: When first designing a new language, always make True/False as keywords.

    ReplyDelete
  3. @Bryan Chen:
    That is definitely not the moral of this story. The moral is that languages and their use evolve over time and therefore it is good practice to design and maintain your language to allow for change/improvement.

    In retrospect True/False could have been keywords. By using keywords sparingly and incorporating mechanisms like __future__, the designers left room to fix this.

    ReplyDelete
  4. Having "credits", etc., be variables has always seemed very icky to me. The REPL could have recognised invalid Python as its own commands to provide access to them, similar to SQL REPLs; few expect «python -c 'print credits'» to work.

    Worse, http://bugs.python.org/issue8220 shows how exit() exists without importing sys. I've had others use site.py's exit thinking they were using sys's.

    ReplyDelete
  5. Isn't there an inherent gain for C code that also needs to deal with these values?

    ReplyDelete
  6. Small typo:
    >>> x = 0xfffe
    >>> x
    65534
    >>> x = 0xfffe+10
    >>> x
    65544
    >>>

    ReplyDelete
    Replies
    1. It isn't a literal, but unlike all the others is valid.

      Delete
  7. Some time ago I asked @ Stackoverflow why Python's built-ins methods can't be overridden or why can't you add new methods to built-ins?

    Best answer I got was that if it were allowed, libraries that use these built-ins might give bad results because it breaks the assumptions those libraries made about the built-ins. But since you say yourself that built-ins variables such as int, str... CAN be overridden, I see no reason why they can't be just extended with new methods, or override their default behaviour. Further more, I believe that libraries (external modules), have their own namespace and once a function is defined inside a module to use a certain object, it can't be overridden from another module:

    module1.py:
    print(int)

    module2.py:
    int = 5
    import module1

    This will print "", (which is the int class string representation), and not "5".

    Here is my question @ SO (it starts with a different question in mind, but very soon I got to the point I made here):
    http://stackoverflow.com/questions/19083160/can-literals-in-python-be-overridden

    Guido, I'd love to get your statement on that matter.

    ReplyDelete
  8. @RB14: It's not the same. When you reassign e.g. 'str', you are messing only with your *own* code (the scope of the assignment). When you redefine the built-in itself, you are messing with *everyone's* code.

    ReplyDelete
    Replies
    1. Well, clearly I didn't think this through. I mistakenly mixed up variables and objects... Thanks for clarifying.

      Delete
  9. Very educational indeed, thanks Guido.

    ReplyDelete
  10. Tangental question (maybe better asked at StackOverflow):

    I was quiet surprised that:

    for x in None :
    print x

    raised: TypeError: 'NoneType' object is not iterable

    My (naive) expectation was the loop would work like 'for x in () :' aka pass right over nothing.

    Why would I do this? Well it's quite common (AFAICT) to return None in functions. It's also common to iterate over the results of a function. It'd be 'nice' if I could:

    for x in function_possibly_returning_None :
    print x

    rather than

    myiterable = function_possibly_returning_None
    if myiterable :
    for x in myiterable :
    print x

    ReplyDelete
  11. Jason, it seems wrong that a function could either return an iterable, e.g. list, or None. Why would it not return an empty list, tuple, etc? A function returning None to indicate none probably returns a non-iterable for non-none.

    ReplyDelete
  12. To Jason's question, I wonder how the BDFL would view:

    for x in None or (): print(x)

    as a workaround.

    ReplyDelete