Thursday, October 24, 2013

Why Python uses 0-based indexing

I was asked on Twitter why Python uses 0-based indexing, with a link to a new (fascinating) post on the subject (http://exple.tive.org/blarg/2013/10/22/citation-needed/). I recall thinking about it a lot; ABC, one of Python's predecessors, used 1-based indexing, while C, the other big influence, used 0-based. My first few programming languages (Algol, Fortran, Pascal) used 1-based or variable-based. I think that one of the issues that helped me decide was slice notation.

Let's first look at use cases. Probably the most common use cases for slicing are "get the first n items" and "get the next n items starting at i" (the first is a special case of that for i == the first index). It would be nice if both of these could be expressed as without awkward +1 or -1 compensations.

Using 0-based indexing, half-open intervals, and suitable defaults (as Python ended up having), they are beautiful: a[:n] and a[i:i+n]; the former is long for a[0:n].

Using 1-based indexing, if you want a[:n] to mean the first n elements, you either have to use closed intervals or you can use a slice notation that uses start and length as the slice parameters. Using half-open intervals just isn't very elegant when combined with 1-based indexing. Using closed intervals, you'd have to write a[i:i+n-1] for the n items starting at i. So perhaps using the slice length would be more elegant with 1-based indexing? Then you could write a[i:n]. And this is in fact what ABC did -- it used a different notation so you could write a@i|n.(See http://homepages.cwi.nl/~steven/abc/qr.html#EXPRESSIONS.)

But how does the index:length convention work out for other use cases? TBH this is where my memory gets fuzzy, but I think I was swayed by the elegance of half-open intervals. Especially the invariant that when two slices are adjacent, the first slice's end index is the second slice's start index is just too beautiful to ignore. For example, suppose you split a string into three parts at indices i and j -- the parts would be a[:i], a[i:j], and a[j:].

So that's why Python uses 0-based indexing.

15 comments:

  1. To anyone who prefers 1-based indexing: you are the same people who screwed up the calendar, starting at year 1 and calling the 1900s the 20th century (and arguing that the year 2001 ought to be the start of the next millennium). Haven't you done enough damage? :-)

    ReplyDelete
    Replies
    1. You seriously think it makes more sense to have a year 0? If the FIRST year isn't year 1, then you end up with the SECOND year being year 1. Yeah, that makes sense. NOT.

      Delete
    2. Time has the notation of 1:?? in the second hour of the day. What's your issue with the years again?

      Delete
  2. Dijkstra had a monograph about this: https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831.html

    Also I should point out that the correct first number is zero, and kids should be taught to count "0, 1, 2, ...". It fixes a lot of problems that you get if you start a 1. In this light, there's no argument at all for starting indexing at 1.

    ReplyDelete
  3. As I pointed out in my G+ post, Dijkstra did not consider start:size as an option. Sadly, that's the way C++ seems to be going (https://www.cilkplus.org/tutorial-array-notation).

    ReplyDelete
    Replies
    1. That Cilk thing goes against the entire STL by not using .start() .end() half-closed intervals. What were these guys thinking, or were they even thinking at all.

      Delete
  4. Y combinator's thread on this topic mention that in 1985 Inside Macintosh explained why graphics coordinates are 0-based: https://news.ycombinator.com/item?id=6601515 (search for "Macintosh").

    ReplyDelete
  5. If you count objects 0, 1, 2 you wind up with a value that's smaller than the number of objects. 0-based indexing works because it is not ordinal.

    ReplyDelete
  6. Your mention of graphics coordinates reminds of another index holy war about the location of the origin: top-left or bottom-left?

    I like ot's description of indices as boundaries rather than labels. I think it gives both the 0th and 1st camps something to hang their hats on :).

    ReplyDelete
  7. Except the indices as boundaries analogy is broken. :-(

    "abcdefghijklmnop"[:3:-1]

    Note that it ends in e, not in d as it should if you consider 3 as a boundary.

    When I first encountered this, I thought it was a bug. Boundaries metaphor was something I deeply embraced.

    Maybe it could stay that way? Guido, can slice semantics with negative stride be changed?

    ReplyDelete
  8. I agree that slices with negative strides are confusing. We can't change them (it would break too much code) but you can usually avoid them by using the reversed() function instead, e.g. ''.join(reversed("abcdefghijklmnop"[3:])). Still not ideal, but how often do you need a string backwards (apart from palindrome puzzles :-)?

    ReplyDelete
  9. You've seen my code on CheckIO. More often than you think. :-P

    But seriously... not only strings, any indexable iterables have same problem. And you haven't seem to have problems with breaking code (especially without any confirmed examples) before. At least in Py4K? :-)

    ReplyDelete
  10. Over on python-ideas we're discussing negative strides now. The best way to use them is to also use negative indexes. E.g. a[::-1] == a[-1:-1-len(a):-1].

    ReplyDelete
  11. This is an old post and interesting, but I think there is a hardware angle you aren't considering in indexing. I'm a EE prof, worked at IBM and Motorola previously. Hardware wants zero indexing. Think about a logical decoder (heart of SRAM memory driving word lines for instance). If you have a four bit decoder, you have options of 0 to 15 as the input so you drive out wordline0 to wordline15. 16 doesn't fit. Also, in a reverse FOR loop with a terminating value of zero, you can use the Z bit (zero flag - NAND of all index bits) for free. Many early assembly languages had instructions for this to avoid an explicit check for termination. Interesting to hear the CS perspective.

    ReplyDelete
  12. Hi Eric, thanks for adding that point. In high school I built and used simple digital logic and in college I used CDC assembler, so I'm well aware of the hardware preference for 0-based indexing. This is also of course where C's 0-base indexing comes from. But Python, priding itself in being a high-level language, needs to be able to motivate its position in this matter without reference to the implementation.

    ReplyDelete

Note: Only a member of this blog may post a comment.