Amazing substring behaviour

In a recent code review at my workplace I found a piece of C# code that contained something along this line:

string foo = "bar";
string substring = foo.Substring(3);

Clearly index position 3 is beyond the end of the string, so I thought I had found a bug and was about to flag the code. Then it occurred to me: Why had the unit tests not failed during the gated checkin build?

Referring to the documentation I found that the String.Substring() method indeed returns an empty string when an index position is specified that is exactly equal to the length of the string.

I’m not new to C# and .NET, so I was quite surprised to have found this unexpected behaviour in such a basic library function. In a scripting language such as AWK I would not be surprised to find a lax and forgiving API, but in a strongly typed programming language such as C# I’m expecting things to be more strict. Personally I find the behaviour inconsistent and irritating because references to other illegal index positions such as these

string foo = "bar";
string substring = foo.Substring(3, 1);
string substring = foo.Substring(4);

both do throw an ArgumentOutOfRangeException!

How do other programming languages behave? I fired up the online compiler Ideone in the browser and made a few comparisons…

C#

using System;

public class Test
{
  public static void Main()
  {
    string foo = "bar";
    string substring1 = foo.Substring(3);     // ok, empty string
    string substring2 = foo.Substring(3, 0);  // ok, empty string
    string substring3 = foo.Substring(3, 1);  // ArgumentOutOfRangeException
  }
}

C++

#include <iostream>
#include <string>

int main()
{
  std::string foo = "bar";
  std::string substring1 = foo.substr(3);      // ok, empty string
  std::string substring2 = foo.substr(3, 0);   // ok, empty string
  std::string substring3 = foo.substr(3, 1);   // ok, empty string

  return 0;
}

Objective-C

#import <objc/objc.h>
#import <objc/Object.h>
#import <Foundation/Foundation.h>

@implementation TestObj
int main()
{
  NSString* foo = @"bar";
  // ok, empty string
  NSString* substring1 = [foo substringFromIndex:3];
  // ok, empty string
  NSString* substring2 = [foo substringWithRange:NSMakeRange(3, 0)];
  // NSRangeException
  NSString* substring3 = [foo substringWithRange:NSMakeRange(3, 1)];

  return 0;
}
@end

Java

import java.util.*;
import java.lang.*;
import java.io.*;

class Ideone
{
  public static void main (String[] args) throws java.lang.Exception
  {
    String foo = "bar";
    String substring1 = foo.substring(3);     // ok, empty string
    String substring2 = foo.substring(3, 3);  // ok, empty string
    String substring3 = foo.substring(3, 4);  // IndexOutOfBoundsException
  }
}

Conclusion

Three out of the four languages that I examined behave the same. Only C++, of all things, is more tolerant than the other languages and doesn’t barf even when a length parameter > 0 is specified.

And the rationale?

I can only speculate why standard library API designers all over the world should agree that an illegal string index position must be allowed for exactly one border case: when the index position is equal to the string length.

One speculation is that it might be convenient for implementing certain loops that iterate over the content of a string.

Another speculation is that null-terminated C strings are at work in the background. The C string “foo” looks like this in memory:

characters:       f  o  o  \0
index positions:  0  1  2  3

So one might argue that index position 3 refers to the terminating null byte. But why expose this in the API of a programming language’s standard library when that programming language does not also expose the concept of the string terminating null byte?

It’s a mystery.