Team Oriented Tokenizing

Tuesday, October 12, 2004

The trouble with the strtok routine is that it uses per-process data to keep track of the tokenizing: if you call some other code after the first call to strtok, you can never be sure that this code won't start a tokenizing session of its own, invaliding you sequence.

But I'm not calling any other code while using strtok!

But someone else who's calling your code may be in the middle of tokenizing party - then, that someone has the problem.

So what, it's her problem if she calls code that she doesn't control! RTM!

It's this kind of response that gives me the creeps. Shed some blame, and everything's all right if it's not your fault. Or, to put it more favouribly, if you understand the problem, it's just about solved. True, but the point is not to have an intellectual exercise in API mastery, but to avoid problems in the first place. The source of a bug is rarely so obvious when all you've got is a core dump on the customer's MVS machine.

So let's find an alternative to strtok. You could use some library to split a string into a list. But such a list would be allocated dynamically, which means you have to take care to clean it up (especially during error conditions). It's also less efficient.

Since I love macro wars (tongue-in-cheek), here's a first try of using just the stack for a recursion-safe tokenizer loop:

#define TOKENS_FOREACH(SZCUR, SIZECUR, STRING, SEP)                 \
{                                                                   \
  char * __psz = (STRING);                                          \
  char * __pTmp = NULL;                                             \
  while(*__psz == (SEP))                                            \
    __psz++;                                                        \
  __pTmp = strchr(__psz, (SEP));                                    \
  if(NULL == __pTmp && strlen(__psz) > 0)                           \
    __pTmp = __psz + strlen(__psz);                                 \
  while(__pTmp != NULL)                                             \
  {                                                                 \
    char SZCUR [ SIZECUR ] = {0};                                   \
    if((__pTmp - __psz) >= (SIZECUR))                               \
    {                                                               \
      fprintf(stderr, "TOKENS_FOREACH: "                            \
              "Buffer too small: size <%d>, token length <%d>\n.",  \
              (SIZECUR), __pTmp - __psz);                           \
      break;                                                        \
    }                                                               \
    strncpy(SZCUR, __psz, __pTmp - __psz);                          \
    SZCUR[__pTmp - __psz] = '\0';                                   \
    {                                                 /* end macro */

#define TOKENS_FOREACH_END(SEP)                                     \
    }                                                               \
    while(*__pTmp == (SEP))                                         \
      __pTmp++;                                                     \
    if(*__pTmp != '\0')                                             \
    {                                                               \
      __psz = __pTmp;                                               \
      __pTmp = strchr(__psz, (SEP));                                \
      if(NULL == __pTmp)                                            \
        __pTmp = __psz + strlen(__psz);                             \
    }                                                               \
    else                                                            \
      __pTmp = NULL;                                                \
  }                                                                 \
}                                                     /* end macro */

You may object to the use of macros to introduce new control-flow elements, but I'd say in this case it's worth it. Note that you might want to adjust the error handling to your standards. Note that the thing only works with one separator charactor. You can use these macros like this:

TOKENS_FOREACH(szCur, 100, pszString, ' ')
{
  printf("%s\n", szCur);
}
TOKENS_FOREACH_END

How do we test the macros? To me, the most important thing when testing (and not just when testing) is to look left and right and consider as many cases as possible. One such case are NULL-pointers for the STRING parameter; the macros don't handle these. Otherwise, we should test empty strings, strings with leading and trailing separator chars, tokens separated by more than one separator, one-element strings:

""
"  "
" A"
"A "
" A "
"A"
"A B"
"A  B"
" A   B  C D"

It's these "special cases" that most often don't get tested.