[Collections]

Introduction

Sunday, April 1, 2001

In this topic, I'll talk about how you collect things in VB.NET. There are significant changes in the way we deal with arrays, dictionaries, and other collections. Much has changed for the better, but if you're looking for VBA.Collection, you might be a little disappointed. I'll talk about approches to help you out on that.

New classes

You'll find the relevant .NET framework classes in the System.Collections namespace in mscorlib.dll and in System.Collections.Specialized in System.dll. When you create a new project, both should be referenced by default. If you want to try things out, you might want to import the whole bunch into your code file. There's also a very familiar class in the namespace Microsoft.VisualBasic (in the library Microsoft.VisualBasic.dll), but I'll get to that later.

In VB5/6 all we had was Collection, Dictionary, and the humble array. Although Collection's interface was pretty dull, it was, due to the hacks and workarounds possible, the premier workhorse for our collection tasks (you'll find my own general purpose workarround in the VB6Core component). Dictionary was a half-baked copy of Perl's associative array. It was tempting to replace Collection with it (in VBScript, Collection didn't even exist), but it was fundamentally different. Array's weren't objects; we'll see how they work now, later.

Now, all collections are classes. There are several interfaces that they implement, so even though the object browser seems more bloated now, you can handle the stuff easily. When you create your own collections, you'll primarily implement ICollection and IEnumerator. But mostly, you'll just inherit from an existing class (there are also special MustInherit / Abstract classes). It beats the heck out of delegating to Collection and setting the proc ID of NewEnum to negative four. Sorting now means calling a method, perhaps providing a comparer, but you don't need to write your own sort algorithm.

Classes ready to use

So let's see what we got. Here's a short overview of the most important classes; I grouped them by the major characteristics:

Base classes

As I mentioned, there are also a number of abstract (MustInherit) classes you can use. Here are three examples you might find useful:

Interfaces

There are several interfaces used with collections. Mostly, you don't need to implement them, but you should definitely consider using them. Here's a short description of most:

What do we do with all these interfaces? Well, it appears more confusing than it really is. When you create your own collection from scratch, you're just concerned with ICollection and IEnumerator. Otherwise you can extend existing classes. If you just use them, you'll find that the all work similiarly, but watch out for enumeration order and keys vs. values.

Arrays

Nature of arrays

Arrays are reference types deriving from the class System.Array. An array of a reference type stores references, whereas an array of values stores the values directly: there is no boxing when you have an array of, say, Doubles; however, an array of System.Object or of System.ValueType can store references pointing to boxed values. You cannot assign an array of values to a reference of type "Array of System.Object" or even "Array of System.ValueType", because the elements in the array aren't boxed.

For copying an array, use the Shared Copy method of the System.Array class.

Declarations and initalization

You use arrays differently than before. Now, you can only specify the size of the array, not its lower bound (Option Base is gone, "n To m" is gone, too). When you put a number in the parens after the identifier, it identifies the upper bound of the array (like in VB6 - think end-point-inclusive, which is unusual). ReDim cannot change the number of dimensions in the array. You cannot use Redim to initially declare an array.

Declaring array references

There is now a new way to declare an array; the following two lines mean the same:

Dim animals() As String
Dim animals As String()

The new Syntax let's you think: "String array". Strictly speaking, the preceding two lines of code only declare references, which are initialized to Nothing.

Creating arrays by specifiying the upper bound

Specifiying the size of an array in the declaration actually means to declare a reference, create an array of the given size, and assign it to the reference.

Dim flowers(9) As CFlower
Dim flowers As CFlower() = New CFlower(9) {}

You cannot combine declaration and array creation like you can do when instatiating a class (using the "As New" syntax). Similiarly, the ReDim statement translates like this:

ReDim flowers(19) As CFlower
flowers = New CFlower(19) {}

The last line is more in sync with what's really happening, so I find this syntax preferable. ReDim does not resize an existing array, it creates a new one; ReDim Preserve does the same, plus copy the original contents to the new array. The operation assigns the new array to the reference, so although arrays are reference types, the change is not reflected elsewhere.

The fact that VB uses parens everywhere necessitates the curly braces after the "New" statement, because the array creation expression must be distinguished from a constructor call. If VB.NET allowed (or even required) the "LBound To UBound" syntax, things would be much clearer.

people = new CPerson[100];    // C# array syntax
people = New CPerson(0 To 99) ' I wish ...
people = New CPerson(99) {}   ' sad VB.NET reality

Creating arrays by inline initialization

Also, arrays can now be initialized right away (although then you don't specify a size, you simply initialize all elements, and the compiler figures out the size needed):

Dim days() As String = New String() {"Monday", "Tuesday", "Wednesday", _
                       "Thursday", "Friday", "Saturday", "Sunday"}

Multidimensional arrays

Multidimensional arrays work like this:

Dim ints As Integer(,) ' just a reference
Dim ints(1, 1) As Integer
Dim ints As Integer(,) = New Integer(1, 1) {}
Dim ints As Integer(,) = New Integer(,) {{1, 2}, {3, 4}}

Note that you can also explicitly create an empty array (by using the "New" syntax and removing the upper bound literals).

Ragged arrays

Create arrays of arrays this way:

Dim shorts As Short()()                     ' just a reference
Dim shorts As Short(1)()                    ' can only dim top rank 
Dim shorts As Short()() = New Short(1)() {} ' ditto (slots are Nothing)
Dim shorts As Short()() = New Short()() {New Short() {}, New Short() {}}

Again, you can create empty arrays, but don't confuse this with Nothing references in the second and third lines.

Mixing arrays types

Dim chars()(,) As Char ' just a reference
Dim chars(1)(,) As Char
Dim chars()(,) As Char = New Char(1)(,) {}
Dim chars()(,) As Char = New Char()(,) {New Char(,) {}, New Char(,) {}}

Enough said.

Array bounds

Anyway, now that all arrays start at zero, I might add that all collections now start at zero as well. In my opinion, the default should be one, for sane people start counting at one, and Basic is a language for sane people. I also think it's a gratutious nonsense to remove user-defined bounds from the language. VB looses some easy flexibility just because of harmonizing with C and Java. There is, however, something to be said in favor of having a consistent starting point to start indexing both arrays and collections (as a default or as a convention, anyway) - I just wish we had other choices as well, for those cases where it's convenient and where cross-language interop doesn't matter.

Array covariance

For now, see "Changes in VB.NET".

Arrays in structures

For now, see "Changes in VB.NET".

Using collections

Let's get familiar with the new collections. You'll find that their methods and properties are a lot more powerful than VBA.Collection's gang of four. But, when you browse the help files, you'll also notice that there is a heavy flavour of the dictionary/hashtable/map philosophy: keys everywhere, but not necessarily a defined order of entries.

Let's start with enumerations (other operations should be fairly easy). You know how For-Each works (assuming a magic collection called NCollection):

Dim col As New NCollection("John Doe", "Sue Doe"), s As String
For Each s In col
   Console.WriteLine(s)
Next s

For-Each is a nice feature that other languages lack, but now VBers can get their hands dirty, too, if they like. First, get an enumerator object:

Dim ce As NCollectionEnumerator
ce = col.GetEnumerator
Do While ce.MoveNext
   Console.WriteLine(ce.Current)
Loop

An interesting aspect is that you call MoveNext before you deal with the very first item. With Do-Loop, you could test at the end of the loop, but it just works differently (just consider an empty collection). Ah, forgot to say, MoveNext returns True as long as the current position is valid.

If you're using Dictionary of HashTable, or anything else that gives disproportionate importance to keys, the enumerator has three more properties: Key, Value, and Entry. The latter is of type DictionaryEntry, which is a structure consisting of Key and Value. It can be convinient to pass that arround, but it's shorter to use Key or Value directly; this explains the ostensible redundancy. As I've mentioned, you can also call Keys or Values, and then use the enumerator you get from the interface (ICollection) that these properties return to iterate with a "normal" IEnumerator.

Rolling your own

The most important thing is implementing a way for clients to enumerate. They can use For- Each loops on collections, but of course you're interested of what goes on behind the scenes. How do For-Each loops work? A loop is executed exactly so often as exist entries in a collection; the client need not know how many. The power of that control structure also stems from the fact that you get a reference to (or a copy of, depending on the types) the objects in the collection automatically. That's more efficient if you need to access several properties of an object (also note that with inheritance on collections and the high-level nature of the framework, there can be many levels of indirection, slowing down member access). The For-Each control structure calls a method (traditionally named [_NewEnum], EnumObjects, or GetEnumerator) on the collection; this method returns an object called an enumerator. For-Each knows that this method is called GetEnumerator because it's a standard stipulated by the IEnumerable interface, which all collections implement (by way of implementing ICollecion, which derives from it). GetEnumerator creates a new enumerator object; the enumerator objects knows about the collection object because it's created by that collection object. Here's an implementation of GetEnumerator:

Class NCollection
   Implements IEnumerable

   Public Function GetEnumerator() As IEnumerator _
     Implements IEnumerable.GetEnumerator
       Return New NCollectionEnumerator(Me)
   End Function

End Class

Here, the collection class NCollection implements the IEnumerable Interface, which has only one member, GetEnumerator (note it could also implement ICollection). GetEnumerator is typed as IEnumerator, because clients (using For-Each loops) expect it that way. So the object returned must implement that interface. Here, NCollectionEnumerator does that. When the enumerator object is constructed, it's passed a referenced to the NCollection instance, so the enumerator knows how to talk to it. Here's NCollectionEnumerator:

Class NCollectionEnumerator
   Implements IEnumerator

   ' store reference to collection; current position
   Private m_Collection As NCollection
   Private m_Pos As Integer

   ' on init, pass ref to the collection we walk through
   ' make sure init pos is invalid
   Sub New(ByVal col As NCollection)
      m_Collection = col
      m_Pos = -1
   End Sub

   Public ReadOnly Property Current As Object _
     Implements IEnumerator.Current
      Get
         If m_Pos < 0 Or m_Pos > m_Collection.UpperBound Then
               Throw New InvalidOperationException
            Else
               Return m_Collection.Item(m_Pos)
         End If
      End Get
   End Property

   Public Function MoveNext() As Boolean Implements IEnumerator.MoveNext
      m_Pos += 1
      Return CType(m_Pos >= 0 And m_Pos < m_Collection.Count, Boolean)
   End Function
   
   Public Sub Reset() Implements IEnumerator.Reset
      m_Pos = -1
   End Sub

End Class   

Note that our NCollection class is zero-based. When the enumerator is created, the current position must be -1, for enumerators work that way - by convention, the user calls MoveNext first (as far as a convention proposed by the help files of a Beta version goes; one may argue that the rules come with the interface description, but they're not enforced by the language).

A new (and potentially better) Collection

So how do we resurrect VBA.Collection? No, I'm not kidding. It had some characteristics that no class in System.Collections has in that combination:

If you study the classes in System.Collections carefully, you'll find that none has all of them. On the other hand, the new ones have features like inserting or removing entire sections, sorting, or checking for existence that VBA.Collection missed (I hardly need to advocate that).

In any case, it's worth analysing your collection needs and decide on a case-by-case basis as to which collection to use, or whether to create a new class. If you only want to map keys and values, use dictionary. If you don't care about keys, but want to sort a few items, try using an ArrayList, or derive a class from CollectionBase. In any case, time it. After all, this is still a Beta and well, we all need to learn new things.

But there are cases when the distinct features of VBA.Collection are just what you need. So why not use the one in the compatibility namespace? Because it's not good enough. You didn't like it in VB5/6, so if it is to serve you now, you want to extend it. Also, it exposes a weakly (Object) typed Item property; this means that the only way to create strongly typed collection is by way of delegation. Another issue is that the lower bound for indexing is one, which is a sane choice in itself, but it's incompatible with every other collection in the .NET framework.

Base class: NameObjectCollectionBase

This base knows keys (well, sort of, but eventually it will) and indices. It's got a little problem with the enumerator, though (I'll explain). We'll start with a generic derivation though, which has "Object"-type elements , one we can immediately use in VB6-style code. You can easily change that to any other type (of course, you could also switch to C++ which has a language feature called "templates" ..., but that's not why you're here). A better strategy, however, is to leave type-specific members (such as "Item" and "Add" [renamed, of course]) with protected access, allowing the creation of strongly-typed collection while reusing other implementation details in this class (such as the GetEnumerator override), but that's an exercise for the reader.

So let's do some inheritance:

Public Class NCollection
   Inherits System.Collections.Specialized.NameObjectCollectionBase

   ' constructors are not inherited
   Sub New()
      MyBase.New
   End Sub

End Class

NOCB has got many protected methods it expects us to call; so much of the code looks like delegation, but it's not. Whenever you inherit, there are some members that you override; all in all NOCB offers a good balance between code reuse and flexibility. We'll also add new features. For example, now that all collections are zero-based, which we'll stick to in this exercise, you might want to add an UpperBound property that eases use in For-Next loops (many MFC collections have a corresponding method):

Public ReadOnly Property UpperBound As Integer
   Get
      Return MyBase.Count - 1
   End Get
End Property

Potentially, you can allow user-defined bounds as well, offsetting the indices behind the scenes.

Let's stick with the properties. Remember how to set the default property in VB6? Still know how you passed Variants to Item? That you had to implement Get/Let/Set properties (if your collection allowed modifiying existing variant items)? Here's the new Item property:

Public Default Overloads Property Item(ByVal index As Integer) As Object
   Get
      Return MyBase.BaseGet(index)
   End Get   
   Set
      MyBase.BaseSet(index, value)
   End Set   
End Property
Public Default Overloads Property Item(ByVal sKey As String) As Object
   Get
      Return MyBase.BaseGet(sKey)
   End Get
   Set
      MyBase.BaseSet(sKey, value)
   End Set
End Property

Here are some methods. Note that NOCB allows for duplicate keys; we make them unique in the Add method (maybe there are other scenarios where you indeed choose to use duplicate names):

Public Sub Add(ByVal value As Object, _
               Optional ByVal sKey As String = Nothing)
   If MyBase.BaseGet(sKey) Is Nothing Then
          MyBase.BaseAdd(sKey, value)
      Else
          Throw New ArgumentException
   End If
End Sub
Public Overloads Sub Remove(ByVal sKey As String)
   MyBase.BaseRemove(sKey)
End Sub
Public Overloads Sub Remove(ByVal index As Integer)
   MyBase.BaseRemove(index)
End Sub
Public Sub Clear()
   MyBase.BaseClear
End Sub
Public Function ExistsKey(ByVal sKey As String) As Boolean
   Return Not (MyBase.BaseGet(sKey) Is Nothing)
End Function

The enumerator

We also have to provide an enumerator to use with For-Each. But the one our base class has is Public, so the clients can just use that, right?

But when they iterate with For-Each, they'll find it prints the keys. Again, even NOCB is a collection of the dictionary style, to some extend. So maybe we can override the GetEnumerator method?

In Beta 1, it was not marked as "Overridable"; in Beta 2, we were free to use our own implementation:

Public Overrides Function GetEnumerator () As IEnumerator
   Return New NCollectionEnumerator(Me)
End Function 

This worked in Beta 2, but the sad news is the in the final version, they've back paddled - you can't override GetEnumerator if you derive from NameObjectCollectionBase.

So if you're interested in creating a collection from scratch, check out the Lists topic, or see the Lists project from my Gregor.NET series.