[Garbage Collection]

Introduction

Wednesday, May 1, 2002

VB.NET brings us a new beast called garbage collector. It promises to resolve circular references, but the price is that we lose determistic finalization of our objects. I'll talk about different memory management concepts here, then see how the new GC delivers, and finally, I'll dispense some unsolicited advice about how to collect ressources in a timely fashion.

Memory management

Let's see what methods different programming languages and environments apply to manage memory. I'll take a high-level, conceptual view point here, focusing on the programmer's tasks. The idea of automatic garbage collection caused a lot of excitement about Java, and I'll compare Java's ways with classic VB's concepts.

Basicly, I'll distinguish four different ways of handling memory. You might come up with a different number, but it's me who has FTP access to this site.

Automatic allocation and deallocation

This is what we're most familier with. Declare a variable, and you've got an instance. The variable might not be initialized to some sensible value, but the compiler will make room for those four bytes an integer takes up.

Whether your instance lives on the stack or not isn't so important (conceptually), but it's the most common scenario. The same principly applies to other variables (like static and extern variables in C [even though the storage class is not considered "auto", and of course lifetime is as long as that of the program], and class static [Shared in VB.NET] variables). In fact, it applies to any variable, as far as the the memory associated with the variable itself is concerned.

In this scenario, the scope of the variable determines its lifetime; you determine the scope by choosing an appropriate place of declaration. You can think of the variable as the instance - "the Integer" is your variable in the source code, and it denotes those four bytes that hold the value as well. This is called value semantics.

Does this concept apply to primitives only? No. In a well-known pointer language, everything is a value type:

int main(){
	CPerson joe, sue;	// create two instances
	joe = sue;		// shallow copy of instance
}				// destroy both instances

There is nothing special about objects that requires that we talk to them via references. In VB, Java, and Delphi, primitives are value types and classes are reference types, but this is a design choice (albeit a reasonable one), not a necessity.

Concluding, this model leaves nothing to worry about garbage collection.

Explicit creation and destruction

This model is in stark contrast to the first one. The programmer decides when to create the instance. He decides when to destroy it. Consequently, declaration scope is not a matter of concern here. This allows access to the instance from any point in the program (theoretically). To take advantage of this flexibility, the "variable" must conceptually be different from the instance. At this point, pointers come into play. Various variables do not hold the value, but only the instance's address. The instances are actually allocated on the heap.

Here, there is no memory "associated" with any variable. Memory can only be accessed indirectly through pointers or references (but excluding C++'s definition of the latter term).

One might say, stop, this is the reference model of Java or VB! Now, references are in fact pointers, though they're more restrictive (you cannot perform pointer arithmetic on them, and casting is limited). But let's see how the pure pointer model works, first:

CPerson * pJoe, * pSue;
pJoe = new CPerson();   // create Joe
pSue = new CPerson();   // create Sue
pJoe = pSue;            // copy pointer value (address)
delete pJoe;            // destroy Sue

Here, we create two instances. We can access those instances by pointers only. If we assign pSue to pJoe, we end up with two pointers to Sue. Joe, on the other hand becomes unreferenced; there is now way to destroy him because we have thrown away the address.

On the other hand, the two pointers to Sue are useless; if we invoke one of CPerson's methods via one of our pointers, the result is undefined. Undefined means you'll crash:

pSue->RaiseHell();	// problem: Sue is dead

So the problem with this is that you might throw away address information that you still need to free memory; vice versa, you might end up with pointers containing some address that is no longer valid.

Conclusion: the value of the pointer does not tell you anything about the state of the object.

Reference counting, COM style

Wouldn't it be better if pointers or references were more in sync with the object they point to? This is certainly so, and in C++, that's the job description. But evidently, the idea is: can this be automated?

Before we get to that, let's deal a little longer with the hard stuff. The heading to this subsection looks hardly promising, after all. Anyway, COM was not primarily designed to make memory management easier. Rather, it's about reusing pieces of software called components, specifically, reusing compiled code (binary reuse). And a compiled component doesn't know about the pointers in the client code; it may run in a different address space (as an out-of-process server), even on a different machine.

In COM, object lifetime is managed by reference counting on a per-object basis. The object keeps the reference count in a private variable. The client increases or decreases the reference count by calling AddRef or Release, respectively. Both come with the infamous IUnknown interface. When the reference count has fallen to zero, the object frees itself from memory.

So when you copy a pointer, you call AddRef. When a pointer is no longer used, you call Release. Sometimes you can skip one or both calls, but using a COM interface in C++ isn't fun. It does not ease memory management, it adds new complexities. Ah, and don't even think you can create COM objects in the same manner as the CPerson objects above.

But there must be something positive about reference counting. Well, there's not, unless it's being automated by the compiler:

Dim joe As CPerson, sue As CPerson
Set joe = New CPerson	' create Joe
Set sue = New CPerson	' create Sue
Set joe = sue		' Joe goes away
Set joe = Nothing
Set sue = Nothing	' now, Sue goes away

If we throw away our reference to Joe (by assigning Sue to it), Joe's reference count falls to zero. Sue is now referenced two times, and the only way we can gid rid of her is by releasing both references (VB will call Release behind the scenes). There is no such thing as "delete" in a protective language.

What about making the dead Sue raising hell, as above? The reference is set to Nothing, otherwise Sue would still be alive. Partying on a Nothing reference isn't so bad in VB:

sue.RaiseHell		' runtime error: object variable not set

An important rule in a high level language is: if you assign a reference to an existing object, this reference is guaranteed to be valid unless you set it to Nothing (nil or null) or unless it goes out of scope. You cannot destroy an instance without resetting your variables (leaving scope is the same, so you can rely on that, too).

Of course, all this magic only works because somebody made it work in the first place. Classic VB is child-safe, in this respect. Objects terminate predictably, and there's no need for a garbage collector. The downside is that this system takes the narrow view: mecanically counting references, classic VB offers no help when your objects are trapped in a reference cycle.

Garbage collection

The garbage collection schemes of Java and VB.NET uphold the promise that a reference, once valid, stays that way until specifically reset. Also, there are friendly runtime exceptions instead of general protection faults when you try to call a method via a null (Nothing) reference.

What's different is the technique behind this. Classes do not need to implement IUnknown. Objects do not keep a reference count. Rather, the runtime environment checks the execution stack, detecting references that are no longer needed. The garbage collector runs on demand when memory runs short. This global, big-picture approach allows it to destroy objects trapped in reference cycles.

The downside is that there is no guarantee as to when an object is freed from memory. It is only guaranteed that it will eventually be collected, but it won't happen right after all references have gone away. In classic VB, destruction was instantanious:

Dim bob As CPerson
Set bob = New CPerson
Set bob = Nothing

' class CPerson
Private Sub Class_Terminate()
	' ...
End Sub

The terminator sub would be hit immediately after Bob was killed by releasing the last (and only reference). This allowed some cool tricks:

' class CHourglass
Private Sub Class_Initialize()
	Screen.MousePointer = vbHourglass
End Sub
Private Sub Class_Terminate()
	Screen.MousePointer = vbDefault
End Sub

' somewhere else
Sub LongTime()
	Dim hg As CHourglass
	Set hg = New CHourglass
	' ... long-lasting stuff
	' ... don't worry about the mouse pointer
End Sub

Don't try this in VB.NET. You'll end up with an hourglass pointer even if the task has been finished. Of course, Screen.MousePointer doesn't work (as of now), but don't confuse me with facts.

Summing up, garbage collection poses some problems. Your objects aren't freed in a timely, predictable manner. The big problem here is the freeing of scarce ressources. When an object terminates, you can use the finalizer; but the real message is that the user of an object needs to call a method like "Close" or "Dispose" when he is finished with it:

Class CResourceWrapper
        Implements System.IDisosable

    Private m_Handle As IntPtr ' OS handle

    Public Sub New()
        m_Handle = CreateOSObject()
    End Sub

    ' needs to be called by the client
    Public Sub Dispose() Implements System.IDisposable.Dispose
        Me.CleanUp()
        ' avoid running Finalize if Dipose has been called
        GC.SuppressFinalization(Me)
    End Sub

    Protected Overrides Sub Finalize()
        ' better clean up late then never
        Me.CleanUp()
        ' allow base class to clean up as well
        MyBase.Finalize()
    End Sub

    Private Sub CleanUp()
        ' don't clean up twice, and allow for partially constructed objects
        If Not m_Handle.Equals(IntPtr.Zero) Then
            ' free your handles here
            CloseHandle(m_Handle)
            ' only dispose once
            m_Handle = IntPtr.Zero
        End If
    End Sub

    Public Sub Foo()
        ' protect disposed objects
        If m_Handle.Equals(IntPtr.Zero) Then
            Throw New ObjectDisposedException()
        End If
        ' ...
    End Sub

End Class

Of course, you don't know when Sub Finalize will be called. Using a finalizer also causes a performance penalty. But be sure to call Dispose from the finalizer just in case the client forgot to do the following:

Module M

    Public Sub Main(ByVal args() As String)
        Dim x As CResourceWrapper = Nothing
        Try
            x = New CResourceWrapper()
            ' ... use the object
        Finally
            If Not (x Is Nothing) Then
                x.Dispose()
            End If
        End Try
    End Sub

End Module

Helping the GC

Normally, it's not a problem if an object isn't freed right away. The garbage collector may take its time before it runs, but it will do its work when memory goes south. In fact, it's activity level is related to the amount of free memory. Just using memory won't lock down the machine, although there are still performance considerations related to memory management, even in a garbage-collected environment (which I don't cover here).

What is far more important is to free up resources other than memory that an object uses. Some resources just aren't available in an unlimited number, no matter how powerful and memory-rich the machine is. Think about database connections. Or internet/server connection handles. Socket connections. File handles. Handles to global device contexts. GDI objects like brushes or pens. In fact, any handle to an object from the OS. If an object uses and owns one of these, it should release them as quickly as possible.

Another unmanaged resource is - unmanaged memory, which you can allocate and free with the Shared methods of the System.Runtime.InteropServices.Marshal class. And, coming full circle, COM objects, that is, unless you use an Runtime-Callable Wrapper (RCW). Here, manual resource management means freeing an object directly, or decrementing its reference count, respectively.

Releasing unmanaged resources may also happen "in the middle" of an object's lifetime. Say, you have a class that draws something on the Desktop window (the entire screen), and you need to keep this object alive because it watches out for some mouse activity, in order to use the global device context again at some point.

The lesson is this: if there is deterministic finalization available, the cleanup of resources like GDI handles can be coupled with the destruction of wrapper objects (using a destructor, or, in classic VB parlance, Sub Terminate). But while this sort of coupling is taken for granted by many who come from a C++ or VB.Classic background, it is by no means the only way to go. For example, using classic VB's traditional file I/O required the same kind of manual work that is expected now by the .NET runtime environment.

In the .NET framework, there is a design pattern arround the System.IDisposable interface, which has been mentioned above. Any object that has a Dispose method (or sometimes called "Close") needs to be disposed. Think in pairs: if you create such an object, write the disposal code right away, and be sure to use a Try/Finally block.

Resources and references

So you need to make some effort at manual ressource management. Unfortunately, no one helps you count the references to an object. Mostly, you'll just have one reference to an object of this kind. You use the thing in some procedure, and you're done. Dealing with scarce resources is usually a short-lived activity. So it goes like this:

Dim ftp As New CFtpConnection("hersite.com", "anonymous", "a@b.c")
ftp.Open()
ftp.DownLoadFile("/dateme.txt", "C:\Temp\sally.txt")
ftp.Close()

Note that I have left out exception handling for brevety. The call to the Close method must be enclosed in a Finally clause.

More open doors

It's easy as long as the design stays simple. Sticking with our internet client example, let's imagine an object model build arround Win32's internet handles hierarchy. There is a CInternetConnection class (which wraps the root internet handle required by every app that uses WinInet.dll), that has a method that returns a valid FTP connection object (wrapping an FTP connection handle):

Dim ic As New CInternetConnection()
Dim ftp As CFtpConnection = ic.OpenFtp("hersite.com", "guest", "guest")
ftp.UploadFile("C:\Temp\joe.jpg", "/candidates/joe.jpg") 
ftp.Close()

Now evidently, it's the client's job to close the FTP connection. But what about the parent internet connection? The OpenFtp method has opened this one, too, and the of course the method has returned before the client can even use the connection:

Public Function OpenFtp(ByVal sHost As String, _
                        ByVal sUser As String, _
                        ByVal sPwd As String) As CFtpConnection
    Me.Open()
    Dim ftp As New CFtpConnection(Me, sHost, sUser, sPwd)
    ftp.Open()
    Return ftp
End Function

So the client must make yet another call, because the serving object can't know when the client has finished.

When to close the gate

Now imagine an application that uses several HTTP connections (think of something that gets stock quotes form different servers and displays them on the screen). Like FTP connections, these require a valid internet handle. So when an HTTP connection is closed, what happens to the parenting internet connection?

This is where you're back at square one. You need to count references manually. It's easy if all the connections are returned from a single method, the connection class is not instantiatable by the client (you can achieve this by declaring Friend constructors only), and if the client can't call CHttpConnection's Open method (again, make it a Friend):

Public Function OpenHttp(ByVal sUrl As String) As CHttpConnection
    Me.Open()
    Dim http As New CHttpConnection(Me, sUrl)
    http.Open()	' client can't use this method
    m_ConnectionCount += 1
    Return http
End Function

Consequently, when the client tries to call CInternetConnection's Close method, there's a way to check:

Public Sub Close()
    If m_ConnectionCount = 0 Then
        ' ... clean up internet handle
    End If
End Sub

Alternatively, you could close all child HTTP connections when the client attempts to close the parent internet connection.

Conversely, if the client closes an HTTP connection, this object will call this method of the internet connection:

' CHttpConnection
Public Sub Close()
    ' ... clean up HTTP connection
    m_InternetConnection.Release()
End Sub

' CInternetConnection
Friend Sub Release()
    m_ConnectionCount -= 1
    If m_ConnectionCount = 0 Then
        Me.Close()
    End If
End Sub

Note that this is a usage counting scheme that isn't really concerned with the number of references. Rather, it counts the number of connection objects that actually use the internet handle. This makes sense, because connection objects themselves might live on; you can bill them as "FTP servers" with persisting properties (like UserName and Password), and create a TreeView that displays nodes representing them, even if there is no active connection.

The important thing here is that the duration of resource usage is not identical to an object's lifetime. You need to count the number of objects which currently use a given resource manually. As an aside, if resources usage was identical to an object's lifetime, you could use Shared variables in the CHttpConnection class for counting instances, but this doesn't apply here.

Much of the code can be wrapped up in a class library, so things can be easy to use. Incidently, it's not all that new. Our internet client example would look very similiar in VB6, provided we intended to allow the CHttpConnection objects to live on beyond their using of OS connection handles.