Working With Text In WebEdit.NET

Sunday, August 7, 2005

Text is universal. It's a good choice for both data formats as well as interfaces to software (user input, programming). In fact, the most powerful techniques for working with computers are those that rely on the ubiquity of that simple, universal format.

Textual commands are better than graphical input, for a number of reasons, just as a textual programming interface is preferable to graphical, or "visual", programming. Textual data has advantages over binary data as well.

If everything is "in text", it's easy to to treat code as data, handle data like code, to generate code, to automate things. It's not a big step from interactive use of software to scripting. Text is future-proof, easy to store and transfer over the network, simple to understand and much simpler for explaining a process than a series of screen shots. Textual information is often good for communication as well, but I'm not going to dwell on that aspect.

Unfortunately, the real world is different. We're hooked by slick user interfaces, and I'm no different. For some activities, namely, content creation, GUI programs are a good choice indeed, because such creative activities are impossible to automate, and a "rich" interface is often helpful in the presentation of complex data.

So, we want to have the best of both worlds, but we're not sure how to get there. One approach is to expose object models to scripting, another is to base every action on a textual command, and build the GUI on top of that. The second approach is preferable. Belatedly, I realize that WebEdit.NET isn't a pure example of it, even though textual interfaces are available for a number of tasks.

Competing approaches for data formats, user interaction, and programming are another fact of life not to be ignored. Let's just take a look at data formats we have to deal with. I'll take a high-level view here:

There's plain text, the command line interface, and shell scripts. In other words, UNIX. It's close to the ideal outlined above, although many see this model as inadequate.
Based on text, there's semi-structured hierarchical data like XML (and that includes XHTML, by the way). It isn't as simple as plain, line oriented text files, but still shares most advantages. On the bright side, it's very expressive, and can be used as a programming language as well.
Another world is that of tabular, and by extension, relational data. SQL - which, thankfully, is textual - is well designed, easy to use, and widely supported. It could play better with plain text tools, however.
Yes, the fourth world is that of objects, mostly statically typed. It's by far the most complex data. OO, somewhat arbitrarily, states that code should be bundled with data, and there's a lot of protocol - yes, Sir, you need to override both Equals and GetHashCode if you want to have a really well-designed class.

Alas, different worlds. How do bridge the gaps? The typical technique is wrapping. COM Interop, P/Invoke, JNI, Perl ODBC, ADO with text drivers, the DataDocs AddIn. The approach is practical, and we can get things done, but there's a high price to pay, namely, the deterioration of systems into immense complexity.

The trouble is, integrating different systems is a challenge, and coders like these sorts of challenges - See, I wrapped that up once, now it's easy for you to use. I don't regret integrating the Windows Shell Namespace into my own .NET application framework, and that prototypical DLL I wrote so that some Java app could export data into a Word document got me a lot of hacking fun.

But for all the practicality and the fun at solving real-world problems, let's not forget that the better approach is to keep things simple, and resolve the dispararities at a lower level. Which is a real challenge. It must be pursued in different places - in data formats, at the language level, and in programming as well as user interfaces.

What's my tack on it? I'll try to use simple text formats when possible, while integrating data with easy-to-use APIs. There are a lot of useful tools, but they're often specific to a certain domain. For example, we have grep for record-oriented text files, select ... where for data tables, and xpath for XML. These tools are powerful for manipulating certain types of data, but it seems that there's duplication of effort.

Anyway, text processing in WebEdit.NET. With the command model and the code interpreter, the interface is textual. Scripts can be readily changed and extended, so different data can be integrated and is accessible from the same textual interface. The data ultimately is plain text, because WebEdit is a text editor, but more importantly, because text is simple and universal.

So, script functions convert data among the different formats. Data is stored in variables - or, if you need non-linear processing ("ad-hoc manipulation") - in document windows.

By now, I've been theorizing more than enough. For the remainder of this post, I'll give examples of text manipulation scripts, and talk about a few other aspects of it. What follows is a progress report - I'm not there yet.

The Line Parser

The first thing is about how the editor presents record-oriented text files. I use my own pseudo-format, dubbed the "ToDoList language", which is defined in the language configuration file (WebEditLanguages.xml). Basically, there is syntax coloring for lines starting with certain tokens:

- default (note, thing to do) 
? maybe (not) 
! important (high priority item) 
* under way (currently being worked on) 
| delegated (in the "pipe") 
\ postponed (later, maybe never) 
+ biggie (death march project) 
# done (good!)

If a line starts with certain token, for example, an exclamation mark, than that line is assigned a certain range kind, the format of which is customizable. A line may be continued with a trailing backslash. You can change, add or remove rules for these mappings (Tools/Settings, then Editing/Languages, then ToDoList/RangeKindMap).

Grepping Lines

Record text files that follow the ToDoList format can be grepped with the following utility:

grepLines(sStart)
{
    loop(i in Vars.TextLines){
        sLine = Vars.TextLines[i];
        if(sLine.StartsWith(sStart)){
            tr(sLine);
            while(sLine.EndsWith("\\")){
                i = Mat.Add(i, 1);
                sLine = Vars.TextLines[i];
                tr(sLine);
            }
        }
    }
}

The grepLines function traces it's output to the Console window. You could also paste these lines into a new document, or return a collection of strings to the caller (examples below - just be aware of the different output options).

Accessing Document Text

The Vars module has a number of properties for that. You can get all the text of a document as a string or a string array, or retrieve the current selection in the same formats (see the Text, TextLines, TextSelection, TextSelectionLines properties).

There is also support for column-oriented text extraction:

Vars.TextSelectionColumns copies columns starting at the lowest column index (at either selection boundary - whichever is smaller) through the highest column index (at either selection boundary - whichever is greater) into a string array, for all lines.
Vars.TextSelectionCells copies such columns only for those lines that are part of the selection.

Transforming Strings

If you have an array of strings (or any ICollection of strings having an indexer - .NET Console's loop construct is handy here) the following function executes a callback for each string:

transformStrings(aStr, cb)
{
    loop(i in aStr){
        aStr[i] = cb(aStr[i]);
    }
    aStr;
}

Such a callback could trim each string and wrap it in single quotes. I use that for generating value lists for use in SQL in operators from database output (if for some reason a sub select is not an option):

trimSingleQuoteString(s)
{
    string.Concat('\'', s.Trim(), '\'');
}

Creating HTML From Plain Text

Suppose you have a text file with paragraph, and need to transform it into HTML:

paragraphLines(asLines)
{
    sb = new StringBuilder();
    foreach(sLine in asLines){
        if(Flow.IsString(sLine)){
            sb.Append("<p>");
            sb.Append(Environment.NewLine);
            sb.Append(sLine);
            sb.Append(Environment.NewLine);
            sb.Append("</p>");
            sb.Append(Environment.NewLine);
            sb.Append(Environment.NewLine);
        }
    }
    sb.ToString();
}

By the way, I just had to escape the angle brackets in the previous code sample - I selected the code, and executed the following line:

code:Vars.TextSelection = Parse.EncodeEntities(Vars.TextSelection)

Creating HTML From Table Data

We have seen the tableLines function in a previous entry. It assumes that columns are separated by tabs, because that was what documents loaded from databases - with the DataDocs AddIn - were formatted.

Well, I've beefed up DataDocs in the meantime. In the data source config file (DataDocs.xml), the output formatting can be explicitly specified on a per-DataItem-basis (separated vs. fixed width, alignment and overflow handling):

<dataitem ...>
  <tableformatter type="DataDocs.CDataTableFormatter"
      rowformat="FixedWidth" delimiter="	">
    <columnformat type="Gregor.Core.CColumnFormatInfo"
        width="20" alignment="Auto" overflow="Limit" isdefault="1" />
    <columnformat type="Gregor.Core.CColumnFormatInfo"
        width="10" alignment="Left" overflow="Blackout" isdefault="0" />
    <columnformat type="Gregor.Core.CColumnFormatInfo"
        width="10" alignment="Left" overflow="Blackout" isdefault="0" />
  </tableformatter>
</dataitem>

You can also use code (the bulk of the implementation is in Gregor.Core - see the CTableFormatter class and related types):

item = DataDocs.CDataDocsConnector.Instance.DataManager.Connections[0].Items[1];
item.TableFormatter.RowFormat = TableRowFormat.FixedWidth;
item.TableFormatter.ColumnFormats.Add(new CColumnFormatInfo(20));
item.TableFormatter.ColumnFormats[0].Alignment = CellAlignment.Center;
item.TableFormatter.ColumnFormats[0].Overflow = CellOverflow.Truncate;

If no formatter is set on a data item, a default formatter is used (tab-separated columns).

Code Assistance

There's code assistance in Gregor.Editing for things like generating event handler skeletons or interface implementation stubs. But some coding techniques are just too dirty for my pristine application framework - chuckle:

createEnumSwitch(sEnumType)
{
    sb = new StringBuilder();
    sb.Append("switch(...){");
    sb.Append(Environment.NewLine);
    tpEnum = Reflect.FindType(sEnumType);
    flags = Bytes.CombineBitFlags(BindingFlags.Public, BindingFlags.Static);
    foreach(fi in tpEnum.GetFields(flags)){
        sb.Append("    case ");
        sb.Append(fi.Name);
        sb.Append(':');
        sb.Append(Environment.NewLine);
        sb.Append("        ...;");
        sb.Append(Environment.NewLine);
        sb.Append("        break;");
        sb.Append(Environment.NewLine);
    }
    sb.Append("    default:");
    sb.Append(Environment.NewLine);
    sb.Append("        throw new System.ComponentModel.IllegalEnumArgumentException(\"...\");");
    sb.Append(Environment.NewLine);
    sb.Append("}");
    sb.Append(Environment.NewLine);
    sb.ToString();
}

Note: make sure you execute code:using System.Reflection first.