New Version of Expat with UTF-16 Support

Article ID: 53061

I've been maintaining a port of the Expat open-source XML parser for the System i and providing ILE RPG language bindings for it. The fact that Expat outputs its results in the UTF-8 flavor of Unicode makes it tricky to use. Because RPG doesn't understand UTF-8, it requires you to delve into Coded Character Set Identifier (CCSID) translation APIs, which in turn makes the process more complicated.

This week, I found a feature that lets Expat write its output in UTF-16 instead! At first, this doesn't sound very exciting; however, because RPG natively supports UTF-16, this ability facilitates parsing XML with Expat from your RPG programs.

This article explains how to use the UTF-16 build of Expat in your RPG programs.

UTF-16 vs. UTF-8

Before I go too far into this article, I want to clarify that I'm discussing the format of data output from Expat. This has no bearing on the input formats that Expat can read. The XML files that Expat reads can be in several different formats, including US-ASCII, ISO-8859-1, UTF-8, and UTF-16. However, the output is always the same. In previous versions of my Expat port for i5/OS, it always sent the output to your program in UTF-8. In the newer versions of my port, it outputs UTF-16 instead.

Expat has to be told at compile time that you want it to output UTF-16 instead of UTF-8. Depending on which one you choose, Expat compiles a different section of code and uses it when the data is parsed. Unfortunately, this means that you can't switch between UTF-16 and UTF-8 at runtime. If you have existing programs that use the UTF-8 version of Expat, don't install the UTF-16 version on top of the old one, because your existing programs will no longer work properly. Instead, I recommend keeping the older version around and installing the new one in a separate library until you can upgrade any existing programs to use UTF-16.

UTF-16 support has been available for a while — but I only recently learned about it. Expat itself supported UTF-16 in older versions (e.g., 1.95.8), but because I didn't know about it, I didn't provide it in the System i port of Expat until version 2.0.0. This version was released by the Expat project in January 2006, but I've only just gotten around to uploading the System i version to my Web site. You can now download that updated version from the following link:
http://www.scottklement.com/expat/

I keep going on and on about UTF-16 support. What does that mean exactly? Several versions of Unicode are available. In UTF-8, the length of each character varies. It can be as small as 1 byte, or as long as 6 bytes! With UTF-16, each character is always 2 bytes long. Each has its advantages and disadvantages. One advantage of UTF-8 is that the most common, invariant characters have the same hex values in UTF-8 that they have in ASCII. Another advantage is that it can store more different characters, because 6 bytes can store more than 2 bytes can.

However, UTF-8 has disadvantages as well. Because the length of each character varies, it's hard to know how big to make each field. If you translate a 30-character EBCDIC field to UTF-8, the output could be as short as 30 bytes or as long as 180 bytes. You don't know what it'll be! What if you want to take a substring of a field? In EBCDIC, the fifth character starts in the fifth byte. In UTF-16, the fifth character starts in the ninth byte because each character takes up 2 bytes. In UTF-8, however, it's hard to say. It could be anywhere from the fifth to the 25th byte. And the length of that character can vary, so you don't know exactly how many bytes to extract. In order to know, you have to read the string of bytes from front to back to find the length of each character. Granted, there are libraries of code that do this for you, but it's not as efficient as reading characters from fixed positions in the string.

For my shop, UTF-16's biggest advantage is that we can use it directly in RPG via the "C" (Unicode Character) data type. Although the RPG manuals refer to this data type as UCS2, it technically can be used for any character set that uses the same encoding scheme, as long as the System i provides a CCSID for that scheme. CCSID 1200 is UTF-16, and CCSID 13488 is UCS2.

How Expat Works

Expat's author refers to Expat as a "stream-oriented XML parser." The idea is that you feed data into Expat as a stream, one chunk at a time. Because you don't have to load the entire file into memory at one time, less memory needs to be used at any one time than would be the case with some other parsers. For example, if you want to process a 200 MB XML document, far less memory is required to work with 32 KB at a time, rather than load the entire 200 MBs into memory at once. If your program uses less memory, that means the memory can be used for other purposes, making the whole system run better.

When you feed these "chunks" of data into Expat, it scans through them looking for "events." An event is a point in the XML document at which you might want your program to receive data from Expat so that it can process the data.

For example, consider the following XML document:

<Cust number="1234">
   <name>Acme, Inc.</name>
   <balance>123.45</balance>
</Cust>

There are three important events. The start of an XML tag, the end of an XML tag, and any character data found between the tags.

If you feed the preceding XML data into Expat, it reads through the data and finds the following events:

  1. Expat reads <Cust number="1234">. It knows that this is the start tag for the Cust element — and the start of an element is an event! Expat calls a subprocedure in your program that's designated to handle start events.

    Expat passes your subprocedure the following parameters (in UTF-16, of course):

    • name = Cust
    • attr(1) = number
    • attr(2) = 1234
    • attr(3) = *NULL
  2. Expat reads <name>. This is another start tag for an element, so it calls your start handler again, this time with the following parameters:

    • name = name
    • attr(1) = *NULL
  3. Expat now finds some character data. This is a different event. Expat calls your character data event handler and passes it the following parameters:

    • string = Acme, Inc.
    • length = 10
  4. Next, Expat reads </name>, which it knows is the end tag for the name element. It calls your end event handler with these parameters:

    • name = name
  5. Then it sees the start element for the <balance> tag and calls the start element handler again, passing these parameters:

    • name = balance
    • attr(1) = *NULL
  6. Then, there's character data, so Expat calls the character data handler with these parameters:

    • string = 123.45
    • length = 6
  7. Next, Expat finds the end tag for the balance element and calls your end handler:

    • name = balance
  8. Finally, Expat finds the end of the Cust element and calls the end handler one last time.

    • name = Cust

Expat does this for every chunk of data that you pass to it. It reads through the chunk of data and calls your routines for every event that it finds in the data. The events that you commonly look for are the start element, character data, and end element events that I discussed in the preceding example.

Because Expat individually processes each chunk of data that you pass it, you might be wondering what happens if Expat gets only part of an element at a particular time.

  • If a chunk of data ends in the middle of an XML tag, Expat remembers the data that you've sent so far and waits until the whole start or end tag has been read before calling your handlers.
  • If a chunk of data ends in the middle of character data, Expat calls your event handler twice. It calls it at the end of the chunk of data and passes any characters that it has processed so far, and then Expat calls the event handler again when it processes the character data in the next chunk. If the character data is very long and needs to be processed in many chunks, Expat calls your character data handler repeatedly for each chunk of data.

Feeding XML to Expat

Here's an example of the steps required to feed an XML document to Expat. In this example, the XML data is coming from a variable in my RPG program:

     H DFTACTGRP(*NO)
     H CCSID(*UCS2:1200) BNDDIR('EXPAT')

     FQSYSPRT   O    F  132        PRINTER OFLIND(Overflow)

      /copy expat_h

     D rc              s             10I 0
     D errCode         s             10I 0

     D XMLdata         s            200C
     D len             s             10I 0
      /free

         XMLdata = %UCS2(
           '<?xml version="1.0" encoding="UTF-16"?>'
         + '<Cust number="1234">'
         +    '<name>Acme, Inc.</name>'
         +    '<balance>123.45</balance>'
         + '</Cust>' );

         len = %len(%trimr(XMLDATA)) * 2;

         parser = XML_ParserCreate(*OMIT);

         XML_SetStartElementHandler (parser: %paddr(start)   );
         XML_SetEndElementHandler   (parser: %paddr(end)     );
         XML_SetCharacterDataHandler(parser: %paddr(chardata));

         rc = XML_Parse( parser : %addr(XMLdata): len: 1);
         if (rc = XML_STATUS_ERROR);
            errCode = XML_GetErrorCode(parser);
            PrintMe = 'Parse error at line '
                    + %char(XML_GetCurrentLineNumber(parser))
                    + ','
                    + %char(XML_GetCurrentColumnNumber(parser))
                    + ': '
                    + %str(XML_ErrorString(errCode));
            except Print;
         endif;

         XML_ParserFree(parser);

         *inlr = *on;

Here are a few notes about the preceding example code:

  1. The H-spec specifies CCSID(*UCS2:1200). This tells the RPG compiler that all my UCS2 fields (data type C in RPG) are in CCSID 1200, which is the CCSID for UTF-16.

  2. The H-spec also specifies BNDDIR('EXPAT') to tell my program where to find the Expat service program.

  3. In the D-specs, I use /COPY to bring in prototypes and other useful definitions needed to call the EXPAT routines.

  4. The first statement in the free-format calc specs is a multi-line EVAL statement (but because it's free format, the word "EVAL" is optional). It uses the %UCS2() BIF to convert the EBCDIC XML data into UTF-16 and assigns it to the XMLDATA variable.

  5. The length of XMLDATA is calculated by trimming off trailing blanks (just like with an ordinary alphanumeric field) and calling the %LEN() built-in function (BIF). The result has to be multiplied by 2 because Expat expects me to tell it the length as a number of bytes rather than a number of characters.

  6. Note that XMLDATA could be an ASCII string or a UTF-8 string instead of UTF-16. However, it cannot be EBCDIC, because neither Expat nor the XML spec supports EBCDIC. I decided to make it UTF-16 because it's easy to do in RPG without a need to call an API.

  7. The XML_CreateParser() subprocedure is part of Expat, and it creates a "parser object." In reality, this is nothing more than a temporary space in memory that Expat uses to store its work variables.

  8. The XML_SetStartElementHandler, XML_SetEndElementHandler, and XML_SetCharacterDataHandler subprocedures tell Expat which subprocedures to call when an event occurs. I've written subprocedures named start, end, and chardata, and I want Expat to call these when it finds a start element, end element, and character data event, respectively. More about those later.

  9. The XML_Parse() routine returns 1 if all is well or 0 if it fails. XML_STATUS_ERROR is a named constant that has a value of 0.

  10. When an error occurs, I can call XML_GetCurrentLineNumber to get the line of the XML document that had an error, XML_GetCurrentColumnNumber to get the column number (within that line) in which the error was found, and XML_GetErrorCode to retrieve the error number of the error that occurred. XML_ErrorString converts an error number into a human-readable error message, and I like to print that out to help with debugging.

  11. I call the XML_Parse() subprocedure to parse one chunk of XML data. I pass the work space (called "parser") as well as the address and length of the chunk of data that I want it to parse. The last parameter to XML_Parse() tells Expat whether this is the last chunk of data. Passing a value of 1 means that it's the last chunk, whereas a value of 0 means that there's more to come.

    Although I have only one chunk in this example program, I could call XML_Parse() in a loop, passing more data each time. That's useful when you're parsing a file. More about that later.

  12. After I feed all the chunks of my XML document (only one in this case) to Expat via the XML_Parse subprocedure, Expat is done searching for events and has already called my event-handling subprocedures. That means I'm done! The XML_ParserFree() subprocedure frees up the temporary space that I reserved when I called XML_CreateParser(), because it's no longer needed.

Start Event Handler

Each time an event (e.g., an XML tag or the data that lies between them) is found, Expat calls your routines. It's up to you to write code to do something with them. In the preceding example, I defined subprocedures that Expat calls whenever it finds an event.

  • The start subprocedure is called when a start tag is found for an XML element.
  • The end subprocedure is called when an end tag is found for an XML element.
  • The chardata subprocedure is called when characters are found between the elements.

When Expat calls these "handlers," it passes them parameters telling them about the events it found. Here's an example of a start handler:

     P start           B
     D start           PI
     D   usrdta                        *   value
     D   elem                     16383C   options(*varsize) const
     D   attr                          *   dim(32767) options(*varsize)

     D elemName        s            100C   varying
     D attrName        s            100C   varying
     D attrVal         s            100C   varying
     D len             s             10I 0
     D data            s          16383C   based(p_data)

      /free

         len = %scan(U'0000': elem) - 1;
         elemName = %subst(elem:1:len);

         PrintMe = 'elemName = ' + %char(elemName);
         except Print;

         x = 1;
         dow attr(x) <> *NULL;

            p_data = attr(x);
            len = %scan(U'0000': data) - 1;
            attrName = %subst(data:1:len);

            p_data = attr(x+1);
            len = %scan(U'0000': data) - 1;
            attrVal  = %subst(data:1:len);

            PrintMe = 'attrName = ' + %char(attrName)
                    + ' attrVal = ' + %char(attrVal);
            except Print;

            x = x + 2;
         enddo;

      /end-free
     P                 E

This subprocedure is a part of your code, not part of Expat. It's something that you write to tell Expat what to do with the XML data.

The first parameter that Expat passes to you is for user-defined data. This is something that I could define in my program if I wanted to — but I didn't need it, so for the time being, it's ignored.

The second parameter is the name of the start tag that Expat finds. There's a catch, however. Expat passes this name as a C-style string! That means that it's a variable length value. To detect the end of the string, I have to search for a character made up of hex zeroes. I can't use the %STR() BIF for this because %STR() is for single-byte character strings, not for UTF-16. Instead, I use the %SCAN() BIF, followed by the %SUBST() BIF to take every character before the u'0000' (hex zeroes) character and save it to my elemName variable.

         len = %scan(U'0000': elem) - 1;
         elemName = %subst(elem:1:len);

Note that elemName is defined using data type C, which is UCS2. Remember, Expat sent me a UTF-16 string, not an EBCDIC one. When I save it to elemName, I'm keeping it in a UTF-16 string.

For the sake of example, all this program does is print out the parameters it receives. That makes it easy to run this program and see the events that Expat encounters and what it passes to each of your event handlers.

To print the element name, I convert it from UTF-16 to EBCDIC characters using the %char() BIF, and then I print it using a program-described print file.

         PrintMe = 'elemName = ' + %char(elemName);
         except Print;

That sure is simpler than having to run the string through iconv() to convert it from UTF-8 to EBCDIC!

The remainder of this sample start element handler loops through the array of attributes. Attributes are those name="value" codes that can appear in an XML start tag. In the XML sample, I have number="1234" in the <Cust> tag.

Expat passes attributes to the start handler as an array of pointers, where each pointer points to a C-style string in UTF-16 format. Expat first passes the attribute name, then its value, then the next name, then the next value, and so on. I can determine when I reach the end of the attributes by looking for a *NULL pointer.

If you look at the code, you'll see that I have a field named data that's based on a pointer named p_data. I set p_data to each of the attributes in turn and extract the C-style string from data the same way that I extracted the element name earlier.

Once again, I print each attribute's name and value for the sake of demonstration.

End Event Handler

When Expat finds the end tag for an XML element, it calls the end handler. In my sample program, this is a subprocedure called end. Like the start handler, it's passed the element name as a C-style string in UTF-16 format. It's also passed the user data parameter, just like the start handler received.

Here's a sample end subprocedure that extracts the element name and prints it out:

     P end             B
     D end             PI
     D   usrdta                        *   value
     D   elem                     16383C   options(*varsize) const
     D elemName        s            100C   varying
      /free
          len = %scan(U'0000': elem) - 1;
          elemName = %subst(elem:1:len);

          PrintMe = 'end of element ' + %char(elemName);
          except Print;
      /end-free
     P                 E

The end element handler is usually where you map your XML data into variables in your program, because it's the point at which you know that there isn't any forthcoming data for the given element. For the time being, though, all I'm doing is printing the parameters passed to it so that you can get an idea of the flow of events.

Character Data Handler

Every time characters are found between the XML tags, the character data is called, and Expat passes as many characters as are available to it. Expat tries to collect as many characters as possible before calling the handler. Expat saves the characters up until it reaches another event (i.e., the start or end of an element) or until it reaches the end of the chunk of data that you passed into XML_Parse. When any of these things happens, Expat calls your character data handler and passes the characters as a parameter.

Because you know that Expat calls this handler at the end of every chunk, the second parameter to the character data handler should be at least as large as the chunk size that you passed into XML_Parse — but that's not a problem for this example!

Here's a sample character data handler that prints the data that it receives:

     P chardata        B
     D chardata        PI
     D   usrdta                        *   value
     D   string                   16383C   const options(*varsize)
     D   len                         10I 0 value

     D data            s          16383C   varying
      /free
            data = %subst(string:1:len);

            PrintMe = 'chardata = ' + %char(data);
            except Print;
      /end-free
     P                 E

As you can see, there are three parameters passed to the character data handler. These are the user data parameter, which I'm not using for now, the character string, and the length of the character string.

Unlike the start and end handlers, the data passed to the character data handler is not a null-terminated C-style string. Instead, the length is passed as a separate parameter. That makes it easy to use the %SUBST() BIF to copy it to another variable.

Testing It Out

The sample code that I've detailed here is for a program called VARDEMO1, and it's included with the Expat download on my Web site. I encourage you to give it a try. Running it gives you a good feel for how the events work. If I run it with the sample XML document from the start of this article, this is what it outputs to the spooled file:

elemName = Cust                   
attrName = number attrVal = 1234  
elemName = name                   
chardata = Acme, Inc.             
end of element name               
elemName = balance                
chardata = 123.45                 
end of element balance            
end of element Cust

I suggest that you look at the preceding code and compare it to this output to see whether you understand how the event handlers are called and what they do.

Where Am I?

Each time your start and end element handlers are called, they're passed the name of the element. However, they don't tell you the name of the "parent" elements (the elements that the current one is inside). For example, in the preceding XML document, the <name> element is inside the <Cust> element. It's up to you to keep track of that so you know where you are in the document.

Furthermore, you want a way to save all the data received by the character data handler until you reach the end of the element. At first glance, it might seem like it makes sense to associate this data with the value of the most recent start handler, but if you think it through, you'll see that it won't work.

With the XML document that I've used so far, just looking at the last start element handled would be fine. But what if you have the following document?

<Cust number="1234">
   <name>Acme, Inc.</name>
   <balance>123.45</balance>
   123 Main St.
   Anywhere, USA 54321
</Cust>

In this example, the address ("123 Main St, Anywhere USA") is part of the <Cust> tag. However, the most recent start element handler was for the <balance> tag!

How do you solve the problem? You put the data on a stack. Each time a new start tag is found, you put that start tag on top of the stack. Each time an end tag is found, you remove the top entry from the stack.

If you're unfamiliar with the concept of a stack, think of it as a stack of shoe boxes that you might put on your office floor. When you start processing the XML document, you write the word "Cust" on the first box and set it on the floor. Next, you label a box with "name" and put it on top of the Cust box. Now you come across "Acme, Inc." You put it into the "name" box because it's the highest one on the stack. Then you reach the end of "name," so you remove the top box from the stack. Now "Cust" is back on top. Next you see "balance" so you label another shoe box with the word "balance" and put that on top of the stack. When you see the data "123.45" you put it in the "balance" box because it's on top. When you reach the end tag for "balance," you take it off the stack, so that "Cust" is back on top. When you read in "123 Main St." it goes in the "Cust" box because that's what's on top of the stack.

This paradigm works well for keeping track of where you are in an XML document and saving the character data that corresponds to the current element. Coding a stack is easy using a multiple-occurrence data structure (MODS) and a numeric variable to keep track of the current depth.

     D depth           s             10I 0 inz(0)
     D Stack           ds                  occurs(16)
     D   elemPath                   256C   varying
     D   elemVal                  16383C   varying

Each time a new start element is encountered, I increase the depth by adding 1 to the depth variable and setting the current occurrence of Stack to that depth.

Each time we're done with an element, at the end of the end element handler, I subtract one from the depth and set the occurrence accordingly.

I want my start handler to increase the depth, add its name to the list of the parents that preceded it, and store that on the stack. I also want my start handler to clear the value for this level of the stack so that it's empty when the character data handler is first called. Here's a sample start handler that does that:

     P start           B
     D start           PI
     D   usrdta                        *   value
     D   elem                     16383C   options(*varsize) const
     D   attr                          *   dim(32767) options(*varsize)

     D elemName        s            256C   varying
     D attrName        s            256C   varying
     D attrVal         s            100C   varying
     D len             s             10I 0
     D data            s          16383C   based(p_data)

      /free

         if (depth = 0);
            elemName = %ucs2('/');
         else;
            elemName = elemPath + %ucs2('/');
         endif;

         len = %scan(U'0000': elem) - 1;
         elemName = elemName + %subst(elem:1:len);

         depth = depth + 1;
         %occur(stack) = depth;
         elemPath = elemName;
         elemVal  = u';

         x = 1;
         dow attr(x) <> *NULL;

            p_data = attr(x);
            len = %scan(U'0000': data) - 1;
            attrName = elemPath + %ucs2('/@') + %subst(data:1:len);

            p_data = attr(x+1);
            len = %scan(U'0000': data) - 1;
            attrVal  = %subst(data:1:len);

            PrintMe = %char(attrName) + '='+ %char(attrVal);
            except Print;

            x = x + 2;
         enddo;

      /end-free
     P                 E

As you can see, the start handler begins by retrieving the parent element's name (because that's currently on the top of the stack) and storing it in the elemName variable followed by a slash. It then gets the element name passed by Expat and adds that to the path. This results in /Cust the first time, then /Cust/name, then /Cust/balance, and so on.

After the element name with its parent elements prepended is calculated, I increase the depth and add that element name to the stack and also clear the value of the element at the top of the stack.

When the character data handler is called, all I have to do is add the character data to the stuff currently at the top of the stack:

     P chardata        B
     D chardata        PI
     D   usrdta                        *   value
     D   string                   16383C   const options(*varsize)
     D   len                         10I 0 value
      /free
            elemVal = elemVal + %subst(string:1:len);
      /end-free
     P                 E

When the end element handler is called, elemVal will contain all the data (or at least the first 16,383 characters of it!) and elemPath will contain the current element. This makes it possible to map those values into an array or print them out or do whatever you want to do with them.

Here's a sample end element handler that just prints the data out:

     P end             B
     D end             PI
     D   usrdta                        *   value
     D   elem                     16383C   options(*varsize) const
      /free

         if (elemVal <> u');
            PrintMe = %char(elemPath) + '=' + %char(elemVal);
            except Print;
         endif;

         depth = depth - 1;
         if (depth > 0);
            %occur(stack) = depth;
         endif;
      /end-free
     P                 E

When you run this on the XML data shown earlier, you get the following results:

 /Cust/@number=1234                   
 /Cust/name=Acme, Inc.                
 /Cust/balance=123.45                 
 /Cust=123 Main St.                   
       Anywhere, USA 54321

The preceding sample code (the one that uses the stack) is from a sample program called VARDEMO2 included with the Expat source code download.

Reading XML from the IFS and More

For the sake of this article, I wanted to demonstrate how XML data can be parsed from a variable in your program and how the new UTF-16 support works. In the past, I've written demonstrations of how to feed Expat data from a stream file, so I won't rehash that here.

Instead, I recommend that you look at the CHARDATA2 and USERDATA sample programs included in the QRPGLESRC file in the Expat download. I've upgraded those programs to use UTF-16 as well, and they demonstrate reading data from the IFS and printing it or mapping it to an array in your program.

Now that you understand some of the basics of Expat, experimenting with the sample programs can really help you learn a lot. Be sure to get your copy of Expat from my site at:
http://www.scottklement.com/expat/

Previous Expat Articles

I previously wrote a series of articles about using Expat from RPG. Those articles used the UTF-8 output from Expat, sometimes treating it as normal ASCII and translating it with the QDCXLATE API, and other times treating it as UTF-8 and converting it with the iconv() API. Although that adds some extra complexity, you still might find those articles interesting. Here are links to those articles:
Using Expat from an RPG Program, Part 1
Using Expat from an RPG Program, Part 2
Using Expat from an RPG Program, Part 3
Using Expat from an RPG Program, Part 4

ProVIP Sponsors

ProVIP Sponsors