Using the Expat XML Parser from an RPG Program, Part 3

Article ID: 20100

Expat is an open source XML parser. In the past two issues of this newsletter, I've featured information on how to use it from your RPG programs, including how to read data from the IFS and feed it to Expat and how to get the results from the parser using special call back routines called "handlers."

WHY DO I NEED AN XPATH?

As explained in the last article, it helps to implement a stack to keep track of which XML element name you're working with. Once you know that, it's possible to copy the data that has been parsed out of the XML document to a variable in your program, for example:

         select;
         when stack(depth) = 'name';
           myNameVar = val;
         when stack(depth) = 'addrLine1';
           myAddrVar = val;
         when stack(depth) = 'city';
           myCityVar = val;
         when stack(depth) = 'state';
           myStateVar = val;
         when stack(depth) = 'zipCode';
           myZipVar = val;
         endsl;

There's a problem with this logic, however. XML documents do not necessarily assume that an element name is only used in one place. For example, consider the following XML that I borrowed from the iSeries Network's Web site:

<?xml version="1.0" ?>
<rss version="2.0">
  <channel>
     <title>iSeries Network News Headlines</title>
     <link>http://iseries.pentontech.com/t?ctl=6784:103C2>
     <copyright>Copyright - Penton Media 2005</copyright>
. . .
     <item>
       <title>IBM's Strategy for RPG: Do They Have It Right?</title>
       <link>http://iseries.pentontech.com/t?ctl=6780:103C2>
     </item>
     <item>
       <title>Original Software Debuts Data Privacy Protector</title>
       <link>http://iseries.pentontech.com/t?ctl=677E:103C2>
     </item>
. . .
  </channel>
</rss>

Look at the "title" tag, above. It's used for both the channel title and the title of each item. The same is true of the "link" tag; it's used to identify the link to the channel as well as the link to each individual news item. How can a program that's parsing this XML document keep track of which title or link it has received?

The answer is something that's referred to as an "Xpath." This is a path that includes not only the element's name, but also the names of the elements that it's "inside."

For example, in the document above, the Xpath for the title of the channel would be "/rss/channel/title", whereas the Xpath for the title of each news item would be "/rss/channel/item/title". If you use it to map the values, you'll be able to differentiate between the different title tags in the document.

IMPLEMENTING XPATH WITH EXPAT

To do that with Expat, I've written a sample program named XPATH. In this program, I've enlarged the length of each entry of the array that I've used for my stack, since an Xpath name is longer than an ordinary element name would be. The new definition follows:

     D stack           s           1024A   varying dim(50)

In last week's version of this program (the program called CHARDATA2) the element name was assigned to the stack with the following line of code:

                                    
    stack(depth) = %trimr(elemName);

To extend this code so that it'll store an XPath instead of a simple element name, I've changed the code to this:

         if (depth = 1);                                           
            stack(depth) = '/' + %trimr(elemName);                 
         else;                                                     
            stack(depth) = stack(depth-1) + '/' + %trimr(elemName);
         endif;                                                    

Those minor changes are all that's necessary for the element names to appear as an Xpath instead of a simple element name. If you run the program entitled XPATH that's included in this week's source code, you'll find that it now prints the values from the sample XML document as follows:

  /invoice/ShipTo/name = Scott Klement              
  /invoice/ShipTo/address/addrLine1 = 123 Sesame St 
  /invoice/ShipTo/address/city = New York           
  /invoice/ShipTo/address/state = NY                
  /invoice/ShipTo/address/zipCode = 54321           

That makes it possible to save the data into variables using the code that follows:

   . . .
         when stack(depth) = '/invoice/ShipTo/name';
           myNameVar = val;
         when stack(depth) = '/invoice/ShipTo/addrLine1';
           myAddrVar = val;
   . . .

USER DATA INSTEAD OF GLOBAL VARIABLES

The next issue that I wanted to solve is that the program uses global variables. The "stack" and "depth" fields are global to the entire program -- and that makes it more difficult to write re-usable code. Furthermore, if you wanted to save the address into variables, such as the myNameVar and myAddrVar variables in the above code snippet, you'd have to declare those variables as global. It becomes difficult to keep track of what the various routines change and what they do not.

Fortunately, Expat can optionally pass "user data" to each of the handler functions. There's an XML_SetUserData() subprocedure in Expat where you can give Expat the address of a variable in your program that should be passed to each handler routine.

Unfortunately, there's only one such address. You can't tell it to pass two variables! This isn't too big of a problem, however, since you can put everything you need to pass into a single data structure and pass that. The following code illustrates this:

                                                   
     D mydata          ds                  qualified  
     D   depth                       10I 0 inz        
     D   stack                     1024A   varying inz
     D                                     dim(50)    

. . . code to create a parser, open the file, etc. goes here . . .

        XML_SetUserData(p: %addr(mydata));

. . . code to feed the document to XML_Parse() goes here . . .

Now that this has been declared, mydata will be passed as the first parameter to each of the handler functions. For example, the prototype and procedure interface for the chardata() sample from last week will now look like this:

      D chardata        PI                                         
      D   d                                 likeds(mydata)         
      D   string                   65535A   const options(*varsize)
      D   len                         10I 0 value          

Whenever I want to reference data in the mydata structure, I can reference it as the "d" parameter in the chardata() routine. The following code shows this:

         select;                                                      
         when  d.stack(d.depth) = '/invoice/ShipTo/name';             
           myNameVar = newval;                                    
         when  d.stack(d.depth) = '/invoice/ShipTo/address/addrLine1';
           myAddrVar = newval;                                   
       . . .

In the sample program named USERDATA (also included with this week's sample code), I've demonstrated this technique. I've also put the addresses and other information that's parsed out of the XML document into the same data structure so that no global variables are needed in the handler functions.

NEED MORE ENCODINGS? LET THE ISERIES DO IT!

When I started this series, I received several questions about encodings. People have XML documents that are encoded in all sorts of different ways! I received questions about everything from Windows- 1252 to the Big5 encoding popular in Asia.

Sadly, the code that I've been showing you assumes that the XML document is in US-ASCII, and uses the system-supplied QTCPEBC table to translate the ASCII to EBCDIC for use in the RPG programs. I must admit that I did that because it was easy, not because it was the best way!

To improve upon that, I decided to take a look at the encoding types that Expat supports. It has native support for ISO-8859-1 ASCII (Latin-1 ASCII), US-ASCII, UTF-8, and UTF-16. Those are the only encodings that Expat supports!

Since the iSeries supports a lot more encodings than that, I decided to have it translate the document to UTF-8, let Expat do its parsing in UTF-8, and then have the iSeries convert it to the job's native CCSID after the parsing is complete.

To do that, I first changed the open() API. I told it that the data in the stream file is text and that I'd like to receive that text in the UTF-8 (CCSID 1208) encoding:

         fd = open( '/tmp/testdoc.xml'
                  : O_RDONLY+O_TEXTDATA+O_CCSID
                  : 0
                  : 1208 );

Now that the data will be in UTF-8, I need to tell Expat that the data will be in this format and that it should ignore the encoding that's printed at the start of the document. I can do that by supplying an encoding to the XML_ParserCreate() API, but that string must already be in ASCII! So I created a constant called UTF8 that has the hex values of the characters "UTF-8" in ASCII and passed that as a parameter:

     D UTF8            c                   x'5554462d38'

     . . .
         p = XML_ParserCreate(UTF8);

When Expat sends this information back to me, I'll use the iconv() API to translate from CCSID 1208 to the job's CCSID. Supplying a special value of 0 to iconv() will tell it to use the job's current CCSID.

The final sample program that I've included this week is called XLATEICONV, and it demonstrates using the iconv() API instead of the QTCPEBC table to translate the XML data.

You can download the sample code for this article from the following link: http://iseries.pentontech.com/t?ctl=6779:103C2

I published another article with sample code that used the iconv() API in the article entitled "How to Convert Data to UTF-8" in the March 18, 2004, issue of this newsletter. If you have a professional account with the iSeries Network, you can read that article at the following link: http://iseries.pentontech.com/t?ctl=6773:103C2

ProVIP Sponsors

ProVIP Sponsors