In last week's newsletter, I introduced the open source Expat XML parser. I demonstrated how to create handlers for the starting and ending elements of the XML tags in a document. In this article, I will take it a bit further and demonstrate how to read the character data from an XML document.
As demonstrated in the previous article, you register "handlers" with Expat to get the output from the parsing function. These handlers are subprocedures that you write in your program. You tell Expat to call them when it finds a particular thing in the XML source. For example, Expat's XML_SetStartElementHandler() API can be called to register a subprocedure that's called when a starting XML element is encountered in the document.
In response to last week's article, I received questions from readers asking me to explain how handlers work and what the %paddr() BIF does in conjunction with setting a handler.
The handler routines utilize a feature of ILE called a "procedure pointer." Think about how a computer runs a program. That program is a series of machine opcodes and data that are loaded into the computer's memory and executed. If executable code is loaded into memory, it has to have an address, or location, in memory where it can be found. A procedure pointer is a variable that contains this address.
You can use this procedure pointer to call a subprocedure. Consider the following sample program:
H DFTACTGRP(*NO)
D TestSubproc PR
D Data 50A const
D MyPointer s * procptr
D CallPtr PR Extproc(MyPointer)
D Message 50A const
C eval MyPointer = %paddr(TestSubproc)
c callp CallPtr('Olives are good.')
c eval *inlr = *on
P TestSubproc B
D TestSubproc PI
D Data 50A const
D wait s 1A
c data dsply wait
P E
In this example, CallPtr() will actually call the TestSubproc() subprocedure because MyPointer contains its address. The reason it contains its address is because I used the %paddr() BIF to return the address of the TestSubproc() subprocedure.
To take that a step further, what would happen if you passed MyPointer as a parameter to a subprocedure that's in a service program? Naturally, you'd have to put the prototype for CallPtr() in that service program. Now, TestSubproc() would be called by that service program, since that's what %paddr() references.
Well, that's exactly what Expat does. You tell it which subprocedure to call for a particular event, pass it in the address of the subprocedure that you want it to call using the %paddr() BIF, and it'll call that subprocedure at the appropriate time.
Last week I demonstrated handlers that are called for the starting and ending element of an XML document. Consider the following document:
<?xml version="1.0"?>
<invoice id="54-12343">
<ShipTo>
<name>Scott Klement</name>
<address type="residence">
<addrLine1>123 Sesame St</addrLine1>
<city>New York</city>
<state>NY</state>
<zipCode>54321</zipCode>
</address>
</ShipTo>
. . .
</invoice>
The subprocedure that you registered for the start of an element would be called for the <invoice>, <shipto>, <name>, <address>, etc., XML tags. The subprocedure registered for the end of an element would be called for each of the </name>, </addrLine1>, etc., XML tags.
To get the character data that's located in between these elements, you register a character data handler by calling Expat's XML_SetCharacterDataHandler() API. This handler will be called whenever Expat encounters character data inside an XML tag.
In the example above, it will be called with the string "Scott Klement", then with the string "123 Sesame St", and so on. To try this out, I've created a new program called CHARDATA1 that extends the OUTLINE program from last week. In the new program, I've added a line that sets a character data handler, as follows:
XML_SetCharacterDataHandler(p: %paddr(chardata));
Here's the code for the CharData() subprocedure that Expat will call when character data is received:
P chardata B
D chardata PI
D data * value
D string 65535A const options(*varsize)
D len 10I 0 value
D x s 10I 0
D val s 132A
D newval s 132A varying
/free
if (len < 1);
return;
endif;
val = %subst(string:1:len);
QDCXLATE( len
: val
: 'QTCPEBC' );
for x = 1 to len;
if ( %subst(string:x:1) >= x'40' );
newval = newval + %subst(val:x:1);
endif;
endfor;
if (%len(newval)<1 or newval = *blanks);
return;
endif;
printme = %subst(blanks: 1: depth)
+ 'Char: '
+ newval;
except print;
/end-free
P E
Note that there's special code in this subprocedure that strips out any unprintable characters (EBCDIC characters less than x'40' are unprintable) and code that prevents the data from being printed if it's blank. The reason for this is that Expat will call the character data handler for all character data, even if I don't want it to!
If you look back at the XML data that I posted above, there are carriage return and line feed characters following the <invoice> tag. Expat calls the character data handler for those characters, even though I wouldn't want to print them on the outline. Likewise, it calls CharData() for the blank spaces before the <ShipTo> tag, and I don't really want to print those. That's why the CharData() routine will strip them out.
If you run the CHARDATA1 program, the outline will now look like this:
invoice id="54-12343"
ShipTo
name
Char: Scott Klement
address type="residence"
addrLine1
Char: 123 Sesame St
city
Char: New York
state
Char: NY
zipCode
Char: 54321
Now that you have a way to receive the character data, you need a way to keep track of the XML element that it belongs to. This is important because the goal of parsing an XML document in RPG is usually to load the fields in the document into variables in your program. To know which variable to load things into, you have to know which tag it belongs to.
Unfortunately, this isn't as simple as it sounds. You can't simply assume that the last tag you received in the starting element handler will be the one that the character data belongs to. Consider the following XML:
<para> <title>Life is good</title> On March 14, 2005, IBM and COMMON presented Scott Klement with the iSeries Innovation Award in the Intellectual category. </para>
As Expat cycles through this XML document, it'll first call the start element handler for <para> (which is short for "paragraph"). It will then call the start element handler for <title>, then the character data handler with the string "Life is good", then the end element handler for </title>, and finally the character data handler with the text of the paragraph.
The problem is, when you get the text for the paragraph, it should belong to the <para> element. But the last element that the start handler was called for was actually the <title> element!
To keep track of the elements properly, you have to implement a stack. Each time the start element handler is called, you have to put a new element on the stack. Each time the end element handler is called, you have to take the top element off of the stack so that the top element now points to the previous tag again.
The easiest way to do this in RPG is with an array and a variable that keeps track of which array element is on "top" of the stack. To do that, I've defined the following in the mainline of my CHARDATA2 program:
D stack s 50A varying dim(50)
D Depth s 10I 0
To add elements to the stack, I've changed my start element handler, a subprocedure called start(), to read as follows:
P start B
D start PI
D data * value
D elem * value
D attr * dim(32767)
options(*varsize)
D elemName s 50A
/free
depth = depth + 1;
elemName = %str(elem);
QDCXLATE( %len(%trimr(elemName))
: elemName
: 'QTCPEBC' );
stack(depth) = %trimr(elemName);
/end-free
P E
If you look at the code, above, you'll see that each time a starting element is found in the XML document, it adds a new level to the stack by adding 1 to the DEPTH variable. The element name can then be stored in the array at this new depth. When the ending element handler is called, all I have to do to remove that element name from the stack is subtract 1 from the DEPTH variable, as follows:
P end B
D end PI
D data * value
D elem * value
/free
depth = depth - 1;
/end-free
P E
I've modified my character data handler so that it'll print out each element in the XML document and its value. It does this by taking the element name that's currently on top of the stack and printing it next to the value that it receives for the character data. It now looks like this:
P chardata B
D chardata PI
D data * value
D string 65535A const options(*varsize)
D len 10I 0 value
D x s 10I 0
D val s 132A
D newval s 132A varying
/free
if (len < 1);
return;
endif;
val = %subst(string:1:len);
QDCXLATE( len
: val
: 'QTCPEBC' );
newval = '';
for x = 1 to len;
if ( %subst(val:x:1) >= x'40' );
newval = newval + %subst(val:x:1);
endif;
endfor;
if (%len(newval)<1 or newval = *blanks);
return;
endif;
printme = stack(depth) + ' = ' + newval;
except print;
/end-free
P E
An excerpt from the output of this routine follows:
name = Scott Klement addrLine1 = 123 Sesame St city = New York state = NY zipCode = 54321
If you wanted to, you could add code to the CharData() routine that would map these values into variables so that you could put them on a screen, use them for calculations, or do whatever you need to do. All you'd have to do is add a SELECT group to the CharData() routine. For example, you might do the following:
select;
when stack(depth) = 'name';
myNameVar = val;
when stack(depth) = 'addrLine1';
myAddrVar = val;
when stack(depth) = 'city';
myCityVar = val;
when stack(depth) = 'state';
myStateVar = val;
when stack(depth) = 'zipCode';
myZipVar = val;
endsl;
In the next article in this series, I'll take this concept of mapping variables a bit further by explaining how to calculate an "XPath" from the element names and how to eliminate the need for global variables to represent the stack.
You can download the code for this article.
The source code for my iSeries port of the Expat XML parser can be found on my Web site at the following link: http://www.scottklement.com/expat/
The previous article in this series can be viewed from the iSeries Network Web site at the following link: systeminetwork.com/article/using-expat-xml-parser-rpg-program-part-1