Skip to content

Use Case: Getting a Single Element

do- edited this page Dec 27, 2021 · 15 revisions

Input

Consider the following XML given as a string named xmlSource:

<?xml version="1.0"?><SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
  <SOAP-ENV:Header/>
  <SOAP-ENV:Body>
    <SendRequestRequest xmlns="urn://some-schema" xmlns:ns0="urn://another-schema">
      <SenderProvidedRequestData Id="Ue7e71ce1-7ce3-4ca5-a689-1a8f2edbb1af">
        <MessageID>3931cda8-3245-11ec-b0bc-000c293433a0</MessageID>
        <ns0:MessagePrimaryContent>
          <ExportDebtRequestsRequest>
            <!-- ... and so on ... -->

Problem

Find the first SenderProvidedRequestData element and get it as a js Object of the following structure:

{
  Id: "Ue7e71ce1-7ce3-4ca5-a689-1a8f2edbb1af",
  MessageID: "3931cda8-3245-11ec-b0bc-000c293433a0",
  MessagePrimaryContent: {
    ExportDebtRequestsRequest: {
      // and so on
    }
  }
}

Basic Solution

const {XMLReader, XMLNode} = require ('xml-toolkit')

const data = await new XMLReader ({
  filterElements : 'SenderProvidedRequestData', 
  map            : XMLNode.toObject ({})
}).process (xmlSource).findFirst ()

Explanation

Here:

  • an XMLReader is created;
    • with the filterElements that tells him to ignore anything before the first SenderProvidedRequestData element occurs
    • and the map option requiring the XMLNode.toObject transformation;
  • the process method implicitly creates an XMLLexer instance, performs all the necessary piping;
  • the findFirst asynchronous method waits for the first result, then frees all the resources used and returns the result.

Q&A

Are namespace references lost from the output?

By default, yes (which is a standard practice with XML to JSON mapping).

But they are easy to reveal by using custom getName for XMLNode.toObject. For example:

map: XMLNode.toObject ({
  getName: (localName, namespaceURI) => `{${namespaceURI}}${localName}`,
  //...
},

How to specify the XML namespace sought for?

By supplying filterElements in the form of a function mapping XMLNode to Boolean:

filterElements: e =>
  e.localName    === 'SenderProvidedRequestData' &&
  e.namespaceURI === 'urn://some-schema'

How to find an element by its attribute values, parent element properties etc?

Same as above: by specifying all necessary conditions as the filterElements function. See XMLNode page for more details on its properties.

Is filterElements needed to get the root document element?

Yes, it's necessary anyway.

Without filterElements, the first object emitted by XMLReader will be about the <?xml version="1.0"?> prolog, not an element at all. If the prololog is missing, it will be the root element, but without any inner content (corresponding to StartElement event instead of EndElement).

To get the root element without mentioning its name, one can use

filterElements: e => e.level === 0

What if the desired element is missing?

The result will be null.

What if the element have no attributes nor children?

The result will be a string representing its text content. For instance, in the example above

const id = await new XMLReader ({
  filterElements : 'MessageID', 
  map            : XMLNode.toObject ({})
}).process (xmlSource).findFirst () // will be '3931cda8-3245-11ec-b0bc-000c293433a0'

If the string is empty (zero length), null will be returned, as if the element is missing.

Why are some whitespace characters lost?

Using filterElements implicitly sets on the stripSpace option, that causes trimming of all (merged) text nodes.

For example:

<Poem>
Onion juice in the eyes
Is the reason she cries.  
</Poem>

(source) will be translated to 'Onion juice in the eyes\nIs the reason she cries.', not '\nOnion juice in the eyes\nIs the reason she cries.\n'

Note: the inner line feed is guaranteed to be preserved. Usually, this is the desired behavior.

But the application developer is always free to explicitly set

stripSpace: false,
filterElements: // ...

and to process all the characters read from the XML source on his own.

How to get multiple text nodes intermitted with child elements?

XMLNode.toObject cannot properly handle such data structures (specific for documentation and other loosely structured, not database bound, XML).

Without a map option set, the result will be an XMLNode instance. All the parsed content is available from there. See the correnponding docs for more details on the API.

What if the XML source is so large it doesn't fit in memory?

To find a small piece of a huge XML, one can specify a readable stream instead of a string as xmlSource. The process will stop right after finding the element, and all the resources will be released immediately.

If the desired element is missing, the source will be scanned completely, but with only a minimal memory buffer in use.

But when the root element is required, XMLReader has no other option than to build the complete document tree.

Clone this wiki locally