Skip to content
Jeff Levesque edited this page Nov 22, 2015 · 32 revisions

##Overview

XML is markup language used to encode data into a document. It is both human-readable, and machine-readable. The benefit of using this markup language, is that there are no predefined tags. The author of a given XML document may create any tags to conform to any arbitrary structure that is logically needed.

###Sample Document

<?xml version='1.0'?>

<!-- Sample Dataset-->
<dataset>
  <observation>
    <dependent-variable>James Blonde</dependent-variable>
    <independent-variable>
      <label>SSN</label>
      <value>0034773019</value>
    </independent-variable>
    <independent-variable>
      <label>Salary</label>
      <value>88500</value>
    </independent-variable>
  </observation>

  <observation>
    <dependent-variable>Boston Powers</dependent-variable>
    <independent-variable>
      <label>SSN</label>
      <value>007000007</value>
    </independent-variable>
    <independent-variable>
      <label>Salary</label>
      <value>88500</value>
    </independent-variable>
  </observation>

  ...
</dataset>

###XML Declaration

An XML document may begin with an optional declaration. If one is used, it is important to remember that nothing may preceed the declaration, not even whitespace, or comments.

Generally, an xml declaration is as follows:

<?xml version='1.0'?>

where the version attribute, indicates the xml version being used. Another optional attribute may be defined in the same declaration. Specifically, the encoding attribute indicates the encoding standard being used in the xml document:

<?xml version='1.0' encoding='UTF-8'?>

By default, xml standard states that all XML software must understand both UTF-8, and UTF-16. When this attribute is not defined, the xml document defaults to UTF-8.

Note: an XML declaration is case sensitive, and cannot begin as <?XML ..?>.

###XML Document:

An XML document is syntactically similar to HTML, except the latter was designed to display data (presentation). XML on the otherhand, was designed to describe data, with a focus on what the data means. Both markup languages adhere to very similar syntax.

XML syntax requirements:

  • An XML document must have exactly one root element (see above <dataset>)
  • The root element encapsulates all other elements
  • An XML element is case sensitive
  • Every XML element, with an opening tag, must have a corresponding closing tag
  • A closing tag, must contain a slash (i.e. </xxx>).
  • XML elements may be nested

####DTD Validation

Document type definition (DTD), define the follow properties:

  • what elements are allowed in the xml document
  • what attributes each element is allowed to have
  • the ordering, and nesting of these elements

DTD's are declared within the DOCTYPE element, under the xml declaration.

The following is an example of an inline definition:

<?xml version='1.0' encoding='UTF-8'?>

<!DOCTYPE documentelement [definition]>

while, the following is an example of an external definition:

<?xml version="1.0"?> 

<!DOCTYPE documentelement SYSTEM "https://localhost/schema.dtd">

Both options can either expand definition (retain square brackets), or define schema.dtd as follows:

<!ELEMENT dataset (observation+)>
<!ELEMENT observation (dependent-variable,independent-variable+)>
<!ELEMENT dependent-variable (#CDATA)>
<!ELEMENT independent-variable (label,value)>
<!ELEMENT label (#CDATA)>
<!ELEMENT value (#CDATA)>

The above DTD defines the following structure:

  • a dataset contains at least one observation
  • an observation contains one dependent-variable, and at least one independent-variable
  • a dependent-variable contains CDATA text
  • an independent-variable contains a label, and a value
  • both label, and value contains CDATA text

Note: if observation+ was replaced with observation*, then there would be 0, or more observations.

Clone this wiki locally