Improved documentation of PicaDecoder.

cboehme · cboehme · commit bd300862874c · 2013-11-03T11:32:49.000+01:00
Added an in-depth description of the parser behaviour.
diff --git a/src/main/java/org/culturegraph/mf/stream/converter/bib/PicaDecoder.java b/src/main/java/org/culturegraph/mf/stream/converter/bib/PicaDecoder.java
@@ -23,28 +23,83 @@
 
 
 /**
- * Parses a PICA+ record with UTF8 encoding assumed.
+ * <p>Parses pica+ records. The parser only parses single records.
+ * A string containing multiple records must be split into
+ * individual records before passing it to {@code PicaDecoder}.</p>
+ * 
+ * <p>The parser is designed to accept any string as valid input and
+ * to parse pica plain format as well as normalised pica. To
+ * achieve this, the parser behaves as following:</p>
+ * 
+ * <ul>
+ * <li>Fields are separated by record markers (0x1d), field
+ * markers (0x1e) or field end markers (0x0a).</li>
+ * <li>The field name and the first subfield are separated by
+ * a subfield marker (0x01f).</li>
+ * <li>The parser assumes that the input starts with a field
+ * name.</li>
+ * <li>The parser assumes that the end of the input marks
+ * the end of the current field and the end of the record.
+ * </li>
+ * <li>Subfields are separated by subfield markers (0x1f).</li>
+ * <li>The first character of a subfield is the name of the
+ * subfield</li>
+ * <li>To handle input with multiple field and subfield separators
+ * following each  other directly (for instance 0x0a and 0x1e), it
+ * is assumed that field names, subfields, subfield names or
+ * subfield values can be empty.</li>
+ * </ul>
+ * 
+ * <p>Please not that the record markers is treated as a field
+ * delimiter and not as a record delimiter. Records need to be
+ * separated prior to parsing them.</p>
+ * 
+ * <p>As the behaviour of the parser may result in unnamed fields or
+ * subfields or fields with no subfields the {@code PicaDecoder}
+ * automatically filters empty fields and subfields:</p>
+ * 
+ * <ul>
+ * <li>Subfields without a name are ignored (such fields cannot
+ * have any value because then the first character of the value
+ * would be the field name).</li>
+ * <li>Subfields which only have a name but no value are always
+ * parsed.</li>
+ * <li>Unnamed Fields are only parsed if the contain not-ignored
+ * subfields.</li>
+ * <li>Named fields containing none or only ignored subfields are
+ * only parsed if {@code skipEmptyFields} is set to {@code false}
+ * otherwise they are ignored.</li>
+ * <li>Input containing only whitespace (spaces and tabs) is
+ * completely ignored</li>
+ * </ul>
+ * 
+ * <p>The {@code PicaDecoder} calls {@code receiver.startEntity} and
+ * {@code receiver.endEntity} for each parsed field and
+ * {@code receiver.literal} for each parsed subfield. Spaces in the
+ * field name are not included in the entity name. The input
+ * "028A \x1faAndy\x1fdWarhol\x1e" would produce the following
+ * sequence of calls:</p>
  * 
- * For each field in the stream the module calls:
  * <ol>
- * <li>receiver.startEntity</li>
- * <li>receiver.literal for each subfield of the field</li>
- * <li>receiver.endEntity</li>
+ * <li>receiver.startEntity("028A")</li>
+ * <li>receiver.literal("a", "Andy")</li>
+ * <li>receiver.literal("d", "Warhol")</li>
+ * <li>receiver.endEntity()</li>
  * </ol>
  * 
- * Spaces in the field name are not included in the entity name.
- * 
- * Empty subfields are skipped. For instance, processing the following input
- * would NOT produce an empty literal: 003@ \u001f\u001e. The parser also
- * skips unnamed fields without any subfields.
+ * <p>The content of subfield 003@$0 is used for the record id. If
+ * {@code ignoreMissingIdn} is false and field 003@$0 is not found
+ * in the record a {@link MissingIdException} is thrown.</p>
  * 
- * If {@code ignoreMissingIdn} is false and field 003@$0 is not found in the
- * record a {@link MissingIdException} is thrown.
+ * <p>The parser assumes that the input is utf-8 encoded. The parser
+ * does not support other pica encodings.</p>
  * 
  * @author Christoph Böhme
  * 
  */
-@Description("Parses a PICA+ record with UTF8 encoding assumed.")
+@Description("Parses pica+ records. The parser only parses single records. " +
+		"A string containing multiple records must be split into " +
+		"individual records before passing it to PicaDecoder.")
 @In(String.class)
 @Out(StreamReceiver.class)
 public final class PicaDecoder
@@ -144,7 +199,7 @@ private boolean recordIsEmpty() {
 	/**
 	 * Searches the record for the sequence specified in {@code ID_FIELD}
 	 * and returns all characters following this sequence until the next
-	 * control character (see {@link PicaConstants} is found or the end of
+	 * control character (see {@link PicaConstants}) is found or the end of
 	 * the record is reached. Only the first occurrence of the sequence is
 	 * processed, later occurrences are ignored.
 	 *