Skip to content

Commit 330808d

Browse files
committed
Update README
1 parent d07b0f6 commit 330808d

File tree

1 file changed

+117
-36
lines changed

1 file changed

+117
-36
lines changed

README.md

Lines changed: 117 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -7,22 +7,23 @@
77

88
# scriptotek/marc
99

10-
This is a small package that provides a simple interface to parsing
10+
This is a small package that provides a simple interface for working with
1111
MARC records using the [File_MARC package](https://github.com/pear/File_MARC).
12-
13-
The package has only been tested with XML encoded MARC 21.
14-
It should likely support everything File_MARC supports, but that
15-
remains to be tested.
12+
It should work with both Binary MARC and MARCXML (with or without namespaces),
13+
but not the various Line mode MARC formats. Records can be edited using the
14+
editing capabilities of File_MARC.
1615

1716
## Installation using Composer:
1817

1918
```
2019
composer require scriptotek/marc dev-master
2120
```
2221

23-
## Usage examples
22+
## Reading records
2423

25-
### Records from a file or string
24+
Records are loaded into a `Collection` object using
25+
`Collection::fromFile` or `Collection::fromString`,
26+
which autodetects if the data is Binary MARC or XML:
2627

2728
```php
2829
use Scriptotek\Marc\Collection;
@@ -32,34 +33,30 @@ foreach ($collection->records as $record) {
3233
echo $record->getField('250')->getSubfield('a') . "\n";
3334
}
3435
```
35-
It should detect if the data is Binary MARC or XML.
36-
If you have the data as a string, use
37-
`Collection::fromFile()` instead.
38-
39-
### Records from SRU/OAI-PMH response
4036

41-
The package makes it easy to handle records from an SRU or OAI/PMH response.
37+
The package will extract MARC records from any container XML,
38+
so you can load an SRU or OAI-PMH response directly:
4239

4340
```php
44-
$response = file_get_contents('http://lx2.loc.gov:210/NLSBPH?' . http_build_query(array(
41+
$response = file_get_contents('http://lx2.loc.gov:210/lcdb?' . http_build_query(array(
4542
'operation' => 'searchRetrieve',
43+
'recordSchema' => 'marcxml',
4644
'version' => '1.1',
47-
'query' => 'dc.publisher=CNIB%20AND%20dc.date=2005',
4845
'maximumRecords' => '10',
49-
'recordSchema' => 'marcxml'
50-
));
46+
'query' => 'bath.isbn=0761532692',
47+
)));
5148

52-
$collection = Collection::fromSruResponse($response);
49+
$collection = Collection::fromString($response);
5350
foreach ($collection->records as $record) {
54-
echo $record->getField('250')->getSubfield('a') . "\n";
51+
echo $record->getField('245')->getSubfield('a') . "\n";
5552
}
5653

5754
```
5855

59-
### Using MARC spec
56+
## Querying with MARC spec
6057

61-
To easily look up a MARC (sub)field, you can use the MARC spec syntax provided
62-
by the [php-marc-spec package](https://github.com/MARCspec/php-marc-spec):
58+
Using the `Record::get()` method you can query a record using the MARC spec
59+
syntax provided by the [php-marc-spec package](https://github.com/MARCspec/php-marc-spec):
6360

6461
```php
6562
use Scriptotek\Marc\Collection;
@@ -71,31 +68,115 @@ foreach ($collection->records as $record) {
7168
}
7269
```
7370

74-
### Convenience methods for handling common fields
71+
## Convenience methods on the Record class
72+
73+
The `Record` class of File_MARC has been extended with a few
74+
convenience methods to make handling of some everyday tasks easier.
75+
76+
### getType()
77+
78+
Returns either 'Bibliographic', 'Authority' or 'Holdings' based on the
79+
value of the sixth character in the leader.
7580

76-
The `Record` class has been extended with a few convenience methods to make
77-
handling of everyday tasks easier, in the spirit of
78-
[pymarc](https://github.com/edsu/pymarc). These generally make some
79-
assumptions, for instance that a compound subject string should be joined using
80-
a colon character.
81-
These assumptions may or may not meet *your* expectations. You should inspect
82-
the relevant field class before using it.
81+
### Handlers for specific fields
82+
83+
Hopefully this list will grow larger over time:
84+
85+
* `getIsbns()`
86+
* `getSubjects()`
87+
* `getTitle()`
88+
89+
Each of these methods returns an array of one of the corresponding field classes (located in `src/Fields`).
90+
For instance `getIsbns()` returns an array of `Scriptotek\Marc\Isbn` objects. All the field classes
91+
implements at minimum a `__toString()` method so you can easily get a string representation of the field
92+
for presentation purpose, like so:
8393

8494
```php
8595
use Scriptotek\Marc\Record;
8696

87-
$source = '<?xml version="1.0" encoding="UTF-8" ?>
88-
<record xmlns="info:lc/xmlns/marcxchange-v1">
97+
$record = Record::from('<?xml version="1.0" encoding="UTF-8" ?>
98+
<record xmlns="""http://www.loc.gov/MARC21/slim">
8999
<leader>99999cam a2299999 u 4500</leader>
90100
<controlfield tag="001">98218834x</controlfield>
91101
<datafield tag="020" ind1=" " ind2=" ">
92102
<subfield code="a">8200424421</subfield>
93103
<subfield code="q">h.</subfield>
94104
<subfield code="c">Nkr 98.00</subfield>
95105
</datafield>
96-
</record>';
97-
98-
$record = Record::from($source);
106+
</record>');
99107
echo $record->isbns[0];
100-
101108
```
109+
110+
Notice that we used `isbns` instead of `getIsbns()`. In the same way, you can request `$record->subjects` instead of `$record->getSubjects()`, etc. This is made possible using [a little bit of PHP magic](https://github.com/scriptotek/php-marc/blob/master/src/Fields/Field.php#L19).
111+
112+
*But* providing a single, *general* string representation that makes sense in all cases
113+
can sometimes be quite a challenge. The general string representation might not fit your
114+
specific need.
115+
116+
Take the `Title` class based on `245`. The string representation doesn't include data
117+
from `$h` (medium) or `$c` (statement of responsibility, etc.), since that's probably
118+
not the kind of info most non-librarians would expect to see in a "title". But it currently
119+
does include everything contained in `$a` and `$b` (except any final `/` ISBD marker),
120+
which means it doesn't make any attempt of removing parallel titles.<sup id="a1">[1](#f1)</sup>
121+
It also includes text from `$n` (part number) and `$p` (part title), but yet some other
122+
subfields like `$f`, `$g` and `$k` are currently ignored since I haven't really decided
123+
whether to include them or not.
124+
125+
I would love to remove the ending dot that is present
126+
in records with explicit ISBD markers, but that's not trivial for the same reason
127+
identifying parallel titles is not<sup id="a1">[1](#f1)</sup> – there's just no safe
128+
way to tell if the final dot is an ISBD marker or part of the title.<sup id="a2">[2](#f2)</sup>
129+
Since explicit ISBD markers are included in records catalogued in the American tradition,
130+
but not in records catalogued in the British tradition, a mix of records from both traditions
131+
will look silly.
132+
133+
I hope this makes clear that you need to check if the assumptions and simplifications made
134+
in the string representation methods makes sense to *your* project or not. It's also not
135+
unlikely that some methods make false assumptions based on (my) incomplete knowledge of
136+
cataloguing rules/practice. A developer given just a few MARC records might for instance assume
137+
that `300 $a` is a subfield for "number of pages".<sup id="a3">[3](#f3)</sup> A quick glance
138+
at e.g. [LC's MARC documentation](https://www.loc.gov/marc/bibliographic/bd300.html) would
139+
be enough to prove that wrong, but in other cases it's harder to avoid making false assumptions
140+
without deep familiarity with cataloguing rules and practices.
141+
142+
There's also cases where different traditions conflict, and you just have to make a choice.
143+
Subject subfields, for instance, have to be joined using some kind of glue.
144+
[LCSHs](https://en.wikipedia.org/wiki/Library_of_Congress_Subject_Headings) are
145+
ordinarily presented as strings glued together with em-dashes or double en-dashes
146+
(`650 $aPhysics $xHistory $yHistory` is presented as `Physics--History--20th century`).
147+
But in other subject heading systems colons are used as the glue (`Physics : History : 20th century`).
148+
This package defaults to colon, but you change that by setting `Subject::glue = '--'` or whatever.
149+
150+
## Notes
151+
152+
<b id="f1">1</b> That might change in the future. But even if I decide to remove parallel titles,
153+
I'm not really sure how to do it in a safe way. Parallel titles are identified by a leading `=`
154+
ISBD marker. If the marker is at the end of subfield `$a`, we can be certain it's an ISBD marker,
155+
but since the `$a` and `$c` subfields are not repeatable, multiple titles are just added to the
156+
`$c` subfield. So if we encounter an `=` sign in the middle middle of `$c` somewhere, how can we
157+
tell if it's an ISBD marker or just an equal sign part of the title (like in the fictive book
158+
`"$aEating the right way : The 2 + 2 = 5 diet"`)? Some kind of escaping would have made that clear,
159+
but the ISBD principles doesn't seem to call for that, leaving us completely in the dark!
160+
*That* is seriously annoying :weary: [](#a1)
161+
162+
<b id="f2">2</b> [According to](http://www.loc.gov/marc/bibliographic/bd245.html)
163+
ISBD principles "field 245 ends with a period, even when another mark of punctuation is present,
164+
unless the last word in the field is an abbreviation, initial/letter, or data that ends with final
165+
punctuation." Determining if something is "an abbreviation, initial/letter, or data that ends with
166+
final punctuation" is certainly not trivial, I would guess that machine learning would be needed
167+
for a highly successful implementation [](#a2)
168+
169+
<b id="f3">3</b> Our old OPAC used to output something like
170+
"Number of pages: One video disc (DVD)…" for DVDs – the developers had apparently just assumed that the
171+
content of `300 $a` could be represented as "number of pages" in all cases. While that sounds silly, getting
172+
the *number* of pages (for documents that actually have pages) from MARC records can be ridiculously hard;
173+
you can safely extract the number from strings like `149 p.` (English), `149 s.` (Norwegian), etc., but you
174+
must ignore the numbers in strings like `10 boxes`, `11 v.` (volumes) etc. So for a start you need a
175+
list of valid abbreviations for "pages" in all relevant languages. Then there's the more complicated cases
176+
like `1 score (16 p.)` – at first sight it looks like we can tokenize that into (number, unit) pairs, like
177+
`("1 score", "16 p.")` and only accept the item(s) having an allowed unit (like `p.`). But then suddenly
178+
comes a case like `"74 p. of ill., 15 p."`, which we would turn into `("74 p. of ill.", "15 p.")`, accepting
179+
`15 p.`, not the correct `74 p.`. So we bite into the grass and start writing rules; if a valid match is found
180+
as the start of the string, then accept it, else if …, else try tokenization, etc... it quickly becomes messy
181+
and it will certainly fail in some cases. Sad to say, after a few years in the library, I still haven't
182+
figured out a general way to extract the number of pages a document have using library data. [](#a3)

0 commit comments

Comments
 (0)