Skip to content

Commit ecab227

Browse files
committed
docs: Update README
1 parent fd2f16f commit ecab227

File tree

1 file changed

+112
-69
lines changed

1 file changed

+112
-69
lines changed

README.md

Lines changed: 112 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -8,23 +8,30 @@
88

99
# scriptotek/marc
1010

11-
This is a small package that provides a simple interface for working with
12-
MARC records using the [File_MARC package](https://github.com/pear/File_MARC).
13-
It should work with both Binary MARC and MARCXML (with or without namespaces),
14-
but not the various Line mode MARC formats. Records can be edited using the
15-
editing capabilities of File_MARC.
11+
A small PHP package providing a simple interface to work with MARC21 records
12+
on top of the excellent [File_MARC package](https://github.com/pear/File_MARC).
13+
14+
Works with both Binary MARC and MARCXML (namespaced or not), but not the various
15+
Line mode MARC formats. Records can be edited using the editing capabilities of
16+
File_MARC.
17+
18+
Note that version 0.3.0 introduced a few breaking changes. See
19+
[releases](https://github.com/scriptotek/php-marc/releases) for more information.
1620

1721
## Installation using Composer:
1822

23+
If you have [Composer](https://getcomposer.org/) installed, the package can
24+
be installed by running
25+
1926
```
20-
composer require scriptotek/marc dev-master
27+
composer require scriptotek/marc
2128
```
2229

2330
## Reading records
2431

2532
Use `Collection::fromFile` or `Collection::fromString` to read one or more
26-
MARC records from a file or string. The methods autodetect if the data is
27-
Binary MARC or XML (namespaced or not).
33+
MARC records from a file or string. The methods autodetect the data format
34+
(Binary XML or MARCXML) and whether the XML is namespaced or not.
2835

2936
```php
3037
use Scriptotek\Marc\Collection;
@@ -35,8 +42,12 @@ foreach ($collection as $record) {
3542
}
3643
```
3744

38-
The package can extract MARC records from any container XML, so you can load
39-
an SRU or OAI-PMH response directly:
45+
The `$collection` object is an iterator. If you rather want a normal array,
46+
for instance in order to count the number of records, you can get that from
47+
`$collection->toArray()`.
48+
49+
The loader can extract MARC records from any container XML, so you can pass
50+
in an SRU or OAI-PMH response directly:
4051

4152
```php
4253
$response = file_get_contents('http://lx2.loc.gov:210/lcdb?' . http_build_query([
@@ -51,14 +62,12 @@ $records = Collection::fromString($response);
5162
foreach ($records as $record) {
5263
...
5364
}
54-
5565
```
5666

5767
If you only have a single record, you can also use `Record::fromFile` or
5868
`Record::fromString`. These use the `Collection` methods under the hood,
5969
but returns a single `Record` object.
6070

61-
6271
```php
6372
use Scriptotek\Marc\Record;
6473

@@ -105,26 +114,43 @@ $record->query('250$a')->text();
105114

106115
## Convenience methods on the Record class
107116

108-
The `Record` class of File_MARC has been extended with a few
109-
convenience methods to make handling of some everyday tasks easier.
117+
The `Record` class extends `File_MARC_Record` with a few convenience methods to
118+
get data from commonly used fields. Each of these methods, except `getType()`,
119+
returns an object or an array of objects of one of the field classes (located in
120+
`src/Fields`). For instance `getIsbns()` returns an array of
121+
`Scriptotek\Marc\Isbn` objects. All the field classes implements at minimum a
122+
`__toString()` method so you easily can get a string representation of the field
123+
for presentation purpose.
124+
125+
Note that all the get methods can also be accessed as attributes thanks to a
126+
little PHP magic (`__get`). So instead of calling `$record->getId()`, you can
127+
use the shorthand variant `$record->id`.
110128

111-
### getType()
129+
### type
112130

113-
Returns either 'Bibliographic', 'Authority' or 'Holdings' based on the
114-
value of the sixth character in the leader.
131+
`$record->getType()` or `$record->type` returns either 'Bibliographic', 'Authority'
132+
or 'Holdings' based on the value of the sixth character in the leader.
133+
See `Marc21.php` for supporting constants.
115134

116-
### Handlers for specific fields
135+
```php
136+
if ($record->type == Marc21::BIBLIOGRAPHIC) {
137+
// ...
138+
}
139+
```
140+
141+
### catalogingForm
117142

118-
Hopefully this list will grow larger over time:
143+
`$record->getCatalogingForm()` or `$record->catalogingForm` returns the value
144+
of LDR/18. See `Marc21.php` for supporting constants.
119145

120-
* `getIsbns()`
121-
* `getSubjects()`
122-
* `getTitle()`
146+
### id
123147

124-
Each of these methods returns an array of one of the corresponding field classes (located in `src/Fields`).
125-
For instance `getIsbns()` returns an array of `Scriptotek\Marc\Isbn` objects. All the field classes
126-
implements at minimum a `__toString()` method so you can easily get a string representation of the field
127-
for presentation purpose, like so:
148+
`$record->getId()` or `$record->id` returns the record id from 001 control field.
149+
150+
### isbns
151+
152+
`$record->getIsbns()` or `$record->isbns` returns an array of `Isbn` objects from
153+
020 fields.
128154

129155
```php
130156
use Scriptotek\Marc\Record;
@@ -139,67 +165,84 @@ $record = Record::fromString('<?xml version="1.0" encoding="UTF-8" ?>
139165
<subfield code="c">Nkr 98.00</subfield>
140166
</datafield>
141167
</record>');
142-
echo $record->isbns[0];
168+
$isbn = $record->isbns[0];
169+
170+
// Get the string representation of the field:
171+
echo $isbn . "\n"; // '8200424421'
172+
173+
// Get the value of $q using the standard FILE_MARC interface:
174+
echo $isbn->getSubfield('q')->getData() . "\n"; // 'h.'
175+
176+
// or using the shorthand `sf()` method from the Field class:
177+
echo $isbn->sf('q') . "\n"; // 'h.'
178+
```
179+
180+
### title
181+
182+
`$record->getTitle()` or `$record->title` returns a `Title` objects from 245
183+
field, or null if no such field is present.
184+
185+
Beware that the default string representation may or may not fit your needs.
186+
It's currently a concatenation of `$a` (title), `$b` (remainder of title),
187+
`$n`(part number) and `$p` (part title). For the remaining subfields like `$f`,
188+
`$g` and `$k`, I haven't decided whether to handle them or not.
189+
190+
Parallel titles are unfortunately encoded in such a way that there's no way I'm
191+
aware of to identify them in a secure manner, meaning there's also no secure way
192+
to remove them if you don't want to include them.<sup id="a1">[1](#f1)</sup>
193+
194+
I'm trimming off any final '`/`' ISBD marker. I would have loved to be able to
195+
also trim off final dots, but that's not trivial for the same reason identifying
196+
parallel titles is not<sup id="a1">[1](#f1)</sup> – there's just no safe way to
197+
tell if the final dot is an ISBD marker or part of the title.<sup
198+
id="a2">[2](#f2)</sup> Since explicit ISBD markers are included in records
199+
catalogued in the American tradition, but not in records catalogued in the
200+
British tradition, a mix of records from both traditions will look silly.
201+
202+
### subjects
203+
204+
`$record->getSubjects($vocabulary, $tag)` or `$record->subjects` returns an array of `Subject`
205+
objects from all [the 6XX fields](http://www.loc.gov/marc/bibliographic/bd6xx.html).
206+
The `getSubjects()` method have two optional arguments you can use to limit by
207+
vocabulary and/or tag.
208+
209+
```php
210+
foreach ($record->getSubjects('mesh', Subject::TOPICAL_TERM) as $subject) {
211+
echo "{$subject->vocabulary} {$subject->type} {$subject}";
212+
}
143213
```
144214

145-
Notice that we used `isbns` instead of `getIsbns()`. In the same way, you can request `$record->subjects` instead of `$record->getSubjects()`, etc. This is made possible using [a little bit of PHP magic](https://github.com/scriptotek/php-marc/blob/master/src/Fields/Field.php#L19).
146-
147-
*But* providing a single, *general* string representation that makes sense in all cases
148-
can sometimes be quite a challenge. The general string representation might not fit your
149-
specific need.
150-
151-
Take the `Title` class based on `245`. The string representation doesn't include data
152-
from `$h` (medium) or `$c` (statement of responsibility, etc.), since that's probably
153-
not the kind of info most non-librarians would expect to see in a "title". But it currently
154-
does include everything contained in `$a` and `$b` (except any final `/` ISBD marker),
155-
which means it doesn't make any attempt of removing parallel titles.<sup id="a1">[1](#f1)</sup>
156-
It also includes text from `$n` (part number) and `$p` (part title), but yet some other
157-
subfields like `$f`, `$g` and `$k` are currently ignored since I haven't really decided
158-
whether to include them or not.
159-
160-
I would love to remove the ending dot that is present
161-
in records with explicit ISBD markers, but that's not trivial for the same reason
162-
identifying parallel titles is not<sup id="a1">[1](#f1)</sup> – there's just no safe
163-
way to tell if the final dot is an ISBD marker or part of the title.<sup id="a2">[2](#f2)</sup>
164-
Since explicit ISBD markers are included in records catalogued in the American tradition,
165-
but not in records catalogued in the British tradition, a mix of records from both traditions
166-
will look silly.
167-
168-
I hope this makes clear that you need to check if the assumptions and simplifications made
169-
in the string representation methods makes sense to *your* project or not. It's also not
170-
unlikely that some methods make false assumptions based on (my) incomplete knowledge of
171-
cataloguing rules/practice. A developer given just a few MARC records might for instance assume
172-
that `300 $a` is a subfield for "number of pages".<sup id="a3">[3](#f3)</sup> A quick glance
173-
at e.g. [LC's MARC documentation](https://www.loc.gov/marc/bibliographic/bd300.html) would
174-
be enough to prove that wrong, but in other cases it's harder to avoid making false assumptions
175-
without deep familiarity with cataloguing rules and practices.
176-
177-
There's also cases where different traditions conflict, and you just have to make a choice.
178-
Subject subfields, for instance, have to be joined using some kind of glue.
179-
[LCSHs](https://en.wikipedia.org/wiki/Library_of_Congress_Subject_Headings) are
180-
ordinarily presented as strings glued together with em-dashes or double en-dashes
181-
(`650 $aPhysics $xHistory $yHistory` is presented as `Physics--History--20th century`).
182-
But in other subject heading systems colons are used as the glue (`Physics : History : 20th century`).
183-
This package defaults to colon, but you change that by setting `Subject::glue = '--'` or whatever.
215+
The string representation of this field makes use of the constant `Subject::glue`
216+
to glue subject components together. The default value is a space-padded colon,
217+
making `Physics : History : 20th century` the string representation of
218+
`650 $aPhysics $xHistory $yHistory`. If you prefer the "LCSH-way" of
219+
`Physics--History--20th century`, just set `Subject::glue = '--'`.
184220

185221
## Notes
186222

223+
It's unfortunately easy to err when trying to present data from MARC records in
224+
end user applications. A developer learning by example might for instance assume
225+
that `300 $a` is a subfield for "number of pages".<sup id="a3">[3](#f3)</sup> A
226+
quick glance at e.g. [LC's MARC
227+
documentation](https://www.loc.gov/marc/bibliographic/bd300.html) would be
228+
enough to prove that wrong, but in other cases it's harder to avoid making false
229+
assumptions without deep familiarity with cataloguing rules and practices.
230+
187231
<b id="f1">1</b> That might change in the future. But even if I decide to remove parallel titles,
188232
I'm not really sure how to do it in a safe way. Parallel titles are identified by a leading `=`
189233
ISBD marker. If the marker is at the end of subfield `$a`, we can be certain it's an ISBD marker,
190234
but since the `$a` and `$c` subfields are not repeatable, multiple titles are just added to the
191235
`$c` subfield. So if we encounter an `=` sign in the middle middle of `$c` somewhere, how can we
192236
tell if it's an ISBD marker or just an equal sign part of the title (like in the fictive book
193237
`"$aEating the right way : The 2 + 2 = 5 diet"`)? Some kind of escaping would have made that clear,
194-
but the ISBD principles doesn't seem to call for that, leaving us completely in the dark!
238+
but the ISBD principles doesn't seem to call for that, leaving us completely in the dark.
195239
*That* is seriously annoying :weary: [](#a1)
196240

197241
<b id="f2">2</b> [According to](http://www.loc.gov/marc/bibliographic/bd245.html)
198242
ISBD principles "field 245 ends with a period, even when another mark of punctuation is present,
199243
unless the last word in the field is an abbreviation, initial/letter, or data that ends with final
200244
punctuation." Determining if something is "an abbreviation, initial/letter, or data that ends with
201-
final punctuation" is certainly not trivial, I would guess that machine learning would be needed
202-
for a highly successful implementation [](#a2)
245+
final punctuation" is certainly not an easy task for anything but humans and AI. [](#a2)
203246

204247
<b id="f3">3</b> Our old OPAC used to output something like
205248
"Number of pages: One video disc (DVD)…" for DVDs – the developers had apparently just assumed that the

0 commit comments

Comments
 (0)