@@ -1746,6 +1746,146 @@ The Brown Corpus uses the tagged corpus reader:
17461746 >>> brown.tagged_paras()
17471747 [[[('The', 'AT'), ...]], [[('The', 'AT'), ...]], ...]
17481748
1749+ Categorized Markdown Corpus Reader
1750+ ==================================
1751+
1752+ This corpus reader class provides additional methods to select features
1753+ present in markdown documents.
1754+
1755+ First, let's make a test corpus:
1756+
1757+ >>> root = make_testcorpus(ext='.md',
1758+ ... a="""\
1759+ ... # Section One
1760+ ... Here's the first sentence of section one. Then the second sentence.
1761+ ...
1762+ ... First section, second paragraph. Let's add a [link](https://example.com).
1763+ ...
1764+ ... # Section Two
1765+ ... This section is more fun. It contains an  followed by a list:
1766+ ...
1767+ ... 1. First list item
1768+ ... 2. Second list item
1769+ ... """,
1770+ ... b="""\
1771+ ... This is the second file. It starts without a section, but then adds one.
1772+ ...
1773+ ... # Section 1
1774+ ... This section has a sub-section!
1775+ ...
1776+ ... ## Section 1a
1777+ ... And here's a quote:
1778+ ...
1779+ ... > Carpe diem
1780+ ...
1781+ ... HTML tags <em>are</em> removed.
1782+ ... """)
1783+
1784+ Now, import the ``CategorizedMarkdownCorpusReader`` class.
1785+
1786+ >>> from nltk.corpus.reader.markdown import CategorizedMarkdownCorpusReader
1787+
1788+ Note that this class requires the following Python packages:
1789+
1790+ - ``markdown-it-py``
1791+ - ``mdit-py-plugins``
1792+ - ``mdit-plain``
1793+
1794+ The corpus provides usual methods like ``words()``, ``sents()``,
1795+ ``paras()``, etc. Each of these methods accepts a list of file IDs
1796+ which can be a Python list or a comma-separated string.
1797+
1798+ >>> corpus = CategorizedMarkdownCorpusReader(root, ['a.md', 'b.md'])
1799+ >>> corpus.fileids()
1800+ ['a.md', 'b.md']
1801+ >>> corpus.words()
1802+ ['Section', 'One', 'Here', "'", 's', 'the', 'first', ...]
1803+ >>> corpus.words('b.md')
1804+ ['This', 'is', 'the', 'second', 'file', '.', 'It', ...]
1805+ >>> corpus.words('a.md, b.md') == corpus.words(['a.md', 'b.md'])
1806+ True
1807+
1808+ Here are some methods specific to the
1809+ ``CategorizedMarkdownCorpusReader`` class to retrieve markdown features:
1810+
1811+ >>> corpus.links()
1812+ [Link(label='link', href='https://example.com', title=None)]
1813+ >>> corpus.images()
1814+ [Image(label='image', src='https://example.com/image.png', title=None)]
1815+ >>> corpus.lists()
1816+ [List(is_ordered=True, items=['First list item', 'Second list item'])]
1817+ >>> corpus.blockquotes()
1818+ [MarkdownBlock(content='Carpe diem')]
1819+
1820+ The corpus can also be broken down into sections based on markdown headings:
1821+
1822+ >>> corpus.sections('a.md')
1823+ [MarkdownSection(content='Section One\n\nHer...'), MarkdownSection(content='Section Two\n\nThi...')]
1824+ >>> for s in corpus.sections():
1825+ ... print(F"{s.heading} (level {s.level})")
1826+ ...
1827+ Section One (level 1)
1828+ Section Two (level 1)
1829+ Section 1 (level 1)
1830+ Section 1a (level 2)
1831+
1832+ Categories
1833+ ----------
1834+
1835+ The ``CategorizedMarkdownCorpusReader`` relies on YAML front matter to
1836+ read metadata defined in markdown documents. This metadata is optional,
1837+ and may define one or more categories for each document.
1838+
1839+ Let's create another test corpus, this time with some metadata:
1840+
1841+ >>> del_testcorpus(root)
1842+ >>> root = make_testcorpus(ext='.md',
1843+ ... a="""\
1844+ ... ---
1845+ ... tags:
1846+ ... - tag1
1847+ ... - tag2
1848+ ... ---
1849+ ... Document A: category metadata.
1850+ ... """,
1851+ ... b="""\
1852+ ... ---
1853+ ... author: NLTK
1854+ ... tags:
1855+ ... - tag2
1856+ ... - tag3
1857+ ... ---
1858+ ... Document B: additional metadata.
1859+ ... """,
1860+ ... c="""\
1861+ ... Document C: no metadata.
1862+ ... """)
1863+
1864+ Load the new corpus and see the ``metadata()`` and ``categories()``
1865+ methods in action:
1866+
1867+ >>> fileids = ['a.md', 'b.md', 'c.md']
1868+ >>> corpus = CategorizedMarkdownCorpusReader(root, fileids)
1869+ >>> corpus.metadata()
1870+ [{'tags': ['tag1', 'tag2']}, {'author': 'NLTK', 'tags': ['tag2', 'tag3']}]
1871+ >>> for fid in fileids:
1872+ ... print(fid, corpus.metadata(fid))
1873+ ...
1874+ a.md [{'tags': ['tag1', 'tag2']}]
1875+ b.md [{'author': 'NLTK', 'tags': ['tag2', 'tag3']}]
1876+ c.md []
1877+ >>> corpus.categories()
1878+ ['tag1', 'tag2', 'tag3']
1879+ >>> corpus.categories('a.md')
1880+ ['tag1', 'tag2']
1881+
1882+ The ``fileids()`` method also accepts categories and returns all file
1883+ IDs that match any of the specified categories:
1884+
1885+ >>> corpus.fileids('tag2')
1886+ ['a.md', 'b.md']
1887+ >>> del_testcorpus(root)
1888+
17491889Verbnet Corpus Reader
17501890=====================
17511891
0 commit comments