Skip to content

Commit 175929b

Browse files
elespiketomaarsen
andauthored
Fix a bug that left out the last section/heading. (nltk#3098)
* Fix a bug that left out the last section/heading. * Add doctests for CategorizedMarkdownCorpusReader * Add CI & test dependencies Required for Markdown corpus tests Co-authored-by: Tom Aarsen <[email protected]>
1 parent 63a63b1 commit 175929b

File tree

4 files changed

+155
-5
lines changed

4 files changed

+155
-5
lines changed

nltk/corpus/reader/markdown.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -319,15 +319,17 @@ def lists(self, fileids=None, categories=None):
319319

320320
def section_reader(self, stream):
321321
section_blocks, block = list(), list()
322-
in_heading = False
323322
for t in self.parser.parse(stream.read()):
324323
if t.level == 0 and t.type == "heading_open":
325-
if block:
324+
if not block:
325+
block.append(t)
326+
else:
326327
section_blocks.append(block)
327-
block = list()
328-
in_heading = True
329-
if in_heading:
328+
block = [t]
329+
elif block:
330330
block.append(t)
331+
if block:
332+
section_blocks.append(block)
331333
return [
332334
MarkdownSection(
333335
block[1].content,

nltk/test/corpus.doctest

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1746,6 +1746,146 @@ The Brown Corpus uses the tagged corpus reader:
17461746
>>> brown.tagged_paras()
17471747
[[[('The', 'AT'), ...]], [[('The', 'AT'), ...]], ...]
17481748

1749+
Categorized Markdown Corpus Reader
1750+
==================================
1751+
1752+
This corpus reader class provides additional methods to select features
1753+
present in markdown documents.
1754+
1755+
First, let's make a test corpus:
1756+
1757+
>>> root = make_testcorpus(ext='.md',
1758+
... a="""\
1759+
... # Section One
1760+
... Here's the first sentence of section one. Then the second sentence.
1761+
...
1762+
... First section, second paragraph. Let's add a [link](https://example.com).
1763+
...
1764+
... # Section Two
1765+
... This section is more fun. It contains an ![image](https://example.com/image.png) followed by a list:
1766+
...
1767+
... 1. First list item
1768+
... 2. Second list item
1769+
... """,
1770+
... b="""\
1771+
... This is the second file. It starts without a section, but then adds one.
1772+
...
1773+
... # Section 1
1774+
... This section has a sub-section!
1775+
...
1776+
... ## Section 1a
1777+
... And here's a quote:
1778+
...
1779+
... > Carpe diem
1780+
...
1781+
... HTML tags <em>are</em> removed.
1782+
... """)
1783+
1784+
Now, import the ``CategorizedMarkdownCorpusReader`` class.
1785+
1786+
>>> from nltk.corpus.reader.markdown import CategorizedMarkdownCorpusReader
1787+
1788+
Note that this class requires the following Python packages:
1789+
1790+
- ``markdown-it-py``
1791+
- ``mdit-py-plugins``
1792+
- ``mdit-plain``
1793+
1794+
The corpus provides usual methods like ``words()``, ``sents()``,
1795+
``paras()``, etc. Each of these methods accepts a list of file IDs
1796+
which can be a Python list or a comma-separated string.
1797+
1798+
>>> corpus = CategorizedMarkdownCorpusReader(root, ['a.md', 'b.md'])
1799+
>>> corpus.fileids()
1800+
['a.md', 'b.md']
1801+
>>> corpus.words()
1802+
['Section', 'One', 'Here', "'", 's', 'the', 'first', ...]
1803+
>>> corpus.words('b.md')
1804+
['This', 'is', 'the', 'second', 'file', '.', 'It', ...]
1805+
>>> corpus.words('a.md, b.md') == corpus.words(['a.md', 'b.md'])
1806+
True
1807+
1808+
Here are some methods specific to the
1809+
``CategorizedMarkdownCorpusReader`` class to retrieve markdown features:
1810+
1811+
>>> corpus.links()
1812+
[Link(label='link', href='https://example.com', title=None)]
1813+
>>> corpus.images()
1814+
[Image(label='image', src='https://example.com/image.png', title=None)]
1815+
>>> corpus.lists()
1816+
[List(is_ordered=True, items=['First list item', 'Second list item'])]
1817+
>>> corpus.blockquotes()
1818+
[MarkdownBlock(content='Carpe diem')]
1819+
1820+
The corpus can also be broken down into sections based on markdown headings:
1821+
1822+
>>> corpus.sections('a.md')
1823+
[MarkdownSection(content='Section One\n\nHer...'), MarkdownSection(content='Section Two\n\nThi...')]
1824+
>>> for s in corpus.sections():
1825+
... print(F"{s.heading} (level {s.level})")
1826+
...
1827+
Section One (level 1)
1828+
Section Two (level 1)
1829+
Section 1 (level 1)
1830+
Section 1a (level 2)
1831+
1832+
Categories
1833+
----------
1834+
1835+
The ``CategorizedMarkdownCorpusReader`` relies on YAML front matter to
1836+
read metadata defined in markdown documents. This metadata is optional,
1837+
and may define one or more categories for each document.
1838+
1839+
Let's create another test corpus, this time with some metadata:
1840+
1841+
>>> del_testcorpus(root)
1842+
>>> root = make_testcorpus(ext='.md',
1843+
... a="""\
1844+
... ---
1845+
... tags:
1846+
... - tag1
1847+
... - tag2
1848+
... ---
1849+
... Document A: category metadata.
1850+
... """,
1851+
... b="""\
1852+
... ---
1853+
... author: NLTK
1854+
... tags:
1855+
... - tag2
1856+
... - tag3
1857+
... ---
1858+
... Document B: additional metadata.
1859+
... """,
1860+
... c="""\
1861+
... Document C: no metadata.
1862+
... """)
1863+
1864+
Load the new corpus and see the ``metadata()`` and ``categories()``
1865+
methods in action:
1866+
1867+
>>> fileids = ['a.md', 'b.md', 'c.md']
1868+
>>> corpus = CategorizedMarkdownCorpusReader(root, fileids)
1869+
>>> corpus.metadata()
1870+
[{'tags': ['tag1', 'tag2']}, {'author': 'NLTK', 'tags': ['tag2', 'tag3']}]
1871+
>>> for fid in fileids:
1872+
... print(fid, corpus.metadata(fid))
1873+
...
1874+
a.md [{'tags': ['tag1', 'tag2']}]
1875+
b.md [{'author': 'NLTK', 'tags': ['tag2', 'tag3']}]
1876+
c.md []
1877+
>>> corpus.categories()
1878+
['tag1', 'tag2', 'tag3']
1879+
>>> corpus.categories('a.md')
1880+
['tag1', 'tag2']
1881+
1882+
The ``fileids()`` method also accepts categories and returns all file
1883+
IDs that match any of the specified categories:
1884+
1885+
>>> corpus.fileids('tag2')
1886+
['a.md', 'b.md']
1887+
>>> del_testcorpus(root)
1888+
17491889
Verbnet Corpus Reader
17501890
=====================
17511891

requirements-ci.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,13 @@
11
click
22
gensim>=4.0.0
3+
markdown-it-py
34
matplotlib
5+
mdit-plain
6+
mdit-py-plugins
47
pytest
58
pytest-mock
69
pytest-xdist[psutil]
10+
pyyaml
711
regex
812
scikit-learn
913
tqdm

tox.ini

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,10 @@ deps =
3030
joblib
3131
tqdm
3232
matplotlib
33+
markdown-it-py
34+
mdit-py-plugins
35+
mdit-plain
36+
pyyaml
3337

3438
changedir = nltk/test
3539
commands =

0 commit comments

Comments
 (0)