Skip to content

Commit b29d4f8

Browse files
authored
Update to 0.1.4-alpha001
Update to 0.1.4-alpha001 - Remove reference to net5 - Update supported dotnet versions in README - Make ObjectExtractor static and add PdfPigExtensionsTests - Update PdfPig NuGet package to 0.1.9-alpha-20231019-c6e2d - Seal classes, make clipper internal
2 parents fe6e6e5 + 4eef723 commit b29d4f8

40 files changed

+5160
-4654
lines changed

README.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
![Linux](https://github.com/BobLd/tabula-sharp/workflows/Linux/badge.svg)
66
![Mac OS](https://github.com/BobLd/tabula-sharp/workflows/Mac%20OS/badge.svg)
77

8-
- Supports .NET 5, .NET Core 3.1, .NET Standard 2.0, .NET Framework 4.5, 4.51, 4.52, 4.6, 4.61, 4.62, 4.7
8+
- Supports .NET 6, .NET Core 3.1, .NET Standard 2.0, .NET Framework 4.52, 4.6, 4.61, 4.62, 4.7
99
- No java bindings
1010

1111
NuGet packages available on the [releases](https://github.com/BobLd/tabula-sharp/releases) page and on www.nuget.org:
@@ -56,7 +56,3 @@ using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() {
5656
![example](images/stream-us-018.png)
5757
## Lattice mode - SpreadsheetExtractionAlgorithm
5858
![example](images/lattice-eu-004.png)
59-
60-
# HELP WANTED
61-
- The original java implementation uses STR trees in [`RectangleSpatialIndex`](https://github.com/tabulapdf/tabula-java/blob/master/src/main/java/technology/tabula/RectangleSpatialIndex.java). This is not the case here so it might be a bit slower. Any help implementing a similar approach is welcome.
62-

Tabula.Csv/Tabula.Csv.csproj

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
<Project Sdk="Microsoft.NET.Sdk">
22

33
<PropertyGroup>
4-
<TargetFrameworks>netcoreapp3.1;netstandard2.0;net452;net46;net461;net462;net47;net5.0;net6.0</TargetFrameworks>
4+
<TargetFrameworks>netcoreapp3.1;netstandard2.0;net452;net46;net461;net462;net47;net6.0</TargetFrameworks>
55
<Description>Extract tables from PDF files (port of tabula-java using PdfPig). Csv and Tsv writers.</Description>
66
<PackageProjectUrl>https://github.com/BobLd/tabula-sharp</PackageProjectUrl>
7-
<Version>0.1.3</Version>
7+
<Version>0.1.4-alpha001</Version>
88
<Authors>BobLd</Authors>
99
<PackageTags>pdf, extract, table, tabula, pdfpig, parse, extraction, csv, tsv, excel, export</PackageTags>
1010
<PackageLicenseExpression>MIT</PackageLicenseExpression>
@@ -22,7 +22,7 @@
2222
</ItemGroup>
2323

2424
<ItemGroup>
25-
<PackageReference Include="CsvHelper" Version="27.2.1" />
25+
<PackageReference Include="CsvHelper" Version="30.0.1" />
2626
</ItemGroup>
2727

2828
<ItemGroup>

Tabula.Json/Tabula.Json.csproj

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
<Project Sdk="Microsoft.NET.Sdk">
22

33
<PropertyGroup>
4-
<TargetFrameworks>netcoreapp3.1;netstandard2.0;net452;net46;net461;net462;net47;net5.0;net6.0</TargetFrameworks>
4+
<TargetFrameworks>netcoreapp3.1;netstandard2.0;net452;net46;net461;net462;net47;net6.0</TargetFrameworks>
55
<Description>Extract tables from PDF files (port of tabula-java using PdfPig). Json writer.</Description>
66
<PackageProjectUrl>https://github.com/BobLd/tabula-sharp</PackageProjectUrl>
7-
<Version>0.1.3</Version>
7+
<Version>0.1.4-alpha001</Version>
88
<Company>BobLd</Company>
99
<Authors>BobLd</Authors>
1010
<PackageTags>pdf, extract, table, tabula, pdfpig, parse, extraction, json, export</PackageTags>
@@ -22,7 +22,7 @@
2222
</ItemGroup>
2323

2424
<ItemGroup>
25-
<PackageReference Include="Newtonsoft.Json" Version="13.0.1" />
25+
<PackageReference Include="Newtonsoft.Json" Version="13.0.3" />
2626
</ItemGroup>
2727

2828
<ItemGroup>

Tabula.Tests/PdfPigExtensionsTests.cs

Lines changed: 462 additions & 0 deletions
Large diffs are not rendered by default.

Tabula.Tests/Tabula.Tests.csproj

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,18 @@
99
</PropertyGroup>
1010

1111
<ItemGroup>
12-
<PackageReference Include="CsvHelper" Version="27.2.1" />
13-
<PackageReference Include="Microsoft.NET.Test.Sdk" Version="16.5.0" />
14-
<PackageReference Include="Newtonsoft.Json" Version="13.0.1" />
15-
<PackageReference Include="xunit" Version="2.4.0" />
16-
<PackageReference Include="xunit.runner.visualstudio" Version="2.4.0" />
17-
<PackageReference Include="coverlet.collector" Version="1.2.0" />
12+
<PackageReference Include="CsvHelper" Version="30.0.1" />
13+
<PackageReference Include="Microsoft.NET.Test.Sdk" Version="17.6.2" />
14+
<PackageReference Include="Newtonsoft.Json" Version="13.0.3" />
15+
<PackageReference Include="xunit" Version="2.4.2" />
16+
<PackageReference Include="xunit.runner.visualstudio" Version="2.4.5">
17+
<PrivateAssets>all</PrivateAssets>
18+
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
19+
</PackageReference>
20+
<PackageReference Include="coverlet.collector" Version="6.0.0">
21+
<PrivateAssets>all</PrivateAssets>
22+
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
23+
</PackageReference>
1824
</ItemGroup>
1925

2026
<ItemGroup>

Tabula.Tests/TestObjectExtractor.cs

Lines changed: 9 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,7 @@ public void TestCanReadPDFWithOwnerEncryption()
2323
{
2424
using (PdfDocument pdf_document = PdfDocument.Open("Resources/S2MNCEbirdisland.pdf"))
2525
{
26-
ObjectExtractor oe = new ObjectExtractor(pdf_document);
27-
PageIterator pi = oe.Extract();
26+
PageIterator pi = ObjectExtractor.Extract(pdf_document);
2827
int i = 0;
2928
while (pi.MoveNext())
3029
{
@@ -39,9 +38,8 @@ public void TestGoodPassword()
3938
{
4039
using (PdfDocument pdf_document = PdfDocument.Open("Resources/encrypted.pdf", new ParsingOptions() { Password = "userpassword" }))
4140
{
42-
ObjectExtractor oe = new ObjectExtractor(pdf_document);
4341
List<PageArea> pages = new List<PageArea>();
44-
PageIterator pi = oe.Extract();
42+
PageIterator pi = ObjectExtractor.Extract(pdf_document);
4543
while (pi.MoveNext())
4644
{
4745
pages.Add(pi.Current);
@@ -55,8 +53,7 @@ public void TestTextExtractionDoesNotRaise()
5553
{
5654
using (PdfDocument pdf_document = PdfDocument.Open("Resources/rotated_page.pdf", new ParsingOptions() { ClipPaths = true }))
5755
{
58-
ObjectExtractor oe = new ObjectExtractor(pdf_document);
59-
PageIterator pi = oe.Extract();
56+
PageIterator pi = ObjectExtractor.Extract(pdf_document);
6057

6158
Assert.True(pi.MoveNext());
6259
Assert.NotNull(pi.Current);
@@ -69,8 +66,7 @@ public void TestShouldDetectRulings()
6966
{
7067
using (PdfDocument pdf_document = PdfDocument.Open("Resources/should_detect_rulings.pdf", new ParsingOptions() { ClipPaths = true }))
7168
{
72-
ObjectExtractor oe = new ObjectExtractor(pdf_document);
73-
PageIterator pi = oe.Extract();
69+
PageIterator pi = ObjectExtractor.Extract(pdf_document);
7470

7571
PageArea page = pi.Next();
7672
IReadOnlyList<Ruling> rulings = page.GetRulings();
@@ -87,8 +83,7 @@ public void TestDontThrowNPEInShfill()
8783
{
8884
using (PdfDocument pdf_document = PdfDocument.Open("Resources/labor.pdf", new ParsingOptions() { ClipPaths = true }))
8985
{
90-
ObjectExtractor oe = new ObjectExtractor(pdf_document);
91-
PageIterator pi = oe.Extract();
86+
PageIterator pi = ObjectExtractor.Extract(pdf_document);
9287
Assert.True(pi.MoveNext());
9388

9489
PageArea p = pi.Current;
@@ -103,8 +98,7 @@ public void TestExtractOnePage()
10398
{
10499
Assert.Equal(2, pdf_document.NumberOfPages);
105100

106-
ObjectExtractor oe = new ObjectExtractor(pdf_document);
107-
PageArea page = oe.Extract(2);
101+
PageArea page = ObjectExtractor.Extract(pdf_document, 2);
108102

109103
Assert.NotNull(page);
110104
}
@@ -117,8 +111,7 @@ public void TestExtractWrongPageNumber()// throws IOException
117111
{
118112
Assert.Equal(2, pdf_document.NumberOfPages);
119113

120-
ObjectExtractor oe = new ObjectExtractor(pdf_document);
121-
Assert.Throws<IndexOutOfRangeException>(() => oe.Extract(3));
114+
Assert.Throws<IndexOutOfRangeException>(() => ObjectExtractor.Extract(pdf_document, 3));
122115
}
123116
}
124117

@@ -127,9 +120,7 @@ public void TestTextElementsContainedInPage()
127120
{
128121
using (PdfDocument pdf_document = PdfDocument.Open("Resources/cs-en-us-pbms.pdf", new ParsingOptions() { ClipPaths = true }))
129122
{
130-
ObjectExtractor oe = new ObjectExtractor(pdf_document);
131-
132-
PageArea page = oe.ExtractPage(1);
123+
PageArea page = ObjectExtractor.ExtractPage(pdf_document, 1);
133124

134125
foreach (TextElement te in page.GetText())
135126
{
@@ -143,9 +134,7 @@ public void TestDoNotNPEInPointComparator()
143134
{
144135
using (PdfDocument pdf_document = PdfDocument.Open("Resources/npe_issue_206.pdf", new ParsingOptions() { ClipPaths = true }))
145136
{
146-
ObjectExtractor oe = new ObjectExtractor(pdf_document);
147-
148-
PageArea p = oe.ExtractPage(1);
137+
PageArea p = ObjectExtractor.ExtractPage(pdf_document, 1);
149138
Assert.NotNull(p);
150139
}
151140
}

Tabula.Tests/TestSpreadsheetExtractor.cs

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -212,7 +212,7 @@ public void TestSpanningCells()
212212
PageArea page = UtilsForTesting.GetPage("Resources/spanning_cells.pdf", 1);
213213
string expectedJson = UtilsForTesting.LoadJson("Resources/json/spanning_cells.json");
214214
SpreadsheetExtractionAlgorithm se = new SpreadsheetExtractionAlgorithm();
215-
List<Table> tables = se.Extract(page);
215+
IReadOnlyList<Table> tables = se.Extract(page);
216216
Assert.Equal(2, tables.Count);
217217

218218
var expectedJObject = (JArray)JsonConvert.DeserializeObject(expectedJson);
@@ -268,7 +268,7 @@ public void TestSpanningCellsToCsv()
268268
PageArea page = UtilsForTesting.GetPage("Resources/spanning_cells.pdf", 1);
269269
string expectedCsv = UtilsForTesting.LoadCsv("Resources/csv/spanning_cells.csv");
270270
SpreadsheetExtractionAlgorithm se = new SpreadsheetExtractionAlgorithm();
271-
List<Table> tables = se.Extract(page);
271+
IReadOnlyList<Table> tables = se.Extract(page);
272272
Assert.Equal(2, tables.Count);
273273

274274
StringBuilder sb = new StringBuilder();
@@ -281,7 +281,7 @@ public void TestIncompleteGrid()
281281
{
282282
PageArea page = UtilsForTesting.GetPage("Resources/china.pdf", 1);
283283
SpreadsheetExtractionAlgorithm se = new SpreadsheetExtractionAlgorithm();
284-
List<Table> tables = se.Extract(page);
284+
IReadOnlyList<Table> tables = se.Extract(page);
285285
Assert.Equal(2, tables.Count);
286286
}
287287

@@ -290,7 +290,7 @@ public void TestNaturalOrderOfRectanglesDoesNotBreakContract()
290290
{
291291
PageArea page = UtilsForTesting.GetPage("Resources/us-017.pdf", 2);
292292
SpreadsheetExtractionAlgorithm se = new SpreadsheetExtractionAlgorithm();
293-
List<Table> tables = se.Extract(page);
293+
IReadOnlyList<Table> tables = se.Extract(page);
294294

295295
string expected = "Project,Agency,Institution\r\nNanotechnology and its publics,NSF,Pennsylvania State University\r\n\"Public information and deliberation in nanoscience and\rnanotechnology policy (SGER)\",Interagency,\"North Carolina State\rUniversity\"\r\n\"Social and ethical research and education in agrifood\rnanotechnology (NIRT)\",NSF,Michigan State University\r\n\"From laboratory to society: developing an informed\rapproach to nanoscale science and engineering (NIRT)\",NSF,University of South Carolina\r\nDatabase and innovation timeline for nanotechnology,NSF,UCLA\r\nSocial and ethical dimensions of nanotechnology,NSF,University of Virginia\r\n\"Undergraduate exploration of nanoscience,\rapplications and societal implications (NUE)\",NSF,\"Michigan Technological\rUniversity\"\r\n\"Ethics and belief inside the development of\rnanotechnology (CAREER)\",NSF,University of Virginia\r\n\"All centers, NNIN and NCN have a societal\rimplications components\",\"NSF, DOE,\rDOD, and NIH\",\"All nanotechnology centers\rand networks\""; // \r\n
296296

@@ -325,7 +325,7 @@ public void TestSpreadsheetWithNoBoundingFrameShouldBeSpreadsheet()
325325
SpreadsheetExtractionAlgorithm se = new SpreadsheetExtractionAlgorithm();
326326
bool isTabular = se.IsTabular(page);
327327
Assert.True(isTabular);
328-
List<Table> tables = se.Extract(page);
328+
IReadOnlyList<Table> tables = se.Extract(page);
329329

330330
StringBuilder sb = new StringBuilder();
331331
(new CSVWriter()).Write(sb, tables[0]);
@@ -337,7 +337,7 @@ public void TestExtractSpreadsheetWithinAnArea()
337337
{
338338
PageArea page = UtilsForTesting.GetAreaFromPage("Resources/puertos1.pdf", 1, new PdfRectangle(30.32142857142857, 793 - 554.8821428571429, 546.7964285714286, 793 - 273.9035714285714)); // 273.9035714285714f, 30.32142857142857f, 554.8821428571429f, 546.7964285714286f);
339339
SpreadsheetExtractionAlgorithm se = new SpreadsheetExtractionAlgorithm();
340-
List<Table> tables = se.Extract(page);
340+
IReadOnlyList<Table> tables = se.Extract(page);
341341
Table table = tables[0];
342342
Assert.Equal(15, table.Rows.Count);
343343

@@ -417,7 +417,7 @@ public void TestShouldDetectASingleSpreadsheet()
417417
{
418418
PageArea page = UtilsForTesting.GetAreaFromPage("Resources/offense.pdf", 1, new PdfRectangle(16.44, 792 - 680.85, 597.84, 792 - 16.44)); // 68.08f, 16.44f, 680.85f, 597.84f);
419419
SpreadsheetExtractionAlgorithm bea = new SpreadsheetExtractionAlgorithm();
420-
List<Table> tables = bea.Extract(page);
420+
IReadOnlyList<Table> tables = bea.Extract(page);
421421
Assert.Single(tables);
422422
}
423423

@@ -426,7 +426,7 @@ public void TestExtractTableWithExternallyDefinedRulings()
426426
{
427427
PageArea page = UtilsForTesting.GetPage("Resources/us-007.pdf", 1);
428428
SpreadsheetExtractionAlgorithm bea = new SpreadsheetExtractionAlgorithm();
429-
List<Table> tables = bea.Extract(page, EXTERNALLY_DEFINED_RULINGS.ToList());
429+
IReadOnlyList<Table> tables = bea.Extract(page, EXTERNALLY_DEFINED_RULINGS.ToList());
430430
Assert.Single(tables);
431431
Table table = tables[0];
432432
Assert.Equal(18, table.Cells.Count);
@@ -458,7 +458,7 @@ public void TestAnotherExtractTableWithExternallyDefinedRulings()
458458
{
459459
PageArea page = UtilsForTesting.GetPage("Resources/us-024.pdf", 1);
460460
SpreadsheetExtractionAlgorithm bea = new SpreadsheetExtractionAlgorithm();
461-
List<Table> tables = bea.Extract(page, EXTERNALLY_DEFINED_RULINGS2.ToList());
461+
IReadOnlyList<Table> tables = bea.Extract(page, EXTERNALLY_DEFINED_RULINGS2.ToList());
462462
Assert.Single(tables);
463463
Table table = tables[0];
464464

@@ -472,7 +472,7 @@ public void TestSpreadsheetsSortedByTopAndRight()
472472
PageArea page = UtilsForTesting.GetPage("Resources/sydney_disclosure_contract.pdf", 1);
473473

474474
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
475-
List<Table> tables = sea.Extract(page);
475+
IReadOnlyList<Table> tables = sea.Extract(page);
476476
for (int i = 1; i < tables.Count; i++)
477477
{
478478
Assert.True(tables[i - 1].Top >= tables[i].Top); // Assert.True(tables[i - 1].getTop() <= tables[i].getTop());
@@ -485,7 +485,7 @@ public void TestDontStackOverflowQuicksort()
485485
PageArea page = UtilsForTesting.GetPage("Resources/failing_sort.pdf", 1);
486486

487487
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
488-
List<Table> tables = sea.Extract(page);
488+
IReadOnlyList<Table> tables = sea.Extract(page);
489489
for (int i = 1; i < tables.Count; i++)
490490
{
491491
Assert.True(tables[i - 1].Top >= tables[i].Top); //Assert.True(tables[i - 1].getTop() <= tables[i].getTop());
@@ -497,7 +497,7 @@ public void TestRTL()
497497
{
498498
PageArea page = UtilsForTesting.GetPage("Resources/arabic.pdf", 1);
499499
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
500-
List<Table> tables = sea.Extract(page);
500+
IReadOnlyList<Table> tables = sea.Extract(page);
501501
// Assert.Equal(1, tables.size());
502502
Table table = tables[0];
503503

@@ -528,7 +528,7 @@ public void TestRealLifeRTL()
528528
{
529529
PageArea page = UtilsForTesting.GetPage("Resources/mednine.pdf", 1);
530530
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
531-
List<Table> tables = sea.Extract(page);
531+
IReadOnlyList<Table> tables = sea.Extract(page);
532532
Assert.Single(tables);
533533
Table table = tables[0];
534534
var rows = table.Rows;
@@ -580,7 +580,7 @@ public void TestSpreadsheetExtractionIssue656()
580580
string expectedCsv = UtilsForTesting.LoadCsv("Resources/csv/Publication_of_award_of_Bids_for_Transport_Sector__August_2016.csv");
581581

582582
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
583-
List<Table> tables = sea.Extract(page);
583+
IReadOnlyList<Table> tables = sea.Extract(page);
584584
Assert.Single(tables);
585585
Table table = tables[0];
586586

Tabula.Tests/TestTableDetection.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ public class TestTableDetection
1616

1717
//private static Level defaultLogLevel;
1818

19-
private class TestStatus
19+
private sealed class TestStatus
2020
{
2121
public int numExpectedTables;
2222
public int numCorrectlyDetectedTables;

Tabula.Tests/TestWriters.cs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ private Table GetTable()
2020
return bea.Extract(page)[0];
2121
}
2222

23-
private List<Table> GetTables()
23+
private IReadOnlyList<Table> GetTables()
2424
{
2525
PageArea page = UtilsForTesting.GetPage("Resources/twotables.pdf", 1);
2626
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
@@ -144,7 +144,7 @@ public void TestCSVSerializeInfinity()
144144
public void TestJSONSerializeTwoTables()
145145
{
146146
string expectedJson = UtilsForTesting.LoadJson("Resources/json/twotables.json");
147-
List<Table> tables = this.GetTables();
147+
IReadOnlyList<Table> tables = this.GetTables();
148148

149149
StringBuilder sb = new StringBuilder();
150150
(new JSONWriter()).Write(sb, tables);
@@ -178,7 +178,7 @@ public void TestJSONSerializeTwoTables()
178178
public void TestCSVSerializeTwoTables()
179179
{
180180
string expectedCsv = UtilsForTesting.LoadCsv("Resources/csv/twotables.csv");
181-
List<Table> tables = this.GetTables();
181+
IReadOnlyList<Table> tables = this.GetTables();
182182

183183
/*
184184
StringBuilder sb = new StringBuilder();

Tabula.Tests/TestsIcdar2013.cs

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@ public void Eu004()
1515
{
1616
using (PdfDocument document = PdfDocument.Open("Resources/icdar2013-dataset/competition-dataset-eu/eu-004.pdf", new ParsingOptions() { ClipPaths = true }))
1717
{
18-
ObjectExtractor oe = new ObjectExtractor(document);
19-
PageArea page = oe.Extract(3);
18+
PageArea page = ObjectExtractor.Extract(document, 3);
2019

2120
var detector = new SimpleNurminenDetectionAlgorithm();
2221
var regions = detector.Detect(page);

0 commit comments

Comments
 (0)