Skip to content

Commit c4cdf13

Browse files
authored
Added IOException throwing readFully and readBody (#2327)
For historical reasons (specifically that a response's body used to be fully buffered during `execute`), `Connection#body()` does not throw a checked IOException. This change adds a new `Connection#readBody()` which does throw a checked exception, and a partner `Connection#readFully()` method to do the internal buffering. Clarified relevant documentation.
1 parent ec7fcd3 commit c4cdf13

File tree

8 files changed

+192
-19
lines changed

8 files changed

+192
-19
lines changed

CHANGES.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
* Custom tags (defined via the `TagSet`) in a foreign namespace (e.g. SVG) can be configured to parse as data tags.
1313
* Added `NodeVisitor#traverse(Node)` to simplify node traversal calls (vs. importing `NodeTraversor`).
1414
* The HTML parser now allows the specific text-data type (Data, RcData) to be customized for known tags. (Previously, that was only supported on custom tags.) [#2326](https://github.com/jhy/jsoup/issues/2326).
15+
* Added `Connection#readFully()` as a replacement for `Connection#bufferUp()` with an explicit IOException. Similarly, added `Connection#readBody()` over `Connection#body()`. Deprecated `Connection#bufferUp()`. [#2327](https://github.com/jhy/jsoup/pull/2327)
1516

1617
### Bug Fixes
1718
* The contents of a `script` in a `svg` foreign context should be parsed as script data, not text. [#2320](https://github.com/jhy/jsoup/issues/2320)

src/main/java/org/jsoup/Connection.java

Lines changed: 55 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -907,42 +907,87 @@ interface Response extends Base<Response> {
907907
@Nullable String contentType();
908908

909909
/**
910-
* Read and parse the body of the response as a Document. If you intend to parse the same response multiple
911-
* times, you should {@link #bufferUp()} first.
912-
* @return a parsed Document
913-
* @throws IOException on error
910+
Read and parse the body of the response as a Document. If you intend to parse the same response multiple times,
911+
you should {@link #readFully()} first, which will buffer the body into memory.
912+
913+
@return a parsed Document
914+
@throws IOException if an IO exception occurs whilst reading the body.
915+
@see #readFully()
914916
*/
915917
Document parse() throws IOException;
916918

917919
/**
918-
* Get the body of the response as a plain string.
919-
* @return body
920+
Read the response body, and returns it as a plain String.
921+
922+
@return body
923+
@throws IOException if an IO exception occurs whilst reading the body.
924+
@since 1.21.1
925+
*/
926+
default String readBody() throws IOException {
927+
throw new UnsupportedOperationException();
928+
}
929+
930+
/**
931+
Get the body of the response as a plain String.
932+
933+
<p>Will throw an UncheckedIOException if the body has not been buffered and an error occurs whilst reading the
934+
body; use {@link #readFully()} first to buffer the body and catch any exceptions explicitly. Or more simply,
935+
{@link #readBody()}.</p>
936+
937+
@return body
938+
@throws UncheckedIOException if an IO exception occurs whilst reading the body.
939+
@see #readBody()
940+
@see #readFully()
920941
*/
921942
String body();
922943

923944
/**
924-
* Get the body of the response as an array of bytes.
925-
* @return body bytes
945+
Get the body of the response as an array of bytes.
946+
947+
<p>Will throw an UncheckedIOException if the body has not been buffered and an error occurs whilst reading the
948+
body; use {@link #readFully()} first to buffer the body and catch any exceptions explicitly.</p>
949+
950+
@return body bytes
951+
@throws UncheckedIOException if an IO exception occurs whilst reading the body.
952+
@see #readFully()
926953
*/
927954
byte[] bodyAsBytes();
928955

956+
/**
957+
Read the body of the response into a local buffer, so that {@link #parse()} may be called repeatedly on the same
958+
connection response. Otherwise, once the response is read, its InputStream will have been drained and may not be
959+
re-read.
960+
961+
<p>Subsequent calls methods than consume the body, such as {@link #parse()}, {@link #body()},
962+
{@link #bodyAsBytes()}, will not need to read the body again, and will not throw exceptions.</p>
963+
<p>Calling {@link #readBody()}} has the same effect.</p>
964+
965+
@return this response, for chaining
966+
@throws IOException if an IO exception occurs during buffering.
967+
@since 1.21.1
968+
*/
969+
default Response readFully() throws IOException {
970+
throw new UnsupportedOperationException();
971+
}
972+
929973
/**
930974
* Read the body of the response into a local buffer, so that {@link #parse()} may be called repeatedly on the
931975
* same connection response. Otherwise, once the response is read, its InputStream will have been drained and
932976
* may not be re-read.
933977
* <p>Calling {@link #body() } or {@link #bodyAsBytes()} has the same effect.</p>
934978
* @return this response, for chaining
935979
* @throws UncheckedIOException if an IO exception occurs during buffering.
980+
* @deprecated use {@link #readFully()} instead (for the checked exception). Will be removed in a future version.
936981
*/
937982
Response bufferUp();
938983

939984
/**
940985
Get the body of the response as a (buffered) InputStream. You should close the input stream when you're done
941986
with it.
942-
<p>Other body methods (like bufferUp, body, parse, etc) will generally not work in conjunction with this method,
987+
<p>Other body methods (like readFully, body, parse, etc) will generally not work in conjunction with this method,
943988
as it consumes the InputStream.</p>
944989
<p>Any configured max size or maximum read timeout applied to the connection will not be applied to this stream,
945-
unless {@link #bufferUp()} is called prior.</p>
990+
unless {@link #readFully()} is called prior.</p>
946991
<p>This method is useful for writing large responses to disk, without buffering them completely into memory
947992
first.</p>
948993
@return the response body input stream

src/main/java/org/jsoup/helper/HttpConnection.java

Lines changed: 28 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1006,24 +1006,45 @@ private ControllableInputStream prepareParse() {
10061006
return streamer;
10071007
}
10081008

1009-
private void prepareByteData() {
1009+
/**
1010+
Reads the bodyStream into byteData. A no-op if already executed.
1011+
*/
1012+
@Override
1013+
public Connection.Response readFully() throws IOException {
10101014
Validate.isTrue(executed, "Request must be executed (with .execute(), .get(), or .post() before getting response body");
10111015
if (bodyStream != null && byteData == null) {
10121016
Validate.isFalse(inputStreamRead, "Request has already been read (with .parse())");
10131017
try {
10141018
byteData = DataUtil.readToByteBuffer(bodyStream, req.maxBodySize());
1015-
} catch (IOException e) {
1016-
throw new UncheckedIOException(e);
10171019
} finally {
10181020
inputStreamRead = true;
10191021
safeClose();
10201022
}
10211023
}
1024+
return this;
1025+
}
1026+
1027+
/**
1028+
Reads the body, but throws an UncheckedIOException if an IOException occurs.
1029+
@throws UncheckedIOException if an IOException occurs
1030+
*/
1031+
private void readByteDataUnchecked() {
1032+
try {
1033+
readFully();
1034+
} catch (IOException e) {
1035+
throw new UncheckedIOException(e);
1036+
}
1037+
}
1038+
1039+
@Override
1040+
public String readBody() throws IOException {
1041+
readFully();
1042+
return body();
10221043
}
10231044

10241045
@Override
10251046
public String body() {
1026-
prepareByteData();
1047+
readByteDataUnchecked();
10271048
Validate.notNull(byteData);
10281049
// charset gets set from header on execute, and from meta-equiv on parse. parse may not have happened yet
10291050
String body = (charset == null ? UTF_8 : Charset.forName(charset))
@@ -1034,7 +1055,7 @@ public String body() {
10341055

10351056
@Override
10361057
public byte[] bodyAsBytes() {
1037-
prepareByteData();
1058+
readByteDataUnchecked();
10381059
Validate.notNull(byteData);
10391060
Validate.isTrue(byteData.hasArray()); // we made it, so it should
10401061

@@ -1053,15 +1074,15 @@ public byte[] bodyAsBytes() {
10531074

10541075
@Override
10551076
public Connection.Response bufferUp() {
1056-
prepareByteData();
1077+
readByteDataUnchecked();
10571078
return this;
10581079
}
10591080

10601081
@Override
10611082
public BufferedInputStream bodyStream() {
10621083
Validate.isTrue(executed, "Request must be executed (with .execute(), .get(), or .post() before getting response body");
10631084

1064-
// if we have read to bytes (via buffer up), return those as a stream.
1085+
// if we have read to bytes (via readFully), return those as a stream.
10651086
if (byteData != null) {
10661087
return new BufferedInputStream(
10671088
new ByteArrayInputStream(byteData.array(), 0, byteData.limit()),

src/main/java/org/jsoup/parser/CharacterReader.java

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@
1616

1717
/**
1818
CharacterReader consumes tokens off a string. Used internally by jsoup. API subject to changes.
19+
<p>If the underlying reader throws an IOException during any operation, the CharacterReader will throw an
20+
{@link UncheckedIOException}. That won't happen with String / StringReader inputs.</p>
1921
*/
2022
public final class CharacterReader implements AutoCloseable {
2123
static final char EOF = (char) -1;
@@ -81,6 +83,10 @@ private void bufferUp() {
8183
doBufferUp(); // structured so bufferUp may become an intrinsic candidate
8284
}
8385

86+
/**
87+
Reads into the buffer. Will throw an UncheckedIOException if the underling reader throws an IOException.
88+
@throws UncheckedIOException if the underlying reader throws an IOException
89+
*/
8490
private void doBufferUp() {
8591
/*
8692
The flow:

src/main/java/org/jsoup/parser/Parser.java

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,11 +61,26 @@ private Parser(Parser copy) {
6161
settings = new ParseSettings(copy.settings);
6262
trackPosition = copy.trackPosition;
6363
}
64-
64+
65+
/**
66+
Parse the contents of a String.
67+
68+
@param html HTML to parse
69+
@param baseUri base URI of document (i.e. original fetch location), for resolving relative URLs.
70+
@return parsed Document
71+
*/
6572
public Document parseInput(String html, String baseUri) {
6673
return parseInput(new StringReader(html), baseUri);
6774
}
6875

76+
/**
77+
Parse the contents of Reader.
78+
79+
@param inputHtml HTML to parse
80+
@param baseUri base URI of document (i.e. original fetch location), for resolving relative URLs.
81+
@return parsed Document
82+
@throws java.io.UncheckedIOException if an I/O error occurs in the Reader
83+
*/
6984
public Document parseInput(Reader inputHtml, String baseUri) {
7085
try {
7186
lock.lock(); // using a lock vs synchronized to support loom threads
@@ -75,10 +90,27 @@ public Document parseInput(Reader inputHtml, String baseUri) {
7590
}
7691
}
7792

93+
/**
94+
Parse a fragment of HTML into a list of nodes. The context element, if supplied, supplies parsing context.
95+
96+
@param fragment the fragment of HTML to parse
97+
@param context (optional) the element that this HTML fragment is being parsed for (i.e. for inner HTML).
98+
@param baseUri base URI of document (i.e. original fetch location), for resolving relative URLs.
99+
@return list of nodes parsed from the input HTML.
100+
*/
78101
public List<Node> parseFragmentInput(String fragment, @Nullable Element context, String baseUri) {
79102
return parseFragmentInput(new StringReader(fragment), context, baseUri);
80103
}
81104

105+
/**
106+
Parse a fragment of HTML into a list of nodes. The context element, if supplied, supplies parsing context.
107+
108+
@param fragment the fragment of HTML to parse
109+
@param context (optional) the element that this HTML fragment is being parsed for (i.e. for inner HTML).
110+
@param baseUri base URI of document (i.e. original fetch location), for resolving relative URLs.
111+
@return list of nodes parsed from the input HTML.
112+
@throws java.io.UncheckedIOException if an I/O error occurs in the Reader
113+
*/
82114
public List<Node> parseFragmentInput(Reader fragment, @Nullable Element context, String baseUri) {
83115
try {
84116
lock.lock();

src/main/java/org/jsoup/parser/TokeniserState.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1659,7 +1659,7 @@ else if (r.matches('>')) {
16591659
abstract void read(Tokeniser t, CharacterReader r);
16601660

16611661
static final char nullChar = '\u0000';
1662-
// char searches. must be sorted, used in inSorted. MUST update TokenisetStateTest if more arrays are added.
1662+
// char searches. must be sorted, used in inSorted. MUST update TokeniserStateTest if more arrays are added.
16631663
static final char[] attributeNameCharsSorted = new char[]{'\t', '\n', '\f', '\r', ' ', '"', '\'', '/', '<', '=', '>', '?'};
16641664
static final char[] attributeValueUnquoted = new char[]{nullChar, '\t', '\n', '\f', '\r', ' ', '"', '&', '\'', '<', '=', '>', '`'};
16651665

src/test/java/org/jsoup/integration/ConnectIT.java

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,42 @@ public void slowReadOk() throws IOException {
129129
assertEquals("outatime", h1.text());
130130
}
131131

132+
@Test void readFullyThrowsOnTimeout() throws IOException {
133+
// tests that response.readFully excepts on timeout
134+
boolean caught = false;
135+
Connection.Response res = Jsoup.connect(SlowRider.Url).timeout(3000).execute();
136+
try {
137+
res.readFully();
138+
} catch (IOException e) {
139+
caught = true;
140+
}
141+
assertTrue(caught);
142+
}
143+
144+
@Test void readBodyThrowsOnTimeout() throws IOException {
145+
// tests that response.readBody excepts on timeout
146+
boolean caught = false;
147+
Connection.Response res = Jsoup.connect(SlowRider.Url).timeout(3000).execute();
148+
try {
149+
res.readBody();
150+
} catch (IOException e) {
151+
caught = true;
152+
}
153+
assertTrue(caught);
154+
}
155+
156+
@Test void bodyThrowsUncheckedOnTimeout() throws IOException {
157+
// tests that response.body unchecked excepts on timeout
158+
boolean caught = false;
159+
Connection.Response res = Jsoup.connect(SlowRider.Url).timeout(3000).execute();
160+
try {
161+
res.body();
162+
} catch (UncheckedIOException e) {
163+
caught = true;
164+
}
165+
assertTrue(caught);
166+
}
167+
132168
@Test
133169
public void infiniteReadSupported() throws IOException {
134170
Document doc = Jsoup.connect(SlowRider.Url)
@@ -249,6 +285,21 @@ public void noLimitAfterFirstRead() throws IOException {
249285
}
250286
}
251287

288+
@Test public void bodyStreamConstrainedViaReadFully() throws IOException {
289+
int cap = 5 * 1024;
290+
String url = FileServlet.urlTo("/htmltests/large.html"); // 280 K
291+
try (BufferedInputStream stream = Jsoup
292+
.connect(url)
293+
.maxBodySize(cap)
294+
.execute()
295+
.readFully()
296+
.bodyStream()) {
297+
298+
ByteBuffer cappedRead = DataUtil.readToByteBuffer(stream, 0);
299+
assertEquals(cap, cappedRead.limit());
300+
}
301+
}
302+
252303
@Test public void bodyStreamConstrainedViaBufferUp() throws IOException {
253304
int cap = 5 * 1024;
254305
String url = FileServlet.urlTo("/htmltests/large.html"); // 280 K

src/test/java/org/jsoup/integration/ConnectTest.java

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -393,6 +393,17 @@ public void postFiles(String url) throws IOException {
393393
*/
394394
}
395395

396+
@Test
397+
public void multipleParsesOkAfterReadFully() throws IOException {
398+
Connection.Response res = Jsoup.connect(echoUrl).execute().readFully();
399+
400+
Document doc = res.parse();
401+
assertTrue(doc.title().contains("Environment"));
402+
403+
Document doc2 = res.parse();
404+
assertTrue(doc2.title().contains("Environment"));
405+
}
406+
396407
@Test
397408
public void multipleParsesOkAfterBufferUp() throws IOException {
398409
Connection.Response res = Jsoup.connect(echoUrl).execute().bufferUp();
@@ -842,6 +853,12 @@ public void maxBodySizeInReadToByteBuffer() throws IOException {
842853
assertEquals(200 * 1024, mediumRes.body().length());
843854
assertEquals(actualDocText, largeRes.body().length());
844855
assertEquals(actualDocText, unlimitedRes.body().length());
856+
857+
assertEquals(actualDocText, defaultRes.readBody().length());
858+
assertEquals(50 * 1024, smallRes.readBody().length());
859+
assertEquals(200 * 1024, mediumRes.readBody().length());
860+
assertEquals(actualDocText, largeRes.readBody().length());
861+
assertEquals(actualDocText, unlimitedRes.readBody().length());
845862
}
846863

847864
@Test void formLoginFlow() throws IOException {

0 commit comments

Comments
 (0)