Cstrchunk by avati · Pull Request #4 · h2oai/h2o-3

avati · 2014-07-07T23:51:45Z

String based chunk support. Set contains two patches

Initial patch to create a string typed chunk
Additional patch from @bghill which has bug fixes to the first patch and wider integration with the parser etc.

Conflicts: h2o-core/src/main/java/water/fvec/CStrChunk.java h2o-core/src/main/java/water/fvec/NewChunk.java

bghill · 2014-07-08T00:26:29Z

The only catch with putting this in the main h2o-dev stream is lines 666-670 of ParsedData2.java. If the Enum maps max out, this code spills over to using strings instead of inserting NA's (the old behavior). Since string/enum unification between chunks isn't included in this patch, if anything tests that behavior it will fail. If that isn't an issue, then merge away.

Commenting out the above mentioned lines is handy for testing. All items are then read in as strings, and basic string vector functionality can be tested. Putting them back in, lets you see the standard enum behavior again (minus overflow NAs).

I hope to finish the string/enum unification tonight.

cliffclick · 2014-07-08T01:03:51Z

If you're that close, I might wait on Anand's pull & take 'em all.

fyi, we've been hacking the "gradle test" and "make check" paths - both
of 'em - and there are some pre-existing known failures in h2o-dev. See
if you can get your tests to run in isolation, but I'll look as soon as
something's available.

Cliff

On 7/7/2014 5:26 PM, Brandon wrote:

The only catch with putting this in the main h2o-dev stream is lines
666-670 of ParsedData2.java. If the Enum maps max out, this code
spills over to using strings instead of inserting NA's (the old
behavior). Since string/enum unification between chunks isn't included
in this patch, if anything tests that behavior it will fail. If that
isn't an issue, then merge away.

Commenting out the above mentioned lines is handy for testing. All
items are then read in as strings, and basic string vector
functionality can be tested. Putting them back in, lets you see the
standard enum behavior again (minus overflow NAs).

I hope to finish the string/enum unification tonight.

—
Reply to this email directly or view it on GitHub
#4 (comment).

cliffclick · 2014-07-08T21:24:29Z

Tomas - I'm poking at work between setting up for this trip.
Can you take the lead on pulling Anand's changes?
If Brandon's are ready, please review & pull them also.

We can handle the H2O1 integration later (if ever; lets work out the
bugs in h2o-dev, then we can look at how clean (or not) it goes into H2O1)

Thanks
Cliff

On 7/7/2014 5:26 PM, Brandon wrote:

The only catch with putting this in the main h2o-dev stream is lines
666-670 of ParsedData2.java. If the Enum maps max out, this code
spills over to using strings instead of inserting NA's (the old
behavior). Since string/enum unification between chunks isn't included
in this patch, if anything tests that behavior it will fail. If that
isn't an issue, then merge away.

Commenting out the above mentioned lines is handy for testing. All
items are then read in as strings, and basic string vector
functionality can be tested. Putting them back in, lets you see the
standard enum behavior again (minus overflow NAs).

I hope to finish the string/enum unification tonight.

—
Reply to this email directly or view it on GitHub
#4 (comment).

tomasnykodym · 2014-07-08T22:53:31Z

Ok I merged it in but I have two comments:

a) the chunk is not supported directly by the serialized bytes as other chunks are (uses two separate arrays internally and both are allocated/copied during read() ) so it will double the required memory.

b) do we want the interface to return String (need to copy the memory for each accessed row) or should we use the ValueString as is used in the parser to avoid memcopy/allocation?

avati · 2014-07-08T23:07:57Z

The interface was driven by the first requirement (Mahout), in which case
String was needed (to create a Scala Array of Strings for each Chunk).
Unfortunately it is expensive.

On Tue, Jul 8, 2014 at 3:53 PM, tomasnykodym notifications@github.com
wrote:

Ok I merged it in but I have two comments:

a) the chunk is not supported directly by the serialized bytes as other
chunks are (uses two separate arrays internally and both are
allocated/copied during read() ) so it will double the required memory.

b) do we want the interface to return String (need to copy the memory for
each accessed row) or should we use the ValueString as is used in the
parser to avoid memcopy/allocation?

—
Reply to this email directly or view it on GitHub
#4 (comment).

cliffclick · 2014-07-08T23:41:20Z

Good catch. Yes, that's a fail.
It's part of the complex/subtle contract on Chunks - no separate
ser/deser step, always only raw bytes - to avoid the massive
copy/allocation costs in moving data around.

So yes, please, flip CStrChunk to a pile-o-bytes, and have the interface
take & return ValueStrings instead.
We can have a 2nd convenience interface that returns a newly allocated
ValueString, but the bulk/efficient one will just fill in a pre-existing
ValueString - allowing you to iterate over all strings without doing
any allocation. Ability to do this is a major design requirement
(controls GC costs).

Brandon, are you up to doing this? Can you coordinate w/Tomas & Anand
(as I'm only sorta on-grid now)

Thanks,
Cliff

On 7/8/2014 3:53 PM, tomasnykodym wrote:

Ok I merged it in but I have two comments:

a) the chunk is not supported directly by the serialized bytes as
other chunks are (uses two separate arrays internally and both are
allocated/copied during read() ) so it will double the required memory.

b) do we want the interface to return String (need to copy the memory
for each accessed row) or should we use the ValueString as is used in
the parser to avoid memcopy/allocation?

—
Reply to this email directly or view it on GitHub
#4 (comment).

cliffclick · 2014-07-08T23:45:11Z

Yes, But No. Internally we'll use ValueString.
We can add an expensive convenience interface, but always supply the
"main" interface as a no-allocation path. It might not be the "main"
interface in terms of places where it's used... but it must be possible
to do so in all the hot paths. Imagine, e.g., "grep" for an address.
Complete fail (probably 10x slower) to allocate all strings. And
storing as-strings blows the all-big-data-is-in-big-primitive-arrays
invariant, leading to GC costs.

We'll look at a double-storage option (both strings & bytes) if String
ops become popular.
Cleaner can drop the Strings on demand, but FullGC will run slower.
Internally we'll use ValueString, as it's only slightly more headache
and a lot faster.

Cliff

On 7/8/2014 4:07 PM, Anand Avati wrote:

The interface was driven by the first requirement (Mahout), in which case
String was needed (to create a Scala Array of Strings for each Chunk).
Unfortunately it is expensive.

On Tue, Jul 8, 2014 at 3:53 PM, tomasnykodym notifications@github.com
wrote:

Ok I merged it in but I have two comments:

a) the chunk is not supported directly by the serialized bytes as other
chunks are (uses two separate arrays internally and both are
allocated/copied during read() ) so it will double the required memory.

b) do we want the interface to return String (need to copy the
memory for
each accessed row) or should we use the ValueString as is used in the
parser to avoid memcopy/allocation?

—
Reply to this email directly or view it on GitHub
#4 (comment).

—
Reply to this email directly or view it on GitHub
#4 (comment).

# This is the 1st commit message: PUBDEV-3793-python-api-misc. Added unit test for h2o.init(). # This is the commit message #2: PUBDEV-3793-python-api-h2o-cluster-commands. Added pyunit test for h2o.ls(). # This is the commit message #3: PUBDEV-3797: H2O cluster apis. Added pyunit skeleton test files. # This is the commit message #4: PUBDEV-3797: api from H2O cluster. Completed h2o.init() tests. # This is the commit message #5: PUBDEV-3797: H2O MODULE API tests. pyunit tests API complete. # This is the commit message #6: PUBDEV-3797: H2O module api tests. Added more tests. # This is the commit message #7: PUBDEV-3797: API test for python client in H2O Module. Added all tests. Ready for review. # This is the commit message #8: PUBDEV-3797: Python API H2O Module. Incorporated Pasha comments and minor code cleanup. # This is the commit message #9: PUBDEV-3797: Python API tests for H2O Module. Fixed pyunit tests to use more general commands and hopefully pass all the tests. # This is the commit message #10: PUBDEV-3797: api for H2O Module. Fixed h2oinit.py failure. Thanks Pasha. # This is the commit message #1: PUBDEV-3797: api test for H2O Module. Clean up h2oinit.py test. # This is the commit message #2: PUBDEV-3797: test H2O Module API. Added more print for pyunit_h2oinit.py.

avati and others added 2 commits July 7, 2014 16:26

CStrChunk: String based chunk

ec6e032

Basic integration of CStrChunk

cdec352

Conflicts: h2o-core/src/main/java/water/fvec/CStrChunk.java h2o-core/src/main/java/water/fvec/NewChunk.java

tomasnykodym merged commit cdec352 into h2oai:master Jul 8, 2014

sebhrusen mentioned this pull request Mar 2, 2021

PUBDEV-8015: Remove warning in Stacked Ensemble predict function about missing fold_column frame #5343

Merged

exalate-issue-sync bot mentioned this pull request Feb 21, 2023

AutoML does not generated Stacked Ensembles when max_models is specified #6589

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cstrchunk#4

Cstrchunk#4
tomasnykodym merged 2 commits intoh2oai:masterfrom
avati:cstrchunk

avati commented Jul 7, 2014

Uh oh!

bghill commented Jul 8, 2014

Uh oh!

cliffclick commented Jul 8, 2014

Uh oh!

cliffclick commented Jul 8, 2014

Uh oh!

tomasnykodym commented Jul 8, 2014

Uh oh!

avati commented Jul 8, 2014

Uh oh!

cliffclick commented Jul 8, 2014

Uh oh!

cliffclick commented Jul 8, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

avati commented Jul 7, 2014

Uh oh!

bghill commented Jul 8, 2014

Uh oh!

cliffclick commented Jul 8, 2014

Uh oh!

cliffclick commented Jul 8, 2014

Uh oh!

tomasnykodym commented Jul 8, 2014

Uh oh!

avati commented Jul 8, 2014

Uh oh!

cliffclick commented Jul 8, 2014

Uh oh!

cliffclick commented Jul 8, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants