Skip to content

Cstrchunk#4

Merged
tomasnykodym merged 2 commits intoh2oai:masterfrom
avati:cstrchunk
Jul 8, 2014
Merged

Cstrchunk#4
tomasnykodym merged 2 commits intoh2oai:masterfrom
avati:cstrchunk

Conversation

@avati
Copy link
Contributor

@avati avati commented Jul 7, 2014

String based chunk support. Set contains two patches

  1. Initial patch to create a string typed chunk
  2. Additional patch from @bghill which has bug fixes to the first patch and wider integration with the parser etc.

avati and others added 2 commits July 7, 2014 16:26
Conflicts:
	h2o-core/src/main/java/water/fvec/CStrChunk.java
	h2o-core/src/main/java/water/fvec/NewChunk.java
@bghill
Copy link
Contributor

bghill commented Jul 8, 2014

The only catch with putting this in the main h2o-dev stream is lines 666-670 of ParsedData2.java. If the Enum maps max out, this code spills over to using strings instead of inserting NA's (the old behavior). Since string/enum unification between chunks isn't included in this patch, if anything tests that behavior it will fail. If that isn't an issue, then merge away.

Commenting out the above mentioned lines is handy for testing. All items are then read in as strings, and basic string vector functionality can be tested. Putting them back in, lets you see the standard enum behavior again (minus overflow NAs).

I hope to finish the string/enum unification tonight.

@cliffclick
Copy link
Contributor

If you're that close, I might wait on Anand's pull & take 'em all.

fyi, we've been hacking the "gradle test" and "make check" paths - both
of 'em - and there are some pre-existing known failures in h2o-dev. See
if you can get your tests to run in isolation, but I'll look as soon as
something's available.

Cliff

On 7/7/2014 5:26 PM, Brandon wrote:

The only catch with putting this in the main h2o-dev stream is lines
666-670 of ParsedData2.java. If the Enum maps max out, this code
spills over to using strings instead of inserting NA's (the old
behavior). Since string/enum unification between chunks isn't included
in this patch, if anything tests that behavior it will fail. If that
isn't an issue, then merge away.

Commenting out the above mentioned lines is handy for testing. All
items are then read in as strings, and basic string vector
functionality can be tested. Putting them back in, lets you see the
standard enum behavior again (minus overflow NAs).

I hope to finish the string/enum unification tonight.


Reply to this email directly or view it on GitHub
#4 (comment).

@cliffclick
Copy link
Contributor

Tomas - I'm poking at work between setting up for this trip.
Can you take the lead on pulling Anand's changes?
If Brandon's are ready, please review & pull them also.

We can handle the H2O1 integration later (if ever; lets work out the
bugs in h2o-dev, then we can look at how clean (or not) it goes into H2O1)

Thanks
Cliff

On 7/7/2014 5:26 PM, Brandon wrote:

The only catch with putting this in the main h2o-dev stream is lines
666-670 of ParsedData2.java. If the Enum maps max out, this code
spills over to using strings instead of inserting NA's (the old
behavior). Since string/enum unification between chunks isn't included
in this patch, if anything tests that behavior it will fail. If that
isn't an issue, then merge away.

Commenting out the above mentioned lines is handy for testing. All
items are then read in as strings, and basic string vector
functionality can be tested. Putting them back in, lets you see the
standard enum behavior again (minus overflow NAs).

I hope to finish the string/enum unification tonight.


Reply to this email directly or view it on GitHub
#4 (comment).

@tomasnykodym tomasnykodym merged commit cdec352 into h2oai:master Jul 8, 2014
@tomasnykodym
Copy link
Contributor

Ok I merged it in but I have two comments:

a) the chunk is not supported directly by the serialized bytes as other chunks are (uses two separate arrays internally and both are allocated/copied during read() ) so it will double the required memory.

b) do we want the interface to return String (need to copy the memory for each accessed row) or should we use the ValueString as is used in the parser to avoid memcopy/allocation?

@avati
Copy link
Contributor Author

avati commented Jul 8, 2014

The interface was driven by the first requirement (Mahout), in which case
String was needed (to create a Scala Array of Strings for each Chunk).
Unfortunately it is expensive.

On Tue, Jul 8, 2014 at 3:53 PM, tomasnykodym notifications@github.com
wrote:

Ok I merged it in but I have two comments:

a) the chunk is not supported directly by the serialized bytes as other
chunks are (uses two separate arrays internally and both are
allocated/copied during read() ) so it will double the required memory.

b) do we want the interface to return String (need to copy the memory for
each accessed row) or should we use the ValueString as is used in the
parser to avoid memcopy/allocation?


Reply to this email directly or view it on GitHub
#4 (comment).

@cliffclick
Copy link
Contributor

Good catch. Yes, that's a fail.
It's part of the complex/subtle contract on Chunks - no separate
ser/deser step, always only raw bytes - to avoid the massive
copy/allocation costs in moving data around.

So yes, please, flip CStrChunk to a pile-o-bytes, and have the interface
take & return ValueStrings instead.
We can have a 2nd convenience interface that returns a newly allocated
ValueString, but the bulk/efficient one will just fill in a pre-existing
ValueString - allowing you to iterate over all strings without doing
any allocation. Ability to do this is a major design requirement
(controls GC costs).

Brandon, are you up to doing this? Can you coordinate w/Tomas & Anand
(as I'm only sorta on-grid now)

Thanks,
Cliff

On 7/8/2014 3:53 PM, tomasnykodym wrote:

Ok I merged it in but I have two comments:

a) the chunk is not supported directly by the serialized bytes as
other chunks are (uses two separate arrays internally and both are
allocated/copied during read() ) so it will double the required memory.

b) do we want the interface to return String (need to copy the memory
for each accessed row) or should we use the ValueString as is used in
the parser to avoid memcopy/allocation?


Reply to this email directly or view it on GitHub
#4 (comment).

@cliffclick
Copy link
Contributor

Yes, But No. Internally we'll use ValueString.
We can add an expensive convenience interface, but always supply the
"main" interface as a no-allocation path. It might not be the "main"
interface in terms of places where it's used... but it must be possible
to do so in all the hot paths. Imagine, e.g., "grep" for an address.
Complete fail (probably 10x slower) to allocate all strings. And
storing as-strings blows the all-big-data-is-in-big-primitive-arrays
invariant, leading to GC costs.

We'll look at a double-storage option (both strings & bytes) if String
ops become popular.
Cleaner can drop the Strings on demand, but FullGC will run slower.
Internally we'll use ValueString, as it's only slightly more headache
and a lot faster.

Cliff

On 7/8/2014 4:07 PM, Anand Avati wrote:

The interface was driven by the first requirement (Mahout), in which case
String was needed (to create a Scala Array of Strings for each Chunk).
Unfortunately it is expensive.

On Tue, Jul 8, 2014 at 3:53 PM, tomasnykodym notifications@github.com
wrote:

Ok I merged it in but I have two comments:

a) the chunk is not supported directly by the serialized bytes as other
chunks are (uses two separate arrays internally and both are
allocated/copied during read() ) so it will double the required memory.

b) do we want the interface to return String (need to copy the
memory for
each accessed row) or should we use the ValueString as is used in the
parser to avoid memcopy/allocation?


Reply to this email directly or view it on GitHub
#4 (comment).


Reply to this email directly or view it on GitHub
#4 (comment).

wendycwong added a commit that referenced this pull request Jan 19, 2017
# This is the 1st commit message:
PUBDEV-3793-python-api-misc.  Added unit test for h2o.init().

# This is the commit message #2:

PUBDEV-3793-python-api-h2o-cluster-commands.  Added pyunit test for h2o.ls().

# This is the commit message #3:

PUBDEV-3797: H2O cluster apis.  Added pyunit skeleton test files.

# This is the commit message #4:

PUBDEV-3797: api from H2O cluster.  Completed h2o.init() tests.

# This is the commit message #5:

PUBDEV-3797: H2O MODULE API tests.  pyunit tests API complete.

# This is the commit message #6:

PUBDEV-3797: H2O module api tests.  Added more tests.

# This is the commit message #7:

PUBDEV-3797: API test for python client in H2O Module.  Added all tests.  Ready for review.

# This is the commit message #8:

PUBDEV-3797: Python API H2O Module.  Incorporated Pasha comments and minor code cleanup.

# This is the commit message #9:

PUBDEV-3797: Python API tests for H2O  Module.  Fixed pyunit tests to use more general commands and hopefully pass all the tests.

# This is the commit message #10:

PUBDEV-3797: api for H2O Module.  Fixed h2oinit.py failure.  Thanks Pasha.

# This is the commit message #1:

PUBDEV-3797: api test for H2O Module.  Clean up h2oinit.py test.

# This is the commit message #2:

PUBDEV-3797: test H2O Module API.  Added more print for pyunit_h2oinit.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants