HPCC-33806 Add support of building indexes to dafilesrv by jpmcmu · Pull Request #20197 · hpcc-systems/HPCC-Platform

jpmcmu · 2025-07-23T13:13:11Z

Type of change:

This change is a bug fix (non-breaking change which fixes an issue).
This change is a new feature (non-breaking change which adds functionality).
This change improves the code (refactor or other change that does not change the functionality)
This change fixes warnings (the fix does not alter the functionality or the generated code)
This change is a breaking change (fix or feature that will cause existing behavior to change).
This change alters the query API (existing queries will have to be recompiled)

Checklist:

Smoketest:

Send notifications about my Pull Request position in Smoketest queue.
Test my draft Pull Request.

Testing:

github-actions · 2025-07-23T13:13:33Z

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-33806

Jirabot Action Result:
Workflow Transition To: Merge Pending
Updated PR

jpmcmu · 2025-07-23T13:14:43Z

fs/dafsserver/dafsserver.cpp

+        if (compression == "default")
+        {
+            flags |= HTREE_COMPRESSED_KEY;
+            compression = "";


Is there a better way to determine the "default" compression format?

this code should probably use translateToCompMethod(compression)

Not really - it is only a very small subset of compression types.

Realistically I think you should always set htree_compressed_key and then pass through the compression as is. Row compression is not used outside the regression suite.

I would change the check in keybuild.cpp:

if (!isEmptyString(compression))

to

if (!isEmptyString(compression) && !strsame(compression, "lzw") && && !strsame(compression, "default"))

Which will allow lzw to be explicitly defined if we ever change the default.

jpmcmu · 2025-07-23T13:15:43Z

fs/dafsserver/dafsserver.cpp

+            helper->setIndexMeta("_nodeSize", std::to_string(nodeSize));
+        }
+
+        if (config.hasProp("noSeek"))


Does it make sense to expose these options? I tried to match was exposed to ECL

I think this should be defaulting to true - it should be true for blob storage, and doesn't really harm to be true for other systems.

jpmcmu · 2025-07-23T13:18:08Z

fs/dafsserver/dafsserver.cpp

+
+    inline void processRow(const void *row, uint64_t rowSize)
+    {
+        unsigned __int64 fpos = 0;


Setting the fpos correctly here is a bit odd, this would definitely need to come from the incoming record, but an fpos may not always make sense, and because the datasets are often projected it isn't easy to reliable get the fpos of a read dataset.

fpos only applicable if building an index of a base dataset (where fpos' refer to offset in flat file).
Not sure there's any need to support it.

Unfortunately ECL has weird semantics to reuse the fileposition field if the last field in the payload is numeric.

I think this is ok as it is, but the index will only be generally readable if it is defined with the FILEPOSITION(FALSE) attribute (if the last field is a numeric value).

If you want to be able to create all keys then you will need to do some horrible transformations to read the integer value of the last field, put it into the fileposition field.
Again create a separate jira to revisit.

jpmcmu · 2025-07-23T13:19:57Z

@ghalliday @jakesmith Still working on a few things here especially in relation to the TLK / publishing, but writing of an index is working.

jpmcmu · 2025-07-24T12:35:59Z

fs/dafsserver/dafsserver.cpp

+    virtual void write(size32_t sz, const void *rowData) override
+    {
+        size32_t rowOffset = 0;
+        while(rowOffset < sz)


Need to handle partial records here.

I think it should be illegal to write partial records to this function. Otherwise you have some notable complications - the call to find the row size needs protecting if the row is partial.
For the moment throw an error if rowOffset > sz

jakesmith

@jpmcmu - looks good in general. Please see comments.

jakesmith · 2025-07-24T12:41:02Z

fs/dafsserver/dafsserver.cpp

    }
 };
-
-


trivial: leave one line, consistent with other spacing between classes.

jakesmith · 2025-07-24T12:49:56Z

fs/dafsserver/dafsserver.cpp

+        translator.setown(createRecordTranslator(outRecord, inRecord));
+    }
+
+    virtual bool getIndexMeta(size32_t & lenName, char * & name, size32_t & lenValue, char * & value, unsigned idx)


trivial: add 'override'

jakesmith · 2025-07-24T13:05:21Z

fs/dafsserver/dafsserver.cpp

+        if (config.hasProp("noSeek"))
+        {
+            bool noSeek = config.getPropBool("noSeek");
+            helper->setIndexMeta("_noSeek", noSeek ? "true" : "false");


trivial: can use boolToStr

jakesmith · 2025-07-24T13:06:45Z

fs/dafsserver/dafsserver.cpp

+        return true;
+    }
+
+    void setIndexMeta(const std::string& name, const std::string& value)


picky: nicer if virtuals of IHThorIndexWriteArg kept together.

doesn't getWidth() need to be implemented with count of meta fields for getIndexMeta to be callable ?

jakesmith · 2025-07-24T13:12:07Z

fs/dafsserver/dafsserver.cpp

+        if (idx >= indexMetaData.size())
+            return false;
+
+        auto it = indexMetaData.begin();


would a std::vector of a pair of std::string's be more suitable?

jakesmith · 2025-07-24T14:15:24Z

fs/dafsserver/dafsserver.cpp

+
+    ~CRemoteIndexWriteActivity()
+    {
+        if (builder != nullptr && helper != nullptr)


there is no alternative at the moment, but when the file is closed (StreamCmd::CLOSE), it should call through to the acitivity to close, so we don't depend on dtor's to do this kind of work.

For now, it would be worth aadding a try/catch - as any unhandled exception at this point (within a dtor) will cause the process to exit.

Related to this, I think you are going to need to serialize back the last row, so that when the client has finished writing all parts of an index, it can use those last parts to create the TLK.
The response from StreamCmd::CLOSE could be extended to return structured info, that container this serialize row data.

jakesmith · 2025-07-24T14:18:22Z

fs/dafsserver/dafsserver.cpp

+        if (builder != nullptr && helper != nullptr)
+        {
+            Owned<IPropertyTree> metadata;
+            metadata.setown(createPTree("metadata", ipt_fast));


trivial: ^ could be on 1 line : Owned metadata = createPTree("metadata");

not worth diverging away from default to specify ipt_fast in this case, it's the default anyway.

jakesmith · 2025-07-24T14:19:28Z

fs/dafsserver/dafsserver.cpp

+        size32_t rowOffset = 0;
+        while(rowOffset < sz)
+        {
+            const RtlRecord& inputRecordAccessor = inMeta->queryRecordAccessor(true);


could be done once and stored as member.

jakesmith · 2025-07-24T14:24:23Z

system/jlib/jstring.hpp

 }

 extern jlib_decl void toLower(std::string & value);
+extern jlib_decl void trim(std::string & value);


could do with comment.. what does it do? Looks like trims leading white space only not trailing

jakesmith · 2025-07-24T14:27:24Z

fs/dafsserver/dafsserver.cpp

+
+        std::string compression = config.queryProp("compressed", "default");
+        toLower(compression);
+        trim(compression);


curious why this string might need leading spaces trimmed ? (vs any other string)

I agree, I'm not sure why you would trim this field.

ghalliday · 2025-07-29T11:25:55Z

Converting to non draft - since ready to review and to allow copilot to run.

Copilot

Pull Request Overview

This PR adds support for building indexes to the dafilesrv (data file server) component, enabling remote index creation capabilities as part of HPCC-33806. The implementation includes a new index write activity class and supporting infrastructure.

Adds a new CRemoteIndexWriteActivity class to handle remote index building operations
Implements helper classes and utilities for index metadata management and record transformation
Adds configuration support for index compression, node size, and other index-specific options

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
system/jlib/jstring.hpp	Adds declaration for new `trim` utility function
system/jlib/jstring.cpp	Implements `trim` function for string whitespace removal
fs/dafsserver/dafsserver.cpp	Adds complete index writing functionality with new classes and TAKindexwrite support
esp/services/ws_dfu/ws_dfuService.cpp	Adds missing break statement and index file descriptor configuration

Copilot · 2025-07-29T11:26:43Z

system/jlib/jstring.cpp

+{
+    value.erase(value.begin(), std::find_if(value.begin(), value.end(), [](unsigned char ch) {
+        return !std::isspace(ch);
+    }));


The trim function only removes leading whitespace but not trailing whitespace. A complete trim implementation should remove both leading and trailing whitespace. Consider using value.erase(std::find_if(value.rbegin(), value.rend(), [](unsigned char ch) { return !std::isspace(ch); }).base(), value.end()) to also remove trailing whitespace.

Suggested change

}));

}));

value.erase(std::find_if(value.rbegin(), value.rend(), [](unsigned char ch) {

return !std::isspace(ch);

}).base(), value.end());

Copilot · 2025-07-29T11:26:44Z

fs/dafsserver/dafsserver.cpp

+        size32_t indexRowSize = helper->transform(rowBuilder, row, this, fpos);
+
+        // Key builder checks for duplicate records so we can just check for sortedness
+        if (memcmp(prevRowBuffer.get(), rowBuffer.get(), helper->getKeyedSize()) > 0)


The comparison uses memcmp which performs byte-wise comparison, but this may not be correct for all data types. For complex types like strings with different encodings or numeric types with different byte representations, this could give incorrect sort order results. Consider using a proper record comparison function that understands the data types.

Copilot · 2025-07-29T11:26:44Z

fs/dafsserver/dafsserver.cpp

+    void openFileStream()
+    {
+        if (!recursiveCreateDirectoryForFile(fileName))
+            throw createDafsExceptionV(DAFSERR_cmdstream_openfailure, "Failed to create dirtory for file: '%s'", fileName.get());


Typo in error message: 'dirtory' should be 'directory'.

Suggested change

throw createDafsExceptionV(DAFSERR_cmdstream_openfailure, "Failed to create dirtory for file: '%s'", fileName.get());

throw createDafsExceptionV(DAFSERR_cmdstream_openfailure, "Failed to create directory for file: '%s'", fileName.get());

Copilot · 2025-07-29T11:26:44Z

fs/dafsserver/dafsserver.cpp

+            throw MakeStringException(99, "Index maximum record length (%d) exceeds 32k internal limit", maxDiskRecordSize);
+
+        rowBuffer.allocateN(maxDiskRecordSize, true);
+        prevRowBuffer.allocateN(maxDiskRecordSize, true);


The prevRowBuffer is allocated with the full maxDiskRecordSize but only helper->getKeyedSize() bytes are used in the comparison. Consider allocating only the needed size for the keyed portion to reduce memory usage, especially for records with large non-keyed portions.

Suggested change

prevRowBuffer.allocateN(maxDiskRecordSize, true);

prevRowBuffer.allocateN(helper->getKeyedSize(), true);

Copilot · 2025-07-29T11:26:44Z

fs/dafsserver/dafsserver.cpp

+    }
+
+    ~CRemoteIndexWriteActivity()
+    {


The destructor performs complex operations including calling builder->finish() which could potentially throw exceptions. Destructors should not throw exceptions as this can lead to undefined behavior. Consider moving the finish() logic to a separate cleanup method that can be called explicitly before destruction.

Suggested change

{

{

try

{

cleanup();

}

catch (...)

{

// Log the exception or handle it appropriately

// Avoid propagating exceptions from the destructor

}

close();

}

void cleanup()

{

ghalliday

@jpmcmu This looks broadly right, but various comments.

ghalliday · 2025-07-29T11:13:08Z

fs/dafsserver/dafsserver.cpp

+    virtual unsigned getFlags() { return flags; }
+    virtual size32_t transform(ARowBuilder & rowBuilder, const void * row, IBlobCreator * blobs, unsigned __int64 & filepos)
+    {
+        // Seems like an UnexpectedVirtualFieldCallback could be used but what about blobs?


Please create a separate jira for supporting blobs. It will need changes to the translator, including a new virtual in the callback interface.

ghalliday · 2025-07-29T11:19:16Z

fs/dafsserver/dafsserver.cpp

+
+    inline void processRow(const void *row, uint64_t rowSize)
+    {
+        unsigned __int64 fpos = 0;


Unfortunately ECL has weird semantics to reuse the fileposition field if the last field in the payload is numeric.

I think this is ok as it is, but the index will only be generally readable if it is defined with the FILEPOSITION(FALSE) attribute (if the last field is a numeric value).

If you want to be able to create all keys then you will need to do some horrible transformations to read the integer value of the last field, put it into the fileposition field.
Again create a separate jira to revisit.

ghalliday · 2025-07-29T11:24:27Z

fs/dafsserver/dafsserver.cpp

+            maxRecordSizeSeen = indexRowSize;
+
+        processed++;
+        memcpy(prevRowBuffer.get(), rowBuffer.get(), maxDiskRecordSize);


Only need to save keyedSize. I suspect benefit of catching invalid input data outweighs the cost.

ghalliday · 2025-07-29T11:24:53Z

fs/dafsserver/dafsserver.cpp

+        size32_t indexRowSize = helper->transform(rowBuilder, row, this, fpos);
+
+        // Key builder checks for duplicate records so we can just check for sortedness
+        if (memcmp(prevRowBuffer.get(), rowBuffer.get(), helper->getKeyedSize()) > 0)


cache helper->getKeyedSize() in a member variable.

fs/dafsserver/dafsserver.cpp

ghalliday · 2025-07-29T12:14:34Z

fs/dafsserver/dafsserver.cpp

+                flags &= ~USE_TRAILING_HEADER; 
+        }
+
+        size32_t fileposSize = hasTrailingFileposition(helper->queryDiskRecordSize()->queryTypeInfo()) ? sizeof(offset_t) : 0;


Throw an error if it has a trailing fileposition - will require changes elsewhere.

ghalliday · 2025-07-29T12:19:40Z

fs/dafsserver/dafsserver.cpp

    }
 };

+class CRemoteIndexWriteHelper : public CThorIndexWriteArg


Does this class actually provided any benefit? It is legal to call createKeyBuilder() with null for the helper.
I suspect it adds complication with no benefit.
I don't think there is currently a way of adding bloom filters without a helper, but that it would be better to add virtuals to allow that, and apply the values directly from a property tree.

Long term that is the direction that disk write is going for many options.

ghalliday · 2025-07-29T12:22:20Z

fs/dafsserver/dafsserver.cpp

+    virtual void write(size32_t sz, const void *rowData) override
+    {
+        size32_t rowOffset = 0;
+        while(rowOffset < sz)


I think it should be illegal to write partial records to this function. Otherwise you have some notable complications - the call to find the row size needs protecting if the row is partial.
For the moment throw an error if rowOffset > sz

ghalliday · 2025-07-29T12:26:05Z

fs/dafsserver/dafsserver.cpp

+        unsigned nodeSize = NODESIZE;
+        if (config.hasProp("nodeSize"))
+        {
+            nodeSize = config.getPropInt("nodeSize");


I think this is an example of some code that is made more complicated by having the helper.

ghalliday · 2025-07-29T12:27:14Z

fs/dafsserver/dafsserver.cpp

+
+        std::string compression = config.queryProp("compressed", "default");
+        toLower(compression);
+        trim(compression);


I agree, I'm not sure why you would trim this field.

HPCC-33806 Add support of building indexes to dafilesrv

42a2209

jpmcmu commented Jul 23, 2025

View reviewed changes

jpmcmu requested review from ghalliday and jakesmith July 23, 2025 13:18

jpmcmu commented Jul 24, 2025

View reviewed changes

jakesmith reviewed Jul 24, 2025

View reviewed changes

ghalliday marked this pull request as ready for review July 29, 2025 11:25

Copilot AI review requested due to automatic review settings July 29, 2025 11:25

Copilot AI reviewed Jul 29, 2025

View reviewed changes

ghalliday requested changes Jul 29, 2025

View reviewed changes

Code review changes

880f2de

jpmcmu force-pushed the HPCC-33806 branch from 1b12393 to 880f2de Compare March 3, 2026 14:17

-    }));
+    }));
+    value.erase(std::find_if(value.rbegin(), value.rend(), [](unsigned char ch) {
+        return !std::isspace(ch);
+    }).base(), value.end());

	throw createDafsExceptionV(DAFSERR_cmdstream_openfailure, "Failed to create dirtory for file: '%s'", fileName.get());
	throw createDafsExceptionV(DAFSERR_cmdstream_openfailure, "Failed to create directory for file: '%s'", fileName.get());

	prevRowBuffer.allocateN(maxDiskRecordSize, true);
	prevRowBuffer.allocateN(helper->getKeyedSize(), true);

-    {
+    {
+        try
+        {
+            cleanup();
+        }
+        catch (...)
+        {
+            // Log the exception or handle it appropriately
+            // Avoid propagating exceptions from the destructor
+        }
+        close();
+    }
+    void cleanup()
+    {

Conversation

jpmcmu commented Jul 23, 2025

Type of change:

Checklist:

Smoketest:

Testing:

Uh oh!

github-actions bot commented Jul 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpmcmu commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakesmith left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghalliday commented Jul 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 29, 2025

jpmcmu commented Jul 23, 2025 •

edited

Loading