Skip to content

Commit f28961c

Browse files
lordgamezfgerlits
authored andcommitted
MINIFICPP-2594 Add XMLReader controller service
and upgrade pugixml library to v1.15 Signed-off-by: Ferenc Gerlits <fgerlits@gmail.com> Closes #1995
1 parent 9ce51ca commit f28961c

File tree

14 files changed

+613
-93
lines changed

14 files changed

+613
-93
lines changed

CMakeLists.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -381,6 +381,10 @@ if (ENABLE_ALL OR ENABLE_PROMETHEUS OR ENABLE_GRAFANA_LOKI OR ENABLE_CIVET)
381381
endif()
382382

383383
## Add extensions
384+
385+
# PugiXML required for standard processors and WEL extension
386+
include(PugiXml)
387+
384388
file(GLOB extension-directories "extensions/*")
385389
foreach(extension-dir ${extension-directories})
386390
if (IS_DIRECTORY ${extension-dir} AND EXISTS ${extension-dir}/CMakeLists.txt)

CONTROLLERS.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ limitations under the License.
3232
- [SSLContextService](#SSLContextService)
3333
- [UpdatePolicyControllerService](#UpdatePolicyControllerService)
3434
- [VolatileMapStateStorage](#VolatileMapStateStorage)
35+
- [XMLReader](#XMLReader)
3536

3637

3738
## AWSCredentialsService
@@ -332,3 +333,21 @@ In the list below, the names of required properties appear in bold. Any other pr
332333
| Name | Default Value | Allowable Values | Description |
333334
|-----------------|---------------|------------------|--------------------------------|
334335
| Linked Services | | | Referenced Controller Services |
336+
337+
338+
## XMLReader
339+
340+
### Description
341+
342+
Reads XML content and creates Record objects. Records are expected in the second level of XML data, embedded in an enclosing root tag. Types for records are inferred automatically based on the content of the XML tags. For timestamps, the format is expected to be ISO 8601 compliant.
343+
344+
### Properties
345+
346+
In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.
347+
348+
| Name | Default Value | Allowable Values | Description |
349+
|-----------------------------|---------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
350+
| Field Name for Content | | | If tags with content (e. g. <field>content</field>) are defined as nested records in the schema, the name of the tag will be used as name for the record and the value of this property will be used as name for the field. If the tag contains subnodes besides the content (e.g. <field>content<subfield>subcontent</subfield></field>), or a node attribute is present, we need to define a name for the text content, so that it can be distinguished from the subnodes. If this property is not set, the default name 'value' will be used for the text content of the tag in this case. |
351+
| **Parse XML Attributes** | false | true<br/>false | When this property is 'true' then XML attributes are parsed and added to the record as new fields, otherwise XML attributes and their values are ignored. |
352+
| Attribute Prefix | | | If this property is set, the name of attributes will be prepended with a prefix when they are added to a record. |
353+
| **Expect Records as Array** | false | true<br/>false | This property defines whether the reader expects a FlowFile to consist of a single Record or a series of Records with a "wrapper element". Because XML does not provide for a way to read a series of XML documents from a stream directly, it is common to combine many XML documents by concatenating them and then wrapping the entire XML blob with a "wrapper element". This property dictates whether the reader expects a FlowFile to consist of a single Record or a series of Records with a "wrapper element" that will be ignored. |

LICENSE

Lines changed: 0 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2354,29 +2354,6 @@ This product bundles 'zlib' within 'OpenCV' under the following license:
23542354
Comments) 1950 to 1952 in the files http://tools.ietf.org/html/rfc1950
23552355
(zlib format), rfc1951 (deflate format) and rfc1952 (gzip format).
23562356

2357-
This product bundles 'TinyXml2' within 'AWS SDK for C++' under a zlib license:
2358-
2359-
Original code by Lee Thomason (www.grinninglizard.com)
2360-
2361-
This software is provided 'as-is', without any express or implied
2362-
warranty. In no event will the authors be held liable for any
2363-
damages arising from the use of this software.
2364-
2365-
Permission is granted to anyone to use this software for any
2366-
purpose, including commercial applications, and to alter it and
2367-
redistribute it freely, subject to the following restrictions:
2368-
2369-
1. The origin of this software must not be misrepresented; you must
2370-
not claim that you wrote the original software. If you use this
2371-
software in a product, an acknowledgment in the product documentation
2372-
would be appreciated but is not required.
2373-
2374-
2. Altered source versions must be plainly marked as such, and
2375-
must not be misrepresented as being the original software.
2376-
2377-
3. This notice may not be removed or altered from any source
2378-
distribution.
2379-
23802357

23812358
This product bundles 'cJSON' within 'AWS SDK for C++' under an MIT license:
23822359

NOTICE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,6 @@ THIRD PARTY COMPONENTS
4343
This software includes third party software subject to the following copyrights:
4444
- Very fast, header-only/compiled, C++ logging library from spdlog - Copyright (c) 2016 Gabi Melman
4545
- An open-source formatting library for C++ from fmt - Copyright (c) 2012 - present, Victor Zverovich
46-
- XML parsing and utility functions from TinyXml2 - Lee Thomason
4746
- JSON parsing and utility functions from JsonCpp - Copyright (c) 2007-2010 Baptiste Lepilleur
4847
- OpenSSL build files for cmake used for Android Builds - Copyright (C) 2007-2012 LuaDist and Copyright (C) 2013 Brian Sidebotham
4948
- Android tool chain cmake build files - Copyright (c) 2010-2011, Ethan Rublee and Copyright (c) 2011-2014, Andrey Kamaev
@@ -78,6 +77,7 @@ This software includes third party software subject to the following copyrights:
7877
- llhttp - Copyright Fedor Indutny, 2018.
7978
- benchmark - Copyright 2015 Google Inc.
8079
- llama.cpp - Copyright (c) 2023-2024 The ggml authors
80+
- pugixml - Copyright (C) 2003, by Kristen Wegner (kristen@tima.net)
8181

8282
The licenses for these third party components are included in LICENSE.txt
8383

cmake/BundledPugiXml.cmake

Lines changed: 0 additions & 59 deletions
This file was deleted.

cmake/PugiXml.cmake

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
include(FetchContent)
18+
19+
set(PUGIXML_BUILD_TESTS OFF CACHE BOOL "" FORCE)
20+
21+
FetchContent_Declare(
22+
pugixml
23+
URL https://github.com/zeux/pugixml/archive/refs/tags/v1.15.tar.gz
24+
URL_HASH SHA256=b39647064d9e28297a34278bfb897092bf33b7c487906ddfc094c9e8868bddcb
25+
)
26+
FetchContent_MakeAvailable(pugixml)

extensions/standard-processors/CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ target_include_directories(minifi-standard-processors PUBLIC "${CMAKE_SOURCE_DIR
2727

2828
include(RangeV3)
2929
include(Asio)
30-
target_link_libraries(minifi-standard-processors ${LIBMINIFI} Threads::Threads range-v3 asio)
30+
target_link_libraries(minifi-standard-processors ${LIBMINIFI} Threads::Threads range-v3 asio pugixml)
3131

3232
include(Coroutines)
3333
enable_coroutines()
Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
/**
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
#include "XMLReader.h"
19+
20+
#include <algorithm>
21+
#include <ranges>
22+
23+
#include "core/Resource.h"
24+
#include "utils/TimeUtil.h"
25+
#include "utils/gsl.h"
26+
27+
namespace org::apache::nifi::minifi::standard {
28+
29+
namespace {
30+
bool hasChildNodes(const pugi::xml_node& node) {
31+
return std::ranges::any_of(node, [] (const pugi::xml_node& child) {
32+
return child.type() == pugi::node_element;
33+
});
34+
}
35+
36+
void addRecordFieldToObject(core::RecordObject& record_object, const std::string& name, const core::RecordField& field) {
37+
auto it = record_object.find(name);
38+
if (it == record_object.end()) {
39+
record_object.emplace(name, field);
40+
return;
41+
}
42+
43+
if (std::holds_alternative<core::RecordArray>(it->second.value_)) {
44+
std::get<core::RecordArray>(it->second.value_).emplace_back(field);
45+
return;
46+
}
47+
48+
core::RecordArray array;
49+
array.emplace_back(it->second);
50+
array.emplace_back(field);
51+
it->second = core::RecordField(std::move(array));
52+
}
53+
} // namespace
54+
55+
void XMLReader::writeRecordField(core::RecordObject& record_object, const std::string& name, const std::string& value, bool write_pcdata_node) const {
56+
// If the name is the value set in the Field Name for Content property, we should only add this value to the RecordObject if we are writing a plain character data node.
57+
if (!write_pcdata_node && name == field_name_for_content_) {
58+
return;
59+
}
60+
61+
if (value == "true" || value == "false") {
62+
addRecordFieldToObject(record_object, name, core::RecordField(value == "true"));
63+
return;
64+
} else if (auto date = utils::timeutils::parseDateTimeStr(value)) {
65+
addRecordFieldToObject(record_object, name, core::RecordField(*date));
66+
return;
67+
} else if (auto date = utils::timeutils::parseRfc3339(value)) {
68+
addRecordFieldToObject(record_object, name, core::RecordField(*date));
69+
return;
70+
}
71+
72+
if (std::ranges::all_of(value, ::isdigit)) {
73+
try {
74+
uint64_t value_as_uint64 = std::stoull(value);
75+
addRecordFieldToObject(record_object, name, core::RecordField(value_as_uint64));
76+
return;
77+
} catch (const std::exception&) {
78+
}
79+
}
80+
81+
if (value.starts_with('-') && std::ranges::all_of(value | std::views::drop(1), ::isdigit)) {
82+
try {
83+
int64_t value_as_int64 = std::stoll(value);
84+
addRecordFieldToObject(record_object, name, core::RecordField(value_as_int64));
85+
return;
86+
} catch (const std::exception&) {
87+
}
88+
}
89+
90+
try {
91+
auto value_as_double = std::stod(value);
92+
addRecordFieldToObject(record_object, name, core::RecordField(value_as_double));
93+
return;
94+
} catch (const std::exception&) {
95+
}
96+
97+
addRecordFieldToObject(record_object, name, core::RecordField(value));
98+
}
99+
100+
void XMLReader::parseNodeElement(core::RecordObject& record_object, const pugi::xml_node& node) const {
101+
gsl_Expects(node.type() == pugi::node_element);
102+
if (parse_xml_attributes_ && node.first_attribute()) {
103+
core::RecordObject child_record_object;
104+
for (const pugi::xml_attribute& attr : node.attributes()) {
105+
writeRecordField(child_record_object, attribute_prefix_ + attr.name(), attr.value());
106+
}
107+
parseXmlNode(child_record_object, node);
108+
addRecordFieldToObject(record_object, node.name(), core::RecordField(std::move(child_record_object)));
109+
return;
110+
}
111+
112+
if (hasChildNodes(node)) {
113+
core::RecordObject child_record_object;
114+
parseXmlNode(child_record_object, node);
115+
addRecordFieldToObject(record_object, node.name(), core::RecordField(std::move(child_record_object)));
116+
return;
117+
}
118+
119+
writeRecordField(record_object, node.name(), node.child_value());
120+
}
121+
122+
void XMLReader::parseXmlNode(core::RecordObject& record_object, const pugi::xml_node& node) const {
123+
std::string pc_data_value;
124+
for (pugi::xml_node child : node.children()) {
125+
if (child.type() == pugi::node_element) {
126+
parseNodeElement(record_object, child);
127+
} else if (child.type() == pugi::node_pcdata) {
128+
pc_data_value.append(child.value());
129+
}
130+
}
131+
132+
if (!pc_data_value.empty()) {
133+
writeRecordField(record_object, field_name_for_content_, pc_data_value, true);
134+
}
135+
}
136+
137+
void XMLReader::addRecordFromXmlNode(const pugi::xml_node& node, core::RecordSet& record_set) const {
138+
core::RecordObject record_object;
139+
parseXmlNode(record_object, node);
140+
core::Record record(std::move(record_object));
141+
record_set.emplace_back(std::move(record));
142+
}
143+
144+
bool XMLReader::parseRecordsFromXml(core::RecordSet& record_set, const std::string& xml_content) const {
145+
pugi::xml_document doc;
146+
if (!doc.load_string(xml_content.c_str())) {
147+
logger_->log_error("Failed to parse XML content: {}", xml_content);
148+
return false;
149+
}
150+
151+
if (expect_records_as_array_) {
152+
pugi::xml_node root = doc.first_child();
153+
for (pugi::xml_node record_node : root.children()) {
154+
addRecordFromXmlNode(record_node, record_set);
155+
}
156+
return true;
157+
}
158+
159+
pugi::xml_node root = doc.first_child();
160+
if (!root.first_child()) {
161+
logger_->log_info("XML content does not contain any records: {}", xml_content);
162+
return true;
163+
}
164+
addRecordFromXmlNode(root, record_set);
165+
return true;
166+
}
167+
168+
void XMLReader::onEnable() {
169+
auto parseBoolProperty = [this](std::string_view property_name) -> bool {
170+
if (auto property_value_str = getProperty(property_name); property_value_str && !property_value_str->empty()) {
171+
if (auto property_value = parsing::parseBool(*property_value_str)) {
172+
return *property_value;
173+
}
174+
throw Exception(PROCESS_SCHEDULE_EXCEPTION, fmt::format("Invalid value for {} property: {}", property_name, *property_value_str));
175+
}
176+
return false;
177+
};
178+
179+
field_name_for_content_ = getProperty(FieldNameForContent.name).value_or("value");
180+
parse_xml_attributes_ = parseBoolProperty(ParseXMLAttributes.name);
181+
attribute_prefix_ = getProperty(AttributePrefix.name).value_or("");
182+
expect_records_as_array_ = parseBoolProperty(ExpectRecordsAsArray.name);
183+
}
184+
185+
nonstd::expected<core::RecordSet, std::error_code> XMLReader::read(io::InputStream& input_stream) {
186+
core::RecordSet record_set{};
187+
const auto read_result = [this, &record_set](io::InputStream& input_stream) -> size_t {
188+
std::string content;
189+
content.resize(input_stream.size());
190+
const auto read_ret = input_stream.read(as_writable_bytes(std::span(content)));
191+
if (io::isError(read_ret)) {
192+
logger_->log_error("Failed to read XML data from input stream");
193+
return io::STREAM_ERROR;
194+
}
195+
if (!parseRecordsFromXml(record_set, content)) {
196+
return io::STREAM_ERROR;
197+
}
198+
return read_ret;
199+
}(input_stream);
200+
if (io::isError(read_result)) {
201+
return nonstd::make_unexpected(std::make_error_code(std::errc::invalid_argument));
202+
}
203+
return record_set;
204+
}
205+
206+
REGISTER_RESOURCE(XMLReader, ControllerService);
207+
} // namespace org::apache::nifi::minifi::standard

0 commit comments

Comments
 (0)