-
Notifications
You must be signed in to change notification settings - Fork 12
[WIP] support line delimited data #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] support line delimited data #8
Conversation
lib/logstash/codecs/csv.rb
Outdated
| CONVERTERS.freeze | ||
|
|
||
| def register | ||
| @buffer = FileWatch::BufferedTokenizer.new(@delimiter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the buffered tokenizer is tricky here, because CSV data can have literal newlines as long as they are quoted. By pre-tokenizing before handing off to the CSV parser that has the context, we effectively prevent legal literal newlines from being used.
Additionally, input plugins that already use the BufferedTokenizer (e.g. File Input) will strip the newlines from their input before passing off each entry to the codec.
I think another approach would be to use CSV::parse instead of CSV::parse_line, and then iterate over the resulting entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL line breaks inside CSV values 😮
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSV in practice is a giant mess, because it developed organically without a formal specification over the course of decades. When one product encountered an edge-case, they came up with a solution to their problem but often solved it in a way that introduced new and weirder edge cases (e.g., with no agreed-upon escape sequence, some solved the comma- and newline-in-field problem by adding quoting, which made quote characters magical and precludes our ability to pre-tokenize records)
Ruby's CSV implementation is pretty robust to the variety of data under the CSV umbrella, giving us options like row_sep to control the record delimiter, col_sep to control the field delimiter, quote_char to control how it understands quoted sequences, etc.
I included a recommendation in my previous review to use CSV::parse, instead of CSV::parse_line, because it is capable of handling multiple entries but otherwise remains the same (the parse_line variant simply ignores any additional records). We can still pipe through the "delimiter" option to CSV's row_sep parameter, and it should handle quoted literal row separators for us..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did test CSV::parse and it does work correctly in handling both line breaks in columns and line breaks at end of line but it does not work for streaming input scenarios where the BufferedTokenizer is useful but breaks the line breaks in columns case. Will recap and continue discussion in main thread.
lib/logstash/codecs/csv.rb
Outdated
| def parse(line, &block) | ||
| begin | ||
| values = CSV.parse_line(data, :col_sep => @separator, :quote_char => @quote_char) | ||
| values = CSV.parse_line(line, :col_sep => @separator, :quote_char => @quote_char) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about something like:
CSV.parse(line, :col_sep => @separator, :quote_char => @quote_char).each do |values|
next if values.nil?
## same implementation, using `next` instead of `return` when finding column names
end|
@yaauie good catch about the line break that can be part of a quoted CSV value. This is a tricky one; this is not standardized and it seems like many implementations do not support it either. The Ruby CSV library does support it though. The problem we face here is, as you pointed out, (and this is related to the long-lasting streaming vs line-oriented data) with the BufferedTokenizer a column containing a line break will be split in 2 lines so line breaks in columns will not work with the BufferedTokenizer. On the other hand, if we don't use the BufferedTokenizer then this codec will not work with streaming input like the tcp input. And as you also pointed out, when used with the file input, line breaks will already be processed (but note that if using the file input, line breaks in columns will not work either, regardless of the csv codec implementation). This is in fact very similar to the problem described in logstash-plugins/logstash-codec-multiline#63 where I suggested introducing a a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to add a test that handles several independent lines without their trailing newline, to ensure that this plugin will continue to work as expected in the real world, since codecs are not guaranteed to hand off byte sequences that end with newlines (e.g., line-oriented inputs like the File Input strips the delimiter before handing off a sequence to the codec).
lib/logstash/codecs/csv.rb
Outdated
| CONVERTERS.freeze | ||
|
|
||
| def register | ||
| @buffer = FileWatch::BufferedTokenizer.new(@delimiter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSV in practice is a giant mess, because it developed organically without a formal specification over the course of decades. When one product encountered an edge-case, they came up with a solution to their problem but often solved it in a way that introduced new and weirder edge cases (e.g., with no agreed-upon escape sequence, some solved the comma- and newline-in-field problem by adding quoting, which made quote characters magical and precludes our ability to pre-tokenize records)
Ruby's CSV implementation is pretty robust to the variety of data under the CSV umbrella, giving us options like row_sep to control the record delimiter, col_sep to control the field delimiter, quote_char to control how it understands quoted sequences, etc.
I included a recommendation in my previous review to use CSV::parse, instead of CSV::parse_line, because it is capable of handling multiple entries but otherwise remains the same (the parse_line variant simply ignores any additional records). We can still pipe through the "delimiter" option to CSV's row_sep parameter, and it should handle quoted literal row separators for us..
|
@yaauie let me recap the problems & alternatives we have: Use cases
Solutions
Any other solution suggestions? If we decide to move forward with something like (1), we could also plan the following:
WDYT? |
|
It is unfortunate that The The other option is to push this new config to the "offending" inputs, providing them a way to declaratively include the delimiters. Since the current behaviour of this codec is to handle each line as an event, emitting a Here, I think Something like: diff --git a/lib/logstash/codecs/csv.rb b/lib/logstash/codecs/csv.rb
index 07d6416..186c8d5 100644
--- a/lib/logstash/codecs/csv.rb
+++ b/lib/logstash/codecs/csv.rb
@@ -21,6 +21,8 @@ class LogStash::Codecs::CSV < LogStash::Codecs::Base
# Optional.
config :separator, :validate => :string, :default => ","
+ config :delimiter, :validate => :string, :default => "\n"
+
# Define the character used to quote CSV fields. If this is not specified
# the default is a double quote `"`.
# Optional.
@@ -109,14 +111,20 @@ class LogStash::Codecs::CSV < LogStash::Codecs::Base
end
def decode(data)
- data = @converter.convert(data)
- begin
- values = CSV.parse_line(data, :col_sep => @separator, :quote_char => @quote_char)
+ data_io = StringIO.new(@converter.convert(data))
+ data_io.close_write
+ ack_position = 0
+ csv = CSV.new(data_io, :col_sep => @separator, :row_sep => @delimiter, :quote_char => @quote_char)
+
+ loop do
+ values = csv.readline
+ ack_position = data_io.pos
+ break if values.nil?
if (@autodetect_column_names && @columns.empty?)
@columns = values
@logger.debug? && @logger.debug("Auto detected the following columns", :columns => @columns.inspect)
- return
+ next
end
decoded = {}
@@ -130,10 +138,11 @@ class LogStash::Codecs::CSV < LogStash::Codecs::Base
end
yield LogStash::Event.new(decoded)
- rescue CSV::MalformedCSVError => e
- @logger.error("CSV parse failure. Falling back to plain-text", :error => e, :data => data)
- yield LogStash::Event.new("message" => data, "tags" => ["_csvparsefailure"])
end
+ rescue CSV::MalformedCSVError => e
+ data_io.seek(ack_position)
+ @logger.error("CSV parse failure. Falling back to plain-text", :error => e, :data => data)
+ yield LogStash::Event.new("message" => data_io.read, "tags" => ["_csvparsefailure"])
end
def encode(event) |
|
@yaauie I don't think we actually need to re-add line breaks. You probably saw that in logstash-plugins/logstash-codec-multiline#63 but I think a better way would be to simply use or not use the
Do we agree on this strategy? If we do I'll go ahead and refactor for this and then we can iterate review on the implementation details. |
|
The subject of this PR is "support line delimited data"; from the specs you added, I take this to mean "when This goal can be achieved without adding From what I can see, the only scenario in which adding a |
|
@yaauie obviously the understanding of the problem has evolved and the description and specs have not yet followed, that's why I tried to recap my understanding of the problem and submitted possible solutions. As I tried to explain, by using then This IMO would provide a simple and cleaner path forward with what we have today, until we come up with a new processing framework (milling or else) at some point in the future. |
Let me know if there are any objections with this plan. |
|
Opened elastic/logstash#11885 for the broader discussion |
|
This is on hold until we conclude elastic/logstash#11885 |
add new
delimiteroption with\nas defaultadd new
input_typeconfig option for eitherlinebased data orstreambased data withlineas defaultinput_type => lineeach data chunk provided by the input is considered a complete CSV line or multi-lines document. In this mode line breaks in columns are supported. This will typically used for inputs likefileorhttp.input_type => streamCSV data can be incomplete in each data chunk and spawn multiple chunks for completeness where thedelimiteroption will identify the data boundary. This will typically used for inputs likestdinortcp.TODO
StringIOandCSV.newto better control parsing exceptions.lineandstreaminput types