-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] use BufferedTokenizer and configurable line delimiter #63
[WIP] use BufferedTokenizer and configurable line delimiter #63
Conversation
@jsvd @guyboertje I would appreciate your input! |
Initial impressions: looks good. I think we should discuss this POC in EAH. I have no intentions of raising the original Milling concept. Perhaps we can talk about a pluggable boundary detector setting in true codecs. |
@guyboertje my goal here is to offer a solution with what we have today. I am +1 on investigating for a better solution which could be applied to all inputs and codecs but -1 on waiting for it to be fleshed out. I believe this proposal is simple enough & BWC to be considered today. My guess is that whatever we decide for better codecs/inputs boundary detection, it will probably be a 7.0 feature. |
What would be the potential problem in moving forward with this solution? |
@colinsurprenant |
This issue is critical for me now as the single way to collect my text multiline logs is to upload files as raw TCP. Is the following is correct way to use code from this repo?
|
And there is another question that may seem unrelated. If I use |
@sovetov you should be able to use the plugin from the repo by editing the |
@sovetov I am not sure I understand your second question about the On the other hand, with streaming input, you can use the |
bump, any objection in moving this forward @guyboertje @jsvd ? |
Lets merge it. |
I faced this issue and to limit it while having multiline for stacktraces, I used mutate to remove new lines only for inputs different than stacktraces. |
There is another use case where a codec does not interact well with line detection by the input. That is UTF-16. The file input will read half a character when it consumes the \n, leaving the rest of the file effectively flipped from UTF-16BE to UTF-16LE. |
Hi, I've been banging my head into a wall trying to understand why my lines were being broken mid-line. I'm really glad I finally found this, as it seems to be the fix. Is there any ETA for merge/release? |
Depending on what we decide in logstash-plugins/logstash-codec-csv#8 I'll followup here. |
Opened elastic/logstash#11885 for the broader discussion |
closing, we can reopen when consensus will be reached on how to solve this. |
NOTE This is a WIP to discuss this proposed strategy of using the
BufferedTokenizer
and configurable line delimiter to extract lines instead of using a hardcoded splitter on\n
.This is essentially a reboot of #26, it would solve #14, #37, #38, #57, logstash-plugins/logstash-input-stdin#16 and replace #6.
The Problem
For historical reasons and because of the ambiguity between line-oriented vs streaming inputs in our input/codec architecture, the
multiline
codec in its current state is actually an in-between for handling line-oriented and streaming data. It was actually meant for handling streaming line-delimited data since it was doing asplit("\n")
on the input thus assuming blobs of line delimited text. But obviously this is both useless in the context of already line-delimited input and useless for text-bytes input as is does not properly support lines across data blocks.Proposal
To correctly handle streaming input for delimited data, using the
BufferedTokenizer
and adding a configurable line delimiter will provide a similar functionality to theline
codec.Also adding a
streaming_input
config option (with afalse
default for BWC) will preserve current behaviour. Usingtrue
would provide support for streaming inputs such asstdin
,tcp
,udp
. I believe this is a pragmatic proposal in todays ambiguous input/codec architecture . My last attempt at solving this was in 2016 and it was suggested we wait on the Milling concept to land. I do not think we need to wait for that to make it work in a practical way in our current imperfect architecture.Current WIP State
Using
streaming_input => false
(default) will keep current behaviour.Using
streaming_input => true
will make it work with streaming inputs such asstdin
,tcp
,udp