Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] use BufferedTokenizer and configurable line delimiter #63

Closed

Conversation

colinsurprenant
Copy link
Contributor

@colinsurprenant colinsurprenant commented Sep 24, 2018

NOTE This is a WIP to discuss this proposed strategy of using the BufferedTokenizer and configurable line delimiter to extract lines instead of using a hardcoded splitter on \n.

This is essentially a reboot of #26, it would solve #14, #37, #38, #57, logstash-plugins/logstash-input-stdin#16 and replace #6.

The Problem

For historical reasons and because of the ambiguity between line-oriented vs streaming inputs in our input/codec architecture, the multiline codec in its current state is actually an in-between for handling line-oriented and streaming data. It was actually meant for handling streaming line-delimited data since it was doing a split("\n") on the input thus assuming blobs of line delimited text. But obviously this is both useless in the context of already line-delimited input and useless for text-bytes input as is does not properly support lines across data blocks.

Proposal

To correctly handle streaming input for delimited data, using the BufferedTokenizer and adding a configurable line delimiter will provide a similar functionality to the line codec.

Also adding a streaming_input config option (with a false default for BWC) will preserve current behaviour. Using true would provide support for streaming inputs such as stdin, tcp, udp. I believe this is a pragmatic proposal in todays ambiguous input/codec architecture . My last attempt at solving this was in 2016 and it was suggested we wait on the Milling concept to land. I do not think we need to wait for that to make it work in a practical way in our current imperfect architecture.

Current WIP State

  • Using streaming_input => false (default) will keep current behaviour.

  • Using streaming_input => true will make it work with streaming inputs such as stdin, tcp, udp

@colinsurprenant
Copy link
Contributor Author

@jsvd @guyboertje I would appreciate your input!

@guyboertje
Copy link
Contributor

Initial impressions: looks good. I think we should discuss this POC in EAH. I have no intentions of raising the original Milling concept. Perhaps we can talk about a pluggable boundary detector setting in true codecs.

@colinsurprenant
Copy link
Contributor Author

@guyboertje my goal here is to offer a solution with what we have today. I am +1 on investigating for a better solution which could be applied to all inputs and codecs but -1 on waiting for it to be fleshed out. I believe this proposal is simple enough & BWC to be considered today. My guess is that whatever we decide for better codecs/inputs boundary detection, it will probably be a 7.0 feature.

@colinsurprenant
Copy link
Contributor Author

What would be the potential problem in moving forward with this solution?

@guyboertje
Copy link
Contributor

@colinsurprenant
None I can see bar a test or two to verify the streaming flag.

@sovetov
Copy link

sovetov commented Oct 16, 2018

This issue is critical for me now as the single way to collect my text multiline logs is to upload files as raw TCP.

Is the following is correct way to use code from this repo?

  • Clone this repo (or my fork):
  • Edit /usr/share/logstash/Gemfile
- gem "logstash-codec-multiline"
+ gem "logstash-codec-multiline", :path => "/home/george/logstash-codec-multiline"
  • Run bin/logstash-plugin install --no-verify (where is the plugin name?)
  • Restart logstash.

@sovetov
Copy link

sovetov commented Oct 16, 2018

And there is another question that may seem unrelated.

If I use json codec and upload it as raw TCP stream, will it work? If so, how do they solved this issue? Just based on syntax of JSON?

@colinsurprenant
Copy link
Contributor Author

@sovetov you should be able to use the plugin from the repo by editing the Gemfile file in logstash home as you suggested but there is no need to run bin/logstash-plugin ... after, just restart logstash.

@colinsurprenant
Copy link
Contributor Author

@sovetov I am not sure I understand your second question about the json codec.
are you asking if the json codec will support streaming data, no it won't, the json codec expects a complete and valid json object when decoding and will not work will with tcp or udp stream.

On the other hand, with streaming input, you can use the json_lines codec which uses a BufferedTokenizer and has a configurable delimiter.

@colinsurprenant
Copy link
Contributor Author

bump, any objection in moving this forward @guyboertje @jsvd ?

@guyboertje
Copy link
Contributor

Lets merge it. :shipit:

@remiville
Copy link

I faced this issue and to limit it while having multiline for stacktraces, I used mutate to remove new lines only for inputs different than stacktraces.
Please merge the fix.

@TheVastyDeep
Copy link

There is another use case where a codec does not interact well with line detection by the input. That is UTF-16. The file input will read half a character when it consumes the \n, leaving the rest of the file effectively flipped from UTF-16BE to UTF-16LE.

@imnotteixeira
Copy link

Hi, I've been banging my head into a wall trying to understand why my lines were being broken mid-line. I'm really glad I finally found this, as it seems to be the fix. Is there any ETA for merge/release?

@colinsurprenant
Copy link
Contributor Author

Depending on what we decide in logstash-plugins/logstash-codec-csv#8 I'll followup here.

@colinsurprenant
Copy link
Contributor Author

Opened elastic/logstash#11885 for the broader discussion

@colinsurprenant
Copy link
Contributor Author

closing, we can reopen when consensus will be reached on how to solve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants