[WIP] use BufferedTokenizer and configurable line delimiter #63

colinsurprenant · 2018-09-24T21:25:29Z

NOTE This is a WIP to discuss this proposed strategy of using the BufferedTokenizer and configurable line delimiter to extract lines instead of using a hardcoded splitter on \n.

This is essentially a reboot of #26, it would solve #14, #37, #38, #57, logstash-plugins/logstash-input-stdin#16 and replace #6.

The Problem

For historical reasons and because of the ambiguity between line-oriented vs streaming inputs in our input/codec architecture, the multiline codec in its current state is actually an in-between for handling line-oriented and streaming data. It was actually meant for handling streaming line-delimited data since it was doing a split("\n") on the input thus assuming blobs of line delimited text. But obviously this is both useless in the context of already line-delimited input and useless for text-bytes input as is does not properly support lines across data blocks.

Proposal

To correctly handle streaming input for delimited data, using the BufferedTokenizer and adding a configurable line delimiter will provide a similar functionality to the line codec.

Also adding a streaming_input config option (with a false default for BWC) will preserve current behaviour. Using true would provide support for streaming inputs such as stdin, tcp, udp. I believe this is a pragmatic proposal in todays ambiguous input/codec architecture . My last attempt at solving this was in 2016 and it was suggested we wait on the Milling concept to land. I do not think we need to wait for that to make it work in a practical way in our current imperfect architecture.

Current WIP State

Using streaming_input => false (default) will keep current behaviour.
Using streaming_input => true will make it work with streaming inputs such as stdin, tcp, udp

colinsurprenant · 2018-09-25T13:40:14Z

@jsvd @guyboertje I would appreciate your input!

guyboertje · 2018-09-26T09:54:28Z

Initial impressions: looks good. I think we should discuss this POC in EAH. I have no intentions of raising the original Milling concept. Perhaps we can talk about a pluggable boundary detector setting in true codecs.

colinsurprenant · 2018-09-26T13:03:42Z

@guyboertje my goal here is to offer a solution with what we have today. I am +1 on investigating for a better solution which could be applied to all inputs and codecs but -1 on waiting for it to be fleshed out. I believe this proposal is simple enough & BWC to be considered today. My guess is that whatever we decide for better codecs/inputs boundary detection, it will probably be a 7.0 feature.

colinsurprenant · 2018-09-26T13:48:55Z

What would be the potential problem in moving forward with this solution?

guyboertje · 2018-09-26T15:40:44Z

@colinsurprenant
None I can see bar a test or two to verify the streaming flag.

sovetov · 2018-10-16T21:07:02Z

This issue is critical for me now as the single way to collect my text multiline logs is to upload files as raw TCP.

Is the following is correct way to use code from this repo?

Clone this repo (or my fork):
Edit /usr/share/logstash/Gemfile

- gem "logstash-codec-multiline"
+ gem "logstash-codec-multiline", :path => "/home/george/logstash-codec-multiline"

Run bin/logstash-plugin install --no-verify (where is the plugin name?)
Restart logstash.

sovetov · 2018-10-16T21:09:40Z

And there is another question that may seem unrelated.

If I use json codec and upload it as raw TCP stream, will it work? If so, how do they solved this issue? Just based on syntax of JSON?

colinsurprenant · 2018-11-19T23:39:13Z

@sovetov you should be able to use the plugin from the repo by editing the Gemfile file in logstash home as you suggested but there is no need to run bin/logstash-plugin ... after, just restart logstash.

colinsurprenant · 2018-11-19T23:46:44Z

@sovetov I am not sure I understand your second question about the json codec.
are you asking if the json codec will support streaming data, no it won't, the json codec expects a complete and valid json object when decoding and will not work will with tcp or udp stream.

On the other hand, with streaming input, you can use the json_lines codec which uses a BufferedTokenizer and has a configurable delimiter.

colinsurprenant · 2019-05-06T20:28:19Z

bump, any objection in moving this forward @guyboertje @jsvd ?

guyboertje · 2019-05-09T13:55:49Z

Lets merge it.

remiville · 2019-11-28T13:59:24Z

I faced this issue and to limit it while having multiline for stacktraces, I used mutate to remove new lines only for inputs different than stacktraces.
Please merge the fix.

TheVastyDeep · 2020-02-02T13:47:59Z

There is another use case where a codec does not interact well with line detection by the input. That is UTF-16. The file input will read half a character when it consumes the \n, leaving the rest of the file effectively flipped from UTF-16BE to UTF-16LE.

imnotteixeira · 2020-04-23T15:55:49Z

Hi, I've been banging my head into a wall trying to understand why my lines were being broken mid-line. I'm really glad I finally found this, as it seems to be the fix. Is there any ETA for merge/release?

colinsurprenant · 2020-05-08T16:20:23Z

Depending on what we decide in logstash-plugins/logstash-codec-csv#8 I'll followup here.

colinsurprenant · 2020-05-09T16:48:09Z

Opened elastic/logstash#11885 for the broader discussion

colinsurprenant · 2020-05-18T20:24:43Z

closing, we can reopen when consensus will be reached on how to solve this.

colinsurprenant added 3 commits September 24, 2018 16:46

use BufferedTokenizer and configurable line delimiter

9c4fd4e

add streaming_input option and revert spec changes

0bf35d9

comment and cosmetic

75d1fd3

colinsurprenant mentioned this pull request Sep 24, 2018

[WIP] proper delimiter support for line based input #26

Open

This was referenced Sep 25, 2018

Correctly handle reads by the input module that are not aligned to a newline #38

Closed

16k buffer limit causes \n in message logstash-plugins/logstash-input-stdin#16

Open

colinsurprenant mentioned this pull request Sep 27, 2018

stdin/multiline-code adds random newlines #37

Open

colinsurprenant mentioned this pull request May 7, 2020

[WIP] support line delimited data logstash-plugins/logstash-codec-csv#8

Closed

4 tasks

colinsurprenant mentioned this pull request May 9, 2020

[META][Discuss] add support for input_type in "dual modes" codecs elastic/logstash#11885

Closed

colinsurprenant marked this pull request as draft May 9, 2020 16:49

colinsurprenant mentioned this pull request May 11, 2020

Multiline received as last input will be dropped #57

Open

colinsurprenant closed this May 18, 2020

mj84 mentioned this pull request Nov 12, 2020

Auto flush on multiline codec not working logstash-plugins/logstash-input-tcp#156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] use BufferedTokenizer and configurable line delimiter #63

[WIP] use BufferedTokenizer and configurable line delimiter #63

colinsurprenant commented Sep 24, 2018 •

edited

Loading

colinsurprenant commented Sep 25, 2018

guyboertje commented Sep 26, 2018

colinsurprenant commented Sep 26, 2018

colinsurprenant commented Sep 26, 2018

guyboertje commented Sep 26, 2018

sovetov commented Oct 16, 2018

sovetov commented Oct 16, 2018 •

edited

Loading

colinsurprenant commented Nov 19, 2018

colinsurprenant commented Nov 19, 2018

colinsurprenant commented May 6, 2019

guyboertje commented May 9, 2019

remiville commented Nov 28, 2019

TheVastyDeep commented Feb 2, 2020

imnotteixeira commented Apr 23, 2020

colinsurprenant commented May 8, 2020

colinsurprenant commented May 9, 2020

colinsurprenant commented May 18, 2020

[WIP] use BufferedTokenizer and configurable line delimiter #63

[WIP] use BufferedTokenizer and configurable line delimiter #63

Conversation

colinsurprenant commented Sep 24, 2018 • edited Loading

The Problem

Proposal

Current WIP State

colinsurprenant commented Sep 25, 2018

guyboertje commented Sep 26, 2018

colinsurprenant commented Sep 26, 2018

colinsurprenant commented Sep 26, 2018

guyboertje commented Sep 26, 2018

sovetov commented Oct 16, 2018

sovetov commented Oct 16, 2018 • edited Loading

colinsurprenant commented Nov 19, 2018

colinsurprenant commented Nov 19, 2018

colinsurprenant commented May 6, 2019

guyboertje commented May 9, 2019

remiville commented Nov 28, 2019

TheVastyDeep commented Feb 2, 2020

imnotteixeira commented Apr 23, 2020

colinsurprenant commented May 8, 2020

colinsurprenant commented May 9, 2020

colinsurprenant commented May 18, 2020

colinsurprenant commented Sep 24, 2018 •

edited

Loading

sovetov commented Oct 16, 2018 •

edited

Loading