Remove split2, refactor Rows for better performance #108

slvrtrn · 2022-10-02T23:25:29Z

After checking out https://github.com/go-faster/ch-bench and writing a simple test for our client as such:

import { createClient } from '@clickhouse/client'
;(async () => {
  const client = createClient()
  const rows = await client.query({
    query: 'SELECT number FROM system.numbers_mt LIMIT 500000000',
    format: 'TabSeparated',
  })
  const start = +new Date()
  for await (const _ of rows.stream()) {
    //
  }
  const end = +new Date()
  console.info(`Execution time: ${end - start} ms`)
})()

I ended up with an "amazing" performance of around ~280 seconds for executing this bit of code (Ryzen 9 5950X).

So, I did the following:

Replaced split2 with a simple self-written solution
Replaced a Row class created on every iteration with just a slim interface
Replaced return type of stream() method with AsyncGenerator<Row, void>, and now we have a proper row type here:

  for await (const row of rows.stream()) {
    await row.text() // <-- `row` is actually of type `Row` instead of `any`
  }

Before: 285 seconds
After: 96 seconds

Still far from perfect (I will investigate more), but that's a start.

slvrtrn · 2022-10-03T00:22:31Z

__tests__/integration/select.test.ts

-      }
-      expect(last).toBe('9999')
-    })
-


Looks like we don't need this anymore. It can be paused (or not consumed) just on the application level.

slvrtrn · 2022-10-03T01:18:50Z

__tests__/integration/abort_request.test.ts

-        })
-      } else {
-        expect(await selectPromise).toEqual(undefined)
-      }


Surprisingly enough, this is no longer the case for Node.js 18.x.

¯\_(ツ)_/¯

You removed destroy(), which caused the error. Maybe that's the reason?

slvrtrn · 2022-10-03T01:20:56Z

__tests__/integration/select.test.ts

-      }
-      expect(last).toBe('9999')
-    })
-


Probably not required anymore with async iterators. It can be paused (not consumed) on the application level, as it is a lazy evaluation.

slvrtrn · 2022-10-03T01:22:12Z

__tests__/integration/streaming_e2e.test.ts

+        callback()
+      },
+      objectMode: true,
+    })


Not the best implementation for this (haven't tested how it behaves with large files), but at least it works without split2.

slvrtrn · 2022-10-03T01:22:57Z

__tests__/integration/streaming_e2e.test.ts

+            if (!line.length) {
+              return
+            } else {
+              const json = JSON.parse(line)


I think we should support already serialized data somehow. It looks strange to parse it here and then stringify it again in the library.

Let's insert data in CSV but query in JSONCompactEachRow? It doesn't change the logic much but simplifies the example

slvrtrn · 2022-10-03T03:08:33Z

After some digging in Node.js issues I found this, and indeed, for await const of is about 30% slower than .on('data', callback) approach on my machine.

Using current release version, with the old stream implementation with split2, this code

import { createClient } from '@clickhouse/client';
(async () => {
  const client = createClient()
  const rows = await client.query({
    query: 'SELECT number FROM system.numbers_mt LIMIT 50000000',
    format: 'TabSeparated'
  })
  const start = +new Date()
  const stream = rows.stream()
  stream.on('data', (_) => {
    //
  })
  await new Promise((resolve) => {
    stream.on('end', () => {resolve()})
  })
  const end = +new Date()
  console.info(`Execution time: ${end - start} ms`)
  await client.close()
})()

executes in 20 seconds, while this

import { createClient } from '@clickhouse/client';
(async () => {
  const client = createClient()
  const rows = await client.query({
    query: 'SELECT number FROM system.numbers_mt LIMIT 50000000',
    format: 'TabSeparated'
  })
  const start = +new Date()
  for await (const _ of rows.stream()) {
    //
  }
  const end = +new Date()
  console.info(`Execution time: ${end - start} ms`)
  await client.close()
})()

takes 30 seconds to finish.

So it might be beneficial to return Stream.Readable instead of an AsyncGenerator.
Only need to figure out how to make it faster :)

slvrtrn · 2022-10-03T03:23:50Z

So, with just the allocations removed, keeping the Stream in place

return Stream.pipeline(
  this._stream,
  split((row: string) => ({ // <- no `new Row` here
    text: row, // <- this is not a function anymore
    json<T>() {
      return decode(row, 'JSON')
    },
  })),
  function pipelineCb(err) {
    if (err) {
      console.error(err)
    }
  }
)

using this code (50M numbers)

import { createClient } from '../src'
void (async () => {
  const client = createClient({
    compression: {
      request: false,
      response: false,
    },
  })
  const rows = await client.query({
    query: 'SELECT number FROM system.numbers_mt LIMIT 50000000',
    format: 'TabSeparated',
  })
  const start = +new Date()
  const stream = rows.stream()
  stream.on('data', (_) => {
    //
  })
  await new Promise((resolve) => {
    stream.on('end', () => {
      resolve(0)
    })
  })
  const end = +new Date()
  console.info(`Execution time: ${end - start} ms`)
})()

we can get as fast as ~3.5-4 seconds on my machine.
The current release takes ~18 seconds to execute the same code.

On 500M records instead of 50M, that's ~37-38 seconds vs ~280-300 seconds.

mshustov · 2022-10-03T07:24:26Z

src/rows.ts

+    let decodedChunk = ''
+    for await (const chunk of this._stream) {
+      decodedChunk += textDecoder.decode(chunk, { stream: true })
+      let idx = 0


nit it's not changed. let's move lower:

const idx = decodedChunk.indexOf('\n')

mshustov · 2022-10-03T07:56:38Z

src/rows.ts

+            },
+          }
+        } else {
+          break


Let's add a special case at the top to remove a nesting level.

if (idx === -1) break; const line = decodedChunk.slice(0, idx); ...

mshustov · 2022-10-03T08:00:26Z

src/rows.ts

+          yield {
+            /**
+             * Returns a string representation of a row.
+             */


These comments belong to Row interface declared below. https://github.com/ClickHouse/clickhouse-js/pull/108/files#diff-b13826a37e4f93783b49eaca9c60dc1d124ee9d6b331be22244f41cc7bb09d39R94-R96
IDE doesn't hint at the method signature.

mshustov · 2022-10-03T08:05:11Z

src/rows.ts

        }
      }
-    )
+    }
+    textDecoder.decode() // flush


Shouldn't it be consumed? The method typings shows it as

/** * Returns the result of running encoding's decoder. The method can be invoked zero or more times with options's stream set to true, and then once without options's stream (or set to false), to process a fragmented input. If the invocation without options's stream (or set to false) has no input, it's clearest to omit both arguments. * * ``` * var string = "", decoder = new TextDecoder(encoding), buffer; * while(buffer = next_chunk()) { * string += decoder.decode(buffer, {stream:true}); * } * string += decoder.decode(); // end-of-queue * ``` * * If the error mode is "fatal" and encoding's decoder returns error, throws a TypeError. */ decode(input?: BufferSource, options?: TextDecodeOptions): string;

mshustov · 2022-10-03T08:14:26Z

__tests__/integration/streaming_e2e.test.ts

+            if (!line.length) {
+              return
+            } else {
+              const json = JSON.parse(line)


Let's insert data in CSV but query in JSONCompactEachRow? It doesn't change the logic much but simplifies the example

mshustov · 2022-10-03T08:19:08Z

__tests__/integration/abort_request.test.ts

-        })
-      } else {
-        expect(await selectPromise).toEqual(undefined)
-      }


You removed destroy(), which caused the error. Maybe that's the reason?

mshustov · 2022-10-03T08:29:07Z

src/rows.ts

+            /**
+             * Returns a string representation of a row.
+             */
+            text(): string {


Are you sure it's more lightweight than a Row class instance?
The method emits an object literal with 2 methods vs. a class with 2 methods in a prototype

class Row { text(){} } (new Row()).hasOwnProperty('text') // false const objectLiteral = { text(){} } objectLiteral.hasOwnProperty('text') // true

split((row: string) => new Row(row, 'JSON')),

ts-node --transpile-only --project tsconfig.dev.json examples/many_numbers.ts Execution time: 1957 ms

split((row: string) => ({ text: row, json<T>() { return decode(row, 'JSON') } })),

ts-node --transpile-only --project tsconfig.dev.json examples/many_numbers.ts Execution time: 394 ms

slvrtrn · 2022-10-03T13:20:07Z

Will create a new one with an updated implementation.

Remove split2, refactor Rows for better performance

650315e

slvrtrn requested a review from mshustov October 2, 2022 23:25

slvrtrn added 4 commits October 3, 2022 01:59

Fix tests

d98426f

Comment the code that still uses split2

dbb7be1

Update Node.js 18.x test

e968dda

Remove split2 from the IT and the examples

6e2cc59

slvrtrn commented Oct 3, 2022

View reviewed changes

mshustov reviewed Oct 3, 2022

View reviewed changes

slvrtrn closed this Oct 3, 2022

slvrtrn deleted the performance branch October 4, 2022 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove split2, refactor Rows for better performance #108

Remove split2, refactor Rows for better performance #108

slvrtrn commented Oct 2, 2022 •

edited

Loading

slvrtrn Oct 3, 2022

slvrtrn Oct 3, 2022

mshustov Oct 3, 2022

slvrtrn Oct 3, 2022

slvrtrn Oct 3, 2022

slvrtrn Oct 3, 2022

mshustov Oct 3, 2022

slvrtrn commented Oct 3, 2022

slvrtrn commented Oct 3, 2022 •

edited

Loading

mshustov Oct 3, 2022

mshustov Oct 3, 2022

mshustov Oct 3, 2022

mshustov Oct 3, 2022

mshustov Oct 3, 2022

mshustov Oct 3, 2022

mshustov Oct 3, 2022

slvrtrn Oct 3, 2022

slvrtrn commented Oct 3, 2022

Remove split2, refactor Rows for better performance #108

Remove split2, refactor Rows for better performance #108

Conversation

slvrtrn commented Oct 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slvrtrn commented Oct 3, 2022

slvrtrn commented Oct 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slvrtrn commented Oct 3, 2022

slvrtrn commented Oct 2, 2022 •

edited

Loading

slvrtrn commented Oct 3, 2022 •

edited

Loading