Enable request body streaming with an IO object #409

janko · 2017-05-01T09:03:24Z

This change adds the ability to stream content of an IO object (an object that responds to #read) into the request body. This is convenient when you want to send large amounts of data as the request body, but you don't want to load all of it into memory.

HTTP.post("http://example.org/upload", body: File.open("video.mp4"))
HTTP.post("http://example.org/upload", body: StringIO.new("file data"))

It was already possible to send the request body in chunks by passing an Enumerable to :body, but that can (and should) only be used for making "Transfer-Encoding: chunked" requests.

This will also allow us to extend form_data.rb to stream the multipart-encoded request, instead having to load the whole body into memory before writing it to the socket.

ixti · 2017-05-01T12:26:21Z

lib/http/request/writer.rb

-          @request_header << "#{Headers::CONTENT_LENGTH}: 0"
-        elsif @body.is_a?(Enumerable) && CHUNKED != @headers[Headers::TRANSFER_ENCODING]
-          raise(RequestError, "invalid transfer encoding")
+        unless @headers[Headers::CONTENT_LENGTH]


it's better to use guard here:

return if @headers[Headers::CONTENT_LENGTH]

I agree, updated in abe1b08.

ixti · 2017-05-01T12:30:45Z

lib/http/request/writer.rb

+          if @body.is_a?(String)
+            @request_header << "#{Headers::CONTENT_LENGTH}: #{@body.bytesize}"
+          elsif @body.respond_to?(:read)
+            @request_header << "#{Headers::CONTENT_LENGTH}: #{@body.size}"


if we are checking respond_to? :read we should also check here respond_to? size, no? So I believe the best will be:

if @body.respond_to?(:bytesize) @request_header << "#{Headers::CONTENT_LENGTH}: #{@body.bytesize}" elsif @body.respond_to?(:size) && @body.respond_to?(:read) @request_header << "#{Headers::CONTENT_LENGTH}: #{@body.size}" elsif @body.nil? @request_header << "#{Headers::CONTENT_LENGTH}: 0" elsif ... end

@httprb/core please take a look if you see any gotchas here

The reason why I wasn't checking here whether the IO object responds to size was because I wanted that it's required for the IO object to respond to size if Content-Length wasn't passed in, because for non-chunked requests Content-Length header is required (I would document this requirement in the wiki).

I mean you are checking for #read method, but using #size then which might be not available:

reader, writer = IO.pipe reader.respond_to? :read # => true reader.respond_to? :size # => false

Oh. I got what you mean. I guess then we should raise an exception:

elsif @body.respond_to? :read raise "IO object must respond to #size" unless @body.respond_to? :size # ...

Yes, when IO.pipe is passed in and Content-Length is not explicitly passed in, then the request will fail with a NoMethodError because IO.pipe doesn't respond to #size. This can be solved by passing Content-Length explicitly:

reader, writer = IO.pipe writer.write("content") writer.close HTTP.headers("Content-Length" => 7).post("http://example.org", body: reader)

If the exception isn't raised when Content-Length is absent and the IO doesn't respond to #size, the user could accidentally send the request without a Content-Length, which isn't a valid HTTP request (unless it's a chunked request).

The error is currently just a NoMethodError, which probably doesn't communicate well to the user, would it help if I improved the error message? Or are you in disagreement with this behaviour in general?

I saw your last comment only after I posted the previous one, great, I'll update the error message then.

Updated in e53c71f

ixti

LGTM!

ixti · 2017-05-01T14:36:39Z

Let's wait for input (if any) of @httprb/core and then I'll fix rubocop and merge!

tarcieri · 2017-05-01T15:08:08Z

IO is Enumerable:

http://ruby-doc.org/core-2.0.0/IO.html

I'm curious why this wasn't working with the existing Enumerable support

janko · 2017-05-01T15:38:58Z

@tarcieri While IO objects can be streamed using their Enumerable interface, there are several limitations of this:

First, HTTP.rb's Enumerable support implies Transfer-Encoding: chunked requests, so currently it's not possible to use it for streaming the body of a regular request.

Second, IO#each iterates over lines of the file, which means if we have big binary file which doesn't have a single newline, the line will equal to the whole content, so the whole file will potentially be loaded into memory during upload.

Third, by relying only on #read and #size, we allow a much broader array of inputs that aren't necessarily descendants of IO, such as:

Tempfile (it's just a delegator to the underlying File object, so it doesn't implement Enumerable)
ActionDispatch::Http::UploadedFile (a delegator to the underlying Tempfile object)
FormData::Multipart which I plan to modify to respond to #read for streaming multipart uploads
Rack input of any web server (Unicorn::StreamInput, Passenger::TeeInput, ...)
Shrine::UploadedFile from the Shrine gem (and also IO objects created by data_uri and rack_file plugins)
Down::ChunkedIO from the Down gem

ixti · 2017-05-01T15:47:40Z

We definitely need to improve work with IO bodies... so that it will allow streaming chunked data of undetermined length (e.g. stream of JSON documents)

janko · 2017-05-01T15:50:31Z

We definitely need to improve work with IO bodies... so that it will allow streaming chunked data of undetermined length (e.g. stream of JSON documents)

Hmm, isn't that what we already have with the Enumerable support?

ixti · 2017-05-01T16:24:20Z

@janko-m I guess so:

require "http"
require "yaml"

rd, wd = IO.pipe

writer = Thread.new(wd) do |io|
  Thread.stop

  puts "READER: Start feeding data"

  %w(a b c).each do |char|
    io.write(%({"#{char}":#{rand}}\n))
    sleep 1
  end

  puts "READER: All data was feeded"

  io.close
end

reader = Thread.new(rd) do |io|
  client   = HTTP.headers({ "Transfer-Encoding" => "chunked" })

  puts "WRITER: start post request"

  response = client.post("http://httpbin.org/post", :body => io).flush

  puts "WRITER: post request finished"

  puts "== RESPONSE ==\n#{response.parse.to_yaml}"
end

sleep 0.1 until "sleep" == writer.status

writer.wakeup
reader.join

ixti · 2017-05-01T16:31:13Z

Ignore me! It's httpbin is not understanding chunked requests. Tried that with netcat - all correct:

$ ncat -l 8080
POST / HTTP/1.1
Transfer-Encoding: chunked
Connection: close
Host: localhost:8080
User-Agent: http.rb/2.2.2

19
{"a":0.6089059717403343}

19
{"b":0.7655504586106395}

19
{"c":0.6511286447469131}

0

britishtea · 2017-05-01T16:50:06Z

I think HTTP::Request::Writer#validate_body_type! should also be updated. Currently it will raise when given a body that responds to #read and #size but is not Enumerable. Perhaps change the test such that it doesn't provide an Enumerable?

Looks like the else clause in #send_request can also be removed, since #validate_body_type! is called before.

Could bodies with a #read but without a #size (or a #size returning nil) be handled as Enumerable bodies are right now?

ixti · 2017-05-01T16:56:45Z

IMHO it worth to create Request::Body wrapper class that will encapsulate and hide away all logic of content-length calculation, type validation and streaming.

janko · 2017-05-01T17:22:06Z

I think HTTP::Request::Writer#validate_body_type! should also be updated. Currently it will raise when given a body that responds to #read and #size but is not Enumerable. Perhaps change the test such that it doesn't provide an Enumerable?

You're right, the current code wouldn't work with an IO object which isn't Enumerable, I need to add a test that uses something other than a StringIO.

Could bodies with a #read but without a #size (or a #size returning nil) be handled as Enumerable bodies are right now?

You mean, that if a body doesn't respond to #size, that it implies a Transfer-Encoding: chunked request (with the corresponding encoding)? Yeah, we could convert such an IO into an Enumerable, and have the same case clause which handles Enumerable handle it.

Enumerator.new do |yielder|
  loop do
    data = @io.read(Connection::BUFFER_SIZE, buffer ||= String.new)
    break if data.nil?
    yielder << data
  end
end

IMHO it worth to create Request::Body wrapper class that will encapsulate and hide away all logic of content-length calculation, type validation and streaming.

I'm sold 💰

janko · 2017-05-02T07:57:52Z

I extracted body-related behaviour into an Request::Body class, which deals with validating input, determining size of the body, and retrieving content to be written to the socket. It doesn't know anything about request headers, chunked encodings or sockets.

During this refactoring, I implemented @britishtea's suggestion of allowing Transfer-Encoding: chunked to be used with an IO object. With the extraction of Request::Body it was simpler to generalize how we treat the scenario when Transfer-Encoding: chunked is set, to support chunked requests using any body type, not just Enumerable. I like this change.

I also modified tests to match all of the content that was written to the socket, not just prefixes and suffixes, so that I have greater confidence that I didn't break anything. I also tested streaming to the socket with both Enumerable and non-Enumerable IOs, to verify that Enumerable IOs are treated as IOs, not Enumerables. I'm not sure whether this is a breaking change, because this will change the chunked request body for Enumerable IOs, but then again it shouldn't matter in what format is the chunked request body, as long as all of the content is sent.

I see now that I've triggered quite a few Rubocop offences due to my style, I will let you correct or ignore offences which suit your preference. Investigating the JRuby failure now.

janko · 2017-05-02T08:53:39Z

I opened a bug report for the JRuby failure: jruby/jruby#4583.

Normally this bug would be harmless, but we are reading chunks using a buffer (to avoid creating a new string for each chunk), so each retrieved chunk will override the next chunk. This means that if we request the first chunk, and JRuby starts executing the code after the first yield (even though we did not tell it to), it will read the next chunk and override the yielded first one.

I think at this point it probably makes more sense to work around this by not using a buffer, even at the cost of more string objects being allocated. This wouldn't introduce a performance regression, since this is a new feature.

tarcieri · 2017-05-02T15:31:11Z

lib/http/request/body.rb

-          chunk = String.new # rubocop:disable Style/EmptyLiteral
-          yield chunk while @body.read(BUFFER_SIZE, chunk)
+          loop do
+            data = @body.read(BUFFER_SIZE) or break


Personally I'd prefer something like:

while (data = @body.read(BUFFER_SIZE)) yield data end

It better communicates intent, IMO, despite assigning in a branch.

@tarcieri I would also prefer that, but Ruby prints a warning when you attempt to use an assignment in conditionals (Ruby thinks that you wanted to use ==). It happens even when warning aren't enabled, just when running ruby -e 'nil if a = 1'.

Try the exact code I gave (with parens). It does not print a warning.

That is wonderful, I'll update the code then 😃

@tarcieri Updated.

tarcieri · 2017-05-02T17:32:52Z

RuboCop ProTip(TM): rubocop -a (auto-correct)

By changing all components to use an IO object as a base, we can implement a common IO interface for all components, which delegates to the underlying IO object. This enables streaming multipart data into the request body, avoiding loading the whole multipart data into memory when File parts are backed by File objects. See httprb/http#409 for the new streaming API.

This is a squashed commit of [PR#12][] with some tiny cleanups applied on top of that. [PR#12]: #12 Allow any IO object in FormData::File ------------------------------------- Previously we allowed only File and StringIO objects as an input to `FormData::File`, but we can generalize that to any IO object that responds to `#read` and `#size` (which includes `Tempfile`, `ActionDispatch::Http::UploadedFile` etc). Open File for given path in binary mode --------------------------------------- That way different operating systems won't attempt to convert newline characters to their internal representation, instead the file content will always be retrieved byte-for-byte as is. Officially support Pathname in FormData::File.new ------------------------------------------------- Previously Pathname was implicitly supported, though extracting filename wasn't working. With the recent refactoring this stopped working, so we make the Pathname support explicit. Make all components into IO objects ----------------------------------- By changing all components to use an IO object as a base, we can implement a common IO interface for all components, which delegates to the underlying IO object. This enables streaming multipart data into the request body, avoiding loading the whole multipart data into memory when File parts are backed by File objects. See httprb/http#409 for the new streaming API. Make CompositeIO convert strings to StringIOs --------------------------------------------- By delegating handling strings to CompositeIO we can remove a lot of the StringIO.new clutter when instantiating CompositeIO objects. Use a buffer when reading IO files in CompositeIO ------------------------------------------------- This way we're not creating a new string for each chunk read, instead each chunk will be read into an existing string object (a "buffer"), replacing any previous content.

janko · 2017-05-18T03:09:56Z

I updated http-form_data to the prerelease version that has the streaming capabilities, and removed to_s so that it's used as an IO by the HTTP::Request::Writer. I also removed assigning Content-Length from HTTP::Client, as it will automatically be inferred from HTTP::FormData#size by the HTTP::Request::Writer, or omitted if Transfer-Encoding is chunked.

I was thinking whether we need some integration tests between HTTP and HTTP::FormData, but considering that both gems are in my opinion very well-tested, and that the only thing HTTP needs here is #read and #size (which both HTTP::FormData::Multipart and HTTP::FormData::Urlencoded implement), I thought that it's not necessary.

This PR should be complete now.

ixti · 2017-05-19T01:29:45Z

Please rebase this branch and I'll be happy to merge it down

This change adds the ability to stream content of an IO object (an object that responds to #read) into the request body. This is convenient when you want to send large amounts of data as the request body, but you don't want to load all of it into memory. HTTP.post("http://example.org/upload", body: File.open("video.mp4")) HTTP.post("http://example.org/upload", body: StringIO.new("file data")) It was already possible to send the request body in chunks by passing an Enumeable to :body, but that's only defined for "Transfer-Encoding: chunked" requests. This will also allow us to extend form_data.rb to stream the multipart-encoded request, instead having to load the whole body into memory before writing it to the socket.

Before creating chunked requests was only possible with Enumerable body. If "Transfer-Encoding: chunked" was set but the body wasn't Enumerable, request would be calculated as if it was a regular request, which meant inclusion of "Content-Length" header and writing request body as is. If body was an Enumerable but "Transfer-Encoding: chunked" wasn't set, an error would be raised. This commit enables sending chunked requests with any body type, by abstracting how the request body is read into an #each method. If "Transfer-Encoding: chunked" is detected, the yielded chunks are encoded accordingly before writing, otherwise the chunks are written as is. This allows making a streaming chunked request using an IO object, where file would be written in chunks of 16 KB, which wasn't previously possible.

As described in jruby/jruby#4583, JRuby starts executing code after the first `yield` even though we requested only the first element, resulting in the first chunk being overriden with the second chunk before it was even returned. We work around this by not using a buffer string, therefore each retrieved chunk is a new string, so even if JRuby immediately retrieves the second chunk, it won't affect the first chunk.

janko · 2017-05-19T05:16:52Z

@ixti Done!

ixti · 2017-05-19T16:32:15Z

Awesome! Thanks!

@ixti

pkgsrc changes: - sort DEPENDS Upstream changes (from CHANGES.md): ## 3.0.0 (2017-10-01) * Drop support of Ruby `2.0` and Ruby `2.1`. ([@ixti]) * [#410](httprb/http#410) Infer `Host` header upon redirects. ([@janko-m]) * [#409](httprb/http#409) Enables request body streaming on any IO object. ([@janko-m]) * [#413](httprb/http#413), [#414](httprb/http#414) Fix encoding of body chunks. ([@janko-m]) * [#368](httprb/http#368), [#357](httprb/http#357) Fix timeout issue. ([@HoneyryderChuck])

janko force-pushed the enable-request-body-streaming branch from 68c2f43 to c4798f5 Compare May 1, 2017 09:28

ixti self-assigned this May 1, 2017

ixti requested changes May 1, 2017

View reviewed changes

ixti approved these changes May 1, 2017

View reviewed changes

tarcieri reviewed May 2, 2017

View reviewed changes

janko mentioned this pull request May 2, 2017

Enable request body streaming httprb/form_data#12

Closed

janko added 7 commits May 19, 2017 14:51

Use a guard to avoid a method-level conditional

c6ac374

Require that IO object responds to #size

68d1c33

Test request writing with a non-Enumerable IO

f958b68

Fix a typo in tests

5ab4706

janko added 4 commits May 19, 2017 14:51

Use assignment in conditional without a warning

0f4a07f

Adhere to the code style

d998099

Update YARD documentation with allowed body type

2690a24

Utilize HTTP::FormData streaming

bb4479f

janko force-pushed the enable-request-body-streaming branch from 3cd6931 to bb4479f Compare May 19, 2017 04:52

ixti merged commit 4445835 into httprb:master May 19, 2017

janko deleted the enable-request-body-streaming branch May 19, 2017 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable request body streaming with an IO object #409

Enable request body streaming with an IO object #409

janko commented May 1, 2017 •

edited

Loading

ixti May 1, 2017

janko May 1, 2017

ixti May 1, 2017

janko May 1, 2017 •

edited

Loading

ixti May 1, 2017

ixti May 1, 2017

janko May 1, 2017 •

edited

Loading

janko May 1, 2017

janko May 1, 2017 •

edited

Loading

ixti left a comment

ixti commented May 1, 2017

tarcieri commented May 1, 2017

janko commented May 1, 2017 •

edited

Loading

ixti commented May 1, 2017

janko commented May 1, 2017 •

edited

Loading

ixti commented May 1, 2017

ixti commented May 1, 2017 •

edited

Loading

britishtea commented May 1, 2017

ixti commented May 1, 2017

janko commented May 1, 2017 •

edited

Loading

janko commented May 2, 2017 •

edited

Loading

janko commented May 2, 2017

tarcieri May 2, 2017 •

edited

Loading

janko May 2, 2017

tarcieri May 2, 2017

janko May 2, 2017

janko May 2, 2017

tarcieri commented May 2, 2017

janko commented May 18, 2017 •

edited

Loading

ixti commented May 19, 2017

janko commented May 19, 2017

ixti commented May 19, 2017

Enable request body streaming with an IO object #409

Enable request body streaming with an IO object #409

Conversation

janko commented May 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janko May 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janko May 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janko May 1, 2017 • edited Loading

Choose a reason for hiding this comment

ixti left a comment

Choose a reason for hiding this comment

ixti commented May 1, 2017

tarcieri commented May 1, 2017

janko commented May 1, 2017 • edited Loading

ixti commented May 1, 2017

janko commented May 1, 2017 • edited Loading

ixti commented May 1, 2017

ixti commented May 1, 2017 • edited Loading

britishtea commented May 1, 2017

ixti commented May 1, 2017

janko commented May 1, 2017 • edited Loading

janko commented May 2, 2017 • edited Loading

janko commented May 2, 2017

tarcieri May 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarcieri commented May 2, 2017

janko commented May 18, 2017 • edited Loading

ixti commented May 19, 2017

janko commented May 19, 2017

ixti commented May 19, 2017

janko commented May 1, 2017 •

edited

Loading

janko May 1, 2017 •

edited

Loading

janko May 1, 2017 •

edited

Loading

janko May 1, 2017 •

edited

Loading

janko commented May 1, 2017 •

edited

Loading

janko commented May 1, 2017 •

edited

Loading

ixti commented May 1, 2017 •

edited

Loading

janko commented May 1, 2017 •

edited

Loading

janko commented May 2, 2017 •

edited

Loading

tarcieri May 2, 2017 •

edited

Loading

janko commented May 18, 2017 •

edited

Loading