Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge pull, the system often encounters errors ret=1018 (Device or resource busy) and ret=1018 (No such file or directory). #511

Closed
gqf2008 opened this issue Oct 26, 2015 · 27 comments
Assignees
Labels
Bug It might be a bug. TransByAI Translated by AI/GPT.
Milestone

Comments

@gqf2008
Copy link

gqf2008 commented Oct 26, 2015

Journal

[2015-10-26 13:36:08.153][error][13948][3998][16] http post on_play uri failed. client_id=3998, url=http://127.0.0.1/lcs/api/rtmp/on_play/shnh-edge2-live1.evideocloud.net, request={"action":"on_play","client_id":3998,"ip":"222.88.95.177","vhost":"live1.evideocloud.net","app":"live","stream":"dxhdbh__gBML4E6B40Lv","pageUrl":""}, response=, code=0, ret=1018(Device or resource busy)
[2015-10-26 13:36:08.153][error][13948][3998][16] hook client on_play failed. url=http://127.0.0.1/lcs/api/rtmp/on_play/shnh-edge2-live1.evideocloud.net, ret=1018(Device or resource busy)
[2015-10-26 13:36:08.153][error][13948][3998][16] http hook on_play failed. ret=1018(Device or resource busy)
[2015-10-26 13:36:08.153][error][13948][3998][16] stream service cycle failed. ret=1018(Device or resource busy)
[2015-10-26 13:36:08.154][error][13948][3998][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:08.342][error][13948][4000][104] rtmp handshake failed. ret=1008(Connection reset by peer)
[2015-10-26 13:36:09.032][error][13948][3990][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:09.032][error][13948][3990][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:09.251][error][13948][4002][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:09.251][error][13948][4002][16] http post on_connect uri failed. client_id=4002, url=http://127.0.0.1/lcs/api/rtmp/on_connect/shnh-edge2-live1.evideocloud.net, request={"action":"on_connect","client_id":4002,"ip":"222.186.130.3","vhost":"live1.evideocloud.net","app":"live","tcUrl":"rtmp://live1.evideocloud.net:1935/live","pageUrl":""}, response=, code=0, ret=1018(Device or resource busy)
[2015-10-26 13:36:09.251][error][13948][4002][16] hook client on_connect failed. url=http://127.0.0.1/lcs/api/rtmp/on_connect/shnh-edge2-live1.evideocloud.net, ret=1018(Device or resource busy)
[2015-10-26 13:36:09.251][error][13948][4002][16] check vhost failed. ret=1018(Device or resource busy)
[2015-10-26 13:36:10.341][error][13948][3973][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:36:12.165][error][13948][4020][104] rtmp handshake failed. ret=1008(Connection reset by peer)
[2015-10-26 13:36:13.365][error][13948][4022][104] rtmp handshake failed. ret=1008(Connection reset by peer)
[2015-10-26 13:36:14.103][error][13948][4008][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:14.103][error][13948][4008][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:14.572][error][13948][4005][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:14.572][error][13948][4005][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:14.614][error][13948][4024][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:14.614][error][13948][4024][16] http post on_connect uri failed. client_id=4024, url=http://127.0.0.1/lcs/api/rtmp/on_connect/shnh-edge2-live1.evideocloud.net, request={"action":"on_connect","client_id":4024,"ip":"222.186.130.3","vhost":"live1.evideocloud.net","app":"live","tcUrl":"rtmp://live1.evideocloud.net:1935/live","pageUrl":""}, response=, code=0, ret=1018(Device or resource busy)
[2015-10-26 13:36:14.614][error][13948][4024][16] hook client on_connect failed. url=http://127.0.0.1/lcs/api/rtmp/on_connect/shnh-edge2-live1.evideocloud.net, ret=1018(Device or resource busy)
[2015-10-26 13:36:14.614][error][13948][4024][16] check vhost failed. ret=1018(Device or resource busy)
[2015-10-26 13:36:15.211][error][13948][3914][62] rtmp handshake failed. ret=1011(Timer expired)
[2015-10-26 13:36:15.256][error][13948][4026][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:15.256][error][13948][4026][16] http post on_connect uri failed. client_id=4026, url=http://127.0.0.1/lcs/api/rtmp/on_connect/shnh-edge2-live1.evideocloud.net, request={"action":"on_connect","client_id":4026,"ip":"222.186.130.3","vhost":"live1.evideocloud.net","app":"live","tcUrl":"rtmp://live1.evideocloud.net:1935/live","pageUrl":""}, response=, code=0, ret=1018(Device or resource busy)
[2015-10-26 13:36:15.256][error][13948][4026][16] hook client on_connect failed. url=http://127.0.0.1/lcs/api/rtmp/on_connect/shnh-edge2-live1.evideocloud.net, ret=1018(Device or resource busy)


[2015-10-26 13:38:13.494][error][13948][4448][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:13.565][error][13948][4452][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:13.767][error][13948][4393][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:13.783][error][13948][4374][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:13.855][error][13948][4385][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:14.494][error][13948][4448][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:14.565][error][13948][4452][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:14.767][error][13948][4393][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:14.783][error][13948][4374][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:14.855][error][13948][4385][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:15.494][error][13948][4448][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:15.565][error][13948][4452][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:15.767][error][13948][4393][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:15.783][error][13948][4374][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:15.855][error][13948][4385][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)

TRANS_BY_GPT3

@gqf2008 gqf2008 changed the title 系统经常出现ret=1018(Device or resource busy)错误,检查HTTP_OKK 系统经常出现ret=1018(Device or resource busy)错误 Oct 26, 2015
@gqf2008 gqf2008 changed the title 系统经常出现ret=1018(Device or resource busy)错误 系统经常出现ret=1018(Device or resource busy) 和 ret=1018(No such file or directory) 错误 Oct 26, 2015
@gqf2008 gqf2008 changed the title 系统经常出现ret=1018(Device or resource busy) 和 ret=1018(No such file or directory) 错误 边缘拉流,系统经常出现ret=1018(Device or resource busy) 和 ret=1018(No such file or directory) 错误 Oct 26, 2015
@winlinvip
Copy link
Member

winlinvip commented Oct 26, 2015

#define ERROR_ST_CONNECT                    1018

This error means "cannot connect to the server.

TRANS_BY_GPT3

@winlinvip winlinvip added the Bug It might be a bug. label Oct 26, 2015
@winlinvip winlinvip added this to the srs 2.0 release milestone Oct 26, 2015
@winlinvip
Copy link
Member

winlinvip commented Oct 26, 2015

Please specify the version, environment, and reproduction method.

TRANS_BY_GPT3

@gqf2008
Copy link
Author

gqf2008 commented Oct 26, 2015

Version: 2.0.195
Environment: CentOS 6.2 64 bit

Reproduction method:

  1. One publishing server is responsible for publishing only and only allows two edge servers to pull streams. The publishing server only hooks the connect publish close callbacks.
  2. When there is a playback request, the two edge servers pull the stream from the publishing server. The edge servers only hook the connect play close callbacks.
  3. At first, it was thought that there was a problem with the implementation of the hooked API, so the hooks were completely disabled. However, through log observation, the same problem was found when connecting to port 1935 of the publishing server.
  4. The two edge servers have F5 in front (triangle transmission) and do port detection/S on port 1935.
  5. The number of concurrent connections on the edge servers does not exceed 100.
  6. This problem can be reproduced by continuously clicking play and stop in the VLC player, and VLC reports an error of being unable to connect to the backend.

TRANS_BY_GPT3

@gqf2008
Copy link
Author

gqf2008 commented Oct 26, 2015

By querying relevant technical documents, it is said that after setting the socket to non-blocking, calling the recv function before the data packet is sent will result in this error. The program needs to ignore this error and continue looping to read. I hope this information is helpful in fixing this issue. Thank you.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Oct 27, 2015

Well, let me see. It seems to be saying that it cannot connect to your http callback server. It shouldn't be a problem with recv yet.

TRANS_BY_GPT3

@gqf2008
Copy link
Author

gqf2008 commented Oct 27, 2015

After closing the hook, it is still the same, connecting to the publishing server also reports 1018 (No such file or directory).

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Oct 27, 2015

Is the release server SRS?

TRANS_BY_GPT3

@gqf2008
Copy link
Author

gqf2008 commented Oct 29, 2015

Yes.

TRANS_BY_GPT3

@gqf2008
Copy link
Author

gqf2008 commented Oct 29, 2015

When there is a 1018 (No such file or directory) error on a connection of the edge server, the client can still connect to the edge SRS and play, but there will be frame drops and freezing.

TRANS_BY_GPT3

@forvim
Copy link

forvim commented Nov 4, 2015

Addendum, I am using version 2.0.197 and have set the srs as the Edge node.
mode remote;
The sourced stream can always be played normally. However, when accessing the SRS edge node, there is a more than 50% chance of error if the player disconnects and reconnects immediately. If the disconnection lasts for more than 10 seconds, there will be no error.

Upon investigating the code, the error message occurs in
SrsEdgeIngester::connect_server -> srs_socket_connect -> st_connect -> st_netfd_poll -> st_poll, and the returned errno is ENOENT (No such file or directory). Packet sniffing shows that communication is normal.

The problem seems to arise when the sourced connection is closed within a few seconds of no one watching after the last client disconnects. If there is a new stream playing for the first time during this period, it is prone to failure in the sourced connection. This issue persists in a loop and is also affected by different vhosts. This problem does not occur in version 1.0.

TRANS_BY_GPT3

@forvim
Copy link

forvim commented Nov 4, 2015

At the same time, the same app's same flow has two edge connections simultaneously, occurring twice.
edge pull connected
After the occurrence of the log, there could be a "core" in the handshake of the origin connection. At that time, only the location was recorded and the "core" file was not saved. This should be the same issue causing the failure in the edge-to-origin connection mentioned above.
if (hs_bytes->s0s1s2[0] != 0x03) {

SrsComplexHandshake::handshake_with_server
SrsRtmpClient::handshake
SrsEdgeIngester::cycle()

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Nov 5, 2015

Having two reverse source connections in the same stream will definitely cause a crash.

TRANS_BY_GPT3

@forvim
Copy link

forvim commented Nov 5, 2015

The accurate situation is that if the 'edge' node encounters repeated disconnections and quick reconnections from the same stream, it will enter into a loop error while pulling the stream from the third time onwards.
If it is a different stream, there will be several occurrences of 'ENOENT' (No such file or directory) for the origin fetching failure, followed by a return to normal.
Sometimes, when encountering two simultaneous origin fetches for the same stream, the program will crash.
It should be an issue with the origin fetching control.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Nov 5, 2015

The Dragon God said that this problem is a thread issue. I have given him this bug.

TRANS_BY_GPT3

winlinvip added a commit that referenced this issue Nov 6, 2015
@winlinvip
Copy link
Member

fixed in 2.0.199

@forvim
Copy link

forvim commented Nov 13, 2015

Version 2.0.199 indeed fixed the aforementioned issues, but a new problem emerged, which occurs less frequently. Multiple edge flows disconnect almost simultaneously when the SRS thread is closed, causing a core dump.
#0 0x00000000004b3eb0 in internal::SrsThread::thread_cycle (this=0x3d292d0) at src/app/srs_app_thread.cpp:239
#1 0x00000000004b3f0f in internal::SrsThread::thread_fun (arg=0x3d292d0) at src/app/srs_app_thread.cpp:247
#2 0x000000000053703e in _st_thread_main () at sched.c:327
#3 0x00000000005377d8 in st_thread_create (start=0x537f52 <st_usleep+202>, arg=0x7fb94f755b40, joinable=32697, stk_size=1285439024) at sched.c:591
#4 0x00000000004b3959 in internal::SrsThread::start (this=<error reading variable: Cannot access memory at address 0x7ffffffd9>) at src/app/srs_app_thread.cpp:109

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Nov 13, 2015

@zhengfl Please summon the Dragon God.

TRANS_BY_GPT3

@ghost
Copy link

ghost commented May 26, 2016

请问,,如果我现在下载下来最新2.0 release版本,还会存在这个问题吗?

@winlinvip
Copy link
Member

winlinvip commented Sep 1, 2016

You can try the latest version, which is 209: https://github.com/ossrs/srs/tree/2.0release#history.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Sep 1, 2016

It seems like it was fixed in 2.0.203.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Sep 1, 2016

ENOENT should be caused by a runaway thread, set by other threads.
Once the synchronization issue of this thread is resolved, there should be no more problems.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Sep 1, 2016

This is because when close(stfd) is performed, it is not closed correctly, resulting in SRS disabling the feature: disconnecting all client connections when deleting a virtual host. This feature triggers the ENOENT issue.

The fd should not block on read and write during close.
Or in other words, an fd can only be closed by one thread, and the thread should finish cleaning up before closing again.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

2.0.211 fixed

@winlinvip
Copy link
Member

winlinvip commented Sep 2, 2016

Fly FD

FlyFD refers to the situation where FD runs away due to improper closure. When it flies away, it can lead to memory and FD leaks, or even the issue of FD being mysteriously closed. Therefore, FD should not fly, and the return value of close(stfd) must be 0, which we can ensure using assert.

How can we ensure that close(stfd) is correct? When closing stfd, it should not be in a waiting state for reading or writing. Consider if a thread is reading or writing stfd:

int osfd = ...; // create and open osfd.
st_netfd_t stfd = st_netfd_open_socket(osfd);
st_read(stfd, ...);
st_write(stfd, ...);
assert(0 == st_netfd_close(stfd)); // safely close it.

It is not possible for a single thread to be reading or writing when closing stfd. However, if multiple threads are involved, for example, one thread is responsible for receiving data, another thread is responsible for sending and processing, and they need to exit, then we need to create a separate thread:

int osfd = ...;
st_netfd_t stfd = st_netfd_open_socket(osfd);


st_thread_t tid = st_thread_create(function(){
    st_read(stfd, ...); // block here.
});


st_write(stfd, ...); 
assert(0 == st_netfd_close(stfd)); // failed and crash, stfd is is reading(EBUSY).

If the receiving thread is still active and the stfd is in the EBUSY state, it cannot be closed. To safely close it, the thread must be interrupted first:

st_thread_interrupt(tid);
assert(0 == st_netfd_close(stfd)); // safely close stfd.

Therefore, in the SRS, if there is a thread reading or writing to stfd, the thread must be stopped first before closing stfd, for example, in the case of the forwarder:



void SrsForwarder::on_unpublish()
{
    // @remark we must stop the thread then safely close the fd.
    pthread->stop();
    sdk->close();
}

If the order is reversed and the thread is stopped first before closing stfd, it will crash.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Sep 2, 2016

st_thread_interrupt interrupts st_read and st_write.

ssize_t st_read(_st_netfd_t *fd, void *buf, size_t nbyte, st_utime_t timeout)
{
    ssize_t n;


    while ((n = read(fd->osfd, buf, nbyte)) < 0) {
if (errno == EINTR) { // This is a system interruption, ignore it.
            continue;
        }


        if (!_IO_NOT_READY_ERROR) {
            return -1;
        }


        /* Wait until the socket becomes readable */
        if (st_netfd_poll(fd, POLLIN, timeout) < 0) {
return -1; // When the thread is blocked here, st_thread_interrupt will cause it to return -1 (EINTR).
        }
    }


    return n;
}

If there is a system interrupt during the read system call, ST will retry it, and there is no problem with that.
If it is in a blocked state (i.e., waiting for fd to be readable in poll), calling st_thread_interrupt will cause this poll to return -1, errno=EINTR, which means that the blocked st_read/st_write will exit and the fd can be safely closed.

TRANS_BY_GPT3

@haofz
Copy link
Contributor

haofz commented Sep 2, 2016

Winlin brother:
I just looked at your modification of srs_close_stfd(st_netfd_t& stfd). After compiling and testing, the program crashed. The crash happened at srs_assert(err != -1), where err is -1.
The reason for this is that forwarder->unpublish() is called twice. On the second call, the system method close() in srs_close_stfd returned -1.

      void SrsSource::destroy_forwarders()
{
    std::vector<SrsForwarder*>::iterator it;
    for (it = forwarders.begin(); it != forwarders.end(); ++it) {
        SrsForwarder* forwarder = *it;
        forwarder->on_unpublish();
srs_freep(forwarder); // The SrsForwarder destructor also calls unpublish() again.
    }
    forwarders.clear();
}

The stack trace is as follows:
First time: err=0

#0  st_netfd_close (fd=0x928670) at io.c:183
#1  0x00000000004c02d6 in srs_close_stfd (stfd=@0x9081a0) at src/app/srs_app_st.cpp:247
#2  0x00000000004a811c in SrsForwarder::close_underlayer_socket (this=0x908180)
    at src/app/srs_app_forward.cpp:336
#3  0x00000000004a70ee in SrsForwarder::on_unpublish (this=0x908180) at src/app/srs_app_forward.cpp:156
#4  0x00000000004964fa in SrsSource::destroy_forwarders (this=0x904a40)
    at src/app/srs_app_source.cpp:2771
#5  0x00000000004947d3 in SrsSource::on_unpublish (this=0x904a40, is_edge=false)
    at src/app/srs_app_source.cpp:2373

Second time: err=-1

#0  st_netfd_close (fd=0x928670) at io.c:183
#1  0x00000000004c02d6 in srs_close_stfd (stfd=@0x9081a0) at src/app/srs_app_st.cpp:247
#2  0x00000000004a811c in SrsForwarder::close_underlayer_socket (this=0x908180)
    at src/app/srs_app_forward.cpp:336
#3  0x00000000004a70ee in SrsForwarder::on_unpublish (this=0x908180) at src/app/srs_app_forward.cpp:156
#4  0x00000000004a67bd in SrsForwarder::~SrsForwarder (this=0x908180, __in_chrg=<value optimized out>)
    at src/app/srs_app_forward.cpp:71
#5  0x00000000004a69ee in SrsForwarder::~SrsForwarder (this=0x908180, __in_chrg=<value optimized out>)
    at src/app/srs_app_forward.cpp:80
#6  0x000000000049651f in SrsSource::destroy_forwarders (this=0x904a40)
    at src/app/srs_app_source.cpp:2772
#7  0x00000000004947d3 in SrsSource::on_unpublish (this=0x904a40, is_edge=false)
    at src/app/srs_app_source.cpp:2373

TRANS_BY_GPT3

@winlinvip
Copy link
Member

fixed in 49853d2

@winlinvip winlinvip changed the title 边缘拉流,系统经常出现ret=1018(Device or resource busy) 和 ret=1018(No such file or directory) 错误 Edge pull, the system often encounters errors ret=1018 (Device or resource busy) and ret=1018 (No such file or directory). Jul 25, 2023
@winlinvip winlinvip added the TransByAI Translated by AI/GPT. label Jul 25, 2023
winlinvip added a commit that referenced this issue Jul 24, 2024
…4126)

1. Should always stop coroutine before close fd, see #511, #1784
2. When edge forwarder coroutine quit, always set the error code.
3. Do not unpublish if invalid state.

---------

Co-authored-by: Jacob Su <[email protected]>
winlinvip added a commit that referenced this issue Jul 24, 2024
1. Should always stop coroutine before close fd, see #511, #1784
2. When edge forwarder coroutine quit, always set the error code.
3. Do not unpublish if invalid state.

---------

Co-authored-by: Jacob Su <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug It might be a bug. TransByAI Translated by AI/GPT.
Projects
None yet
Development

No branches or pull requests

5 participants