gossip: connectivity logging is extremely spammy in 128 node cluster #30088

a-robinson · 2018-09-11T19:22:02Z

I spun up a 128 node cluster for tpc-c 50k testing (it was meant to be / will soon be 120, but my brain accidentally rounded up when creating it), and the gossip connectivity logs are overwhelming. First of all, they're huge, taking up almost my entire screen with just one log line:

I180911 18:58:34.645464 127 gossip/gossip.go:941  [n1] gossip connectivity
  n3 [sentinel];
  n1 -> n24; n1 -> n45; n1 -> n46; n1 -> n74; n1 -> n105; n1 -> n122; n2 -> n1; n2 -> n36; n2 -> n56; n2 -> n90; n2 -> n94; n3 -> n1; n3 -> n83; n4 -> n1; n4 -> n21; n4 -> n87; n4 -> n89; n4 -> n106; n4 -
> n109; n5 -> n2; n5 -> n71; n5 -> n79; n5 -> n120; n5 -> n127; n6 -> n4; n6 -> n14; n6 -> n65; n6 -> n68; n6 -> n98; n6 -> n99; n7 -> n4; n7 -> n57; n7 -> n91; n7 -> n94; n7 -> n115; n8 -> n2; n8 -> n35;
 n8 -> n63; n8 -> n100; n8 -> n103; n8 -> n119; n9 -> n3; n9 -> n87; n9 -> n96; n9 -> n99; n9 -> n115; n10 -> n4; n10 -> n30; n10 -> n60; n10 -> n83; n10 -> n127; n11 -> n2; n11 -> n17; n11 -> n29; n11 -> n30; n11 -> n45; n11 -> n84; n12 -> n3; n12 -> n42; n12 -> n87; n13 -> n8; n13 -> n21; n13 -> n22; n13 -> n25; n13 -> n34; n13 -> n76; n14 -> n11; n14 -> n37; n14 -> n62; n14 -> n78; n14 -> n110; n14 -> n117; n15 -> n3; n15 -> n14; n15 -> n49; n15 -> n59; n15 -> n74; n15 -> n92; n16 -> n3; n16 -> n5; n16 -> n79; n16 -> n112; n16 -> n126; n17 -> n9; n17 -> n25; n17 -> n53; n17 -> n70; n17 -> n120; n18 -> n11; n18 -> n40; n18 -> n64; n18 -> n114; n18 -> n128; n19 -> n6; n19 -> n9; n19 -> n37; n19 -> n40; n19 -> n66; n19 -> n77; n20 -> n15; n20 -> n19; n20 -> n48; n20 -> n61; n20 -> n67; n20 -> n116; n21 -> n12; n21 -> n83; n21 -> n96; n21 -> n114; n22 -> n7; n22 -> n51; n22 -> n55; n22 -> n63; n22 -> n76; n22 -> n109; n23 -> n7; n23 -> n20; n23 -> n26; n23 -> n60; n23 -> n62; n23 -> n113; n24 -> n7; n24 -> n10; n24 -> n15; n24 -> n26; n24 -> n31; n24 -> n59; n25 -> n7; n25 -> n40; n25 -> n44; n25 -> n50; n25 -> n119; n25 -> n120; n26 -> n8; n26 -> n47; n26 -> n55; n26 -> n57; n26 -> n111; n26 -> n121; n27 -> n10; n27 -> n33; n27 -> n38; n27 -> n60; n27 -> n62; n28 -> n4; n28 -> n52; n28 -> n54; n28 -> n93; n28 -> n117; n28 -> n126; n29 -> n1; n29 -> n22; n29 -> n65; n29 -> n72; n29 -> n90; n30 -> n3; n30 -> n54; n30 -> n66; n30 -> n79; n30 -> n81; n30 -> n128; n31 -> n29; n31 -> n46; n31 -> n48; n31 -> n59; n31 -> n75; n31 -> n117; n32 -> n29; n32 -> n43; n32 -> n60; n32 -> n73; n32 -> n113; n33 -> n12; n33 -> n51; n33 -> n82; n33 -> n114; n33 -> n126; n34 -> n10; n34 -> n17; n34 -> n18; n34 -> n24; n34 -> n50; n35 -> n7; n35 -> n27; n35 -> n36; n35 -> n66; n35 -> n76; n36 -> n22; n36 -> n30; n36 -> n71; n36 -> n83; n36 -> n96; n36 -> n109; n37 -> n12; n37 -> n16; n37 -> n46; n37 -> n71; n37 -> n101; n37 -> n118; n38 -> n19; n38 -> n29; n38 -> n64; n38 -> n84; n38 -> n99; n38 -> n108; n39 -> n9; n39 -> n37; n39 -> n59; n39 -> n86; n39 -> n99; n39 -> n105; n40 -> n16; n40 -> n22; n40 -> n98; n40 -> n104; n40 -> n119; n40 -> n121; n41 -> n18; n41 -> n29; n41 -> n35; n41 -> n54; n41 -> n94; n41 -> n120; n42 -> n2; n42 -> n20; n42 -> n26; n42 -> n31; n42 -> n94; n42 -> n97; n43 -> n8; n43 -> n18; n43 -> n63; n43 -> n67; n43 -> n117; n44 -> n18; n44 -> n38; n44 -> n39; n44 -> n82; n44 -> n95; n44 -> n113; n45 -> n14; n45 -> n25; n45 -> n28; n45 -> n53; n45 -> n124; n46 -> n27; n46 -> n32; n46 -> n56; n46 -> n65; n46 -> n66; n46 -> n75; n47 -> n15; n47 -> n32; n47 -> n35; n47 -> n41; n47 -> n44; n48 -> n16; n48 -> n25; n48 -> n73; n48 -> n98; n48 -> n121; n48 -> n123; n49 -> n6; n49 -> n13; n49 -> n34; n49 -> n35; n49 -> n37; n49 -> n94; n50 -> n42; n50 -> n55; n50 -> n62; n50 -> n65; n50 -> n73; n50 -> n125; n51 -> n11; n51 -> n20; n51 -> n32; n51 -> n54; n51 -> n57; n52 -> n8; n52 -> n33; n52 -> n76; n52 -> n80; n52 -> n95; n52 -> n102; n53 -> n10; n53 -> n23; n53 -> n41; n53 -> n58; n53 -> n67; n53 -> n105; n54 -> n16; n54 -> n23; n54 -> n47; n54 -> n77; n54 -> n104; n54 -> n123; n55 -> n15; n55 -> n56; n55 -> n101; n55 -> n110; n55 -> n128; n56 -> n21; n56 -> n24; n56 -> n28; n56 -> n60; n56 -> n100; n57 -> n12; n57 -> n28; n57 -> n30; n57 -> n39; n57 -> n49; n58 -> n9; n58 -> n28; n58 -> n38; n58 -> n56; n58 -> n62; n58 -> n94; n59 -> n27; n59 -> n41; n59 -> n44; n59 -> n91; n59 -> n119; n59 -> n120; n60 -> n15; n60 -> n81; n60 -> n96; n60 -> n107; n60 -> n115; n61 -> n2; n61 -> n10; n61 -> n17; n61 -> n41; n61 -> n64; n61 -> n101; n62 -> n30; n62 -> n43; n62 -> n44; n62 -> n88; n63 -> n16; n63 -> n26; n63 -> n47; n63 -> n77; n63 -> n121; n63 -> n123; n64 -> n20; n64 -> n45; n64 -> n77; n64 -> n98; n64 -> n104; n64 -> n106; n65 -> n18; n65 -> n28; n65 -> n32; n65 -> n48; n65 -> n59; n65 -> n84; n66 -> n6; n66 -> n25; n66 -> n57; n66 -> n90; n66 -> n95; n66 -> n119; n67 -> n4; n67 -> n24; n67 -> n49; n67 -> n56; n67 -> n93; n67 -> n105; n68 -> n3; n68 -> n103; n68 -> n109; n68 -> n122; n69 -> n1; n69 -> n34; n69 -> n44; n69 -> n97; n69 -> n109; n69 -> n127; n70 -> n29; n70 -> n39; n70 -> n55; n70 -> n82; n70 -> n111; n71 -> n45; n71 -> n65; n71 -> n69; n71 -> n81; n71 -> n83; n71 -> n106; n72 -> n50; n72 -> n68; n72 -> n87; n72 -> n93; n72 -> n102; n73 -> n16; n73 -> n57; n73 -> n77; n73 -> n98; n73 -> n104; n73 -> n128; n74 -> n2; n74 -> n17; n74 -> n28; n74 -> n52; n74 -> n73; n74 -> n104; n75 -> n41; n75 -> n58; n75 -> n80; n75 -> n95; n75 -> n101; n76 -> n9; n76 -> n20; n76 -> n50; n76 -> n65; n76 -> n74; n77 -> n5; n77 -> n44; n77 -> n51; n77 -> n70; n77 -> n84; n77 -> n118; n78 -> n8; n78 -> n24; n78 -> n39; n78 -> n73; n78 -> n95; n78 -> n127; n79 -> n13; n79 -> n34; n79 -> n35; n79 -> n37; n79 -> n90; n79 -> n113; n80 -> n15; n80 -> n19; n80 -> n67; n80 -> n112; n80 -> n128; n81 -> n7; n81 -> n18; n81 -> n53; n81 -> n80; n81 -> n96; n81 -> n124; n82 -> n68; n82 -> n84; n82 -> n86; n82 -> n105; n82 -> n107; n83 -> n8; n83 -> n33; n83 -> n47; n83 -> n58; n83 -> n70; n84 -> n19; n84 -> n39; n84 -> n52; n84 -> n78; n84 -> n92; n84 -> n100; n85 -> n38; n85 -> n42; n85 -> n69; n85 -> n79; n85 -> n102; n86 -> n11; n86 -> n63; n86 -> n72; n86 -> n76; n86 -> n116; n86 -> n117; n87 -> n31; n87 -> n64; n87 -> n73; n87 -> n95; n87 -> n110; n88 -> n25; n88 -> n34; n88 -> n46; n88 -> n69; n88 -> n74; n88 -> n110; n89 -> n13; n89 -> n42; n89 -> n61; n89 -> n69; n89 -> n81; n89 -> n92; n90 -> n9; n90 -> n19; n90 -> n23; n90 -> n31; n90 -> n36; n90 -> n107; n91 -> n68; n91 -> n86; n91 -> n93; n91 -> n108; n91 -> n112; n92 -> n23; n92 -> n27; n92 -> n69; n92 -> n78; n92 -> n105; n93 -> n61; n93 -> n63; n93 -> n88; n93 -> n89; n93 -> n92; n94 -> n32; n94 -> n38; n94 -> n54; n94 -> n55; n94 -> n66; n94 -> n117; n95 -> n6; n95 -> n30; n95 -> n34; n95 -> n97; n95 -> n107; n96 -> n27; n96 -> n40; n96 -> n43; n96 -> n70; n96 -> n79; n96 -> n92; n97 -> n20; n97 -> n32; n97 -> n36; n97 -> n57; n97 -> n61; n98 -> n5; n98 -> n23; n98 -> n53; n98 -> n75; n98 -> n112; n99 -> n31; n99 -> n51; n99 -> n82; n99 -> n91; n99 -> n110; n99 -> n111; n100 -> n50; n100 -> n61; n100 -> n68; n100 -> n108; n100 -> n115; n101 -> n6; n101 -> n22; n101 -> n92; n101 -> n107; n101 -> n112; n102 -> n38; n102 -> n48; n102 -> n55; n102 -> n76; n102 -> n81; n103 -> n85; n103 -> n89; n103 -> n101; n103 -> n111; n103 -> n115; n104 -> n5; n104 -> n13; n104 -> n53; n104 -> n78; n104 -> n116; n105 -> n14; n105 -> n17; n105 -> n35; n105 -> n52; n105 -> n81; n105 -> n88; n106 -> n33; n106 -> n54; n106 -> n62; n106 -> n70; n106 -> n74; n106 -> n82; n107 -> n21; n107 -> n22; n107 -> n24; n107 -> n31; n107 -> n47; n107 -> n51; n108 -> n14; n108 -> n63; n108 -> n71; n108 -> n85; n108 -> n97; n109 -> n40; n109 -> n58; n109 -> n70; n109 -> n85; n109 -> n108; n109 -> n120; n110 -> n10; n110 -> n19; n110 -> n37; n110 -> n40; n110 -> n43; n110 -> n72; n111 -> n11; n111 -> n27; n111 -> n43; n111 -> n64; n111 -> n90; n112 -> n78; n112 -> n79; n112 -> n88; n112 -> n111; n112 -> n113; n112 -> n126; n113 -> n33; n113 -> n43; n113 -> n64; n113 -> n67; n113 -> n86; n113 -> n100; n114 -> n60; n114 -> n77; n114 -> n98; n114 -> n104; n114 -> n123; n115 -> n89; n115 -> n91; n115 -> n97; n115 -> n106; n115 -> n116; n115 -> n123; n116 -> n26; n116 -> n49; n116 -> n85; n116 -> n114; n116 -> n127; n117 -> n53; n117 -> n68; n117 -> n71; n117 -> n96; n117 -> n116; n117 -> n127; n118 -> n13; n118 -> n36; n118 -> n49; n118 -> n91; n118 -> n124; n119 -> n12; n119 -> n42; n119 -> n106; n119 -> n110; n119 -> n114; n120 -> n48; n120 -> n67; n120 -> n71; n120 -> n86; n120 -> n100; n120 -> n118; n121 -> n5; n121 -> n51; n121 -> n84; n121 -> n89; n121 -> n109; n122 -> n42; n122 -> n75; n122 -> n78; n122 -> n88; n122 -> n89; n123 -> n5; n123 -> n17; n123 -> n56; n123 -> n66; n123 -> n101; n123 -> n119; n124 -> n12; n124 -> n48; n124 -> n52; n124 -> n97; n124 -> n103; n124 -> n107; n125 -> n58; n125 -> n69; n125 -> n85; n125 -> n99; n125 -> n112; n125 -> n115; n126 -> n1; n126 -> n50; n126 -> n74; n126 -> n93; n126 -> n111; n127 -> n52; n127 -> n59; n127 -> n88; n127 -> n106; n127 -> n113; n127 -> n126; n128 -> n4; n128 -> n6; n128 -> n13; n128 -> n47; n128 -> n61; n128 -> n100;

Secondly, they're being logged a few times per second, making it impossible to read anything else out of the logs.

The second point is probably indicative of some amount of instability in the gossip network that should be investigated on its own right. However, even if it was only being logged once a minute as part of the gossip status, this would be quite annoying. The entire "gossip status" message literally doesn't fit on the screen of my 15" macbook:

Also, wow that's a lot of data being transmitted over gossip (the server indicates 25GB sent / 12GB received). I suspect a lot of it is the connectivity map being resent in its entirety over and over.

It'd be reasonable to say we should first understand/fix the fact that there are frequent changes happening in the gossip network. I'm still worried about how things will look after that, though.

@petermattis

The text was updated successfully, but these errors were encountered:

petermattis · 2018-09-11T19:29:41Z

I think we can remove the every minute logging of gossip connectivity. The other logging only occurs when gossip connectivity changes. It shouldn't be changing frequently and we should figure out why it is.

a-robinson · 2018-09-11T20:09:24Z

Yeah, I'll check out what's going on with gossip after the tpc-c runs are done, or more likely I'll just spin up a cluster using smaller nodes later on to focus on the gossip in a more cost effective way.

However, there will always be a lot of logging triggered every time one or more nodes get restarted as the gossip network adjusts to the change(s) and stabilizes, which will certainly add up during an event like a rolling upgrade.

petermattis · 2018-09-11T20:52:09Z

On a 3 node cluster we gossip at ~2KB/sec. On a 9 node cluster we gossip at ~12KB/sec. On a 27 node cluster we gossip at ~55KB/sec. None of these numbers scale up to 37GB gossiped over 2 hours. I think there is something to investigate there.

petermattis · 2018-09-11T22:23:33Z

I think there is something to investigate there.

We both gossip the cluster-id as well as include it in every gossip.Request. That seems excessive, though I doubt it has a significant impact.

petermattis · 2018-09-11T22:40:11Z

On a 3 node cluster, over the course of 1 minute we send/receive the following keys via gossip:

    gossip-clients:1      138     6348
          cluster-id       57     3363
          cluster-id       53     3339
         first-range       35     2870
            sentinel       38     2394
    gossip-clients:3       45     2115
             store:1        6     2100
    gossip-clients:2       35     1680
          liveness:2       15      975
          liveness:3       13      845
          liveness:1       11      715
             store:3        2      708
             store:2        2      708

The second column is the number of times the key was sent/received. The third column is the size of the gossiped value times the number of times it was gossiped (providing an indication of what fraction of the gossiped bytes it is consuming).

Note that the sizes above are only measuring gossip.Info. In my testing, this only accounts for ~40% of the gossip traffic. The other 60% is gossip.{Request,Response} overhead. I suspect most of the extra data is in gossip.HighWaterStamps, though I haven't confirmed that.

petermattis · 2018-09-11T22:43:25Z

We need to gossip the sentinel frequently, but there is no need to be gossiping the cluster-id and first-range so fast.

petermattis · 2018-09-11T22:52:10Z

The numbers I posted above were the sent/received gossip infos from a single node. The math inclined will notice that we sent/received 138 gossip-clients:1 keys. I looked into this a bit more and it turns out we sent 138 keys during that minute time frame. But something is off here because we are only supposed to send that key every 2 seconds so we should have gossiped it 60 times. A bit more instrumentation shows that every time AddInfo is called for that key, we send it two each connected node twice and sometimes three times. I'm not sure why. Definitely a bug.

petermattis · 2018-09-11T23:02:58Z

I'm pretty sure the problem is due to how we update highWaterStamps. For server connections, we only update highWaterStamps upon receipt of a new request from the remote node. But that means that if we have a new info to send, we'll keep sending it until the remote node sends another request. I'm pretty sure the same problem occurs for the client-side of the connection. Rather than waiting for an update from the remote node, I think we need to update our view of the remote's highWaterStamps as soon as we send them something.

The previous handling of high water stamps left a gap during which an info could be resent over a connection multiple times. The general gossip send loop is to wait for a new info to be added to the local info store, to compute a delta between the local info store and the remote info store using the remote's high water stamps and to update the remote's high water stamps when the remote sends a reply on the connection. The problem here is that updating the remote's high water stamps when the remote sends a reply leaves a window where an info added to the local store would be sent multiple times (every time an additional info was added) until the remote replied. The fix is to keep track of a per connection high water stamp map and ratchet the contained stamps both when we receive a reply from the remote and when we send the remote an info. The result is a 30-40% reduction in gossip traffic (bytes sent/received) as measured on idle 3, 9 and 27 node clusters. See cockroachdb#30088 Release note: None

petermattis · 2018-09-12T16:29:47Z

See #30126. I think that is an explanation for the gigabytes of gossip traffic.

30106: gossip: remove unnecessary gossip traffic r=spencerkimball,a-robinson,bdarnell a=petermattis The previous handling of high water stamps left a gap during which an info could be resent over a connection multiple times. The general gossip send loop is to wait for a new info to be added to the local info store, to compute a delta between the local info store and the remote info store using the remote's high water stamps and to update the remote's high water stamps when the remote sends a reply on the connection. The problem here is that updating the remote's high water stamps when the remote sends a reply leaves a window where an info added to the local store would be sent multiple times (every time an additional info was added) until the remote replied. The fix is to keep track of a per connection high water stamp map and ratchet the contained stamps both when we receive a reply from the remote and when we send the remote an info. The result is a 30-40% reduction in gossip traffic (bytes sent/received) as measured on idle 3, 9 and 27 node clusters. See #30088 Release note: None Co-authored-by: Peter Mattis <[email protected]>

Make gossip.Request.{Addr,ClusterID} nullable and only send them on the first request over a connection. Remove gossip.Response.Addr which was unused. Only gossip the cluster ID once per server. The cluster ID never expires, so there was no need to be re-gossiping it frequently. Keep track of which high water stamps have been sent to the remote side of a connection and only send stamps that are newer than those already sent. This has a dramatic impact on gossip traffic as a cluster gets larger as usually only 1 or 2 stamps have changed on each gossip message which results in reducing an O(n) map size to a constant. In a local 27-node cluster, these changes reduce gossip traffic by ~60%. Fixes cockroachdb#30126 See cockroachdb#30088 Release note: None

The previous handling of high water stamps left a gap during which an info could be resent over a connection multiple times. The general gossip send loop is to wait for a new info to be added to the local info store, to compute a delta between the local info store and the remote info store using the remote's high water stamps and to update the remote's high water stamps when the remote sends a reply on the connection. The problem here is that updating the remote's high water stamps when the remote sends a reply leaves a window where an info added to the local store would be sent multiple times (every time an additional info was added) until the remote replied. The fix is to keep track of a per connection high water stamp map and ratchet the contained stamps both when we receive a reply from the remote and when we send the remote an info. The result is a 30-40% reduction in gossip traffic (bytes sent/received) as measured on idle 3, 9 and 27 node clusters. See cockroachdb#30088 Release note: None

Make gossip.Request.{Addr,ClusterID} nullable and only send them on the first request over a connection. Remove gossip.Response.Addr which was unused. Only gossip the cluster ID once per server. The cluster ID never expires, so there was no need to be re-gossiping it frequently. Keep track of which high water stamps have been sent to the remote side of a connection and only send stamps that are newer than those already sent. This has a dramatic impact on gossip traffic as a cluster gets larger as usually only 1 or 2 stamps have changed on each gossip message which results in reducing an O(n) map size to a constant. In a local 27-node cluster, these changes reduce gossip traffic by ~60%. Fixes cockroachdb#30126 See cockroachdb#30088 Release note: None

The previous handling of high water stamps left a gap during which an info could be resent over a connection multiple times. The general gossip send loop is to wait for a new info to be added to the local info store, to compute a delta between the local info store and the remote info store using the remote's high water stamps and to update the remote's high water stamps when the remote sends a reply on the connection. The problem here is that updating the remote's high water stamps when the remote sends a reply leaves a window where an info added to the local store would be sent multiple times (every time an additional info was added) until the remote replied. The fix is to keep track of a per connection high water stamp map and ratchet the contained stamps both when we receive a reply from the remote and when we send the remote an info. The result is a 30-40% reduction in gossip traffic (bytes sent/received) as measured on idle 3, 9 and 27 node clusters. See cockroachdb#30088 Release note: None

As discussed on cockroachdb#30088. It can be absolutely massive on large clusters, and isn't needed because we still log the connectivity info whenever it changes. Release note: None

30035: builder: upgrade to go1.11 r=bdarnell,raduberinde,nvanbenschoten a=benesch Go 1.11 brings us some nice performance improvements on single-node clusters. See #29827 for benchmark results. Fix #29827. Release note: None 30285: gossip: Don't include gossip connectivity in periodic status logs r=a-robinson a=a-robinson As discussed on #30088. It can be absolutely massive on large clusters, and isn't needed because we still log the connectivity info whenever it changes. Release note: None I still think that logging on every change in connectivity is going to be a mess in large clusters during rolling restarts even if gossip is otherwise perfectly stable, but this is useful regardless of whether we do anything about that. Co-authored-by: Nikhil Benesch <[email protected]> Co-authored-by: Alex Robinson <[email protected]>

As discussed on cockroachdb#30088. It can be absolutely massive on large clusters, and isn't needed because we still log the connectivity info whenever it changes. Release note: None

petermattis · 2018-09-27T16:27:38Z

@a-robinson Do you think there is anything else to address here for 2.1? My understanding is that gossip stabilized on your 100+ node cluster and this spam just affected startup of the cluster.

a-robinson · 2018-09-27T16:47:51Z

Unfortunately, that doesn't seem to be the case. I saved the latest log file from the cluster before taking it down, and all the log files have a lot of the connectivity logs even though no nodes were being added/removed around that time. Two nodes were down due to the bad_alloc issue, but they had been down for well over an hour before the time period these logs cover. Let me know if you'd like to see them for yourself.

I imagine the cull logic in gossip that closes the least useful connection is causing a lot of this, and at a high level I'm not sure that connectivity logging on changes can happily coexist with the culling of an outgoing connection from every node every minute, or at least not on large clusters.

Also, orthogonally from the thrashing, the logging during startup is excessive. We log every time we get new connectivity info that isn't identical to the last thing we logged, but during startup a node can get gossip info piece by piece, leading to log output like the below that keeps going on and on until all the nodes' info is received:

I180918 20:02:51.829369 55 gossip/client.go:129  [n?] started gossip client to 35.229.39.61:26257
I180918 20:02:51.836827 168 gossip/gossip.go:942  [n?] gossip connectivity
  n47 [sentinel];
I180918 20:02:51.837244 168 gossip/gossip.go:942  [n?] gossip connectivity
  n47 [sentinel];
  n32 -> n95; n32 -> n96; n32 -> n122; n32 -> n132; n32 -> n135;
I180918 20:02:51.837609 168 gossip/gossip.go:942  [n?] gossip connectivity
  n47 [sentinel];
  n32 -> n95; n32 -> n96; n32 -> n122; n32 -> n132; n32 -> n135; n66 -> n27; n66 -> n55; n66 -> n56; n66 -> n88; n66 -> n102; n81 -> n9; n81 -> n12; n81 -> n66; n81 -> n76; n81 -> n87;
I180918 20:02:51.837920 168 gossip/gossip.go:942  [n?] gossip connectivity
  n47 [sentinel];
  n32 -> n95; n32 -> n96; n32 -> n122; n32 -> n132; n32 -> n135; n37 -> n8; n37 -> n64; n37 -> n65; n37 -> n97; n37 -> n101; n52 -> n5; n52 -> n19; n52 -> n68; n52 -> n79; n52 -> n106; n55 -> n12; n55 -> n58; n55 -> n68; n55 -> n75; n55 -> n96; n66 -> n27; n66 -> n55; n66 -> n56; n66 -> n88; n66 -> n102; n71 -> n43; n71 -> n75; n71 -> n87; n71 -> n88; n71 -> n128; n81 -> n9; n81 -> n12; n81 -> n66; n81 -> n76; n81 -> n87; n92 -> n10; n92 -> n26; n92 -> n75; n92 -> n94; n92 -> n132; n106 -> n2; n106 -> n10; n106 -> n61; n106 -> n70; n106 -> n93;
I180918 20:02:51.838131 168 gossip/gossip.go:942  [n?] gossip connectivity
  n47 [sentinel];
  n32 -> n95; n32 -> n96; n32 -> n122; n32 -> n132; n32 -> n135; n37 -> n8; n37 -> n64; n37 -> n65; n37 -> n97; n37 -> n101; n52 -> n5; n52 -> n19; n52 -> n68; n52 -> n79; n52 -> n106; n55 -> n12; n55 -> n58; n55 -> n68; n55 -> n75; n55 -> n96; n66 -> n27; n66 -> n55; n66 -> n56; n66 -> n88; n66 -> n102; n71 -> n43; n71 -> n75; n71 -> n87; n71 -> n88; n71 -> n128; n81 -> n9; n81 -> n12; n81 -> n66; n81 -> n76; n81 -> n87; n92 -> n10; n92 -> n26; n92 -> n75; n92 -> n94; n92 -> n132; n104 -> n9; n104 -> n98; n104 -> n110; n104 -> n115; n104 -> n128; n106 -> n2; n106 -> n10; n106 -> n61; n106 -> n70; n106 -> n93;
I180918 20:02:51.838544 168 gossip/gossip.go:942  [n?] gossip connectivity
  n47 [sentinel];
  n7 -> n37; n7 -> n40; n7 -> n42; n7 -> n71; n7 -> n78; n26 -> n29; n26 -> n58; n26 -> n118; n26 -> n120; n26 -> n127; n32 -> n95; n32 -> n96; n32 -> n122; n32 -> n132; n32 -> n135; n37 -> n8; n37 -> n64; n37 -> n65; n37 -> n97; n37 -> n101; n47 -> n11; n47 -> n13; n47 -> n23; n47 -> n76; n47 -> n78; n52 -> n5; n52 -> n19; n52 -> n68; n52 -> n79; n52 -> n106; n55 -> n12; n55 -> n58; n55 -> n68; n55 -> n75; n55 -> n96; n66 -> n27; n66 -> n55; n66 -> n56; n66 -> n88; n66 -> n102; n71 -> n43; n71 -> n75; n71 -> n87; n71 -> n88; n71 -> n128; n81 -> n9; n81 -> n12; n81 -> n66; n81 -> n76; n81 -> n87; n92 -> n10; n92 -> n26; n92 -> n75; n92 -> n94; n92 -> n132; n104 -> n9; n104 -> n98; n104 -> n110; n104 -> n115; n104 -> n128; n106 -> n2; n106 -> n10; n106 -> n61; n106 -> n70; n106 -> n93; n116 -> n12; n116 -> n18; n116 -> n69; n116 -> n73; n116 -> n76;
I180918 20:02:51.838867 168 gossip/gossip.go:942  [n?] gossip connectivity
  n47 [sentinel];
  n7 -> n37; n7 -> n40; n7 -> n42; n7 -> n71; n7 -> n78; n26 -> n29; n26 -> n58; n26 -> n118; n26 -> n120; n26 -> n127; n30 -> n4; n30 -> n31; n30 -> n34; n30 -> n63; n30 -> n79; n32 -> n95; n32 -> n96; n32 -> n122; n32 -> n132; n32 -> n135; n37 -> n8; n37 -> n64; n37 -> n65; n37 -> n97; n37 -> n101; n45 -> n37; n45 -> n46; n45 -> n58; n45 -> n77; n45 -> n85; n47 -> n11; n47 -> n13; n47 -> n23; n47 -> n76; n47 -> n78; n52 -> n5; n52 -> n19; n52 -> n68; n52 -> n79; n52 -> n106; n55 -> n12; n55 -> n58; n55 -> n68; n55 -> n75; n55 -> n96; n66 -> n27; n66 -> n55; n66 -> n56; n66 -> n88; n66 -> n102; n71 -> n43; n71 -> n75; n71 -> n87; n71 -> n88; n71 -> n128; n72 -> n17; n72 -> n18; n72 -> n77; n72 -> n90; n72 -> n130; n81 -> n9; n81 -> n12; n81 -> n66; n81 -> n76; n81 -> n87; n92 -> n10; n92 -> n26; n92 -> n75; n92 -> n94; n92 -> n132; n94 -> n42; n94 -> n43; n94 -> n115; n94 -> n129; n94 -> n134; n104 -> n9; n104 -> n98; n104 -> n110; n104 -> n115; n104 -> n128; n106 -> n2; n106 -> n10; n106 -> n61; n106 -> n70; n106 -> n93; n116 -> n12; n116 -> n18; n116 -> n69; n116 -> n73; n116 -> n76; n130 -> n4; n130 -> n11; n130 -> n19; n130 -> n20; n130 -> n65;
I180918 20:02:51.839272 168 gossip/gossip.go:942  [n?] gossip connectivity
  n47 [sentinel];
  n7 -> n37; n7 -> n40; n7 -> n42; n7 -> n71; n7 -> n78; n26 -> n29; n26 -> n58; n26 -> n118; n26 -> n120; n26 -> n127; n30 -> n4; n30 -> n31; n30 -> n34; n30 -> n63; n30 -> n79; n32 -> n95; n32 -> n96; n32 -> n122; n32 -> n132; n32 -> n135; n37 -> n8; n37 -> n64; n37 -> n65; n37 -> n97; n37 -> n101; n45 -> n37; n45 -> n46; n45 -> n58; n45 -> n77; n45 -> n85; n47 -> n11; n47 -> n13; n47 -> n23; n47 -> n76; n47 -> n78; n52 -> n5; n52 -> n19; n52 -> n68; n52 -> n79; n52 -> n106; n55 -> n12; n55 -> n58; n55 -> n68; n55 -> n75; n55 -> n96; n63 -> n17; n63 -> n36; n63 -> n39; n63 -> n101; n63 -> n119; n66 -> n27; n66 -> n55; n66 -> n56; n66 -> n88; n66 -> n102; n71 -> n43; n71 -> n75; n71 -> n87; n71 -> n88; n71 -> n128; n72 -> n17; n72 -> n18; n72 -> n77; n72 -> n90; n72 -> n130; n81 -> n9; n81 -> n12; n81 -> n66; n81 -> n76; n81 -> n87; n92 -> n10; n92 -> n26; n92 -> n75; n92 -> n94; n92 -> n132; n94 -> n42; n94 -> n43; n94 -> n115; n94 -> n129; n94 -> n134; n104 -> n9; n104 -> n98; n104 -> n110; n104 -> n115; n104 -> n128; n106 -> n2; n106 -> n10; n106 -> n61; n106 -> n70; n106 -> n93; n116 -> n12; n116 -> n18; n116 -> n69; n116 -> n73; n116 -> n76; n130 -> n4; n130 -> n11; n130 -> n19; n130 -> n20; n130 -> n65;
...

a-robinson · 2018-09-27T16:50:18Z

I've been tossing around the idea of rate limiting this logging to happen no more than once a minute, which would alleviate most of the pain. Would that still accomplish most of the original goal? I realize we've discovered some gossip problems lately, but I'm not really convinced that this information belongs in non-verbose logs.

petermattis · 2018-09-27T17:02:52Z

It's true that the gossip connectivity logs might be addressing yesterday's war. Perhaps we should re-add them to the once per minute logging, but only output the message if it has changed since the last log line.

Fixes cockroachdb#30088 This will still probably result in logging it once a minute on large clusters due to the inherent instability caused by connection culling, but once a minute is a lot more bearable than the current state of things. Release note: None

30730: gossip: Log connectivity no more than once a minute r=a-robinson a=a-robinson Fixes #30088 This will still probably result in logging it once a minute on large clusters due to the inherent instability caused by connection culling, but once a minute is a lot more bearable than the current state of things. Release note: None Co-authored-by: Alex Robinson <[email protected]>

Fixes cockroachdb#30088 This will still probably result in logging it once a minute on large clusters due to the inherent instability caused by connection culling, but once a minute is a lot more bearable than the current state of things. Release note: None

Make gossip.Request.{Addr,ClusterID} nullable and only send them on the first request over a connection. Remove gossip.Response.Addr which was unused. Only gossip the cluster ID once per server. The cluster ID never expires, so there was no need to be re-gossiping it frequently. Keep track of which high water stamps have been sent to the remote side of a connection and only send stamps that are newer than those already sent. This has a dramatic impact on gossip traffic as a cluster gets larger as usually only 1 or 2 stamps have changed on each gossip message which results in reducing an O(n) map size to a constant. In a local 27-node cluster, these changes reduce gossip traffic by ~60%. Fixes cockroachdb#30126 See cockroachdb#30088 Release note: None

a-robinson added this to the 2.1 milestone Sep 11, 2018

petermattis mentioned this issue Sep 12, 2018

gossip: remove unnecessary gossip traffic #30106

Merged

petermattis mentioned this issue Sep 13, 2018

gossip: miscellaneous reductions in gossip traffic #30193

Closed

petermattis mentioned this issue Sep 14, 2018

release-2.1: gossip: remove unnecessary gossip traffic #30246

Merged

This was referenced Sep 17, 2018

TPC-C 50k #30284

Closed

gossip: Don't include gossip connectivity in periodic status logs #30285

Merged

knz changed the title ~~gossip connectivity logging is extremely spammy in 128 node cluster~~ gossip: connectivity logging is extremely spammy in 128 node cluster Sep 17, 2018

knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-gossip labels Sep 17, 2018

a-robinson mentioned this issue Sep 17, 2018

backport-2.1: gossip: Don't include gossip connectivity in periodic status logs #30329

Merged

nstewart added the S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. label Sep 18, 2018

tbg assigned petermattis Sep 27, 2018

a-robinson mentioned this issue Sep 27, 2018

gossip: Log connectivity no more than once a minute #30730

Merged

a-robinson mentioned this issue Sep 27, 2018

release-2.1: gossip: Log connectivity no more than once a minute #30733

Merged

craig bot closed this as completed in #30730 Sep 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gossip: connectivity logging is extremely spammy in 128 node cluster #30088

gossip: connectivity logging is extremely spammy in 128 node cluster #30088

a-robinson commented Sep 11, 2018 •

edited

Loading

petermattis commented Sep 11, 2018

a-robinson commented Sep 11, 2018

petermattis commented Sep 11, 2018

petermattis commented Sep 11, 2018

petermattis commented Sep 11, 2018

petermattis commented Sep 11, 2018

petermattis commented Sep 11, 2018

petermattis commented Sep 11, 2018

petermattis commented Sep 12, 2018

petermattis commented Sep 27, 2018

a-robinson commented Sep 27, 2018

a-robinson commented Sep 27, 2018 •

edited

Loading

petermattis commented Sep 27, 2018

gossip: connectivity logging is extremely spammy in 128 node cluster #30088

gossip: connectivity logging is extremely spammy in 128 node cluster #30088

Comments

a-robinson commented Sep 11, 2018 • edited Loading

petermattis commented Sep 11, 2018

a-robinson commented Sep 11, 2018

petermattis commented Sep 11, 2018

petermattis commented Sep 11, 2018

petermattis commented Sep 11, 2018

petermattis commented Sep 11, 2018

petermattis commented Sep 11, 2018

petermattis commented Sep 11, 2018

petermattis commented Sep 12, 2018

petermattis commented Sep 27, 2018

a-robinson commented Sep 27, 2018

a-robinson commented Sep 27, 2018 • edited Loading

petermattis commented Sep 27, 2018

a-robinson commented Sep 11, 2018 •

edited

Loading

a-robinson commented Sep 27, 2018 •

edited

Loading