Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-3721] - chore: add cruise-control to snap #38

Merged
merged 17 commits into from
May 30, 2024

Conversation

marcoppenheimer
Copy link
Contributor

@marcoppenheimer marcoppenheimer commented Feb 6, 2024

Review Notes
  • Charmed Kafka needs changes to support internal clients for the Metrics Reporter, so this won't work until they're added. Have tested with a locally patched charm.
  • This doesn't have tests, and won't run without 'additional' config. We will need to add tests + default config that works with them so that they snap app can work stand-alone.

MVP CruiseControl addition to Snap

The final layout will be similar to Kafka, as follows:

  • $SNAP/opt/cruise-control/bin - Modified start script from upstream, and Snap wrapper for setting default env-vars

  • $SNAP/opt/cruise-control/libs - All build/libs and build/dependant-libs for running cruisecontrol. Also includes libs for cruise-control-metrics-reporter, not sure if these are needed or not yet

  • $SNAP_DATA/etc/cruise-control - Config files. May include:

    • log4j.properties - Bundled with snap
    • cruisecontrol.properties - Added by charm
    • cruisecontrol.jaas - Added by charm
    • capacity.json - Added by charm
  • $SNAP_COMMON/var/log/cruise-control - Dir for logs from Log4j2

  • $SNAP_COMMON/var/lib/cruise-control - Data dir, but will likely not be used

Minor changes to existing Kafka snap

We unset KAFKA_LOG4J_OPTS now in kafka/bin/bin-wrapper.bash. This avoids very distracting and mostly unnecessary logging when running bin-scripts locally.

We also now take the STAGE'd CruiseControl libs and add the metrics-reporter JARs to Kafka's libs.

Heavily modified kafka-cruise-control-start.sh

CruiseControl's default 'start-the-daemon' bash script expects that it's running directly from within the CruiseControl repo after being built from source. As a result, it hard-codes a bunch of paths that we have modified to get it to fit in the Snap layout.

tl;dr - We change base_dir and set $CLASSPATH to include files in a single /opt/cruise-control/libs dir. If we compare the diff to upstream:

30c30
< base_dir=$(dirname $0)
---
> base_dir=$(dirname $0)/..
40,63d39
< # run ./gradlew copyDependantLibs to get all dependant jars in a local dir
< shopt -s nullglob
< for dir in "$base_dir"/cruise-control/build/dependant-libs;
< do
<   if [ -z "$CLASSPATH" ] ; then
<     CLASSPATH="$dir/*"
<   else
<     CLASSPATH="$CLASSPATH:$dir/*"
<   fi
< done
< 
< if [ -z "$CLASSPATH" ]; then
<   CLASSPATH="$base_dir/cruise-control/build/libs/*"
< else
<   CLASSPATH="$CLASSPATH:$base_dir/cruise-control/build/libs/*"
< fi
< 
< if [ -z "$CLASSPATH" ]; then
<   CLASSPATH="$base_dir/cruise-control-metrics-reporter/build/libs/*"
< else
<   CLASSPATH="$CLASSPATH:$base_dir/cruise-control-metrics-reporter/build/libs/*"
< fi
< shopt -u nullglob
< 
143a120,125
> 
> # classpath addition for release
> for file in "$base_dir"/libs/*;
> do
>     CLASSPATH="$CLASSPATH":"$file"
> done

SASL Auth with JAAS

In order to pick up auth for Kafka + ZooKeeper, similar to Kafka we need to set -Djava.security.auth.login.config in $KAFKA_OPTS, pointing to a path for some cruisecontrol.jaas. For Charmed Kafka, this will be in /etc/environment most likely.

An example file would look like:

# ZooKeeper auth
Client {
    org.apache.zookeeper.server.auth.DigestLoginModule required
    username="relation-6"
    password="0NMZsBGhd6yoQLVhUNu03LliVdB4HErW";
};

# Kafka auth
KafkaClient {
    org.apache.kafka.common.security.scram.ScramLoginModule required
    username="admin"
    password="GwNjX8CTd0VHXBEoVlUz1ObWB3LaAOi8";
};

Hard-coded logging path override

We provide a default Log4j2 config packaged with the snap that takes -Dcruisecontrol.logs.dir as a Java config, as the default config file provided by upstream is hard-coded to a path we don't want to use.

8c4
< property.filename=./logs
---
> property.filename=${sys:cruisecontrol.logs.dir}

NOTE: CruiseControl uses Log4J2, which has a slightly different syntax to what is already present in the snap for Kafka itself

This gets set in $KAFKA_LOG4J_OPTS in the start-wrapper.bash, using the snap daemon environment variable $LOG_DIR.

We also pass -Dlog4j.configurationFile here too, pointing to the file in $SNAP_DATA/etc/cruise-control

Minimal properties that need to be set over defaults

11c11,13
< bootstrap.servers=localhost:9092
---
> bootstrap.servers=10.137.147.101:9092,10.137.147.207:9092,10.137.147.208:9092
> sasl.mechanism=SCRAM-SHA-512
> security.protocol=SASL_PLAINTEXT
90c92
< capacity.config.file=config/capacityJBOD.json
---
> capacity.config.file=/var/snap/charmed-kafka/current/etc/cruise-control/capacity.json
177c179
< zookeeper.connect=localhost:2181/
---
> zookeeper.connect=10.137.147.206:2181,10.137.147.231:2181,10.137.147.46:2181/kafka
180c182
< zookeeper.security.enabled=false
---
> zookeeper.security.enabled=true 

Proof it works (probably)

~  ❯ curl http://localhost:9090/kafkacruisecontrol/state
MonitorState: {state: RUNNING(9.600% trained), NumValidWindows: (4/5) (80.000%), NumValidPartitions: 117/117 (100.000%), flawedPartitions: 0}
ExecutorState: {state: NO_TASK_IN_PROGRESS}
AnalyzerState: {isProposalReady: true, readyGoals: [NetworkInboundUsageDistributionGoal, CpuUsageDistributionGoal, PotentialNwOutGoal, LeaderReplicaDistributionGoal, NetworkInboundCapacityGoal, LeaderBytesInDistributionGoal, DiskCapacityGoal, ReplicaDistributionGoal, RackAwareGoal, TopicReplicaDistributionGoal, NetworkOutboundCapacityGoal, CpuCapacityGoal, DiskUsageDistributionGoal, NetworkOutboundUsageDistributionGoal, ReplicaCapacityGoal]}
AnomalyDetectorState: {selfHealingEnabled:[], selfHealingDisabled:[DISK_FAILURE, BROKER_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY, TOPIC_ANOMALY, MAINTENANCE_EVENT], selfHealingEnabledRatio:{DISK_FAILURE=0.0, BROKER_FAILURE=0.0, GOAL_VIOLATION=0.0, METRIC_ANOMALY=0.0, TOPIC_ANOMALY=0.0, MAINTENANCE_EVENT=0.0}, recentGoalViolations:[], recentBrokerFailures:[], recentMetricAnomalies:[], recentDiskFailures:[], recentTopicAnomalies:[], recentMaintenanceEvents:[], metrics:{meanTimeBetweenAnomalies:{GOAL_VIOLATION:0.00 milliseconds, BROKER_FAILURE:0.00 milliseconds, METRIC_ANOMALY:0.00 milliseconds, DISK_FAILURE:0.00 milliseconds, TOPIC_ANOMALY:0.00 milliseconds}, meanTimeToStartFix:0.00 milliseconds, numSelfHealingStarted:0, numSelfHealingFailedToStart:0, ongoingAnomalyDuration=0.00 milliseconds}, ongoingSelfHealingAnomaly:None, balancednessScore:100.000}

@marcoppenheimer marcoppenheimer marked this pull request as ready for review March 30, 2024 20:49
Copy link
Contributor

@Batalex Batalex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how clean the diff is, especially in the installation hook and the snapcraft file.
Do we need to add CC's license in the directory with the same name?

Out of scope question, should we remove kafka-export license since it is not part of the snap?

Copy link
Contributor

@deusebio deusebio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look sane. I only have a couple of points I'd like to follow before approving

snap/local/opt/kafka/bin/bin-wrapper.bash Show resolved Hide resolved
snap/local/etc/cruise-control/log4j.properties Outdated Show resolved Hide resolved
@marcoppenheimer marcoppenheimer changed the title WIP - chore: add cruise-control to snap [DPE-3721] - chore: add cruise-control to snap May 28, 2024
@marcoppenheimer
Copy link
Contributor Author

marcoppenheimer commented May 28, 2024

Have re-requested review as it was a bit stale.
Since last review:

chore: add licenses

chore: add default cruisecontrol.properties file

chore: update log4j2 configuration

  • sys seems to be necessary
  • Removed console appender from writing to syslogs

Have also opened a PR on the Kafka repo using the KAFKA_LOG4J_OPTS where necessary - canonical/kafka-operator#201

Copy link
Contributor

@deusebio deusebio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have still few comments I'd like to get addressed. Also, thanks for investigating the main for variable handling, I'm a bit disappointed that it does not work :(, but that's all right! Thanks for investingating this! 🤷‍♂️

But before merging, I also believe it would be very important to

  1. Starting cruise control in CI and doing the curl (that you mention in the PR description) to make sure the service is up and running. The defaults in the CC properties file - e.g. zookeeper and kafka endpoint - should be such that the plain "deployment" works out-of-the box (like what we have in CI for Kafka and ZooKeeper)

  2. Just add the same steps in the README.md, to keep the documentation up to date.


# Loggers
logger.cruisecontrol.name=com.linkedin.kafka.cruisecontrol
logger.cruisecontrol.level=${main:cruisecontrol.log.level}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo update with sys if main was not working

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this resolved? can you clarify where/why sometimes we use sys, and sometime we use main?

snap/local/etc/cruise-control/log4j.properties Outdated Show resolved Hide resolved
snap/local/etc/cruise-control/log4j.properties Outdated Show resolved Hide resolved
snap/local/etc/cruise-control/log4j.properties Outdated Show resolved Hide resolved
snap/local/opt/kafka/bin/bin-wrapper.bash Show resolved Hide resolved
Copy link
Contributor

@Batalex Batalex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@marcoppenheimer marcoppenheimer requested a review from deusebio May 30, 2024 00:13
Copy link
Contributor

@zmraul zmraul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I hope the script to start cc wasn't a pain to get working 😬

Comment on lines 19 to 20
appender.kafkaCruiseControlAppender.policies.time.type=TimeBasedTriggeringPolicy
appender.kafkaCruiseControlAppender.policies.time.interval=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to keep the same approach as we recently did with Kafka:

Suggested change
appender.kafkaCruiseControlAppender.policies.time.type=TimeBasedTriggeringPolicy
appender.kafkaCruiseControlAppender.policies.time.interval=1
appender.kafkaCruiseControlAppender.policies.size.type = SizeBasedTriggeringPolicy
appender.kafkaCruiseControlAppender.policies.size.size=100MB
appender.kafkaCruiseControlAppender.strategy.type = DefaultRolloverStrategy
appender.kafkaCruiseControlAppender.strategy.max = 10

Copy link
Contributor

@zmraul zmraul May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After checking, there are up to 6 files between kafka and cc. A fully running cluster log output might be too much if each unit is going to have up to 3Gb (uncompressed) of logs.

Comment on lines 29 to 30
appender.operationAppender.policies.time.type=TimeBasedTriggeringPolicy
appender.operationAppender.policies.time.interval=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
appender.operationAppender.policies.time.type=TimeBasedTriggeringPolicy
appender.operationAppender.policies.time.interval=1
appender.operationAppender.policies.size.type = SizeBasedTriggeringPolicy
appender.operationAppender.policies.size.size=100MB
appender.operationAppender.strategy.type = DefaultRolloverStrategy
appender.operationAppender.strategy.max = 10

Comment on lines 39 to 40
appender.requestAppender.policies.time.type=TimeBasedTriggeringPolicy
appender.requestAppender.policies.time.interval=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
appender.requestAppender.policies.time.type=TimeBasedTriggeringPolicy
appender.requestAppender.policies.time.interval=1
appender.requestAppender.policies.size.type = SizeBasedTriggeringPolicy
appender.requestAppender.policies.size.size=100MB
appender.requestAppender.strategy.type = DefaultRolloverStrategy
appender.requestAppender.strategy.max = 10

snap/local/opt/kafka/bin/bin-wrapper.bash Show resolved Hide resolved
interface: content
source:
read:
- $SNAP_COMMON/var/log/cruise-control

apps:
daemon:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would change this to be kafka. Makes things explicit once we start having multiple apps on the snap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it for now and update it to something useful when we have KRaft and need to also update it everywhere in the charms and docs, but I see your point.

Copy link
Contributor

@deusebio deusebio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@marcoppenheimer marcoppenheimer merged commit c266f9c into 3/edge May 30, 2024
2 checks passed
@marcoppenheimer marcoppenheimer deleted the feat/cruise_control branch May 30, 2024 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants