-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor download logic #2651
Refactor download logic #2651
Conversation
@@ -60,7 +60,7 @@ | |||
"runtimeSettings": [ | |||
{ | |||
"handlerSettings": { | |||
"protectedSettingsCertThumbprint": "4037FBF5F1F3014F99B5D6C7799E9B20E6871CB3", | |||
"protectedSettingsCertThumbprint": "BD447EF71C3ADDF7C837E84D630F3FAC22CCD22F", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The older test data is encrypted using an algorithm that is no longer supported on OpenSSL 3.0. All the changes in the data directory are to update the test data with new certificates pulled from a current live VM.
You will see updates of the certificate thumbprints throughout this PR, as well as the certificate data itself.
Codecov Report
@@ Coverage Diff @@
## develop #2651 +/- ##
===========================================
- Coverage 72.03% 71.97% -0.06%
===========================================
Files 103 103
Lines 15760 15692 -68
Branches 2502 2486 -16
===========================================
- Hits 11352 11295 -57
+ Misses 3891 3881 -10
+ Partials 517 516 -1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
|
||
in_vm_gs_metadata = InVMGoalStateMetaData(find(xml_doc, "InVMGoalStateMetaData")) | ||
self._activity_id = self._string_to_id(in_vm_gs_metadata.activity_id) | ||
self._correlation_id = self._string_to_id(in_vm_gs_metadata.correlation_id) | ||
self._created_on_timestamp = self._ticks_to_utc_timestamp(in_vm_gs_metadata.created_on_ticks) | ||
|
||
def _fetch_extensions_on_hold(self, xml_doc, wire_client): | ||
@staticmethod | ||
def _fetch_extensions_on_hold(xml_doc, wire_client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main change in this function is that I moved the download logic to the WireClient class (method fetch_artifacts_profile_blob) to make it consistent with other downloads.
I also ensured that we send telemetry for the cases that we assume OnHold is False because there is an error fetching the artifacts profile)
@@ -81,51 +80,42 @@ def _parse_extensions_config(self, xml_text, wire_client): | |||
self._status_upload_blob_type = getattrib(status_upload_node, "statusBlobType") | |||
logger.verbose("Extension config shows status blob type as [{0}]", self._status_upload_blob_type) | |||
|
|||
self._on_hold = self._fetch_extensions_on_hold(xml_doc, wire_client) | |||
self._on_hold = ExtensionsGoalStateFromExtensionsConfig._fetch_extensions_on_hold(xml_doc, wire_client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made _fetch_extensions_on_hold() a static function, per pycharm's suggestion.
@@ -291,7 +291,7 @@ def _parse_extensions(self, vm_settings): | |||
# "settingsSeqNo": 0, | |||
# "settings": [ | |||
# { | |||
# "protectedSettingsCertThumbprint": "4037FBF5F1F3014F99B5D6C7799E9B20E6871CB3", | |||
# "protectedSettingsCertThumbprint": "BD447EF71C3ADDF7C837E84D630F3FAC22CCD22F", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of a global update. See my comment for the changes in the "data" directory
import azurelinuxagent.common.conf as conf | ||
import azurelinuxagent.common.logger as logger | ||
import azurelinuxagent.common.utils.textutil as textutil | ||
from azurelinuxagent.common import conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are fixes for pylint warnings on Python 3.10
@@ -127,25 +130,6 @@ def get_ext_handler_pkgs(self, ext_handler): | |||
def get_goal_state(self): | |||
return self.client.get_goal_state() | |||
|
|||
def _download_ext_handler_pkg_through_host(self, uri, destination): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were refactored and moved to WireClient, where the other download methods are defined
return self.decode_config(resp.read()) | ||
|
||
def fetch_cache(self, local_file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The agent used to have a disk cache for the goal state. We stopped using that cache a long time ago, but this function was still here and had a caller. Removed the function and modified the caller to use read_file directly.
uri, headers = host.get_artifact_request(uri) | ||
response, _ = self.fetch(uri, headers, use_proxy=False, retry_codes=restutil.HGAP_GET_EXTENSION_ARTIFACT_RETRY_CODES) | ||
return response | ||
def fetch_artifacts_profile_blob(self, uri): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the refactoring of the download logic begins. Most of the logic is in _download_with_fallback_channel() and _download_using_appropriate_channel(). fetch_artifacts_profile_blob(), fetch_manifest() and download_extension() just define the functions to use for the direct and HGAP call and then invoke _download_with_fallback_channel()
|
||
def stream(self, uri, destination, headers=None, use_proxy=None, max_retry=None): | ||
raise ExtensionDownloadError("Failed to download {0} from all URIs. Last error: {1}".format(download_type, ustr(most_recent_error)), code=ExtensionErrorCodes.PluginManifestDownloadError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extended the error message to also include the errors from the direct and HGAP channels
os.remove(destination) | ||
except Exception as exception: | ||
logger.warn("Can't delete {0}: {1}", destination, ustr(exception)) | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now all these functions report errors consistently (using exceptions )
""" | ||
max_retry indicates the maximum number of retries for the HTTP request; None indicates that the default value should be used | ||
""" | ||
def _fetch_response(self, uri, headers=None, use_proxy=None, retry_codes=None, ok_codes=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the max_retry parameter, since it is no longer used. All downloads have the same strategy in terms of timeouts, retries and error reporting
""" | ||
try: | ||
ret = host_func() | ||
if ret in (None, False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was trying to handle the different error values (False, None, exceptions) of the different downloads. Now all functions use exceptions so this is no longer needed
@@ -871,9 +874,6 @@ def _call_hostplugin_with_container_check(self, host_func): | |||
try: | |||
ret = host_func() | |||
|
|||
if ret in (None, False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing -- now the code uses exceptions
host_func = lambda: self._download_ext_handler_pkg_through_host(uri, destination) | ||
|
||
try: | ||
success = self.client.send_request_using_appropriate_channel(direct_func, host_func) is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug that would not detect errors downloading extensions. The stream() method used for the download use to return False on error and this is comparing against None.
This is no longer an issue, since the refactored code now uses exceptions to report errors
@@ -743,10 +754,7 @@ def _fetch_response(self, uri, headers=None, use_proxy=None, max_retry=None, ret | |||
msg = "Fetch failed: {0}".format(error) | |||
logger.warn(msg) | |||
report_event(op=WALAEventOperation.HttpGet, is_success=False, message=msg, log_event=False) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method use to return None for errors, or raise an exception for InvalidContainerError and ResourceGoneError. Now it always reports errors with exceptions
@@ -900,70 +900,36 @@ def _call_hostplugin_with_container_check(self, host_func): | |||
log_event=True) | |||
raise | |||
|
|||
def __send_request_using_host_channel(self, host_func): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here: __send_request_using_host_channel() and __send_request_using_direct_channel() attempted to provide a uniform interface to download report errors. They are no longer needed, since all the download functions use exceptions.
@@ -1160,9 +1126,12 @@ def get_header_for_xml_content(self): | |||
} | |||
|
|||
def get_header_for_cert(self): | |||
trans_cert_file = os.path.join(conf.get_lib_dir(), | |||
TRANSPORT_CERT_FILE_NAME) | |||
content = self.fetch_cache(trans_cert_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was using fetch_cache(), which is a leftover from older agent versions that cached the goal state on disk. I removed that function and this code invokes read_file directly
@@ -1181,10 +1150,6 @@ def get_on_hold(self): | |||
return self.get_goal_state().extensions_goal_state.on_hold | |||
|
|||
def upload_logs(self, content): | |||
host_func = lambda: self._upload_logs_through_host(content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_call_hostplugin_with_container_check() implements retry logic when the HGAP returns 410 (ResourceGone). Those retries are needed only for the extensionsArtifact API. Upload log does not neeed this.
if not self.protocol.download_ext_handler_pkg(source_uri, target_file): | ||
raise Exception("Failed to download extension package from {0}".format(source_uri)) | ||
except Exception as exception: | ||
self.logger.info("Error downloading extension package: {0}", ustr(exception)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cleanup logic is now part of WireClient.stream()
@@ -1274,33 +1259,8 @@ def download(self): | |||
self.logger.info("The existing extension package is invalid, will ignore it.") | |||
|
|||
if not package_exists: | |||
downloaded = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Download extension used to have its own retry logic. No longer needed. All the download logic is now in 1 single place and is consistent across artifact blob, manifests, and extension packages
@@ -3302,9 +3304,13 @@ def manifest_location_handler(url, **kwargs): | |||
with mock_wire_protocol(self.test_data, http_get_handler=manifest_location_handler) as protocol: | |||
ext_handlers = protocol.get_goal_state().extensions_goal_state.extensions | |||
|
|||
with self.assertRaises(ExtensionDownloadError): | |||
protocol.client.fetch_manifest(ext_handlers[0].manifest_uris, | |||
timeout_in_minutes=0, timeout_in_ms=200) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout_in_minutes and timeout_in_ms parameters were used by test code only, so I removed them. Now the test code uses _DOWNLOAD_TIMEOUT instead.
@@ -35,6 +36,8 @@ def setUp(self): | |||
ext_handler.version = "1.0.0" | |||
|
|||
protocol = WireProtocol("http://Microsoft.CPlat.Core.RunCommandLinux/foo-bar") | |||
protocol.client.get_host_plugin = Mock() | |||
protocol.client.get_artifact_request = Mock(return_value=(None, None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mocks in this class have very internal knowledge of the code they are testing. They need rewriting, but for now I just defined a mock HGAP to account for the code I refactored.
@@ -95,44 +98,52 @@ def _assert_download_and_expand_succeeded(self): | |||
self.assertTrue(os.path.exists(self._get_extension_package_file()), "The extension package was not downloaded to the expected location") | |||
self.assertTrue(os.path.exists(self._get_extension_command_file()), "The extension package was not expanded to the expected location") | |||
|
|||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behaviours here used to be in implemented by WireProtocol.download_ext_handler_pkg, but now they are implemented by WireClient.stream
update_handler._report_status(exthandlers_handler) | ||
update_handler._report_status(exthandlers_handler) | ||
update_handler._report_status(exthandlers_handler) | ||
self.assertEqual(1, logger_warn.call_count, "UpdateHandler._report_status() should report only 1 WARNING when there are multiple errors within the same goal state") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code assumes there will be only 1 warning in the log. Other components could also log warnings (I added a couple). I fixed the code to filter out the wanings that are not relevant to these tests
@@ -838,50 +835,18 @@ def http_get_handler(url, *_, **kwargs): | |||
self.assertTrue(self.is_host_plugin_extension_artifact_request(urls[3]), "The retry request should have been over the host channel") | |||
self.assertFalse(HostPluginProtocol.is_default_channel, "The default channel should not have changed") | |||
|
|||
def test_upload_logs_should_not_refresh_plugin_when_first_attempt_succeeds(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upload logs does not need the retry logic for 410. I removed that code and its corresponding tests below
@staticmethod | ||
def _set_and_fail_helper_channel_functions(fail_direct=False, fail_host=False): | ||
def direct_func(*_): | ||
direct_func.counter += 1 | ||
if direct_func.fail: | ||
return None | ||
raise Exception("Direct channel failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mocks need to be updated to raise exceptions instead of reporting errors with None or False
with mock_wire_protocol(mockwiredata.DATA_FILE) as protocol: | ||
# Scenario #1: Direct channel is default | ||
HostPluginProtocol.is_default_channel = False | ||
direct_func, host_func = self._set_and_fail_helper_channel_functions(fail_direct=True, fail_host=True) | ||
|
||
# Assert we keep trying both channels, but the default channel doesn't change | ||
for iteration in range(5): | ||
ret = protocol.client.send_request_using_appropriate_channel(direct_func, host_func) | ||
self.assertEqual(None, ret) | ||
with self.assertRaises(HttpError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now errors raise exceptions
@narrieta Looks to me you haven't touch the agent artifacts and manifest download logic. it has its own logic inside the update.py. Is that something we want that way or have a plan to change that? |
@nagworld9 - Good point. The agent manifest should be covered by this PR. The Agent package is a lot trickier to address given the current code. I will refactor that on a separate PR. |
The code to download the artifact profile blob, manifests (agent and extensions), and extension ZIPs followed different code paths with different implementations of the retry logic, error reporting, etc.
This change refactors all that code into a common path.
The main difficulty for this refactoring is that the underlying code to do the network requests was using different functions that handle errors in different ways: some functions would return None, some would return False, some would raise exceptions and some would do a combination of these. In the past, that inconsistent error handling has produced issues in the Agent. While doing this refactoring I noticed that the extension download code was checking for None, while the underlying implementation was returning False, so errors downloading extensions were being ignored.
The refactored code reports errors consistently, by raising exceptions.
As part of these changes, I added error information to the telemetry event that reports download errors to include the error messages from the direct and HostGAPlugin calls. This makes monitoring telemetry data simpler. Sample error message:
This refactoring will allow us to implement other changes needed in the agent much more easily: we need to add extra headers to the HostGAPlugin requests for Fast Track goal states, and we need to keep track of metrics for HostGAPlugin reliability.
While I was working on these changes I updated my workstation to Ubuntu 22.04, which comes with Python 3.10.4 and OpenSSL 3.0.2.
Pylint has additional checks on 3.10, and this PR includes fixes for those warnings on the files that I touched.
Some of the older test data for certificates are encrypted using a deprecated algorithm that is no longer supported by OpenSSL 3.0. This PR also updates those test data with samples taken from a current VM.