Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor download logic #2651

Merged
merged 3 commits into from
Aug 17, 2022
Merged

Refactor download logic #2651

merged 3 commits into from
Aug 17, 2022

Conversation

narrieta
Copy link
Member

The code to download the artifact profile blob, manifests (agent and extensions), and extension ZIPs followed different code paths with different implementations of the retry logic, error reporting, etc.

This change refactors all that code into a common path.

The main difficulty for this refactoring is that the underlying code to do the network requests was using different functions that handle errors in different ways: some functions would return None, some would return False, some would raise exceptions and some would do a combination of these. In the past, that inconsistent error handling has produced issues in the Agent. While doing this refactoring I noticed that the extension download code was checking for None, while the underlying implementation was returning False, so errors downloading extensions were being ignored.

The refactored code reports errors consistently, by raising exceptions.

As part of these changes, I added error information to the telemetry event that reports download errors to include the error messages from the direct and HostGAPlugin calls. This makes monitoring telemetry data simpler. Sample error message:

2022-08-15T21:37:23.876662Z ERROR ExtHandler ExtHandler Event: name=Microsoft.Azure.Extensions.CustomScript, op=Download, message=Failed to download artifacts: [ExtensionDownloadError] Failed to download extension package from all URIs. 
Last error: [HttpError] Download failed both on the primary and fallback channels. 
Primary: [[HttpError] [HTTP Failed] GET https://umsalvlwk4lfvxs4w05v.blob.core.windows.net2/5237dd14-0aad-f051-0fad-1e33e1b63091/5237dd14-0aad-f051-0fad-1e33e1b63091_2.1.7.zip -- IOError [Errno -2] Name or service not known -- 6 attempts made] 
Fallback: [[ProtocolError] Fetch failed from [http://168.63.129.16:32526/extensionArtifact]: [HTTP Failed] [400: Bad Request] b''], duration=0

This refactoring will allow us to implement other changes needed in the agent much more easily: we need to add extra headers to the HostGAPlugin requests for Fast Track goal states, and we need to keep track of metrics for HostGAPlugin reliability.

While I was working on these changes I updated my workstation to Ubuntu 22.04, which comes with Python 3.10.4 and OpenSSL 3.0.2.

Pylint has additional checks on 3.10, and this PR includes fixes for those warnings on the files that I touched.

Some of the older test data for certificates are encrypted using a deprecated algorithm that is no longer supported by OpenSSL 3.0. This PR also updates those test data with samples taken from a current VM.

@@ -60,7 +60,7 @@
"runtimeSettings": [
{
"handlerSettings": {
"protectedSettingsCertThumbprint": "4037FBF5F1F3014F99B5D6C7799E9B20E6871CB3",
"protectedSettingsCertThumbprint": "BD447EF71C3ADDF7C837E84D630F3FAC22CCD22F",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The older test data is encrypted using an algorithm that is no longer supported on OpenSSL 3.0. All the changes in the data directory are to update the test data with new certificates pulled from a current live VM.

You will see updates of the certificate thumbprints throughout this PR, as well as the certificate data itself.

@codecov
Copy link

codecov bot commented Aug 15, 2022

Codecov Report

Merging #2651 (06c2deb) into develop (e2b61fe) will decrease coverage by 0.05%.
The diff coverage is 92.30%.

@@             Coverage Diff             @@
##           develop    #2651      +/-   ##
===========================================
- Coverage    72.03%   71.97%   -0.06%     
===========================================
  Files          103      103              
  Lines        15760    15692      -68     
  Branches      2502     2486      -16     
===========================================
- Hits         11352    11295      -57     
+ Misses        3891     3881      -10     
+ Partials       517      516       -1     
Impacted Files Coverage Δ
...protocol/extensions_goal_state_from_vm_settings.py 83.45% <ø> (ø)
azurelinuxagent/common/protocol/wire.py 78.54% <90.24%> (-0.04%) ⬇️
...ol/extensions_goal_state_from_extensions_config.py 92.05% <100.00%> (+0.82%) ⬆️
azurelinuxagent/ga/exthandlers.py 85.69% <100.00%> (-0.18%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.


in_vm_gs_metadata = InVMGoalStateMetaData(find(xml_doc, "InVMGoalStateMetaData"))
self._activity_id = self._string_to_id(in_vm_gs_metadata.activity_id)
self._correlation_id = self._string_to_id(in_vm_gs_metadata.correlation_id)
self._created_on_timestamp = self._ticks_to_utc_timestamp(in_vm_gs_metadata.created_on_ticks)

def _fetch_extensions_on_hold(self, xml_doc, wire_client):
@staticmethod
def _fetch_extensions_on_hold(xml_doc, wire_client):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change in this function is that I moved the download logic to the WireClient class (method fetch_artifacts_profile_blob) to make it consistent with other downloads.

I also ensured that we send telemetry for the cases that we assume OnHold is False because there is an error fetching the artifacts profile)

@@ -81,51 +80,42 @@ def _parse_extensions_config(self, xml_text, wire_client):
self._status_upload_blob_type = getattrib(status_upload_node, "statusBlobType")
logger.verbose("Extension config shows status blob type as [{0}]", self._status_upload_blob_type)

self._on_hold = self._fetch_extensions_on_hold(xml_doc, wire_client)
self._on_hold = ExtensionsGoalStateFromExtensionsConfig._fetch_extensions_on_hold(xml_doc, wire_client)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made _fetch_extensions_on_hold() a static function, per pycharm's suggestion.

@@ -291,7 +291,7 @@ def _parse_extensions(self, vm_settings):
# "settingsSeqNo": 0,
# "settings": [
# {
# "protectedSettingsCertThumbprint": "4037FBF5F1F3014F99B5D6C7799E9B20E6871CB3",
# "protectedSettingsCertThumbprint": "BD447EF71C3ADDF7C837E84D630F3FAC22CCD22F",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of a global update. See my comment for the changes in the "data" directory

import azurelinuxagent.common.conf as conf
import azurelinuxagent.common.logger as logger
import azurelinuxagent.common.utils.textutil as textutil
from azurelinuxagent.common import conf
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are fixes for pylint warnings on Python 3.10

@@ -127,25 +130,6 @@ def get_ext_handler_pkgs(self, ext_handler):
def get_goal_state(self):
return self.client.get_goal_state()

def _download_ext_handler_pkg_through_host(self, uri, destination):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were refactored and moved to WireClient, where the other download methods are defined

return self.decode_config(resp.read())

def fetch_cache(self, local_file):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent used to have a disk cache for the goal state. We stopped using that cache a long time ago, but this function was still here and had a caller. Removed the function and modified the caller to use read_file directly.

uri, headers = host.get_artifact_request(uri)
response, _ = self.fetch(uri, headers, use_proxy=False, retry_codes=restutil.HGAP_GET_EXTENSION_ARTIFACT_RETRY_CODES)
return response
def fetch_artifacts_profile_blob(self, uri):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the refactoring of the download logic begins. Most of the logic is in _download_with_fallback_channel() and _download_using_appropriate_channel(). fetch_artifacts_profile_blob(), fetch_manifest() and download_extension() just define the functions to use for the direct and HGAP call and then invoke _download_with_fallback_channel()


def stream(self, uri, destination, headers=None, use_proxy=None, max_retry=None):
raise ExtensionDownloadError("Failed to download {0} from all URIs. Last error: {1}".format(download_type, ustr(most_recent_error)), code=ExtensionErrorCodes.PluginManifestDownloadError)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended the error message to also include the errors from the direct and HGAP channels

os.remove(destination)
except Exception as exception:
logger.warn("Can't delete {0}: {1}", destination, ustr(exception))
raise
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now all these functions report errors consistently (using exceptions )

"""
max_retry indicates the maximum number of retries for the HTTP request; None indicates that the default value should be used
"""
def _fetch_response(self, uri, headers=None, use_proxy=None, retry_codes=None, ok_codes=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the max_retry parameter, since it is no longer used. All downloads have the same strategy in terms of timeouts, retries and error reporting

"""
try:
ret = host_func()
if ret in (None, False):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was trying to handle the different error values (False, None, exceptions) of the different downloads. Now all functions use exceptions so this is no longer needed

@@ -871,9 +874,6 @@ def _call_hostplugin_with_container_check(self, host_func):
try:
ret = host_func()

if ret in (None, False):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing -- now the code uses exceptions

host_func = lambda: self._download_ext_handler_pkg_through_host(uri, destination)

try:
success = self.client.send_request_using_appropriate_channel(direct_func, host_func) is not None
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug that would not detect errors downloading extensions. The stream() method used for the download use to return False on error and this is comparing against None.

This is no longer an issue, since the refactored code now uses exceptions to report errors

@@ -743,10 +754,7 @@ def _fetch_response(self, uri, headers=None, use_proxy=None, max_retry=None, ret
msg = "Fetch failed: {0}".format(error)
logger.warn(msg)
report_event(op=WALAEventOperation.HttpGet, is_success=False, message=msg, log_event=False)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method use to return None for errors, or raise an exception for InvalidContainerError and ResourceGoneError. Now it always reports errors with exceptions

@@ -900,70 +900,36 @@ def _call_hostplugin_with_container_check(self, host_func):
log_event=True)
raise

def __send_request_using_host_channel(self, host_func):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here: __send_request_using_host_channel() and __send_request_using_direct_channel() attempted to provide a uniform interface to download report errors. They are no longer needed, since all the download functions use exceptions.

@@ -1160,9 +1126,12 @@ def get_header_for_xml_content(self):
}

def get_header_for_cert(self):
trans_cert_file = os.path.join(conf.get_lib_dir(),
TRANSPORT_CERT_FILE_NAME)
content = self.fetch_cache(trans_cert_file)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was using fetch_cache(), which is a leftover from older agent versions that cached the goal state on disk. I removed that function and this code invokes read_file directly

@@ -1181,10 +1150,6 @@ def get_on_hold(self):
return self.get_goal_state().extensions_goal_state.on_hold

def upload_logs(self, content):
host_func = lambda: self._upload_logs_through_host(content)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_call_hostplugin_with_container_check() implements retry logic when the HGAP returns 410 (ResourceGone). Those retries are needed only for the extensionsArtifact API. Upload log does not neeed this.

if not self.protocol.download_ext_handler_pkg(source_uri, target_file):
raise Exception("Failed to download extension package from {0}".format(source_uri))
except Exception as exception:
self.logger.info("Error downloading extension package: {0}", ustr(exception))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup logic is now part of WireClient.stream()

@@ -1274,33 +1259,8 @@ def download(self):
self.logger.info("The existing extension package is invalid, will ignore it.")

if not package_exists:
downloaded = False
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Download extension used to have its own retry logic. No longer needed. All the download logic is now in 1 single place and is consistent across artifact blob, manifests, and extension packages

@@ -3302,9 +3304,13 @@ def manifest_location_handler(url, **kwargs):
with mock_wire_protocol(self.test_data, http_get_handler=manifest_location_handler) as protocol:
ext_handlers = protocol.get_goal_state().extensions_goal_state.extensions

with self.assertRaises(ExtensionDownloadError):
protocol.client.fetch_manifest(ext_handlers[0].manifest_uris,
timeout_in_minutes=0, timeout_in_ms=200)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout_in_minutes and timeout_in_ms parameters were used by test code only, so I removed them. Now the test code uses _DOWNLOAD_TIMEOUT instead.

@@ -35,6 +36,8 @@ def setUp(self):
ext_handler.version = "1.0.0"

protocol = WireProtocol("http://Microsoft.CPlat.Core.RunCommandLinux/foo-bar")
protocol.client.get_host_plugin = Mock()
protocol.client.get_artifact_request = Mock(return_value=(None, None))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mocks in this class have very internal knowledge of the code they are testing. They need rewriting, but for now I just defined a mock HGAP to account for the code I refactored.

@@ -95,44 +98,52 @@ def _assert_download_and_expand_succeeded(self):
self.assertTrue(os.path.exists(self._get_extension_package_file()), "The extension package was not downloaded to the expected location")
self.assertTrue(os.path.exists(self._get_extension_command_file()), "The extension package was not expanded to the expected location")

@staticmethod
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behaviours here used to be in implemented by WireProtocol.download_ext_handler_pkg, but now they are implemented by WireClient.stream

update_handler._report_status(exthandlers_handler)
update_handler._report_status(exthandlers_handler)
update_handler._report_status(exthandlers_handler)
self.assertEqual(1, logger_warn.call_count, "UpdateHandler._report_status() should report only 1 WARNING when there are multiple errors within the same goal state")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code assumes there will be only 1 warning in the log. Other components could also log warnings (I added a couple). I fixed the code to filter out the wanings that are not relevant to these tests

@@ -838,50 +835,18 @@ def http_get_handler(url, *_, **kwargs):
self.assertTrue(self.is_host_plugin_extension_artifact_request(urls[3]), "The retry request should have been over the host channel")
self.assertFalse(HostPluginProtocol.is_default_channel, "The default channel should not have changed")

def test_upload_logs_should_not_refresh_plugin_when_first_attempt_succeeds(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upload logs does not need the retry logic for 410. I removed that code and its corresponding tests below

@staticmethod
def _set_and_fail_helper_channel_functions(fail_direct=False, fail_host=False):
def direct_func(*_):
direct_func.counter += 1
if direct_func.fail:
return None
raise Exception("Direct channel failed")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mocks need to be updated to raise exceptions instead of reporting errors with None or False

with mock_wire_protocol(mockwiredata.DATA_FILE) as protocol:
# Scenario #1: Direct channel is default
HostPluginProtocol.is_default_channel = False
direct_func, host_func = self._set_and_fail_helper_channel_functions(fail_direct=True, fail_host=True)

# Assert we keep trying both channels, but the default channel doesn't change
for iteration in range(5):
ret = protocol.client.send_request_using_appropriate_channel(direct_func, host_func)
self.assertEqual(None, ret)
with self.assertRaises(HttpError):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now errors raise exceptions

@narrieta narrieta merged commit 3aebcdd into Azure:develop Aug 17, 2022
@narrieta narrieta deleted the download branch August 17, 2022 18:14
@nagworld9
Copy link
Contributor

@narrieta Looks to me you haven't touch the agent artifacts and manifest download logic. it has its own logic inside the update.py. Is that something we want that way or have a plan to change that?

@narrieta
Copy link
Member Author

narrieta commented Sep 1, 2022

@nagworld9 - Good point. The agent manifest should be covered by this PR. The Agent package is a lot trickier to address given the current code. I will refactor that on a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants