Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_msg should provide easy access to useful attachment info #256

Closed
2 of 3 tasks
gwiedeman opened this issue May 24, 2022 · 11 comments
Closed
2 of 3 tasks

extract_msg should provide easy access to useful attachment info #256

gwiedeman opened this issue May 24, 2022 · 11 comments
Labels
Complete This feature has been fully implemented. enhancement Partially Accepted The feature request has been accepted in part, and may fully be accepted later.

Comments

@gwiedeman
Copy link
Contributor

Bug Metadata

  • Version of extract_msg: 0.30.10
  • Your python version: Python 3.9.12
  • How did you launch extract_msg?
    • My command line or
    • I used the extract_msg package

Describe the bug
extract_msg is a wonderful library for reading detailed information from MSG files. However, it does not seem to provide easy access to attachments' mime type or content-disposition/attachment method (whether it is a regular attachment or an "inline" embedded image). The .renderingPosition attribute does not seem to be helpful, as it seems to return 4294967295 for both inline and attachment content dispositions.

What code did you use or can we use to reproduce this error?

This is kind of what I'm expecting:

mail = extract_msg.openMsg(path/to/message.msg)
for mailAttachment in mail.attachments:
    print (mailAttachment.contentType)
    print (mailAttachment.contentDisposition)
> image/png
> inline
> application/pdf
> attachment

According to the libpff docs (which I think apply to msg files as well), the entry types should be 0x370e for the mime type and possibly 0x3705 for attachment method but I'm unsure about the second. 0x370e appears to work using the _ensureSet() method:

mail = extract_msg.openMsg(path/to/message.msg)
for mailAttachment in mail.attachments:
    print (mailAttachment._ensureSet('_contentType', '__substg1.0_370e'))
> image/png
> application/pdf

I think I can put in a PR to add content type if that would be helpful. It seems straightforward.

However I'm not sure how to access the content disposition/attachment method. Doing it the same way returns None for me. mailAttachment._ensureSetProperty('_attachmentMethod', '37050003') seems to return 1 for both inline and attachment. Looking at the libpff docs, I would expect inline attachments to return 2 or perhaps 5? Not sure if the example I'm emailing is representative.

Is there a message.msg file you want to share to help us reproduce this?

  • [] Uploaded message (drag and drop on this window)
  • Emailed message as an attachment to admins: Example file for Issue 256

Traceback

n/a

Screenshots
n/a

Additional context
I'm not sure how MSGs reference or handle inline attachments. For MBOX/EML it seems to be denoted in a Content-Disposition header which is what I'm referring to, but its very possible that MSGs don't manage this the way I'm expecting.

@TheElementalOfDestruction
Copy link
Collaborator

I believe we had a long discussion about this on the discord with the result effectively being "it's not feasible" given how inconsistent the format is. Much of the documentation is "SHOULD" instead of "MUST" meaning that you can never rely on things to be concrete. Half of the time you would just get "unknown" for what you are asking.

A lot of the time, the way the html figures out how to display content is by having tags that include it, but I haven't managed to fully figure out the RTF. It also seems to be trying to use some kind of tag system for this as well. This system seems to be the only reliable detection method, and it's rather hard to work with. I'll take a look at your specific file and see what info it may have though.

@TheElementalOfDestruction
Copy link
Collaborator

I'd like to use one of my own test files as a brief example of things that would not easily be feasible. Giving the content type would, simply put, require data analysis to determine the type of data. The content type of the attachment is not a requirement, nor is a file extension.

File extension is effectively the worst way to determine file type, as multiple formats could share the same extension or the file could just have a bad format to begin with. If the file provides some kind of identifier for the data then we could just return that. If all other methods fails, we either need to return "unknown" or use an analysis of the bytes of the file to determine the type. Of course, trying to manually implement that myself would be a terrible idea, so I would have to use something that already exists, adding another dependency to the module, a less than ideal solution. I could do it, of course, but I have to determine if it is worth it.

@TheElementalOfDestruction
Copy link
Collaborator

(By the way, sorry to use so many messages, but this happens frequently where I send a message and go away only to keep thinking about something and go look for more information)

So here is a module that I could potentially use, however it requires an external library to be installed as well. The python module is here: https://pypi.org/project/mimetypes-magic/

This module can get a mimetype or mimetype-like string for a data stream or file, which we could then return as the content disposition, should it be recognized. I'm not sure how it handles unknown types yet but that it something that can be figured out. Of course, I'd like to give at least partial access to the interface so people can customize some of the behavior, so that would take a bit to setup. If you think it would be worth it I can start the process of adding it.

Aside from that, here is what I could find about your attachments. All of your attachments do have mimetype properties, accessible with Attachment._getTypedData('370E'). This could, of course, be mapped to a temporary property that is accessible as long as the stream is there or if we add support with that module.

I've confirmed that the PNG that I would expect to not have a null rendering position does in fact have that, but looking at the bodies the reason for this is likely because of how it is actually rendered. I can see clearly that the HTML data simply uses a tag to insert the image into the text, although the way the RTF is doing it is not clear to me. I assume it is also using a tag to insert it. My guess is that the rendering position being anything else would mean the bodies would have 2 of the same image.

In addition, it is made explicit that it will be rendered in HTML and RTF as the attachment flags for it is the value 4, attRenderedInBody. This is, depending on if it consistently exists, our best chance at actually detecting if something is rendered. However the documentation says that if the value is 0 or absent, it is effectively saying "anything goes" and we will not be able to really determine anything. In fact, both of the other attachments are specifically missing this property. At best this property can confirm things, but the lack of it confirms nothing. Here is a copy of the exact documentation regarding this property:
image

TL;DR: A lot of information can be found, but the lack of such information generally means that anything could be the case. If you are feeling frustrated at this, this is what I have to deal with when trying to implement features for this module.

@TheElementalOfDestruction TheElementalOfDestruction added enhancement Partially Accepted The feature request has been accepted in part, and may fully be accepted later. labels Jun 2, 2022
@TheElementalOfDestruction
Copy link
Collaborator

I'm doing what I can to add access to some of this, but let me clarify a few things you mentioned in the initial post.

  1. You mentioned attach method, but it's not what you think. For context, this is what it is:
    image
    To be specific, it refers to how you access the data, rather than how the data is going to be used. Attach flags is more likely to help but again is absolutely no guarantee. This property is actually used internally by the attachment class. A value of 5 means it's an embedded msg file. A value of 1 is the most typical, and I've confirmed that many attachments that are rendered or not will both use 1.
  2. You are right about the mimetype being in 0x370E, but it doesn't have to be set, as I mentioned.
  3. The following is a list of mime properties. Looking at it, it looks like ContentId being set may actually be the best way to tell if something is being rendered in the body. I've exclusively seen things referenced in the body (unless they are application specific like outlook signatures, in which case everything goes out the window) through the cid, a property that can already be accessed.
    image

@TheElementalOfDestruction
Copy link
Collaborator

Partial support for this has been implemented. You can now access more of the fields that the msg reports directly in v0.31.0

@gwiedeman
Copy link
Contributor Author

This looks awesome thank you for tackling this @TheElementalOfDestruction ! The updated attributes work great. I looked at 0x3705 for a larger PST dataset we have and I agree its not really useful at all and that matching it via the RTF/HTML is the way to go. I also think its fine to just return None for missing mimeTypes and ContentIDs. Seems reasonable to me that users/downstream can try mimetypes-magic or similar if they need more than None. Thanks as always for maintaining!

@TheElementalOfDestruction
Copy link
Collaborator

So, I have a bit of an update on this, and I need to cover a few things so just bear with me.

This next version I'm working on (0.35.0) is the first where we actually implement optional dependencies for some of the more advanced features. This is one of those features. The interface shouldn't be changing much depending on how you are using it (if you are accessing properties on Message, Attachment, or any of the other classes, they should still be the same, but they have been made more consistent, so there are now additional properties you can access that are a lot more universal) but I'm actually going to be trying to add an optional dependency setting for getting a mimetype on something where it hasn't been set by allowing the data to be analyzed.

I'm not sure which module exactly we are going to be using, but it will probably be the one I sent. Issues with getting the mimetype will be logged but full suppressed otherwise, so users will only notice files are giving mime types more often rather than getting any errors. The logging is there so users can see an issue is happening if they look at the logs and report them for fixing.

I've also decided to take a much closer look at how MSG files handle inline attachments, and things jumped to a whole new level of weird when it comes to Content IDs... you see, the bodies can refer to content ids that just don't exist, or don't exist in the way they should. I have two examples that greatly illustrate what I mean:

  1. MSG files with outlook signatures involving an image will save in a custom, currently unreadable format. While the image is stored as a bitmap, parsing it in any reasonable way to figure out where to put it and stuff like that has come to a stand still. However, the RTF body will actually refer to it by a Content ID... a content ID not actually listed in the attachment. I checked every stream of the attachment and the Content ID is not listed in any of them, leaving me to wonder just how I am supposed to figure it out. However, this one is a bit special in that it does actually have a rendering position... something the other attachments that are not inline also have.
  2. MSG files may have references to Content IDs that are for files in embedded messages. However, this is actually invalid, and outlook itself will completely fail to render these. Why this happens is actually a complete mystery.

So as you can see, things are changing and becoming a lot more complicated. The new optional dependencies should be installable by doing extract-msg[mime] or something similar when telling pip or the requirements to install, but I'm not sure yet. I'll let you know when I figure that out.

@TheElementalOfDestruction TheElementalOfDestruction added In Progress This issue or feature request has been confirmed or approved, respectively, and is being worked on. Nearly Implemented This feature has nearly been implemented and will be available in the next version. labels Jun 28, 2022
@TheElementalOfDestruction TheElementalOfDestruction added Complete This feature has been fully implemented. and removed In Progress This issue or feature request has been confirmed or approved, respectively, and is being worked on. Nearly Implemented This feature has nearly been implemented and will be available in the next version. labels Jul 11, 2022
@TheElementalOfDestruction
Copy link
Collaborator

Alright, extract msg now has optional modules that can be installed to add functionality. installing with pip install extract-msg[all] will enable all optional modules. For additional mimetype stuff you want the mime module specifically(pip install extract-msg[mime])

@grahamperrin

This comment was marked as resolved.

@TheElementalOfDestruction
Copy link
Collaborator

What's required for the image to appear is 2 things: the msg file actually contains the reference needed for that image (if it's not attached to the file, there isn't anything to be done) and that you are using the prepared html instead of plain html, as prepared will insert the image data into the body. This uses required dependencies and not any optional dependencies (only uses bs4)

@grahamperrin
Copy link

Thanks! Now I see, --prepared-html in the help. Apologies for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Complete This feature has been fully implemented. enhancement Partially Accepted The feature request has been accepted in part, and may fully be accepted later.
Projects
None yet
Development

No branches or pull requests

3 participants