-
-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extract_msg should provide easy access to useful attachment info #256
Comments
I believe we had a long discussion about this on the discord with the result effectively being "it's not feasible" given how inconsistent the format is. Much of the documentation is "SHOULD" instead of "MUST" meaning that you can never rely on things to be concrete. Half of the time you would just get "unknown" for what you are asking. A lot of the time, the way the html figures out how to display content is by having tags that include it, but I haven't managed to fully figure out the RTF. It also seems to be trying to use some kind of tag system for this as well. This system seems to be the only reliable detection method, and it's rather hard to work with. I'll take a look at your specific file and see what info it may have though. |
I'd like to use one of my own test files as a brief example of things that would not easily be feasible. Giving the content type would, simply put, require data analysis to determine the type of data. The content type of the attachment is not a requirement, nor is a file extension. File extension is effectively the worst way to determine file type, as multiple formats could share the same extension or the file could just have a bad format to begin with. If the file provides some kind of identifier for the data then we could just return that. If all other methods fails, we either need to return "unknown" or use an analysis of the bytes of the file to determine the type. Of course, trying to manually implement that myself would be a terrible idea, so I would have to use something that already exists, adding another dependency to the module, a less than ideal solution. I could do it, of course, but I have to determine if it is worth it. |
(By the way, sorry to use so many messages, but this happens frequently where I send a message and go away only to keep thinking about something and go look for more information) So here is a module that I could potentially use, however it requires an external library to be installed as well. The python module is here: https://pypi.org/project/mimetypes-magic/ This module can get a mimetype or mimetype-like string for a data stream or file, which we could then return as the content disposition, should it be recognized. I'm not sure how it handles unknown types yet but that it something that can be figured out. Of course, I'd like to give at least partial access to the interface so people can customize some of the behavior, so that would take a bit to setup. If you think it would be worth it I can start the process of adding it. Aside from that, here is what I could find about your attachments. All of your attachments do have mimetype properties, accessible with I've confirmed that the PNG that I would expect to not have a null rendering position does in fact have that, but looking at the bodies the reason for this is likely because of how it is actually rendered. I can see clearly that the HTML data simply uses a tag to insert the image into the text, although the way the RTF is doing it is not clear to me. I assume it is also using a tag to insert it. My guess is that the rendering position being anything else would mean the bodies would have 2 of the same image. In addition, it is made explicit that it will be rendered in HTML and RTF as the attachment flags for it is the value 4, TL;DR: A lot of information can be found, but the lack of such information generally means that anything could be the case. If you are feeling frustrated at this, this is what I have to deal with when trying to implement features for this module. |
Partial support for this has been implemented. You can now access more of the fields that the msg reports directly in v0.31.0 |
This looks awesome thank you for tackling this @TheElementalOfDestruction ! The updated attributes work great. I looked at |
So, I have a bit of an update on this, and I need to cover a few things so just bear with me. This next version I'm working on (0.35.0) is the first where we actually implement optional dependencies for some of the more advanced features. This is one of those features. The interface shouldn't be changing much depending on how you are using it (if you are accessing properties on Message, Attachment, or any of the other classes, they should still be the same, but they have been made more consistent, so there are now additional properties you can access that are a lot more universal) but I'm actually going to be trying to add an optional dependency setting for getting a mimetype on something where it hasn't been set by allowing the data to be analyzed. I'm not sure which module exactly we are going to be using, but it will probably be the one I sent. Issues with getting the mimetype will be logged but full suppressed otherwise, so users will only notice files are giving mime types more often rather than getting any errors. The logging is there so users can see an issue is happening if they look at the logs and report them for fixing. I've also decided to take a much closer look at how MSG files handle inline attachments, and things jumped to a whole new level of weird when it comes to Content IDs... you see, the bodies can refer to content ids that just don't exist, or don't exist in the way they should. I have two examples that greatly illustrate what I mean:
So as you can see, things are changing and becoming a lot more complicated. The new optional dependencies should be installable by doing extract-msg[mime] or something similar when telling pip or the requirements to install, but I'm not sure yet. I'll let you know when I figure that out. |
Alright, extract msg now has optional modules that can be installed to add functionality. installing with |
This comment was marked as resolved.
This comment was marked as resolved.
What's required for the image to appear is 2 things: the msg file actually contains the reference needed for that image (if it's not attached to the file, there isn't anything to be done) and that you are using the prepared html instead of plain html, as prepared will insert the image data into the body. This uses required dependencies and not any optional dependencies (only uses bs4) |
Thanks! Now I see, |
Bug Metadata
Describe the bug
extract_msg is a wonderful library for reading detailed information from MSG files. However, it does not seem to provide easy access to attachments' mime type or content-disposition/attachment method (whether it is a regular attachment or an "inline" embedded image). The
.renderingPosition
attribute does not seem to be helpful, as it seems to return4294967295
for both inline and attachment content dispositions.What code did you use or can we use to reproduce this error?
This is kind of what I'm expecting:
According to the libpff docs (which I think apply to msg files as well), the entry types should be
0x370e
for the mime type and possibly0x3705
for attachment method but I'm unsure about the second.0x370e
appears to work using the_ensureSet()
method:I think I can put in a PR to add content type if that would be helpful. It seems straightforward.
However I'm not sure how to access the content disposition/attachment method. Doing it the same way returns
None
for me.mailAttachment._ensureSetProperty('_attachmentMethod', '37050003')
seems to return1
for both inline and attachment. Looking at the libpff docs, I would expect inline attachments to return2
or perhaps5
? Not sure if the example I'm emailing is representative.Is there a message.msg file you want to share to help us reproduce this?
Traceback
n/a
Screenshots
n/a
Additional context
I'm not sure how MSGs reference or handle inline attachments. For MBOX/EML it seems to be denoted in a
Content-Disposition
header which is what I'm referring to, but its very possible that MSGs don't manage this the way I'm expecting.The text was updated successfully, but these errors were encountered: