-
-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unpack requires a buffer of 16 bytes #274
Comments
Yeah, this isn't related to the size of the msg file. Looks like something went wrong with the main properties stream, most likely because it's misaligned with what was expected. Let's confirm what the problem is by simply logging the size of the properties stream before it errors so we know why it broke. Unfortunately you'll have to edit one of the files for this test, but you can revert the change immediately after. In your traceback there is a path for streams = divide(self.__stream[skip:], 16) Insert the following line after that, right before the for loop: logger.warning(len(self.__stream)) When you run extract_msg again, this will add a log message immediately before the traceback that contains a number. Let me know what that number is. Thanks |
I added the _logger.warning(len(self._stream)) code right before the for loop and below is the output: |
Yep, the alignment was off. For a message it should be divisible by 16 but yours was only divisible by 8. I'll check to see if I got the details wrong but I believe my implementation was right. |
Confirmed, it's parsing the header correctly. Looks like the data in your msg file is blatantly malformed, and I don't know why. Can you tell me anything about it like what program made it and if outlook can open it properly? |
it is a email chain conversation between our executive and client. Also it is opening properly in Outlook. |
Did outlook make the file? Anyways, you should probably just change that log to just output the stream itself instead of the size and send that. The properties stream doesn't contain sensitive info. The most is has is random date properties. I need to see what format it is using and why. Also, to confirm, the number for the log, did that print more than once for the email or did it error immediately after the first log? |
Sorry, apparently I need to make a correction cause I screwed up. 628 actually onligns to 4 bytes, not 8 or 16, making this file weird as all heck. I actually checked it manually, and I can see that it isn't misaligned (the properties are exactly where they should be, the header is valid, etc.) It just, for whatever reason, has 4 extra null bytes at the end. I'm looking into what might cause this and whether this is considered acceptable for the standard to know how best to handle it. |
Nothing is mentioned in the docs, so my guess is that because everything is aligned properly it manages to read the things, fails to read the end, silently fails but has already parsed all the data it needs to, and as such just looks like everything is fine. So that's what I'll do: I'll add a check to make sure the size is 16, and if it isn't then I'll just pretend it doesn't exist. I'll bundle this fix into 0.35.0 which is pretty close to being done and has a lot of improvements and bug fixes. No idea why outlook did this tbh, and I'd actually recommend you try to report it to Microsoft as it seems like a bug in outlook. |
If you want to have a fix immediately, you can replace the following lines: for st in streams:
prop = createProp(st)
self.__props[prop.name] = prop With this: for st in streams:
if len(st) == 16:
prop = createProp(st)
self.__props[prop.name] = prop
else:
logger.warning(f'Found stream from divide that was not 16 bytes: {st}. Ignoring.') |
Thanks for sharing the fix. |
Odd. I'd like to turn on the debug logging and have you send me a copy of the set of log messages. To do this from the command line, simply add In addition, there are 2 other things I would like to check. The first is if using a different save type other than the default (I would recommend either RTF or HTML) causes data to show up at all (just need to know if it does, not the full details of the data). The second is if you open it in outlook and go to the print preview, what fields of the header (things like To, From, etc.) appear at the top? I don't need to data in those fields, just which ones. If outlook shows a field, it means it has accessible data that the module is failing to access. Thanks |
Unlikely that that caused any issue. To be clear, the html contained the header that looked correct? Additionally, to be clear, was the header section of the output from extract-msg populated with the actual data when you saved plain text? Also, I see why your output looked so bad. Two streams were completely absent from the file: plain text body and compressed RTF body. If the plain text body isn't found, the program may try to generate it from the RTF if possible. But the RTF body wasn't there. As such, plain text just doesn't output anything. |
|
Alright, I misunderstood the issue a bit. I thought the header just contained the field names but no data. Yeah, just a case of no plain text body being available and no current method for extracting plain text out of the HTML. In addition, I've added a bit of code in the last commit that will improve the error handling for such a scenario where the body stream doesn't exist and can't be generated. |
Can we expect a fix for this issue in upcoming release? or the fix would be to improve error handling? |
The fix for the properties stream is there, as well as better error handling. Aside from that, nothing else. Changing things to add it once I figure out the best way will be easy, as only MessageBase actually needs to be changed and then all of the saveable classes that use a body will be updated with that code. For better tracking, I recommend making that a specific feature request as it is separate from the original issue of this post. |
Next release now contains what may be the finalized code for version 0.35.0 if you would like to try that out and see if it works properly. I think everything should be working correctly, I'm just still running some tests on it to make sure everything is in working order. |
All of the fixes for this are now done in 0.35.0. I created a new feature request for generating the plain text body from the HTML body where possible, #278. Let me know if the main bug from this was not resolved. |
In order to get your bug addressed in a timely manner, or at all 😃, please fill out the below bug report. Please try to make it as easy as possible for us to understand what is going on. We may close out any bugs or issues without warning that are not complete or coherent.
In the bug template below anything is [square brackets] should be filled out or removed if the item doesn't apply.
Should you encounter an error that has not already been reported, please do the following when reporting it:
Bug Metadata
Describe the bug
I am 100's of email from which I want to extract the message details. But for some of the emails I am encountering below error:
struct.error: unpack requires a buffer of 16 bytes
[ If applicable ]
**What code did you use or can we use to reproduce this error?
I ran below command from command line.
I could not share the email file to avoid any compliance issue but I can share the email size which is 55kb.
I have also observed that some email even bigger that 100kb are getting extracted successfully so I don't think it is due to email size.
Is there a message.msg file you want to share to help us reproduce this?
Traceback
Screenshots
Additional context
[Add any other context about the problem here.]
The text was updated successfully, but these errors were encountered: