Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unpack requires a buffer of 16 bytes #274

Closed
2 of 4 tasks
akr1991 opened this issue Jul 6, 2022 · 20 comments
Closed
2 of 4 tasks

unpack requires a buffer of 16 bytes #274

akr1991 opened this issue Jul 6, 2022 · 20 comments
Labels
In Progress This issue or feature request has been confirmed or approved, respectively, and is being worked on.

Comments

@akr1991
Copy link

akr1991 commented Jul 6, 2022

In order to get your bug addressed in a timely manner, or at all 😃, please fill out the below bug report. Please try to make it as easy as possible for us to understand what is going on. We may close out any bugs or issues without warning that are not complete or coherent.

In the bug template below anything is [square brackets] should be filled out or removed if the item doesn't apply.

Should you encounter an error that has not already been reported, please do the following when reporting it:
Bug Metadata

  • Version of extract_msg: [0.34.3]
  • Your python version: Python [3.6.7]
  • How did you launch extract_msg?
    • My command line or
    • I used the extract_msg package

Describe the bug
I am 100's of email from which I want to extract the message details. But for some of the emails I am encountering below error:
struct.error: unpack requires a buffer of 16 bytes

[ If applicable ]
**What code did you use or can we use to reproduce this error?

I ran below command from command line.
I could not share the email file to avoid any compliance issue but I can share the email size which is 55kb.
I have also observed that some email even bigger that 100kb are getting extracted successfully so I don't think it is due to email size.

python -m extract_msg "error-email.msg"

Is there a message.msg file you want to share to help us reproduce this?

  • Uploaded message (drag and drop on this window)
  • Emailed message as an attachment to admins: [Enter Subject Line Here]

Traceback

[Put your traceback here]

Screenshots
image

Additional context
[Add any other context about the problem here.]

@TheElementalOfDestruction
Copy link
Collaborator

Yeah, this isn't related to the size of the msg file. Looks like something went wrong with the main properties stream, most likely because it's misaligned with what was expected.

Let's confirm what the problem is by simply logging the size of the properties stream before it errors so we know why it broke. Unfortunately you'll have to edit one of the files for this test, but you can revert the change immediately after. In your traceback there is a path for properties.py. On line 54 you'll see

streams = divide(self.__stream[skip:], 16)

Insert the following line after that, right before the for loop:

logger.warning(len(self.__stream))

When you run extract_msg again, this will add a log message immediately before the traceback that contains a number. Let me know what that number is.

Thanks

@akr1991
Copy link
Author

akr1991 commented Jul 7, 2022

I added the _logger.warning(len(self._stream)) code right before the for loop and below is the output:
2022-07-07 13:07:17,855 - extract_msg.properties - WARNING - 628

@TheElementalOfDestruction
Copy link
Collaborator

Yep, the alignment was off. For a message it should be divisible by 16 but yours was only divisible by 8. I'll check to see if I got the details wrong but I believe my implementation was right.

@TheElementalOfDestruction
Copy link
Collaborator

TheElementalOfDestruction commented Jul 7, 2022

Confirmed, it's parsing the header correctly. Looks like the data in your msg file is blatantly malformed, and I don't know why.

Can you tell me anything about it like what program made it and if outlook can open it properly?

@akr1991
Copy link
Author

akr1991 commented Jul 7, 2022

it is a email chain conversation between our executive and client. Also it is opening properly in Outlook.

@TheElementalOfDestruction
Copy link
Collaborator

Did outlook make the file?

Anyways, you should probably just change that log to just output the stream itself instead of the size and send that. The properties stream doesn't contain sensitive info. The most is has is random date properties. I need to see what format it is using and why.

Also, to confirm, the number for the log, did that print more than once for the email or did it error immediately after the first log?

@akr1991
Copy link
Author

akr1991 commented Jul 7, 2022

Yes file is made from outlook only.

PFB output of stream:
2022-07-07 14:08:48,600 - extract_msg.properties - WARNING - b'\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x01\xff\x0f\x06\x00\x00\x00H\x00\x00\x00\x00\x00\x00\x00\x02\x01\xf6\x0f\x06\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x03\x00\r4\x02\x00\x00\x008\x00\x05\x00\x00\x00\x00\x00\x03\x00\x0f4\x02\x00\x00\x008\x00\x05\x00\x00\x00\x00\x00\x0b\x00\x02\x00\x02\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x0b\x00\x1b\x0e\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\xde?\x06\x00\x00\x00\xe9\xfd\x00\x00\x00\x00\x00\x00\x02\x01\x13\x10\x06\x00\x00\x00\xad\xb8\x00\x00\x00\x00\x00\x00\x0b\x00\x1f\x0e\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00@\x00\x06\x0e\x06\x00\x00\x00\x00\x9f\xc7\x8e\xcdI\xd8\x01@\x009\x00\x06\x00\x00\x00\x00\x9f\xc7\x8e\xcdI\xd8\x01\x03\x00\xf4\x0f\x06\x00\x00\x00\x07\x00\x00\x00\x00\x00\x00\x00\x03\x00\xf7\x0f\x06\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\xfe\x0f\x06\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x1f\x007\x00\x06\x00\x00\x00\xb8\x00\x00\x00\x00\x00\x00\x00\x1f\x00\x1d\x0e\x06\x00\x00\x00\xb0\x00\x00\x00\x00\x00\x00\x00\x1f\x00=\x00\x06\x00\x00\x00\n\x00\x00\x00\x00\x00\x00\x00\x1f\x00p\x00\x06\x00\x00\x00\xb0\x00\x00\x00\x00\x00\x00\x00@\x00\x070\x06\x00\x00\x00\x17\xb8\xdf\x97\xcdI\xd8\x01@\x00\x080\x06\x00\x00\x00\x17\xb8\xdf\x97\xcdI\xd8\x01\x03\x00&\x00\x06\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x17\x00\x06\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x08?\x06\x00\x00\x00\t\x04\x00\x00\x00\x00\x00\x00\x03\x00\x07\x0e\x06\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x80\x10\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x1f\x00#@\x06\x00\x00\x004\x00\x00\x00\x00\x00\x00\x00\x1f\x008@\x06\x00\x00\x004\x00\x00\x00\x00\x00\x00\x00\x1f\x00"@\x06\x00\x00\x00\n\x00\x00\x00\x00\x00\x00\x00\x02\x01\x19\x0c\x06\x00\x00\x00\x8a\x00\x00\x00\x00\x00\x00\x00\x1f\x00\x1f\x0c\x06\x00\x00\x004\x00\x00\x00\x00\x00\x00\x00\x1f\x00\x1a\x0c\x06\x00\x00\x004\x00\x00\x00\x00\x00\x00\x00\x1f\x00\x1e\x0c\x06\x00\x00\x00\n\x00\x00\x00\x00\x00\x00\x00\x1f\x00\x04\x0e\x02\x00\x00\x00<\x00\x00\x00\x00\x00\x00\x00\x1f\x00\x03\x0e\x02\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x1f\x00\x02\x0e\x02\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x1f\x00\x1a\x00\x06\x00\x00\x00\x12\x00\x00\x00\x00\x00\x00\x00\x03\x00\x08\x0e\x06\x00\x00\x00y\xbe\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

Number for the log : it got printed once only. PFB the snapshot. I added the log for size as well as stream output.
Screenshot 2022-07-07 141432

@TheElementalOfDestruction
Copy link
Collaborator

Sorry, apparently I need to make a correction cause I screwed up. 628 actually onligns to 4 bytes, not 8 or 16, making this file weird as all heck.

I actually checked it manually, and I can see that it isn't misaligned (the properties are exactly where they should be, the header is valid, etc.) It just, for whatever reason, has 4 extra null bytes at the end. I'm looking into what might cause this and whether this is considered acceptable for the standard to know how best to handle it.

@TheElementalOfDestruction
Copy link
Collaborator

TheElementalOfDestruction commented Jul 7, 2022

Nothing is mentioned in the docs, so my guess is that because everything is aligned properly it manages to read the things, fails to read the end, silently fails but has already parsed all the data it needs to, and as such just looks like everything is fine. So that's what I'll do: I'll add a check to make sure the size is 16, and if it isn't then I'll just pretend it doesn't exist. I'll bundle this fix into 0.35.0 which is pretty close to being done and has a lot of improvements and bug fixes.

No idea why outlook did this tbh, and I'd actually recommend you try to report it to Microsoft as it seems like a bug in outlook.

@TheElementalOfDestruction TheElementalOfDestruction added the In Progress This issue or feature request has been confirmed or approved, respectively, and is being worked on. label Jul 7, 2022
@TheElementalOfDestruction
Copy link
Collaborator

If you want to have a fix immediately, you can replace the following lines:

for st in streams:
    prop = createProp(st)
    self.__props[prop.name] = prop

With this:

for st in streams:
    if len(st) == 16:
        prop = createProp(st)
        self.__props[prop.name] = prop
    else:
        logger.warning(f'Found stream from divide that was not 16 bytes: {st}. Ignoring.')

TheElementalOfDestruction added a commit that referenced this issue Jul 7, 2022
@akr1991
Copy link
Author

akr1991 commented Jul 7, 2022

Thanks for sharing the fix.
I added this fix in properties.py file. Now the error is gone but it is not extracting the email correctly.
It only extracted below 7 lines but not the final Body of email.
From:
To:
Cc:
Bcc:
Subject:
Date:
---------------

@TheElementalOfDestruction
Copy link
Collaborator

Odd. I'd like to turn on the debug logging and have you send me a copy of the set of log messages. To do this from the command line, simply add --verbose as an option somewhere and it will print out a lot more messages. Of course, I recommend you take a cursory glance at it to strip any sensitive information it might have before sending it, but I don't think there should be any. These log messages will tell me a lot about the structure of the file and what the module was trying to access that it couldn't find (as it looks like the body and header properties were not found at all).

In addition, there are 2 other things I would like to check. The first is if using a different save type other than the default (I would recommend either RTF or HTML) causes data to show up at all (just need to know if it does, not the full details of the data). The second is if you open it in outlook and go to the print preview, what fields of the header (things like To, From, etc.) appear at the top? I don't need to data in those fields, just which ones. If outlook shows a field, it means it has accessible data that the module is failing to access.

Thanks

@akr1991
Copy link
Author

akr1991 commented Jul 8, 2022

Output from --verbose log
image

Also as requested :

  1. I saved the file as HTML and in HTML file everything is showing up
  2. I opened the file in outlook and went to print preview and there also everything is showing up. PFB snapshot.

image

In the email trail messages I see a Image link which is not showing up and appearing as below. Can this cause any issue?
image

@TheElementalOfDestruction
Copy link
Collaborator

Unlikely that that caused any issue. To be clear, the html contained the header that looked correct?

Additionally, to be clear, was the header section of the output from extract-msg populated with the actual data when you saved plain text?

Also, I see why your output looked so bad. Two streams were completely absent from the file: plain text body and compressed RTF body. If the plain text body isn't found, the program may try to generate it from the RTF if possible. But the RTF body wasn't there. As such, plain text just doesn't output anything.

@akr1991
Copy link
Author

akr1991 commented Jul 8, 2022

Unlikely that that caused any issue. To be clear, the html contained the header that looked correct?
Yes
Additionally, to be clear, was the header section of the output from extract-msg populated with the actual data when you saved plain text?
Yes

@TheElementalOfDestruction
Copy link
Collaborator

Alright, I misunderstood the issue a bit. I thought the header just contained the field names but no data.

Yeah, just a case of no plain text body being available and no current method for extracting plain text out of the HTML. In addition, I've added a bit of code in the last commit that will improve the error handling for such a scenario where the body stream doesn't exist and can't be generated.

@akr1991
Copy link
Author

akr1991 commented Jul 8, 2022

Can we expect a fix for this issue in upcoming release? or the fix would be to improve error handling?

@TheElementalOfDestruction
Copy link
Collaborator

The fix for the properties stream is there, as well as better error handling. Aside from that, nothing else. Changing things to add it once I figure out the best way will be easy, as only MessageBase actually needs to be changed and then all of the saveable classes that use a body will be updated with that code.

For better tracking, I recommend making that a specific feature request as it is separate from the original issue of this post.

@TheElementalOfDestruction
Copy link
Collaborator

Next release now contains what may be the finalized code for version 0.35.0 if you would like to try that out and see if it works properly. I think everything should be working correctly, I'm just still running some tests on it to make sure everything is in working order.

@TheElementalOfDestruction
Copy link
Collaborator

All of the fixes for this are now done in 0.35.0. I created a new feature request for generating the plain text body from the HTML body where possible, #278. Let me know if the main bug from this was not resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
In Progress This issue or feature request has been confirmed or approved, respectively, and is being worked on.
Projects
None yet
Development

No branches or pull requests

2 participants