Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Alignment of Multilingual Prompts in RTP LX Dataset #2

Closed
SeuperHakkerJa opened this issue Jun 3, 2024 · 1 comment
Closed

Comments

@SeuperHakkerJa
Copy link

SeuperHakkerJa commented Jun 3, 2024

Description:

Thanks for the great dataset!

I have noticed an issue with the alignment of the prompts across different languages. Specifically, the prompts at the same index in different languages are not translations of each other.

For example:

  • The 3rd prompt in RTP_LX_EN, RTP_LX_KO, RTP_LX_UK, and RTP_LX_ZH-Hant are all related to calling the police as racist (I am not posting the original sentence here to avoid crawlers).
  • Meanwhile, the 3rd prompt in RTP_LX_ZH-Hans, RTP_LX_DE, RTP_LX_FR, and RTP_LX_ES are related to taxi drivers.

I am aware that there are different numbers of prompts in different languages. However, I also noticed that there seems to be a subset of prompts that exist across all datasets. Therefore, I am wondering if this misalignment between the multilingual prompts is intentional. Additionally, is there a way to align the translations correctly so that each prompt at a specific index is a translation of the same prompt across all languages?

Best regards,
Jacob

@adewynter
Copy link
Collaborator

Hi, sorry for the delay in response!

Re: the transcreation, that is somewhat expected. Each translator was instructed to adapt it to things that were more culturally relevant. Curious that they relate to taxi drivers. I'm curious whether the subject varies further across the other languages. Aside, we are working on a meta-study to see how the annotations relate to the subject of the sentence (along other things), so stay tuned!

As for your second question, there is a shared subset of prompts, but it is intentionally-ish misaligned. This is because of two reasons: sometimes we removed low-quality transcreations; and we intended to obfuscate some of the hand-created prompts to ensure a certain level of anonymity for the prompt authors.

I'll be closing this but do feel free to reopen/ask more questions if needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants