Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon 7th No.21】RFC文档 新增 reset_peak_memory_stats/reset_max_memory_allocated/memory_stats API #1022

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Qin-sx
Copy link

@Qin-sx Qin-sx commented Nov 19, 2024

new file:   rfcs/APIs/20241119_api_design_for_reset_peak_memory_stats_reset_max_memory_allocated_memory_stats.md

希望可以与paddle开发人员讨论一下函数的具体实现细节

	new file:   rfcs/APIs/20241119_api_design_for_reset_peak_memory_stats_reset_max_memory_allocated_memory_stats.md
Copy link

paddle-bot bot commented Nov 19, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备,具体请参考示例模版
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

@CLAassistant
Copy link

CLAassistant commented Nov 19, 2024

CLA assistant check
All committers have signed the CLA.


# 四、对比分析

由于 PyTorch 中的函数结构更符合飞浆的要求,因此可以参考 PyTorch 中的函数实现。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

飞浆-》飞桨

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

抱歉,已修改

# 七、影响面

## 需要进一步讨论的问题
1. 是否对`Stat`类进行修改。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以修复Stat类,可以评估下修改 或 不修改,两种方案的优缺点。

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我目前认为可以直接在Stat类中加入新的函数。我认为现在不需要使用友元函数或者友元类来修改Stat类中的私有变量。不知道飞桨对这种类似的修改是否有要求?


## 需要进一步讨论的问题
1. 是否对`Stat`类进行修改。
2. 目前`StatRegistry`类中是否只注册了`Allocated`和`Reserved`。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还需要注册哪些信息?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为我目前对飞桨的内存池还不太了解,希望可以和飞桨技术人员交流一下内存池的架构。
PyTorch的内存池比较复杂,信息更多。如新的commit中引用的struct DeviceStats类所示,Pytorch的内存池中有更多的类似活动内存块和非活动内存块,大内存池(>1MB)和小内存池(<1MB)等信息。

## 需要进一步讨论的问题
1. 是否对`Stat`类进行修改。
2. 目前`StatRegistry`类中是否只注册了`Allocated`和`Reserved`。
3. 是否参考Pytorch注册更多的内存信息标签。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以列举一下有哪些内存信息标签。这些新注册的标签是否会被使用到。因为飞桨单侧有行覆盖率的要求,如果注册了标签,但没有使用,单侧会无法通过。

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为对飞桨内存池还没有深入了解。我认为第一版函数可以按照此RFC文档中方案设计:

  1. reset_peak_memory_stats函数:将allocated memory和reserved memory中的peak值改为current值
  2. reset_max_memory_allocated函数:参考Pytorch,调用reset_peak_memory_stats函数
  3. MemoryStats函数:返回包含以下信息的字典
{
  {"memory.allocated.peak", 0},
  {"memory.allocated.current", 0},
  {"memory.reserved.peak", 0},
  {"memory.reserved.current", 0}
}

请问这种设计能否满足目前的需求?

	modified:   rfcs/APIs/20241119_api_design_for_reset_peak_memory_stats_reset_max_memory_allocated_memory_stats.md
@luotao1
Copy link
Collaborator

luotao1 commented Dec 3, 2024

comment by @From00

  • 感觉这个方案可能有些问题,仅修改stats类的peak_value_无法达到重置max_memory的目的。飞桨的这套内存池统计方案,是专门针对显存读写的高并发场景设计的,和PyTorch的实现不太一致。统计的数据实际是分散存储在各个线程里的,需要同时重置各个线程的数据才行,否则stats.peak_value_重置后数据也会很快被其它线程恢复。
  • 《应用于计算分支的更新峰值的方法、装置、设备和介质》,可以通过这份公开专利,看一下这块具体的实现逻辑 https://patentimages.storage.googleapis.com/86/71/69/a6a95a1bcb9b05/CN114610575B.pdf
  • 如果有需要讨论的话,包括上面公开专利的技术交底书,请如流加我,我会拉群和研发 @From00 一起讨论:
image

@Qin-sx
Copy link
Author

Qin-sx commented Dec 3, 2024

收到,谢谢,已添加

	modified:   rfcs/APIs/20241119_api_design_for_reset_peak_memory_stats_reset_max_memory_allocated_memory_stats.md
	modified:   rfcs/APIs/20241119_api_design_for_reset_peak_memory_stats_reset_max_memory_allocated_memory_stats.md
@luotao1
Copy link
Collaborator

luotao1 commented Dec 12, 2024

同学你好,我们review后,评估了下: @Qin-sx

  • 想将题目改成:【新增paddle.device.cuda.reset_max_memory_allocated和paddle.device.cuda.reset_max_memory_reserved】这样torch的reset_peak_memory_stats、reset_max_memory_allocated、reset_max_memory_cached、reset_max_memory_reserved 四个都有对应了
  • 难度不变,依然2星

这样你的设计文档和代码,改改应该可以在本期黑客松中完成。
请问你是否同意这样的修改,如果同意的话,我们将会同步修改下题目。然后你重新改下RFC、PR等。

	modified:   rfcs/APIs/20241119_api_design_for_reset_peak_memory_stats_reset_max_memory_allocated_memory_stats.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants