Antman

【论文笔记】TGS论文阅读笔记

May 6, 2023

20 minutes

antman gpu

摘要

容器广泛用于数据中心中的资源管理。
- 支持容器云中的深度学习(DL)训练的一个常见实践是静态地将 GPU 完全绑定到容器上。
- 由于生产中 DL 作业的资源需求多种多样，大量的 GPU 未得到充分利用。
- 因此，GPU 集群具有较低的 GPU 利用率，由于排队，导致作业完成时间较长。
我们提出 TGS (透明 GPU 共享) ，一个系统，提供透明的 GPU 共享到集装箱云中的 DL 训练。
Read more...

【代码分析】Tensorflow的session执行分析

December 23, 2022

6 minutes

Tensorflow GPU Antman

Tensorflow kernal launch 的过程

分析session执行的过程，并分析Antman对执行过程的修改

函数调用链 Run()–>RunInternel()–>RunAsync()–>ScheduleReady()–>Process()

修改了direct_session.cc , 在session执行前后运行中间件框架

【代码分析】Antman对Tensorflow的修改

December 4, 2022

8 minutes

Tensorflow GPU Antman

Antman对Tensorflow的代码修改

总体的关系图，主要包括两个实现，内存方面的GPUResourceManagement以及算力方面的GpuOpManager。

graph TD A>gpu_resource_manage_file] B[SessionRunRegistry] C[SessionRunAction] D[Executor] E[GPUResouceManagement] F[GPU Statistic] G[GpuOpManager] H[GpuUsageAdjustment] I(dump gpu statistic) J[GPU Process State] K[GPUVMemAllocator] L[GPUAdjustableAllocator] A -->|FileListener| E B -->|Register| E E -->|need_to_adjust_memory_| H H -->|new| L H -->|get| K C -->|Derive| E C -->|Derive| F B -->|Register| F F -->|need_to_dump_statistics_| I B -->|Run| C J -->|maybe_create_gpu_vmem_allocator|K D -->|run thread| G E -->|GetEstimatedIdleTime| G

GPUVMemAllocator

GPUVMemAllocator 可以分配host的mem作为显存的备用，以免出现OOM错误。

李志轩 | Tweakzx