Opening
System and algorithm co-design for distributed machine learning
Implementing a ML algorighms is not easy. (Many stacks)
Models size > GB
More machine -> better throughput
-> Number of iterations increases (How to distributed)
当代的machine learning系统的构建 -> 未来的machine learning系统的构建
How to build ML program to parallel?
- Difference between ML program and classic program
- Error tolerance -> Fix error in the next iteration
- Asymetric convergence
How to distribute? - Chopping: model dependency
More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
What to communicate?
- communicate partial results (Still convergency guarantees(WoW)) Hybrid updates
现在的machine learning framework:
Faster isn’t always better
Program optimization:
- JIT
- Ahead of time compilation
- Whole program optimization (good for TPUs)
TPU v2
比较让人惊讶的是TPU v2的发展。现在TPU已经可以互相不通过host(服务器)来进行交互以及通信。很像在RDMA网卡上使用FPGA来加速某些计算。 -> networked together
减少精度
在很多时候,机器学习的计算不需要很多的高精度的计算。适当丢失些计算可以加速处理,提高晶体管利用率。
Using reenforicment learning to improve system performance, i.e. device placement
TVM End to end IR stack for deep learning systems
IR: intermediate representation
ChainerMM
https://github.com/chainer/chainermn
“Why are u using 1000 GPUs ?(What’s network is so complex?)”
- “Because our CEO is crazy.”
Services of ML