AI X system@SOSP'17

Opening

System and algorithm co-design for distributed machine learning

Implementing a ML algorighms is not easy. (Many stacks)

Models size > GB

More machine -> better throughput

​ -> Number of iterations increases (How to distributed)

当代的machine learning系统的构建 -> 未来的machine learning系统的构建

How to build ML program to parallel?

  • Difference between ML program and classic program
  • Error tolerance -> Fix error in the next iteration
  • Asymetric convergence

How to distribute? - Chopping: model dependency

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

What to communicate?

  • communicate partial results (Still convergency guarantees(WoW)) Hybrid updates

现在的machine learning framework:


Faster isn’t always better

Program optimization:

  • JIT
  • Ahead of time compilation
  • Whole program optimization (good for TPUs)

TPU v2

比较让人惊讶的是TPU v2的发展。现在TPU已经可以互相不通过host(服务器)来进行交互以及通信。很像在RDMA网卡上使用FPGA来加速某些计算。 -> networked together

减少精度

在很多时候,机器学习的计算不需要很多的高精度的计算。适当丢失些计算可以加速处理,提高晶体管利用率。

Using reenforicment learning to improve system performance, i.e. device placement


TVM End to end IR stack for deep learning systems

IR: intermediate representation


ChainerMM

https://github.com/chainer/chainermn

“Why are u using 1000 GPUs ?(What’s network is so complex?)”

  • “Because our CEO is crazy.”

Services of ML

http://mwtds.azurewebsites.net/