Tenstorrent partition部分的详细算法

MbjYjbpivj · August 12, 2024, 12:09pm

有没有 TT partition部分的详细算法，如何将网络模型的各个子图划分到各个core上计算，子图是根据什么算法划分的？

mingzhe · August 15, 2024, 2:26am

根据这个图，图优化相关算法预计在BUDA前端代码实现，目前项目还在搭建中，如需要参考图切分算法可以先去看iree的flow方言中有相关优化

另外，数据层面，ttmlir有一定的张量布局转换，使得不同大小数据能切分到统一芯片能接受的数据格式大小，具体可以参考深入TT-MLIR: Tenstorrent的mid-level编译器软件栈的张量布局章节。

saturn · August 15, 2024, 2:33am

TT partition 部分算法是在 Buda 前端实现的，具体在 tt-buda/blob/main/pybuda/csrc/balancer/balancer.cpp 文件中。

github.com

tenstorrent/tt-buda/blob/main/pybuda/csrc/balancer/balancer.cpp#L403


      
              }
          }
          
          std::shared_ptr<BalancerSolution> run_balancer_and_placer(
              Graph* graph, BalancerConfig& config, std::shared_ptr<BalancerCacheCollection> cache_collection)
          {
              log_info("Running Balancer with Policy: {}", config.policy_type);
              PROFILE_SCOPE();
          
              log_debug(LogBalancer, "{}", config);
              LegalOpModels valid_op_models = legalizer::get_legal_op_models(graph, config, cache_collection);
              legalizer::GraphSolver graph_solver = get_graph_solver(config, cache_collection, graph, valid_op_models);
              BalancerPolicySolution balancer_policy_solution = run_policy(graph, config, graph_solver);
              update_ops_on_selected_op_models(graph, balancer_policy_solution.graph_solver_solution.selected_op_models);
          
              auto const& [op_models, block_shape_map, output_host_tms, cut_edges] =
                  legalizer::resolve_block_shapes(graph, config, balancer_policy_solution.graph_solver_solution);
          
              if (balancer_policy_solution.placer_solution.has_value())
              {
                  insert_input_queues(balancer_policy_solution.placer_solution.value(), graph, op_models);

github.com

tenstorrent/tt-buda/blob/main/pybuda/csrc/balancer/balancer.cpp#L405


      
          
          std::shared_ptr<BalancerSolution> run_balancer_and_placer(
              Graph* graph, BalancerConfig& config, std::shared_ptr<BalancerCacheCollection> cache_collection)
          {
              log_info("Running Balancer with Policy: {}", config.policy_type);
              PROFILE_SCOPE();
          
              log_debug(LogBalancer, "{}", config);
              LegalOpModels valid_op_models = legalizer::get_legal_op_models(graph, config, cache_collection);
              legalizer::GraphSolver graph_solver = get_graph_solver(config, cache_collection, graph, valid_op_models);
              BalancerPolicySolution balancer_policy_solution = run_policy(graph, config, graph_solver);
              update_ops_on_selected_op_models(graph, balancer_policy_solution.graph_solver_solution.selected_op_models);
          
              auto const& [op_models, block_shape_map, output_host_tms, cut_edges] =
                  legalizer::resolve_block_shapes(graph, config, balancer_policy_solution.graph_solver_solution);
          
              if (balancer_policy_solution.placer_solution.has_value())
              {
                  insert_input_queues(balancer_policy_solution.placer_solution.value(), graph, op_models);
              }
              else

函数 get_legal_op_models，会根据 tensor 大小，数据流向，L1内存大小限制等等因素设置 grid_shape(x,y)，表示需要占用多少个 tensor core，将节点所有合法的 op_model 保存起来；
函数 run_policy 则会根据具体的策略，选择合适 op_model，当所有的节点的 op_model 都确定了，子图也就确定了。
后续有更详细分析，会贴在这里。

saturn · August 16, 2024, 2:19am

run_balancer_and_placer 示例分析

以 resnet50 为例，分析 pybuda balancer 和 placer 的算法，命令如下：

python third_party/buda-model-demos/model_demos/cv_demos/resnet/pytorch_resnet.py

查看日志文件，获取上文提及的关键函数 get_legal_op_models 生成的合法 op_models，其中，节点 conv2d_0.dc.conv2d.3.dc.conv2d.1.dc.matmul.11 的部分合法 op_model 信息如下：

# op_mode1
.grid_shape = GridShape{.r = 4, .c = 1}
.op_type = matmul,
.data_format = Float16_b,
.math_fidelity = HiFi3,
.t_stream_factor = TStreamFactor{.dir = TStreamDir::R, .r = 2, .c = 1},
.cached_execution_cycles = 86240,
.input_buffers = {
[0] = BufferModel{.block_shape = BlockShape{.t = 2, .mblock_m = 7, .mblock_n = 1, .ublock = UBlockShape{.rt = 7, .ct = 4}}, .buffer_factor = 2, .l1_size_tiles = 56, .data_format = Float16_b}
[1] = BufferModel{.block_shape = BlockShape{.t = 2, .mblock_m = 1, .mblock_n = 2, .ublock = UBlockShape{.rt = 4, .ct = 1}}, .buffer_factor = 2, .l1_size_tiles = 16, .data_format = Float16_b}
},
.output_buffers = {
[0] = BufferModel{.block_shape = BlockShape{.t = 2, .mblock_m = 7, .mblock_n = 2, .ublock = UBlockShape{.rt = 7, .ct = 1}}, .buffer_factor = 2, .l1_size_tiles = 196, .data_format = Float16_b}
},
# op_mode2
.grid_shape = GridShape{.r = 8, .c = 1}
.op_type = matmul,
.data_format = Float16_b,
.math_fidelity = HiFi3,
.t_stream_factor = TStreamFactor{.dir = TStreamDir::R, .r = 7, .c = 1},
.cached_execution_cycles = 43120,
.input_buffers = {
[0] = BufferModel{.block_shape = BlockShape{.t = 7, .mblock_m = 1, .mblock_n = 1, .ublock = UBlockShape{.rt = 7, .ct = 4}}, .buffer_factor = 2, .l1_size_tiles = 56, .data_format = Float16_b}
[1] = BufferModel{.block_shape = BlockShape{.t = 7, .mblock_m = 1, .mblock_n = 2, .ublock = UBlockShape{.rt = 4, .ct = 1}}, .buffer_factor = 2, .l1_size_tiles = 16, .data_format = Float16_b}
},
.output_buffers = {
[0] = BufferModel{.block_shape = BlockShape{.t = 7, .mblock_m = 1, .mblock_n = 2, .ublock = UBlockShape{.rt = 7, .ct = 1}}, .buffer_factor = 2, .l1_size_tiles = 28, .data_format = Float16_b}
},

op_model 分析如下：

GridShape 表示占用的 Tensix core 的形状，{.r = 4, .c = 1} 表示横向占用4个
op_type 表示算子类型
t_stream_factor 表示流向因子，表示数据流动的方向，用于切割输入 tensor 为小份
cached_execution_cycles 表示算子执行时间
input_buffers、output_buffers 表示输入输出数据的形状

以上日志文件并不完全，该节点一共尝试了 201 种 op_model，其中合法的有 68 种，合法的 GridShape 从最小的 (1,1) 到最大的 (8,2)，时间(cycle)从 344176 下降到了 23912。

op_model 不合法的原因有以下这些：

#define OpModelFailureReasons                                                               \
    X(NoFailure, "No Failure - Valid OpModel")                                              \
    X(IllegalStackForGrid, "Illegal stack for grid dimension")                              \
    X(UserAccessPreventsStreaming, "User access prevents streaming")                        \
    X(OperandAccessPreventsStreaming, "Producer access prevents streaming")                 \
    X(OperandAndUserAccessPreventsStreaming, "Producer and User access prevents streaming") \
    X(IllegalStreaming, "Illegal streaming")                                                \
    X(L1UsageOverMaxLimit, "L1 usage > L1 Max")                                             \
    X(ExceededDramChannelCapacity, "Exceeded DRAM channel capacity")                        \
    X(InputBufferAllocationFailure, "Failed to allocate input buffers")                     \
    X(PaddingConstraintsNotSatisfied, "Padding constraints not satisfied")                  \
    X(IllegalSparseMatmul, "Illegal sparse matmul")                                         \
    X(MaxFailureReason, "")

主要是校验了一下内存使用是否超过限制以及数据流向是否有问题。

观察关键函数 run_policy_nlp 可知，它除了判断前后节点的输出输入数据格式是否能对应上之外，还会选择执行时间最为接近的 op_model。

总结，pybuda 的 graph 是从 tvm 编译而来，中间并没有发现将大的 graph 划分成小的 subgraph，然后合并 subgraph 中的 node 的动作，而是由 balancer/placer 直接对 graph 中的每个 node 进行分析（默认每个 node 最少占用一个 Tensix core），获取合法的 op_model，然后从合法的 op_model 中使用合适的策略选择最终的 op_model，策略可以是占用最少的 Tensor core，也可以是每个节点的执行时间尽可能接近。