Skip to content

Conversation

@waliwali777
Copy link
Contributor

@waliwali777 waliwali777 commented Nov 11, 2024

PR types

Others

PR changes

Others

Description

线上自动并行 PaddleNLP-CI-gpt-3 CI 会在测试执行结束后出现随机挂 exit -6的现象。原本的处理方法是结束CI,并提示用户通过re-run的方式来解决。但随机挂概率较高,导致用户负担较重。因此,本PR主要是通过自动re-run的方式解决该问题:

  1. 执行test_case 需要通过子进程方式将test函数集合写入function.txt中,父进程读取function.txt中的函数名称,再通过子进程的方式进行调用执行。因此,父进程可以捕获每个测试的ExitCode
  2. 子进程返回 ExitCode = 250则代表出现随机挂 exit -6,该测试会自动re-run一次,如果问题复现,则日志中只提示用户该测试多次出现随机挂问题,不再中断CI,继续执行剩余测试
  3. test_case执行完后,统计当前case的总测试数目、成功测试数目、运行失败测试数目、精度校验失败测试数目、随机挂测试数目

@paddle-bot
Copy link

paddle-bot bot commented Nov 11, 2024

Thanks for your contribution!

@codecov
Copy link

codecov bot commented Nov 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 52.97%. Comparing base (b5e3f0c) to head (77d8675).
Report is 240 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9400      +/-   ##
===========================================
- Coverage    53.01%   52.97%   -0.04%     
===========================================
  Files          678      676       -2     
  Lines       108787   107838     -949     
===========================================
- Hits         57668    57124     -544     
+ Misses       51119    50714     -405     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

function llama_dygraph_auto_bs8_fp32_DP2() {
echo "=========== $FUNCNAME run begin ==========="
export_env
cd ${llama_case_path}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么每个函数里要再调用一次 export_env 和 cd ${llama_case_path},建议在run_ci.sh 中执行一次hook即可

llama_align_dy2st_fthenb_and_vpp_auto_bs2_fp32_DP1-MP1-PP4
llama_align_dygraph_dy2st_pir_auto_pp_bs2_bf16_DP1-MP1-PP4
)
restore_func $fun_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议将 restore_llama_case_list_auto_func 与 llama_case_list_auto 合并为一个,仅维护一个列表。

Copy link
Contributor

@wawltor wawltor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wawltor wawltor merged commit 3fe6aba into PaddlePaddle:develop Nov 19, 2024
10 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants