bcrypt 死锁探秘

产品经理反馈程序经常失去响应,从他那里创建了 dump 文件,取回来,用 windbg 分析一番。感慨颇多。

调试过程

加载符号

1
2
>!analyze -v
wow64cpu!CpupSyscallStub+0x9:

看到这个,系统是64位的,转换一下

1
2
3
4
5
6
7
8
9
10
11
12
13
>.load wow64exts
>!sw
>!analyze -v

STACK_TEXT:
002294a8 76a43d3c 00000000 002294ec 396663fe ntdll_771d0000!ZwDelayExecution+0x15
00229510 5169f801 00000001 00000000 00035086 KERNELBASE!SleepEx+0x65
00229528 519377cd 0dd20834 0dd20830 00229650 clr!EESleepEx+0x4f
00229538 517afa11 00000000 39676e87 025cfb70 clr!__DangerousSwitchToThread+0x72
00229650 517afad5 00390da0 39676e37 00229700 clr!ThreadNative::StartInner+0x2c1
002296e0 79946049 00229700 00000000 025cfb50 clr!ThreadNative::Start+0x6a
002296f8 79945fe4 00000001 00229738 0802a1c7 mscorlib_ni!System.Threading.Thread.Start(System.Threading.StackCrawlMark ByRef)+0x61
00229704 0802a1c7 023dbe24 025cfb40 00000000 mscorlib_ni!System.Threading.Thread.Start()+0x18

看了一下,这里应该不是事故现场。看看其他线程

1
2
3
4
5
6
7
8
9
10
11
>~*kb
//此处省略一些信息
......
417 Id: 1480.3cfc Suspend: 0 Teb: 7ea61000 Unfrozen
# ChildEBP RetAddr Args to Child
00 3849f5dc 7721eb4e 00000250 00000000 00000000 ntdll_771d0000!NtWaitForSingleObject+0x15
01 3849f640 7721ea32 00000000 00000000 00000000 ntdll_771d0000!RtlpWaitOnCriticalSection+0x13e
02 3849f668 77209aa9 772d20c0 4f52bc5c 7ea63000 ntdll_771d0000!RtlEnterCriticalSection+0x150
03 3849f6fc 7720984c 3849f76c 4f52bde8 00000000 ntdll_771d0000!LdrpInitializeThread+0xc6
04 3849f748 77209879 3849f76c 771d0000 00000000 ntdll_771d0000!_LdrpInitialize+0x1ad
05 3849f758 00000000 3849f76c 771d0000 00000000 ntdll_771d0000!LdrInitializeThunk+0x10

417个线程,大部分线程堆栈都是这样,看起来可能是存在死锁,导致线程不能正常退出

1
2
>!locks
Scanned 9 critical sections

没有多余信息。查看一下关键段信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
>!cs -s -l -o
DebugInfo = 0x772d4380
Critical section = 0x772d20c0 (ntdll_771d0000!LdrpLoaderLock+0x0)
LOCKED
LockCount = 0x187
WaiterWoken = No
OwningThread = 0x00001a0c
RecursionCount = 0x1
LockSemaphore = 0x250
SpinCount = 0x00000000
OwningThread DbgId = ~47s
OwningThread Stack =
ChildEBP RetAddr Args to Child
184fd76c 7721eb4e 00000ed8 00000000 00000000 ntdll_771d0000!NtWaitForSingleObject+0x15 (FPO: [3,0,0])
184fd7d0 7721ea32 00000000 00000000 00000000 ntdll_771d0000!RtlpWaitOnCriticalSection+0x13e (FPO: [Non-Fpo])
184fd7f8 6f882f8e 6f894060 00000000 0dd63b7c ntdll_771d0000!RtlEnterCriticalSection+0x150 (FPO: [Non-Fpo])
184fd820 6f882dc8 06f640c8 00000000 0dd63b78 bcrypt+0x2f8e
184fd874 72269a42 184fd894 72269a70 00000000 bcrypt+0x2dc8
184fd898 7659c167 722a49cc 184fd8c0 721aea6e msxml6!AutoInitSalt::AutoInitSalt+0x1f (FPO: [Non-Fpo]) (CONV: thiscall)
184fd8a4 721aea6e 721aea9c 721aeb90 00000001 msvcrt!_initterm+0x13 (FPO: [Non-Fpo])
184fd8c0 72191456 72190000 00000000 00000000 msxml6!_CRT_INIT+0xc3 (FPO: [Non-Fpo]) (CONV: stdcall)
184fd920 721ae3fe 72190000 00000001 00000000 msxml6!__DllMainCRTStartup+0x9e (FPO: [Non-Fpo]) (CONV: stdcall)
184fd940 77209344 72190000 00000001 00000000 msxml6!InitDllMain+0x90 (FPO: [Non-Fpo]) (CONV: stdcall)
184fd960 7720fde1 7219135c 72190000 00000001 ntdll_771d0000!LdrpCallInitRoutine+0x14
184fda54 7720ea5e 00000000 6f549168 184fdbf8 ntdll_771d0000!LdrpRunInitializeRoutines+0x26f (FPO: [Non-Fpo])
184fdbc8 7724d39f 184fdc38 184fdbf8 0dee0974 ntdll_771d0000!LdrpLoadDll+0x472 (FPO: [Non-Fpo])
184fdc04 76a42e0f 00000000 184fdc58 184fdc38 ntdll_771d0000!LdrLoadDll+0xc7 (FPO: [Non-Fpo])
184fdc4c 75d29c67 00000000 00000000 00002008 KERNELBASE!LoadLibraryExW+0x233 (FPO: [Non-Fpo])
184fdc68 75d29bea 00000000 184fdce4 00002008 ole32!LoadLibraryWithLogging+0x16 (FPO: [Non-Fpo]) (CONV: stdcall)
184fdc8c 75d29ad6 184fdce4 184fdcb0 184fdcb4 ole32!CClassCache::CDllPathEntry::LoadDll+0xaf (FPO: [Non-Fpo]) (CONV: stdcall)
184fdcbc 75d28fde 184fdce4 184fdfcc 184fdcdc ole32!CClassCache::CDllPathEntry::Create_rl+0x37 (FPO: [Non-Fpo]) (CONV: stdcall)
184fdf08 75d28eb3 00000001 184fdfcc 184fdf38 ole32!CClassCache::CClassEntry::CreateDllClassEntry_rl+0xd4 (FPO: [Non-Fpo]) (CONV: thiscall)
184fdf50 75d28db9 00000001 071100e8 184fdf7c ole32!CClassCache::GetClassObjectActivator+0x224 (FPO: [Non-Fpo]) (CONV: stdcall)
${$ntdllwsym}!RtlpStackTraceDataBase is NULL. Probably the stack traces are not enabled.
-----------------------------------------
DebugInfo = 0x070be3a8
Critical section = 0x6f894060 (bcrypt+0x14060)
LOCKED
LockCount = 0x1
WaiterWoken = No
OwningThread = 0x000021e4
RecursionCount = 0x1
LockSemaphore = 0xED8
SpinCount = 0x00000000
OwningThread DbgId = ~18s
OwningThread Stack =
ChildEBP RetAddr Args to Child
0b80e2bc 7721eb4e 00000250 00000000 00000000 ntdll_771d0000!NtWaitForSingleObject+0x15 (FPO: [3,0,0])
0b80e320 7721ea32 00000000 00000000 0b80e388 ntdll_771d0000!RtlpWaitOnCriticalSection+0x13e (FPO: [Non-Fpo])
0b80e348 77200329 772d20c0 7c9ba944 6f88275c ntdll_771d0000!RtlEnterCriticalSection+0x150 (FPO: [Non-Fpo])
0b80e3e4 77200262 6f840000 0b80e420 00000000 ntdll_771d0000!LdrGetProcedureAddressEx+0x159 (FPO: [Non-Fpo])
0b80e400 76a41f7c 6f840000 0b80e420 00000000 ntdll_771d0000!LdrGetProcedureAddress+0x18 (FPO: [Non-Fpo])
0b80e428 6f8826e6 6f840000 6f88275c 0dd7a980 KERNELBASE!GetProcAddress+0x44 (FPO: [Non-Fpo])
0b80e43c 6f882fbc 6f840000 0dd7a980 00000000 bcrypt+0x26e6
0b80e470 6f882dc8 0dd7a980 00000000 0dd63c20 bcrypt+0x2fbc
0b80e4c4 6f8564b4 0b80e4f0 76b6b79c 00000000 bcrypt+0x2dc8
0b80e4f4 6f856445 00000002 0b80e5b4 0b80e618 bcryptprimitives!CheckSignaturePadding+0x44 (FPO: [Non-Fpo])
0b80e530 6f886380 0dd7b1b8 0b80e5b4 0b80e618 bcryptprimitives!MSCryptRsaVerifySignature+0x90 (FPO: [Non-Fpo])
0b80e574 76b8515e 0dc37230 0b80e5b4 0b80e618 bcrypt+0x6380
0b80e5a8 76b8511a 0dc37230 76b6b79c 0b80e618 crypt32!I_CryptCNGSignAndEncodeHash+0x18d (FPO: [Non-Fpo])
0b80e5e0 76b84fc5 00000001 0044b478 00000000 crypt32!I_CryptCNGVerifyEncodedSignature+0xba (FPO: [Non-Fpo])
0b80e668 76b84eb4 76b6bf8c 00000001 0044b478 crypt32!I_CryptCNGVerifyCertificateSignedContent+0x15d (FPO: [Non-Fpo])
0b80e6d4 76b65db2 00000000 00000001 00000002 crypt32!CryptVerifyCertificateSignatureEx+0x242 (FPO: [Non-Fpo])
0b80e72c 76b65b7c 0de850b0 0ddb6358 00000000 crypt32!ChainGetSubjectStatus+0x90 (FPO: [Non-Fpo])
0b80e758 76b655a3 0de850b0 00000000 0ddb6358 crypt32!CCertIssuerList::CreateElement+0x51 (FPO: [Non-Fpo])
0b80e790 76b694fc 0de850b0 0de85128 07138910 crypt32!CCertIssuerList::AddIssuer+0x87 (FPO: [Non-Fpo])
0b80e7bc 76b6605d 00000002 0de850b0 0de85128 crypt32!CChainPathObject::FindAndAddIssuersFromCacheByMatchType+0x87 (FPO: [Non-Fpo])
${$ntdllwsym}!RtlpStackTraceDataBase is NULL. Probably the stack traces are not enabled.

有很多个,但是前两个就可以破案了。

18号线程:

0b80e348 77200329 772d20c0 7c9ba944 6f88275c ntdll_771d0000!RtlEnterCriticalSection+0x150 (FPO: [Non-Fpo])
进入了 0x6f894060 关键段,想要访问 0x772d20c0 这个关键段

47号线程:

184fd7f8 6f882f8e 6f894060 00000000 0dd63b7c ntdll_771d0000!RtlEnterCriticalSection+0x150 (FPO: [Non-Fpo])
进入了 0x772d20c0 关键段,但是想要进入 0x6f894060 关键段。造成死锁

再看47号线程这部分,LockCount = 0x187 有391个线程因为这个被锁住,跟线程数也差不多对的上,基本可以断定 0x772d20c0 这个关键段不能被正常释放导致。再分析两个线程的堆栈,可以看到都进入了 bcrypt 模块,基本可以判定 微软的 bcrypt 会存在死锁

去网上搜了一下,果然有人遇上同样的问题
https://social.technet.microsoft.com/Forums/Lync/en-US/dee65a4a-ed42-426b-8540-427d2875154f/excel-365-may-experience-a-deadlock-while-opening-encrypted-spreadsheet?forum=Office2016ITPro
不是客户端本身代码问题,松了一口气,但是微软这个加密模块会存在死锁,感觉还是有点 emmm… 唔得行

后记

一个加密的工具模块,竟然会用到锁,这个设计也是挺迷的。这个问题在 Win7 下出现过,Win10 还没有接到反馈。说到底 dotnet framework 这个设计还是为人所诟病的。运行时不能独立出来,对系统本身依赖太多。首先会因为用户本身的环境有问题,导致软件运行不正常;其次,还有一部分的问题因为用户的环境不同,有些会出问题,有些不会出问题,而且框架没得改。也难怪没有火起来。还好微软及时幡然悔悟,dotnet core 3 终于回归正轨。

WPF 这套开发效率确实是高,但是面对各种各样用户的系统环境,如果 dotnet 不能让人做到可以自己编译自己改,那就还是存在硬伤的。发现问题但改不动,微软对于框架本身的自信最终还是会害了自己。就看 dotnet 5 能不能有啥大突破了