Logs for PR #927 (2026-01-22T22:45:22.534831+00:00):

=== СТАТУС: Успешно выполнены программы: main_matrix_transpose, main_matrix_multiply ===
=== main_matrix_transpose stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 11.8432 sec (CUDA: 0.112519 sec, OpenCL: 0.706046 sec, Vulkan: 11.0246 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using CUDA API...
Matrix size: rows=H=8192 x cols=W=16384 (512 MB)
______________________________________________________
Evaluating algorithm #1/2: 01 naive transpose (non-coalesced)
algorithm times (in seconds) - 10 values (min=0.0239883 10%=0.0239894 median=0.0240359 90%=0.0258209 max=0.0258209)
median effective algorithm bandwidth: 41.6045 GB/s
______________________________________________________
Evaluating algorithm #2/2: 02 transpose via local memory (coalesced)
algorithm times (in seconds) - 10 values (min=0.00817342 10%=0.00817681 median=0.00818267 90%=0.0082992 max=0.0082992)
median effective algorithm bandwidth: 122.209 GB/s
=== main_matrix_multiply stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 0.308016 sec (CUDA: 0.124573 sec, OpenCL: 0.038077 sec, Vulkan: 0.145307 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using CUDA API...
C = A x B, matrices size: C (rows=H=2048 x cols=W=4096) = A (rows=H=2048 x cols=K=1024) x B (rows=K=1024 x cols=W=4096)
matrices data size: A - 8 MB, B - 16 MB, C - 16 MB
______________________________________________________
Evaluating algorithm #1/3: CPU with OpenMP
algorithm times (in seconds) - 1 values (min=11.97 10%=11.97 median=11.97 90%=11.97 max=11.97)
algorithm GFlops: 1.43454 GFlops
algorithm effective memory bandwidth: 0.0045687 GB/s
______________________________________________________
Evaluating algorithm #2/3: 01 naive
algorithm times (in seconds) - 10 values (min=0.171345 10%=0.172939 median=0.174058 90%=0.329976 max=0.329976)
algorithm GFlops: 98.6536 GFlops
algorithm effective memory bandwidth: 0.314191 GB/s
relative differences with CPU: 8388608 values (min=0 10%=8.67401e-08 median=4.71637e-07 90%=2.07923e-06 max=3.12559)
median relative difference with CPU: 4.71637e-07
99% percentile relative difference with CPU: 1.95534e-05
______________________________________________________
Evaluating algorithm #3/3: 02 using local memory
algorithm times (in seconds) - 10 values (min=0.152764 10%=0.152767 median=0.152778 90%=0.155259 max=0.155259)
algorithm GFlops: 112.395 GFlops
algorithm effective memory bandwidth: 0.357955 GB/s
relative differences with CPU: 8388608 values (min=0 10%=8.67415e-08 median=4.71645e-07 90%=2.07943e-06 max=6.30526)
median relative difference with CPU: 4.71645e-07
99% percentile relative difference with CPU: 1.95739e-05