Logs for PR #1047 (2026-02-26T03:57:29.817619+00:00):

=== СТАТУС: Успешно выполнены программы: main_matrix_transpose, main_matrix_multiply ===
=== main_matrix_transpose stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 8.72174 sec (CUDA: 0.11048 sec, OpenCL: 0.8089 sec, Vulkan: 7.80229 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using OpenCL API...
Matrix size: rows=H=8192 x cols=W=16384 (512 MB)
______________________________________________________
Evaluating algorithm #1/2: 01 naive transpose (non-coalesced)
Kernels compilation done in 3.46203 seconds
algorithm times (in seconds) - 10 values (min=0.028699 10%=0.0297832 median=0.0298146 90%=3.4951 max=3.4951)
median effective algorithm bandwidth: 33.5407 GB/s
______________________________________________________
Evaluating algorithm #2/2: 02 transpose via local memory (coalesced)
Kernels compilation done in 0.0991758 seconds
algorithm times (in seconds) - 10 values (min=0.00847714 10%=0.00847844 median=0.00849441 90%=0.107754 max=0.107754)
median effective algorithm bandwidth: 117.724 GB/s
=== main_matrix_multiply stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 0.320729 sec (CUDA: 0.125026 sec, OpenCL: 0.0377321 sec, Vulkan: 0.157902 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using OpenCL API...
C = A x B, matrices size: C (rows=H=2048 x cols=W=4096) = A (rows=H=2048 x cols=K=1024) x B (rows=K=1024 x cols=W=4096)
matrices data size: A - 8 MB, B - 16 MB, C - 16 MB
______________________________________________________
Evaluating algorithm #1/3: CPU with OpenMP
algorithm times (in seconds) - 1 values (min=11.9457 10%=11.9457 median=11.9457 90%=11.9457 max=11.9457)
algorithm GFlops: 1.43746 GFlops
algorithm effective memory bandwidth: 0.00457799 GB/s
______________________________________________________
Evaluating algorithm #2/3: 01 naive
Kernels compilation done in 0.107665 seconds
algorithm times (in seconds) - 10 values (min=0.061319 10%=0.0617409 median=0.0631913 90%=0.172276 max=0.172276)
algorithm GFlops: 271.738 GFlops
algorithm effective memory bandwidth: 0.865428 GB/s
relative differences with CPU: 8388608 values (min=0 10%=0 median=2.21073e-07 90%=1.12363e-06 max=2.77294)
median relative difference with CPU: 2.21073e-07
99% percentile relative difference with CPU: 1.09303e-05
______________________________________________________
Evaluating algorithm #3/3: 02 using local memory
Kernels compilation done in 0.111219 seconds
algorithm times (in seconds) - 10 values (min=0.0288905 10%=0.0315843 median=0.031748 90%=0.136915 max=0.136915)
algorithm GFlops: 540.869 GFlops
algorithm effective memory bandwidth: 1.72255 GB/s
relative differences with CPU: 8388608 values (min=0 10%=0 median=2.21073e-07 90%=1.12363e-06 max=2.77294)
median relative difference with CPU: 2.21073e-07
99% percentile relative difference with CPU: 1.09303e-05