Warm tip: This article is reproduced from serverfault.com, please click

cuda indexing nvidia pycuda

indexing-是否为CUDA中的每个内核调用保证唯一线程ID？

(indexing - Is Unique Thread Id guaranteed for each Kernel Call in CUDA?)

发布于 2020-12-06 12:31:34

我最近开始使用Cuda，在C ++，Java和Python上有多线程，多进程编码的经验。

有了PyCuda，我看到了这样的示例代码，

ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
{
     int i = threadIdx.x;
     outvec[i] = scalar*vec[i];
}
""")

线程ID本身似乎参与了代码的逻辑。然后的问题是，是否有足够的线程ID覆盖我的整个数组（显然我需要为其建立索引的索引必须到达那里的所有元素），以及如果更改数组的大小会发生什么情况。

索引是否总是在0到N之间？

Questioner

BBSysDyn

Viewed

11

Original

English

Paul G. 2020-12-15 19:24:06

在CUDA中，线程ID在每个所谓的线程块中都是唯一的，这意味着你的示例内核仅在一个块工作的情况下才做正确的事情。可以在早期示例中完成此操作，以使你容易理解，但就性能而言，通常这是一件非常糟糕的事情：

对于一个块，你只能利用GPU中的多个流式多处理器（SM）之一，即使SM在等待时有足够的并行工作要做时也只能隐藏内存访问等待时间。

如果你的内核不包含循环，则单个线程块还限制了线程数，因此也限制了问题的大小，因此每个线程可以计算多个元素。

内核执行在层次结构上看得很清楚：为简单起见，我们将自己限制为一维索引，内核在所谓的gridDim.x线程块网格上执行，每个线程块包含的blockDim.x线程数由threadIdx.x每个块编号，而每个块的编号均由编号blockIdx.x。

要获取线程的唯一ID（最好使用硬件从数组中加载元素的方式），你必须采用blockIdx.x * blockDim.x + threadIdx.x。如果每个线程都要计算一个以上的元素，则使用以下形式的循环

for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < InputSize; i += gridDim.x * blockDim.x) { 
/* ... */
}

这称为网格跨度循环，因为它gridDim.x * blockDim.x是内核上所有工作线程的数量。不同的步幅（特别是让一个线程在连续元素上工作：步幅= 1）可能有效，但是由于非理想的内存访问模式，速度会慢得多。

热门帖子

1

这里分享一个免费的在线 PDF 总结工具： NoteGPT

2

没想到 Arc 浏览器对网络要求如此严格

3

澳大利亚🇦🇺归来~第一次去南半球，虽然看过很多次照片，亲临大洋路时仍觉震撼

4

失业三个月，面试寥寥无几，朋友失业的也很多

5

开发了一个在线批量图片压缩网站

6

路由器批量端口映射到 NAS 的问题，求教～

7

chatgpt-4o 实时语音功能入口在哪里呀

8

出二手书，有需要的朋友可以看看。

9

有什么不错的口粮茶推荐吗

10

求教 nas+盒子流畅观看 4k 原盘的方案

热门github

1

A multi-platform library for OpenGL, OpenGL ES, Vulkan, window and input (翻译：适用于 OpenGL、OpenGL ES、Vulkan、窗口和输入的多平台库)

2

Dev tool that writes scalable apps from scratch while the developer oversees the implementation (翻译：可扩展开发工具的 PoC，该工具从头开始编写整个应用程序，同时开发人员监督实施)

3

shadcn/ui, but for Svelte. ✨ (翻译：shadcn-svelte是shadcn/ui的非官方社区主导的Svelte端口。)

4

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems. (翻译：用于生成式 AI 的 Python 风险识别工具 (PyRIT) 是一个开放式访问自动化框架，使安全专业人员和机器学习工程师能够主动发现其生成式 AI 系统中的风险。)

5

Performance-portable, length-agnostic SIMD with runtime dispatch (翻译：Highway 是一个提供可移植 SIMD/向量内在函数的 C++ 库。)

6

ZK Credo (翻译：ZK信条)

7

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement (翻译：OpenCodeInterpreter：将代码生成与执行和优化集成)

8

Joplin - the secure note taking and to-do app with synchronisation capabilities for Windows, macOS, Linux, Android and iOS. (翻译：Joplin - 一个开源的笔记和待办事项应用程序，具有Windows，macOS，Linux，Android和iOS的同步功能。)

9

Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention. (翻译：Mamba 是一种新的状态空间模型架构，在信息密集型数据（例如语言建模）上显示出良好的性能，而之前的二次模型在 Transformers 方面存在不足。它基于结构化状态空间模型的进展，并本着FlashAttention的精神进行高效的硬件感知设计和实现。)

10

This repository contains System Design resources which are useful while preparing for interviews and learning Distributed Systems (翻译：该存储库包含系统设计资源，在准备面试和学习分布式系统时非常有用)

11

Curso para aprender el lenguaje de programación Python desde cero y para principiantes. 75 clases, 37 horas en vídeo, código, proyectos y grupo de chat. Fundamentos, frontend, backend, testing, IA... (翻译：从零开始学习 Python 编程语言的课程，适合初学者)

12

🎓 Path to a free self-taught education in Computer Science! (翻译：🎓计算机科学免费自学教程！)

13

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java (翻译：十亿行挑战 —— 使用 Java 对文本文件中的 10 亿行数据进行聚合的有趣探索)

14

A collective list of free APIs (翻译：免费 API 的集合列表)

15

📚 Freely available programming books (翻译：📚 免费提供的编程书籍)