+ -
当前位置:首页 → 问答吧 → 点解GPU计算会比CPU更慢..?

点解GPU计算会比CPU更慢..?

时间:2014-04-23

来源:互联网

复制内容到剪贴板代码:template <typename type, size_t Rank, typename Allocator>
void print(const multiarraycase<type, Rank, Allocator> &a, ostream &output)
{
typedef multiarraycase<type, Rank, Allocator>::base_array base_array;
__if_exists(base_array::type)
{
for (size_t i = 0; i < a.size(0); ++i)
print(a[i], output);
output << "\n";
}
__if_not_exists(base_array::type)
{
std::transform(a.begin(), a.end(), std::ostream_iterator<type>(output, "\t"),
[](type value)
{
return (type)round(value * 100) / 100;
});
output << "\n";
}
}
template <typename type, size_t Rank, typename Allocator>
void show(const multiarraycase<type, Rank, Allocator> &a)
{
print(a, std::cout);
}

void main()
{
default_random_engine generator;
auto GetRandom = bind(uniform_int_distribution<int>(1, 9), std::ref(generator));

high_resolution_clock::time_point record;
duration<double> time_diff;

size_t N = 10000000;
multiarray<float> a(N), b(N), c(N);

array_view<const float, 1> _a(a.size(), a.data());
array_view<const float, 1> _b(b.size(), b.data());
array_view<float, 1> _c(c.size(), c.data());

while (true)
{

system("cls");

for (size_t i = 0; i < N; ++i)
{
a[i] = GetRandom();
b[i] = GetRandom();
}

record = high_resolution_clock::now();
_a.refresh();
_b.refresh();
parallel_for_each(
// Define the compute domain, which is the set of threads that are created.
_c.extent,
// Define the code to run on each thread on the accelerator.
[=](index<1> idx) restrict(amp)
{
_c[idx] = _a[idx] + _b[idx];
}
);
_c.synchronize();
time_diff = duration_cast<duration<double>>(high_resolution_clock::now() - record);
cout << "GPU: " << time_diff.count() << endl;

record = high_resolution_clock::now();
for (size_t i = 0; i < N; ++i)
{
c[i] = a[i] + b[i];
}
time_diff = duration_cast<duration<double>>(high_resolution_clock::now() - record);
cout << "CPU: " << time_diff.count() << endl;

cout << endl;

system("pause");

}
}

作者: Susan﹏汪汪   发布时间: 2014-04-23

呢个似乎唔系 openCL, 亦唔系 Cuda, 何解你会认为系用到 GPU?

作者: 烟民比食屎9更贱   发布时间: 2014-04-23

引用:原帖由 烟民比食屎9更贱 於 2014-4-16 12:05 AM 发表
呢个似乎唔系 openCL, 亦唔系 Cuda, 何解你会认为系用到 GPU?
Visual studio 的C++ amp功能

作者: Susan﹏汪汪   发布时间: 2014-04-23

引用:原帖由 Susan﹏汪汪 於 2014-4-16 12:06 AM 发表

Visual studio 的C++ amp功能



呢个 amp lib 几令我怀疑其效能, 好似无咩人提过。同埋你 hardware 有否合乎哂佢既要求?有无用 GPU-Z 睇下个 GPU loading 有无增加?

C++ Accelerated Massive Parallelism (C++ AMP) is a library implemented on DirectX 11 and an open specification from Microsoft for implementing data parallelism directly in C++. It is intended to make programming GPUs easy for the developer by supporting a range of expertise from none (in which case the system does its best) to being more finely controllable, but still portable. Code that cannot be run on GPUs will fall back onto one or more CPUs instead and use SSE instructions. The Microsoft implementation is included in Visual Studio 2012, including debugger and profiler support. Support for other platforms and hardware may become available from Microsoft or other compiler or hardware vendors.

作者: 烟民比食屎9更贱   发布时间: 2014-04-23

引用:原帖由 烟民比食屎9更贱 於 2014-4-16 03:05 AM 发表
呢个 amp lib 几令我怀疑其效能, 好似无咩人提过。同埋你 hardware 有否合乎哂佢既要求?有无用 GPU-Z 睇下个 GPU loading 有无增加?

C++ Accelerated Massive Parallelism (C++ AMP) is a library implemented on ...
VS可以直接睇到GPU堆叠

作者: Susan﹏汪汪   发布时间: 2014-04-23

Data dependency?

作者: a8d7e8   发布时间: 2014-04-23

引用:原帖由 a8d7e8 於 2014-4-16 01:33 PM 发表
Data dependency?
唔明想问咩

作者: Susan﹏汪汪   发布时间: 2014-04-23

frroget it my fault
引用:原帖由 Susan﹏汪汪 於 2014-4-16 13:43 发表

唔明想问咩



作者: a8d7e8   发布时间: 2014-04-23

引用:原帖由 Susan﹏汪汪 於 2014-4-16 04:06 AM 发表

VS可以直接睇到GPU堆叠



我谂佢模拟出黎, 有无睇过实际 GPU realtime loading 先?

作者: 烟民比食屎9更贱   发布时间: 2014-04-23

引用:原帖由 烟民比食屎9更贱 於 2014-4-16 10:53 PM 发表
我谂佢模拟出黎, 有无睇过实际 GPU realtime loading 先?
不可能模拟出黎...模拟GPU堆叠对C++ compiler黎讲系白痴设计

作者: Susan﹏汪汪   发布时间: 2014-04-23

没有人有合理解释吗
汪汪估计系因为data copy问题

以#1的算式来看
复制内容到剪贴板代码:c[i] = a[i] + b[i]
GPU应该可以用单一至两个运算周期就计完一大block data...当系O(1)
如果CPU计算的话...需要花上O(N)时间

但问题系...资料必需事先copy到GPU专用记忆体才可以运算
复制内容到剪贴板代码:_a.refresh();
_b.refresh();

_c.synchronize();
汪汪经过得多测试才知道....使用GPU计算时必需调用上面的三个function来update GPU专用记忆体的资料

但这个GPU专用记忆体系需要CPU来做update
结果单单系update就需要O(3N)的时间

而实测结果也差不多系GPU计算时间足足有CPU的三倍


所以
既然GPU在启动上开销好大...倒不如直头一次过把所有运算内容事先塞晒入GPU内
汪汪就做左个更大胆的测试

CPU用快速算法计算一大block data的Convolution...时间系0.04
而GPU用直接计算Convolution的方法来做...时间系0.07
时间上的确拉近左好多....证明GPU的运算速度系比CPU快....

只不过...如果要为GPU来设计FFT或Convolution的快速算法就已经很麻烦
第一...GPU只能接受简单的data type (C++ standard complex type依然可以使用...不过要自行重写GPU专用function)
第二...GPU不支援Recursion (因为所有GPU function必需以inline方式调用)

作者: Susan﹏汪汪   发布时间: 2014-04-23