点解GPU计算会比CPU更慢..?
时间:2014-04-23
来源:互联网
void print(const multiarraycase<type, Rank, Allocator> &a, ostream &output)
{
typedef multiarraycase<type, Rank, Allocator>::base_array base_array;
__if_exists(base_array::type)
{
for (size_t i = 0; i < a.size(0); ++i)
print(a[i], output);
output << "\n";
}
__if_not_exists(base_array::type)
{
std::transform(a.begin(), a.end(), std::ostream_iterator<type>(output, "\t"),
[](type value)
{
return (type)round(value * 100) / 100;
});
output << "\n";
}
}
template <typename type, size_t Rank, typename Allocator>
void show(const multiarraycase<type, Rank, Allocator> &a)
{
print(a, std::cout);
}
void main()
{
default_random_engine generator;
auto GetRandom = bind(uniform_int_distribution<int>(1, 9), std::ref(generator));
high_resolution_clock::time_point record;
duration<double> time_diff;
size_t N = 10000000;
multiarray<float> a(N), b(N), c(N);
array_view<const float, 1> _a(a.size(), a.data());
array_view<const float, 1> _b(b.size(), b.data());
array_view<float, 1> _c(c.size(), c.data());
while (true)
{
system("cls");
for (size_t i = 0; i < N; ++i)
{
a[i] = GetRandom();
b[i] = GetRandom();
}
record = high_resolution_clock::now();
_a.refresh();
_b.refresh();
parallel_for_each(
// Define the compute domain, which is the set of threads that are created.
_c.extent,
// Define the code to run on each thread on the accelerator.
[=](index<1> idx) restrict(amp)
{
_c[idx] = _a[idx] + _b[idx];
}
);
_c.synchronize();
time_diff = duration_cast<duration<double>>(high_resolution_clock::now() - record);
cout << "GPU: " << time_diff.count() << endl;
record = high_resolution_clock::now();
for (size_t i = 0; i < N; ++i)
{
c[i] = a[i] + b[i];
}
time_diff = duration_cast<duration<double>>(high_resolution_clock::now() - record);
cout << "CPU: " << time_diff.count() << endl;
cout << endl;
system("pause");
}
}
作者: Susan﹏汪汪 发布时间: 2014-04-23
作者: 烟民比食屎9更贱 发布时间: 2014-04-23
呢个似乎唔系 openCL, 亦唔系 Cuda, 何解你会认为系用到 GPU?
作者: Susan﹏汪汪 发布时间: 2014-04-23
Visual studio 的C++ amp功能
C++ Accelerated Massive Parallelism (C++ AMP) is a library implemented on DirectX 11 and an open specification from Microsoft for implementing data parallelism directly in C++. It is intended to make programming GPUs easy for the developer by supporting a range of expertise from none (in which case the system does its best) to being more finely controllable, but still portable. Code that cannot be run on GPUs will fall back onto one or more CPUs instead and use SSE instructions. The Microsoft implementation is included in Visual Studio 2012, including debugger and profiler support. Support for other platforms and hardware may become available from Microsoft or other compiler or hardware vendors.
作者: 烟民比食屎9更贱 发布时间: 2014-04-23
呢个 amp lib 几令我怀疑其效能, 好似无咩人提过。同埋你 hardware 有否合乎哂佢既要求?有无用 GPU-Z 睇下个 GPU loading 有无增加?
C++ Accelerated Massive Parallelism (C++ AMP) is a library implemented on ...
作者: Susan﹏汪汪 发布时间: 2014-04-23
作者: a8d7e8 发布时间: 2014-04-23
Data dependency?

作者: Susan﹏汪汪 发布时间: 2014-04-23
唔明想问咩

作者: a8d7e8 发布时间: 2014-04-23
VS可以直接睇到GPU堆叠
作者: 烟民比食屎9更贱 发布时间: 2014-04-23
我谂佢模拟出黎, 有无睇过实际 GPU realtime loading 先?
作者: Susan﹏汪汪 发布时间: 2014-04-23

汪汪估计系因为data copy问题
以#1的算式来看
如果CPU计算的话...需要花上O(N)时间
但问题系...资料必需事先copy到GPU专用记忆体才可以运算
_b.refresh();
和
_c.synchronize();
但这个GPU专用记忆体系需要CPU来做update
结果单单系update就需要O(3N)的时间
而实测结果也差不多系GPU计算时间足足有CPU的三倍
所以
既然GPU在启动上开销好大...倒不如直头一次过把所有运算内容事先塞晒入GPU内
汪汪就做左个更大胆的测试
CPU用快速算法计算一大block data的Convolution...时间系0.04
而GPU用直接计算Convolution的方法来做...时间系0.07
时间上的确拉近左好多....证明GPU的运算速度系比CPU快....
只不过...如果要为GPU来设计FFT或Convolution的快速算法就已经很麻烦
第一...GPU只能接受简单的data type (C++ standard complex type依然可以使用...不过要自行重写GPU专用function)
第二...GPU不支援Recursion (因为所有GPU function必需以inline方式调用)
作者: Susan﹏汪汪 发布时间: 2014-04-23
热门阅读
-
office 2019专业增强版最新2021版激活秘钥/序列号/激活码推荐 附激活工具
阅读:74
-
如何安装mysql8.0
阅读:31
-
Word快速设置标题样式步骤详解
阅读:28
-
20+道必知必会的Vue面试题(附答案解析)
阅读:37
-
HTML如何制作表单
阅读:22
-
百词斩可以改天数吗?当然可以,4个步骤轻松修改天数!
阅读:31
-
ET文件格式和XLS格式文件之间如何转化?
阅读:24
-
react和vue的区别及优缺点是什么
阅读:121
-
支付宝人脸识别如何关闭?
阅读:21
-
腾讯微云怎么修改照片或视频备份路径?
阅读:28