I've used Julia for some time, but still know little about it, especially for the paralell computing. I want to obtain a new array with the existing data, as the array is very large, I want to do it in a parallel way, and the code is written as follows:
ψ1 = Array{Complex}(undef, D)
ψ = rand(Complex{Float64},D)
Threads.@threads for k = 1:D
ψ1[k] = @views sum(ψ[GetBasis(k - 1, N)])
end
I run it with julia -t 4, But it turn out to be very slow, compared to the non parallel code as follows,
ψ1 = [@views sum(ψ[GetBasis(k - 1, N)]) for k=1:D]
I have no idea about why this happens, and GetBasis() is just a function to generate a Array, Array{Int64,1}(N).
I would like to ask how I could improve the first code, or, is there some way I can modify the second code to also run it in a parallel way? As the array can be very large, and I want to find a way to speed it up...
Thanks a lot, and look forward to your replies!
The complete code can be found as follows
function GetBasis(n, N)
xxN = collect(0:N-1)
BI = BitFlip.(n,xxN).+1
end
function BitFlip(n, i)
n = n ⊻ (1 << i)
end
N=24
D=2^N
ψ1 = Array{Complex}(undef, D)
ψ = rand(Complex{Float64},D)
Threads.@threads for k = 1:D
ψ1[k] = @views sum(ψ[GetBasis(k - 1, N)])
end
ψ2 = [@views sum(ψ[GetBasis(k - 1, N)]) for k=1:D]
When all is done properly the multithreaded code is faster indeed.
function GetBasis(n, N)
BI = BitFlip.(n,0:N-1).+1
end
function BitFlip(n, i)
n = n ⊻ (1 << i)
end
const N=24
const D=2^N
const ψ = rand(Complex{Float64},D)
const ψ1 = Vector{Complex{Float64}}(undef, D)
using BenchmarkTools
Threads.nthreads() # should return 4 or more
# set JULIA_NUM_THREADS environment variable
Now testing:
julia> GC.gc()
julia> @btime ψ2 = [@views sum(ψ[GetBasis(k - 1, N)]) for k=1:D];
5.591 s (16777218 allocations: 4.50 GiB)
julia> GC.gc()
julia> @btime Threads.@threads for k = 1:D
@inbounds ψ1[k] = @views sum(ψ[GetBasis(k - 1, N)])
end
2.293 s (16777237 allocations: 4.25 GiB)
Note the amount of memory used by this code - you need to run garbage collector before running the benchamrk and the test will be less meaningful when you have less than 16GB RAM in your machine.
Thank you very much! The performance is much better now! May I ask why should I defined ψ and ψ1 as a const? And I've tried study N=26, and in this case, the parallel code is slower than the first code again, I'm not sure why it is the case, here I have 16GB RAM in my machine.
For N=26 you are running into garbage collection. You need 64GB of RAM to test the performance for N=26.
Regarding
const
- if you are running benchmarks of code that is outside of a function you need to make it type stable. When you put that code inside a function you can removeconst
s. See docs.julialang.org/en/v1/manual/performance-tips for details.Then, does it mean that, in case the RAM of my machine is not enough, the parallel code may be slower than the non-parallel one? If so, is there some way to improve it? And I also want to run it by GPU, but I could not find a function similar to @threads in CUDA of Julia...
For GPU look at juliagpu.github.io/CUDA.jl/stable/usage/array Again you will have memory problem. "the parallel code may be slower than the non-parallel one" - it depends on how garbage collector is going to behave. You would need to bechmark but in that case those will need to be run many hundred times due to high variance caused by the garbage collector.