Question: Golang: why using goroutines to parallelize calls ends up being slower?

Question

Golang: why using goroutines to parallelize calls ends up being slower?

Answers 2
Added at 2017-01-01 20:01
Tags
Question

I have two versions of merge sort implementation. The first is a "normal" version and the second uses goroutines that parallelize the work being done on each subset of the slice in each step of the recursion.

One would assume that being able to parallelize this work would make the concurrent implementation faster: if I need to work on slice A and slice B, then working on them concurrently should be faster than doing so synchronously.

Now I'm assuming something is wrong with either my implementation of my understanding, because my concurrent version ends up being 13-14x slower than the sync version.

Can anyone point me in the right direction as to what I'm missing?

"Normal" (synchronous implementation):

// MergeSort sorts the slice s using Merge Sort Algorithm
func MergeSort(s []int) []int {
    if len(s) <= 1 {
        return s
    }

    n := len(s) / 2

    var l []int
    var r []int

    l = MergeSort(s[:n])
    r = MergeSort(s[n:])

    return merge(l, r)
}

"Concurrent" version:

// MergeSortMulti sorts the slice s using Merge Sort Algorithm
func MergeSortMulti(s []int) []int {
    if len(s) <= 1 {
        return s
    }

    n := len(s) / 2

    wg := sync.WaitGroup{}
    wg.Add(2)

    var l []int
    var r []int

    go func() {
        l = MergeSortMulti(s[:n])
        wg.Done()
    }()

    go func() {
        r = MergeSortMulti(s[n:])
        wg.Done()
    }()

    wg.Wait()
    return merge(l, r)
}

Both use the same merge function:

func merge(l, r []int) []int {
    ret := make([]int, 0, len(l)+len(r))
    for len(l) > 0 || len(r) > 0 {
        if len(l) == 0 {
            return append(ret, r...)
        }
        if len(r) == 0 {
            return append(ret, l...)
        }
        if l[0] <= r[0] {
            ret = append(ret, l[0])
            l = l[1:]
        } else {
            ret = append(ret, r[0])
            r = r[1:]
        }
    }
    return ret
}

This is my benchmarking code:

package msort

import "testing"

var a []int

func init() {
    for i := 0; i < 1000000; i++ {
        a = append(a, i)
    }
}
func BenchmarkMergeSortMulti(b *testing.B) {
    for n := 0; n < b.N; n++ {
        MergeSortMulti(a)
    }
}

func BenchmarkMergeSort(b *testing.B) {
    for n := 0; n < b.N; n++ {
        MergeSort(a)
    }
}

It reveals that the concurrent version is a lot slower than the normal synchronous version:

BenchmarkMergeSortMulti-8              1    1711428093 ns/op
BenchmarkMergeSort-8                  10     131232885 ns/op
Answers to

Golang: why using goroutines to parallelize calls ends up being slower?

nr: #1 dodano: 2017-01-01 21:01

This is because you spawn tons of goroutines which preempt when calling wg.Wait(). Scheduler has no idea which one to pick, it can pick randomly blocked ones till it meets one that finally can return and unblock another one. When I limited number of concurrent calls to MergeSortMulti it became roughly 3x faster than synchronous version.

This code isn't beautiful, but it is a proof.

// MergeSortMulti sorts the slice s using Merge Sort Algorithm
func MergeSortMulti(s []int) []int {
    if len(s) <= 1 {
        return s
    }

    n := len(s) / 2

    wg := sync.WaitGroup{}
    wg.Add(2)

    var l []int
    var r []int

    const N = len(s)
    const FACTOR = 8  // ugly but easy way to limit number of goroutines

    go func() {
        if n < N/FACTOR {
            l = MergeSort(s[:n])
        } else {
            l = MergeSortMulti(s[:n])
        }
        wg.Done()
    }()

    go func() {
        if n < N/FACTOR {
            r = MergeSort(s[n:])
        } else {
            r = MergeSortMulti(s[n:])
        }
        wg.Done()
    }()

    wg.Wait()
    return merge(l, r)
}

Results will be different on your machine, but:

FACTOR = 4:

BenchmarkMergeSortMulti-8             50          33268370 ns/op
BenchmarkMergeSort-8                  20          91479573 ns/op

FACTOR = 10000

BenchmarkMergeSortMulti-8             20          84822824 ns/op
BenchmarkMergeSort-8                  20         103068609 ns/op

FACTOR = N/4

BenchmarkMergeSortMulti-8              3         352870855 ns/op
BenchmarkMergeSort-8                  10         129107177 ns/op

Bonus: You can do also use semaphore to limit the number of goroutines which is a bit slower on my machine (select is used to avoid dead-lock):

var sem = make(chan struct{}, 100)

// MergeSortMulti sorts the slice s using Merge Sort Algorithm
func MergeSortMulti(s []int) []int {
    if len(s) <= 1 {
        return s
    }

    n := len(s) / 2

    wg := sync.WaitGroup{}
    wg.Add(2)

    var l []int
    var r []int

    select {
    case sem <- struct{}{}:
        go func() {
            l = MergeSortMulti(s[:n])
            <-sem
            wg.Done()
        }()
    default:
        l = MergeSort(s[:n])
        wg.Done()
    }

    select {
    case sem <- struct{}{}:
        go func() {
            r = MergeSortMulti(s[n:])
            <-sem
            wg.Done()
        }()
    default:
        r = MergeSort(s[n:])
        wg.Done()
    }

    wg.Wait()
    return merge(l, r)
}

It yields:

BenchmarkMergeSortMulti-8             30          36741152 ns/op
BenchmarkMergeSort-8                  20          90843812 ns/op
nr: #2 dodano: 2017-01-04 13:01

Your assumption is not correct:

One would assume that being able to parallelize this work would make the concurrent implementation faster: if I need to work on slice A and slice B, then working on them concurrently should be faster than doing so synchronously.

All parallel software falls under Amdahl's Law (on Wikipedia), which I could paraphrase as 'the sequential setup is not free'.

This is of course especially so when only using a single CPU core. However, it still matters even with multiple cores, for which the choreography and distribution of units of work across cores needs to be thought through if high performance is the aim. Fortunately, kopiczko's answer provides some good tips in the specific case in question.

This has been a research topic for decades: see e.g. an old (but still relevant) summary of tricks of the trade in "Practical Parallel Processing: An Introduction to Problem Solving in Parallel" by Tidmus and Chalmers.

Source Show
◀ Wstecz