So, at work, a portion of a program that I am working on has to perform some cleaning procedures on just over 200,000 records. Cleaning each record iteratively was taking way too long, so I decided to use some of my parallelization skills from last summer to see if I could decrease the run time on the cleaning portion. I decided to use 4 threads to ease the work on our poor processor. Here are some of the results.
Cleaning 5,00 records:
Iterative Clean Average Time: 1.1864 seconds
4-Threads Clean Average Time: 1.3628 seconds
Cleaning 50,00 records:
Iterative Clean Average Time: 209.28 seconds
4-Threads Clean Average Time: 39.23 seconds
For the smaller number of records, the threaded clean performs slower, but for the larger number of records, there was a serious increase in performance. I then wanted to measure just how much of a performance gain I acheived.
If you perform the following calculation:
(100) * (39.23 / 209.28)
That says that for 50,000 records the threaded cleaning procedure runs in in about 18.74% of the original time. This percentage was a bit shocking, as with 4-threads, I should only expect each to take on 25% of the work. Perhaps, I needed to time more trials (I only did 3 for each), but either way, I think I’m going to go with the parallel cleaning procedure.