wrote a parallel ray tracer in haskell. Built off a very old reference implementation
rewrote it for current haskell and ghc versions, then parallelized
and made another few small improvements were made.




The only parallizable part is
 {- returns a color array representing the image -}
    getImage :: Int -> Resolution -> Scene -> [Color]
    getImage d r@(rx,ry) s = [image (fromIntegral x, fromIntegral (-y)) | y<- [-(ry-1)..0], x <- [0..(rx-1)]]
      where
      | image = rayTrace d r s

Why? Well we are creating a very large number of rays, each of which has a set of computations to do with 
the scene data. The computations for a single ray are just a few recursive operations, which are shallow relative
to the number of arrays. The recursive depth on a single array is never likely to be much more than 50, while there
may be 1920*1080 rays to compute for example.

First some benchmarks:
For timing I used the getCPUTime function. To ensure that this the time would be
accurate, I do not call it until after the file has been written, and the
raytracing has been fully executed. 

On a single cores I used the following trc files
ex1.trc
ex2.trc
ex3.trc
ex4.trc
ex5.trc

On the sequential version of the code difference between each cpuTime was in seconds, averaged
over 3 trials for each,

1. 5.9 seconds
2. 3.2 seconds
3. 5.9 seconds
4. 12.2 seconds //lots of lighting interactions with the plane
5. 8.4 seconds

Note that this is without eventlogging used in compilation. With it turned on,
the program takes significantly longer.
For example ex4.trc takes about 19.5 seconds to run, and about 20 if the log file
is written.

Times now come from threadscope reporting:
Here is the eventlog for ex4.trc, without any parallelism in the code:
time
20.25 seconds



There can be a huge variability in the running time of similar looking scenes,
it mostly depends on both the resolution of the image and the arrangement of objects
in the scene. One section of the image with lots of refraction and shadow effects
will be more difficult to calculate. It could prove a problem if the heaviest section
of an image is forced onto a single CPU. All of my attempts do not try and systematically resolve this
but one strategy I did worked pretty well anyway. 

The first thing I tried was a simple parList with rdeepseq
on the list of rays ala the lecture slides. 

To parallize the code, I rewrote the list comprehension to use a rdeepseq with the parList
stragey.
I choose rdeepseq since the rays are independent of each other, and they
should just  be evaluated immediately since their list location are needed in the file write.

rdeepseq results
4. 19.04, 18.12, 16.55, 17.54

What happened? It looked like we got some very modest improvements but then started growing worse again.
Looking at the threadscope output on 4 cores, we see some good parallelization at the beginning
of the run, a bit over a second, then everything just gets forced onto a single core. This wastes
over a million sparks which are never evaluated. In fact about 1.9 million sparks are needlessly made and wasted.

What went wrong? Trying to just use a simple parList and rdeepseq is not managing our cores properly.
Everything is getting tossed onto core 2. Well using parList and evaluating every ray in parallel is very bad idea.
Individual rays are cheap, and commparable to the cost of a spark, so we're making the mistake of having
way too fine grained parallelization. 

To remedy this, I rewrote the parallel section to instead use parBuffer and rdeepseq. I split the list of rays into
512 sized chunks, and evaluated each one of those in parallel. 

code:

This worked significantly better and got a nice speedup. Even in the single thread case, we've made an improvement over
the old version.  See the following table on example4.trc.
4. 12.76, 7.39, 5.4, 4.50, 4.26, 4.01, 4.09, 4.15,3.92, 3.85, 3.75, 4.09

The efficency measure from threadscope  overed around 86 to 88 percent for the first few cores, then dipped into the 70s.
Suprisingly the total time continued to decrease until the final and 12th thread run, and a brief hiccup around 6-8 thread mark.
This may have to do with my computer having 6 physical cores, with 6 additional hyperthreads. It may also just be random
chance on what happened to also be running on my pc. 

Overall we saw a nice but not optimal speedup as we added more cores once we parallized properly using parBuffer. Further
experimentation could be done on changing the chunk size, or attempting to spread difficult rays around rather than dumping them
on a single core. The actual tracing could also be improved using certain acceleration data structures. 

Would these still hold if the renderer was more complicated?
Generally yes, however it would take much longer in general.
I only implemented planes and spheres, which have very simple calculations,
and a very simple light model.
