Programming Assignment III: OpenMP Programming

Parallel Programming by Prof. Yi-Ping You

Due date: 23:59, October 30, Thursday, 2025

The purpose of this assignment is to familiarize yourself with OpenMP programming.

Programming Assignment III: OpenMP Programming

We provide workstations dedicated to this course. You can access the workstations by ssh or use your own environment to complete the assignment.

Hostname	IP
hpclogin[01-03].cs.nycu.edu.tw	140.113.17.[101-103]

ssh <usrname>@hpclogin[01-03].cs.nycu.edu.tw

Additionally, we have configured the environment for you, which MUST be activated before running the assignment code using the following command in bash:
module load pp
You can leave the environment by typing module unload pp.

Please download the source code and unzip it:

wget https://nycu-sslab.github.io/PP-f25/assignments/HW3/HW3.zip
unzip HW3.zip -d HW3
cd HW3

0. Using Workstation

In this assignment, you’ll need to use run -c to allocate physical threads for running your multi-threaded program on the dedicated server. If you don’t specify the number of physical threads, only 1 physical thread will emulate all the threads you use, which may result in poor performance.

You can allocate up to 6 threads on the server. However, since resources are shared among all students, please allocate only as many threads as you need. Here is an example of how to run a program with 4 allocated physical thread:

run -c 4 -- ./cg

1. Part 1: Parallelizing Conjugate Gradient Method with OpenMP

The conjugate gradient method is an algorithm for the numerically solving specific systems of linear equations. It is commonly used to solve partial differential equations and is also applied to certain optimization problems. For more information, you can refer to Wikipedia.

Please navigate to the part1 folder:

cd part1

You can build and run the CG program by:

make; run -c 1 -- ./cg

In this assignment, you will parallelize a serial implementation of the conjugate gradient method using OpenMP. The serial implementation can be found in cg.c and cg_impl.c. For more details about the code, please refer to the README file.

To effectively parallelize the conjugate gradient method, consider using profiling tools to analyze performance differences before and after your modifications. Simply adding a parallel-for directive to every for loop may not yield efficient results. Instead, you should use the profiling tools mentioned in Programming Assignment ZERO to better understand how your changes impact performance.

Here are some suggestions to help you complete this part:

Try to find the hot spots of the program.
Potential improvements may come from increasing data parallelism, optimizing code structure, reducing page faults, or other refinements.

See the requirements to complete this section.

You can also run our grader via: run -c 4 -- ./cg_grader, which reports the correctness of your program and a performance score.

Part 2: Parallelizing PageRank Algorithm with OpenMP

In this part, you will implement two graph processing algorithms: Breadth-First Search (BFS) and a basic implementation of Page Rank. A well-optimized implementation of this assignment should be able to run these algorithms on graphs with hundreds of millions of edges on a multi-core machine in just a few seconds.

Please enter the part2 folder:

cd part2

Read the following sections and accomplish the requirements of this part.

2.1 Background: Representing Graphs

The starter code operates on directed graphs, which are implemented in common/graph.h and common/graph_internal.h. We recommend starting by understanding the graph representation in these files. In this implementation, a graph is represented as an array of edges (both outgoing_edges and incoming_edges), where each edge is stored as an integer representing the destination vertex ID. The edges are sorted by their source vertex, meaning the source vertex is implicitly defined. This design provides a compact representation of the graph and allows it to be stored contiguously in memory. For example, to iterate over the outgoing edges of all nodes in the graph, you can use the following code , which leverages helper functions defined in common/graph.h and implemented in common/graph_internal.h:

for (int i = 0; i < num_nodes(g); i++) {
    // Vertex is an alias for int; Vertex * points into g.outgoing_edges[].
    const Vertex *start = outgoing_begin(g, i);
    const Vertex *end = outgoing_end(g, i);
    for (const Vertex *v = start; v != end; v++)
        printf("Edge %u %u\n", i, *v);
}

2.2 Task 1: Implementing Page Rank

As a simple warm up exercise to help you get comfortable with the graph data structures and familiarize yourself with some OpenMP basics, we ask you to start by implementing a basic version of the well-known Page Rank algorithm.

Please take a look at the pseudocode provided to you in the page_rank() function, located in page_rank/page_rank.cpp. You task is to implement the function, parallelizing the code using OpenMP. As with any algorithm, begin by identifying independent work and determining any necessary synchronization.

You can test your implementation, checking correctness and performance against the staff reference solution using:

run -c 4 -- ./pr <PATH_TO_GRAPH_DIRECTORY>/com-orkut_117m.graph

If you are working on our workstation machines, we’ve located a copy of the graph and some other graphs at /data/pp/hw3/test.

Some interesting real-world graphs include:

com-orkut_117m.graph
oc-pokec_30m.graph
rmat_200m.graph
soc-livejournal1_68m.graph

Some useful synthetic, but large graphs include:

random_500m.graph
rmat_200m.graph

There are also some very small graphs for testing. If you look in the /tools directory of the starter code, you’ll notice a useful program called graph_tools.cpp that can be used to make your own graphs as well.

By default, the pr program runs your page rank algorithm with an increasing number of threads (so you can assess speedup trends). However, since runtimes at low core counts can be long, you can explicitly specify the number of threads to only run your code under a single configuration.

run -c 4 -- ./pr <PATH_TO_GRAPH_FILENAME> 4

Your code should handle cases where there are no outgoing edges by distributing the probability mass on such vertices evenly among all the vertices in the graph. That is, your code should work as if there were edges from such a node to every node in the graph (including itself). The comments in the starter code describe how to handle this case.

You can also run our grader via: run -c 4 -- ./pr_grader <PATH_TO_GRAPH_DIRECTORY>, which reports the correctness of your program and a performance score for four specific graphs.

You are highly recommended to read the paper, “Direction-Optimizing Breadth-First Search”, which proposed a hybrid method combining top-down and bottom-up algorithms for breadth-first search, before you continue the following sections. In Sections 2.3, 2.4, 2.5, you will need to finish a top-down, bottom-up, and hybrid method, respectively.

2.3 Task 2: Parallel Breadth-First Search (“Top Down”)

Breadth-first search (BFS) is a common algorithm that you’ve almost certainly seen in a prior algorithms class.

Please familiarize yourself with the function bfs_top_down() in breadth_first_search/bfs.cpp, which contains a sequential implementation of BFS. The code uses BFS to compute the distance to vertex 0 for all vertices in the graph. You may wish to familiarize yourself with the graph structure defined in common/graph.h as well as the simple array data structure VertexSet (breadth_first_search/bfs.h), which is an array of vertices used to represent the current frontier of BFS.

You can run bfs using:

run -c 1 -- ./bfs <PATH_TO_GRAPHS_DIRECTORY>/rmat_200m.graph

(as with page rank, bfs’s first argument is a graph file, and an optional second argument is the number of threads.)

When you run bfs, you’ll see the execution time and the frontier size printed for each step in the algorithm. Correctness will pass for the top-down version (since we’ve given you a correct sequential implementation), but it will be slow. (Note that bfs will report failures for a “bottom-up” and “hybrid” versions of the algorithm, which you will implement later in this assignment.)

In this part of the assignment your job is to parallelize a top-down BFS. As with page rank, you’ll need to focus on identifying parallelism, as well as inserting the appropriate synchronization to ensure correctness. We wish to remind you that you should not expect to achieve near-perfect speedups on this problem (we’ll leave it to you to think about why!).

Tips/Hints:

Always start by considering what work can be done in parallel.
Some part of the computation may need to be synchronized, for example, by wrapping the appropriate code within a critical region using #pragma omp critical. However, in this problem you can get by with a single atomic operation called compare and swap. You can read about GCC’s implementation of compare and swap, which is exposed to C code as the function __sync_bool_compare_and_swap. If you can figure out how to use compare-and-swap for this problem, you will achieve much higher performance than using a critical region.
Are there conditions where it is possible to avoid using compare_and_swap? In other words, when you know in advance that the comparison will fail?
There is a preprocessor macro VERBOSE to make it easy to disable useful print per-step timings in your solution (see the top of breadth_first_search/bfs.cpp). In general, these printfs occur infrequently enough (only once per BFS step) that they do not notably impact performance, but if you want to disable the printfs during timing, you can use this #define as a convenience.

2.4 Task 3: “Bottom-Up” BFS

Think about what behavior might cause a performance problem in the BFS implementation from Part 2.3. An alternative implementation of a breadth-first search step may be more efficient in these situations. Instead of iterating over all vertices in the frontier and marking all vertices adjacent to the frontier, it is possible to implement BFS by having each vertex check whether it should be added to the frontier! Basic pseudocode for the algorithm is as follows:

for(each vertex v in graph)
    if(v has not been visited &&
       v shares an incoming edge with a vertex u on the frontier)
            add vertex v to frontier;

This algorithm is sometimes referred to as a “bottom-up” implementation of BFS, since each vertex looks “up the BFS tree” to find its ancestor. (As opposed to being found by its ancestor in a “top-down” fashion, as was done in Part 2.3.)

Please implement a bottom-up BFS to compute the shortest path to all the vertices in the graph from the root (see bfs_bottom_up() in breadth_first_search/bfs.cpp). Start by implementing a simple sequential version. Then parallelize your implementation.

Tips/Hints:

It may be useful to think about how you represent the set of unvisited nodes. Do the top-down and bottom-up versions of the code lend themselves to different implementations?
How do the synchronization requirements of the bottom-up BFS change?

2.5 Task 4: Hybrid BFS

Notice that in some steps of the BFS, the “bottom-up” BFS is significantly faster than the top-down version. In other steps, the top-down version is significantly faster. This suggests a major performance improvement in your implementation, if you could dynamically choose between your “top-down” and “bottom-up” formulations based on the size of the frontier or other properties of the graph! If you want a solution competitive with the reference one, your implementation will likely have to implement this dynamic optimization. Please provide your solution in bfs_hybrid() in breadth_first_search/bfs.cpp.

Tips/Hints:

If you used different representations of the frontier in Parts 2.3 and 2.4, you may have to convert between these representations in the hybrid solution. How might you efficiently convert between them? Is there an overhead in doing so?

3. Requirements

3.1 Part 1 Requirements

You will modify only cg_impl.c to improve the performance of the CG program, such as inserting OpenMP pragmas to parallelize parts of the program. Of course, you may add any other variables or functions if necessary.

3.2 Part 2 Requirements

You will modify only bfs.cpp and page_rank.cpp to implement and parallelize the algorithms with OpenMP.

4. Grading Policy

NO CHEATING!! You will receive no credit if you are found cheating.

We will judge your code with the following aspects:

Correctness: Your parallelized program should run faster than the original (serial) program.
Scalability: We will evaluate your program with 2, 3, 4 or more threads. Your program is expected to be scalable.
Performance: We will compare your code with TA’s solution. You should get a closer or better performance than TA’s implementation.

Total of 100%:

Part1 (20%): Refer to cg/grade.c.
Part2 (80%): page_rank counts for 16% and breadth_first_search counts for 64%. Refer to page_rank/grade.cpp and breadth_first_search/grade.cpp.

5. Evaluation Platform

Your program should be able to run on UNIX-like OS platforms. We will evaluate your programs on the dedicated workstations.

The workstations are based on Debian 12.9 with Intel(R) Core(TM) i5-10500 CPU @ 3.10GHz processors and GeForce GTX 1060 6GB. g++-12, clang++-11, and CUDA 12.8 have been installed.

You can run the graders (built from the source code) on the workstation. We will use the same grader to judge your code:

run -c 4 -- ./cg_grader
run -c 4 -- ./pr_grader   /data/pp/hw3/test
run -c 4 -- ./bfs_grader  /data/pp/hw3/test

6. Submission

All your files should be organized in the following hierarchy and zipped into a .zip file, named HW3_xxxxxxx.zip, where xxxxxxx is your student ID.

Directory structure inside the zipped file:

HW3_xxxxxxx.zip (root)
- cg_impl.c
- bfs.cpp
- page_rank.cpp

Zip the files with the following command:

zip HW3_xxxxxxx.zip cg_impl.c bfs.cpp page_rank.cpp

Please run the provided testing script test_hw3 before submitting your assignment.

Run test_hw3 in a directory that contains your HW3_XXXXXXX.zip file on the workstation. test_hw3 checks if the ZIP hierarchy is correct and runs graders to check your answer, although for reference only.

Be aware that test_hw3 runs very heavy work which results in high CPU usage, we recommend you first use cg_grader, pr_grader, bfs_grader to check your programs’s correctness. Only use test_hw3 before your submission.

test_hw3 <student_id>

You will get NO POINT if your ZIP’s name is wrong or the ZIP hierarchy is incorrect.

You will get a 5-point penalty if you hand out unnecessary files (e.g., obj files, .vscode, .__MACOSX).

Be sure to upload your zipped file to new E3 e-Campus system by the due date.

Programming Assignment III: OpenMP Programming

Table of contents

0. Using Workstation

1. Part 1: Parallelizing Conjugate Gradient Method with OpenMP

Part 2: Parallelizing PageRank Algorithm with OpenMP

2.1 Background: Representing Graphs

2.2 Task 1: Implementing Page Rank

2.3 Task 2: Parallel Breadth-First Search (“Top Down”)

2.4 Task 3: “Bottom-Up” BFS

2.5 Task 4: Hybrid BFS

3. Requirements

3.1 Part 1 Requirements

3.2 Part 2 Requirements

4. Grading Policy

5. Evaluation Platform

6. Submission

7. References