Before you begin tuning for performance:
Understand your transaction profile. What kinds of transactions are being replicated? Do these transactions affect the same rows and tables? Are these transactions liable to conflict if applied in parallel? Is the transaction profile constant, or does it change, perhaps with the time of day or month. A clear understanding of your transaction profile helps you select those parameters and settings that will be most useful.
Tune the replicate database to handle contentions. Most primary databases have been tuned to minimize contentions through the use of clustered indexes, partitioning, row-level locking, and so on. Make sure that your replicate database has been tuned similarly.
Define a set of repeatable transactions that accurately reflect your replication environment. Tuning your parallel DSI environment is an iterative process. You will need to set parameters, run a test, measure performance, compare against previous measurements, and repeat until you have maximized your results.
First, reset the dsi_serialization_method parameter. Set the dsi_serialization_method parameter to no_wait to enable maximum parallelism. Then attempt to reduce contentions by testing other parameters. Because the wait_for_commit (the default) setting supplies minimal parallelism and therefore minimal benefit, only reset dsi_serialization_method to wait_for_commit after all attempts to reduce contention using the no_wait setting have failed to increase performance.
Set the dsi_num_threads parameter correctly. The dsi_num_threads parameter defines the total number of DSI executor threads; the dsi_num_large_xact_threads parameter defines the total number of DSI executor threads reserved for large transactions. Thus, the total number of DSI executor threads (dsi_num_threads) equals the number of DSI threads reserved for large transactions plus the number of threads available for small transactions.
To begin, try setting dsi_num_threads to 5, and dsi_num_large_xact threads to 2. After selecting a dsi_serialization_method and a dsi_partitioning_rule:
Increase dsi_num_threads if contention does not increase, or
Decrease dsi_num_threads if contention does not decrease.
Make sure that dsi_num_threads is greater than the default, and that the value for dsi_num_threads is greater than that for dsi_num_large_xact_threads.