CU2CL Release Notes

v0.7.0b - The Multi-File Release

Summary
- Major revisioni - Dramatically improved support for multi-file codebases
- Overhauled invocation - now via Clang's libTooling interface
  - Now invoke once with all files - was once per file
  - Supports sharing of global OpenCL state and better propagation of cross-file translations
- Overhauled OpenCL rewriting behavior - now via Clang Replacements
  - More suitable for cross-file translations (rewrites agnostic of originating source file)
  - Provides deduplication for identically-generated OpenCL = less "multiply-defined" errors
- Other upgrades
  - Clang 3.2 --> Clang 3.4 (provides libTooling and Replacement features)
  - Embedded default overrides for CUDA 5.5+ headers (further simplified invocation)
  - cu2cl_util.c/h/cl family of utility files encapsulate shared CU2CL emulation functions
    - Streamlines translated applications - only user code now
  - Improved OpenCL boilerplate generation and encapsulation
    - Global init/cleanup in cu2cl_util - cl_kernels/cl_programs in per-CUDA-file init/cleanup

Full Notes
The major feature of this release is the overhaul of two core functionalities of the translator. These were overhauled to take advantage of new features provided by Clang 3.4 that allow for increased ease of use, improved scalability to more complex codebases, and useability improvements in the generated OpenCL code. The internal mechanisms which generate raw translated OpenCL strings from AST insights remain much the same, but the translator invocation and collating of generated OpenCL to final output are significantly changed.

The first overhaul consists of switching the tool from operating as a Clang plugin, to operating as a Clang refactoring tool. Essentially, rather than launching an instance of the Clang compiler which triggers the CU2CL plugin, CU2CL now exists as its own executable that performs calls into the Clang library. While seemingly subtle, this change has several benefits:
 - Dramatically simplified invocation: There is no longer a need to supply complex operands to clang to force loading of the CU2CL shared library or provide it arguments. One now simply needs to provide the source files, standard compiler arguments (include directories, defines, etc.), and any command line options.
 - Improved compatibility with packaged versions of Clang. As long as the external Clang API remains consistent, CU2CL's sensitivity to minor revisions should be improved. Similarly, since a library interface is used, CU2CL is not as tightly coupled to having a locally-compiled install of Clang. (However, Clang is a rapidly moving target; their major revisions are likely to still cause hiccups.)
 - Most importantly, the tool construction allows CU2CL to internally maintain persistent state across multiple independent translation units. That is, without needing to manage any state on-disk, CU2CL can be aware of translations which originated from one CUDA file provided to the invocation while working on subsequent files provided to the same invocation. This functionality is critical to ensuring high-fidelity translations to complex, multi-file CUDA code bases.

The second overhaul consists of switching the mechanism through which individual CUDA source elements are replaced with their corresponding OpenCL strings. In prior versions which operated on a single translation unit at a time, it was suitable to write edited lines directly to a Clang Rewriter object. These handle replacement of CUDA strings as well as output file generation. However, Rewriters are bound to a Clang SourceManager object, which does not usually persist across multiple source files, and thus were unsuitable for managing rewrites spanning a range of input files. Therefore, staged rewrites are now stored in the SourceManager-oblivious Replacement format, and only applied to Rewriters for output after all files have been processed. The use of the new Replacement data structure provides other conveniences which have aided in the intelligent merging of rewrites originating from separate translation units:
 - Automatic deduplication: primarily this is of benefit to headers and other files which are shared between multiple translation units. In some cases like these, multiple identical rewrites are generated for a given CUDA structure. (Two stages of deduplication are performed, one for each translation unit, then after merging Replacements from all TUs, a final global deduplication is performed.)
 - Merging of generated OpenCL: When multiple input files generate non-overlapping OpenCL rewrites in a single source file, they can now be merged such that the output file is now generated exactly once. This prevents earlier issues wherein an overlapping output file would be generated by a later translation and destroy the earlier version of the output file.

The release also includes a few other notable upgrades, which are enabled in part by the aforementioned overhauls:
 - cu2cl_util.c/h/cl family of files: All utility code is now bundled into a single family of utility files, streamlining translated code so that it solely contains user code - not CU2CL emulation functions.
 - Some manual overrides that were previously needed at invocation to handle CUDA 5.5+ header files have now been embedded by default, further simplifying tool usage.
 - Improved boilerplate generation and locality: Global OpenCL boilerplate (device selection and context/queue creation/destruction) is now encapsulated in __cu2cl_Init() and __cu2cl_Cleanup functions, rather than being injected into a (possibly absent) main() method. Local boilerplate (program and kernel build) is now encapsulated in per-file _Init() and _Cleanup() functions, which are automatically triggered by the global calls.
 - Globally-visible variables: Certain OpenCL variables which must be shared across all objects composing an executable are now globally shared via extern qualifiers - while retaining ownership in the source file they were generated from. This allows a single OpenCL instance to persist across multiple separately-compiled object files, preserving commonplace build semantics. (Previously, to share this state it was largely necessary to have a single top-level file which #included all other project files, and all variables were "owned" by the top-level file, rather than the source file they came from.)