.. Copyright 2022 NWChemEx-Project
..
.. Licensed under the Apache License, Version 2.0 (the "License");
.. you may not use this file except in compliance with the License.
.. You may obtain a copy of the License at
..
.. http://www.apache.org/licenses/LICENSE-2.0
..
.. Unless required by applicable law or agreed to in writing, software
.. distributed under the License is distributed on an "AS IS" BASIS,
.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
.. See the License for the specific language governing permissions and
.. limitations under the License.

.. _cache_design:

############
Cache Design
############

As a consequence of our :ref:`memoization_design` strategy we have decided to
implement an associative container, called "the cache", to address the
remaining memoization considerations. As was briefly noted in the memoization
section, and explicitly called out in :ref:`design_checkpoint_restart`
checkpoint/restart (C/R) will also be handled by the cache.

.. note::

    We use the convention that "cache" or "the cache" refers to the concept and
    that ``Cache`` refers to the class.

********************
Cache Considerations
********************

The considerations for the cache are really the union of the memoization
considerations not considered in the :ref:`memoization_design` section with the
union of the checkpoint/restart (C/R) considerations not considered in
:ref:`design_checkpoint_restart`. The memoization considerations focus on using
the cache to avoid recomputing quantities, whereas the C/R considerations focus
on how we should save the cache so that it can be reloaded and resued at a later
time.

In the :ref:`memoization_design` section it was decided that at the core of the
cache is a high-performance associative array. The remainder of this section
focuses on enumerating considerations pertaining to how one can unify the
considerations in the :ref:`memoization_design` section with the associative
array concept.

- Since memoization occurs per type ``T`` (``T`` derived from ``ModuleBase``)
  the actual cache will be a nested associative array (``T`` maps to an
  associative array ``A`` and ``A`` actually maps the inputs to results)
- Will likely want to compress values and/or dump them to disk
- What are the keys? Input values? Hashes? Universally Unique IDs (UUIDs)?
- How many entries do we expect in a module's cache? 10s? 100s? 1000s?

  - Cache entries are ultimately calls to a module. In a given run, a module
    probably gets called less than 10 times, maybe in the 100s for an iterative
    process, or a very fundamental and frequently called module.
  - 1000+ calls is not unfathomable, but probably rare enough to justify
    treating it separately

- There's a number of associative array types; which one(s) do we want?

  - Certain structures excel at determining if an entry exists. Others excel
    at being able to add entries.

- Checksums for data validity?
- Needs to operate in a distributed environment

  - Each process gets its own cache?
  - For traditional memoization, distributed objects will stay distributed
  - For C/R less clear

    - Could do a checkpoint per process
    - Could do a single, sycnhronized, checkpoint
    - Could do something in between

*********************
Cache Implementations
*********************

At its heart the cache involves associative arrays and conceptually we may be
able to use an existing high-performance C/C++ library to implement it. This
section explores the available options, their features, and ultimately their
suitability for our ``Cache`` class.

.. note::

   The notes on these libraries are primarily based off of documentation. I do
   not have experience with any of the libraries so the descriptions may be
   inaccurate. Activity, star, and watcher information was accurate as of March
   2022 and may have changed since then.

C++ Associative Container Libraries
===================================

First up are the C++ libraries on GitHub with the ``associative-array`` tag:

- HArray

  - https://github.com/Bazist/HArray
  - Uses a trie (instead of hash table or binary tree)
  - Prefix compression
  - Save/load to disk
  - GPLv3
  - Appears to be abandoned (last commit July 2021). 88 stars and 12 watchers.

- AVL Array

  - https://github.com/mpaland/avl_array
  - Fixed size array (designed for embedded systems)
  - MIT License
  - Appears abandoned (last commit March 2020). 28 stars and 4 watchers.

- uCDB

  - https://github.com/JulStrat/uCDB
  - Designed for Arduino (apparently a competior of Raspberry Pi)
  - Seems like it can only read from existing databases
  - Not licensed.
  - Active development. 5 stars and 1 watcher.

- Vista

  - https://github.com/breese/vista
  - Implements various span classes (including an associative array one)
  - Operate on existing contiguous memory.
  - Does not appear to allow modification.
  - No license.
  - Likely abandoned (last commit May 2020). 4 stars and 3 watchers.

- ceres

  - https://github.com/ceresBakalite/ceres
  - It looks like it's an example project.
  - Appears to just wrap ``std::map``
  - GPLv3
  - Likely abandoned (last commit April 2021). 0 star and 1 watcher.

C Libraries for Assoicative Containers
======================================

Next are C libraries on GitHub with the ``associative-array`` tag:

- libdict

  - https://github.com/fmela/libdict
  - Looks like it's just implementations of common data structures with C APIs
    (no special features)
  - BSD
  - Likely abandoned (last commit May 2019). 255 stars and 28 watchers.

- thmap

  - https://github.com/rmind/thmap
  - Concurrent implementation of a trie-hash map
  - BSD
  - Likely abandoned (last commit April 2020). 63 stars and 7 watchers.

- rhashmap

  - https://github.com/rmind/rhashmap
  - Hash map that uses robin hood hashing
  - BSD
  - Last commit September 2021. 40 stars and 5 watchers.

- cdict

  - https://github.com/mkostoevr/cdict
  - Seems like a limited functionality replacement for std::map (in C)
  - MIT
  - Last commit November 2021. 2 stars and 1 watcher.

- core-array

  - https://github.com/Johns-Q/core-array
  - Appears to be a simple map class which can only have integer keys and
    integer values.
  - AGPLv3
  - Last commit September 2021. 1 star and 1 watcher.

- c_vector_map

  - https://github.com/michael105/c_vector_map
  - Maps strings to integers.
  - Documentation says it was designed/optimized for a highly-specific usecase.
  - BSD
  - Last commit April 2021. 0 star and 1 watcher.

- Data-Structures-C

  - https://github.com/chivington/Data-Structures-C
  - Appears to be someone's toy project to learn about data structures.
  - Documentation states it's not production ready
  - DO_WHATEVER_YOU_WANT license
  - Likely abandoned (last commit July 2019). 0 stars and 0 watchers.

Databases
=========

While we've focused on associative arrays up to this point, what we're after
can also be considered a database. The advantage of moving to databases is that
we get some real heavy-hitters contributing software. Putting the cart before
the horse, our current cacheing strategy will rely on databases for a number of
the considerations. The full design discussion is lengthy and we defer that
conversation to :ref:`database_design`.

**************
Cache Strategy
**************

Ultimately a lot of the Cache's implementation will get punted to an underlying
class ``Database``. The design of that class, and how it addresses the ``Cache``
considerations is beyond our current scope, but can be found
:ref:`here <database_design>`. This discussion focuses on the design of the API
wrapping the ``Database`` class.

.. _fig_cache_design:
.. figure:: assets/cache_design.png

   Main design points of the ``Cache`` and ``ModuleCache`` classes.


:numref:`fig_cache_design` summarizes the design of the ``Cache`` class. Each
``ModuleManager`` instance can have at most one ``Cache`` instance. This
ensures there's a single source of truth for the program. The ``Cache`` will be
hierarchical since memoization only makes sense for modules of the same C++
type. The ``Cache`` refers to the entire cache object, whereas ``ModuleCache``
refers to the cache for all modules of the same C++ type. Thus ``Cache`` can
be thought of as a map from C++ type to ``ModuleCache`` instances. In lieu of
the actual C++ module type, we define a concept called the module ID. For now
the Module ID is simply the module key that a module was originally registered
under, but in the future may be expanded to include for example version
information. Each ``ModuleCache`` is thought of as a map from a set of inputs to
a set of results.

Under the hood of ``ModuleCache`` we use the ``Database`` class to store the
actual data. From the user's perspective the ``Database`` class implements the
``Cache`` class and ``ModuleCache`` instances are slices of the ``Database``. We
accomplish this by having the keys of the database be pairs whose first element
is the module ID and the second element is the input set (values in the database
are the result sets). ``Database`` is implemented by a polymorphic PIMPL.
Additional PIMPLs can be added to cover different database backends and/or
nested for more comlicated C/R scenarios.

It is worth noting that in this design the ``Cache`` class is very similar to
the ``Database`` class and could at this stage just be a ``Database``; however,
we maintain separate classes because ``Cache`` is the public API and
``Database`` will be an implementation detail.

So far this strategy does not address how iterative modules will be memoized
yet, nor will the database help on this issue. The primary problem to solve is
how to avoid needing to store all of the intermediate results. Our current
strategy is to more or less punt and make module developers do it. More
specifically the ``Cache`` class will maintain a second module to
``ModuleCache`` mapping (these ``ModuleCache`` instances will actually be held
as instances of the derived ``UserCache`` class). Module developers are
able to access the ``UserCache`` through the ``ModuleBase`` class and are free
to cache/uncache whatever values they want.


The use of the ``UserCache`` with iterative modules has a potential pitfall. In
general the ``UserCache`` will wrap the same ``Database`` as the ``ModuleCache``
(this facilitates using a single C/R backend).  Furthermore, a lot of times the
inputs to the iterative module are the same as the inputs used to tag
intermediate results (and the intermediate results are often the same type as
the module's results). In turn restarting an iterative module that has not
converged may lead to the iterative module being memoized to the last
intermediate even though it has not converged. To prevent this, ``UserCache``
tags its entries before dumping them, whereas ``ModuleCache`` does not.
When attempting to memoize an iterative module, tagged cache entries will not be
considered. If no untagged enteries are found control goes into the module.
Inside the module, the user checks the ``UserCache`` instance (which in turn
only considers tagged entries) and the module is able to restart using the last
intermediate.

Cache Considerations Addressed
==============================

With respect to the cache considerations this strategy addresses the following
considerations:

- Each class ``T`` deriving from ``ModuleBase`` is assumed to wrap a unique
  algorithm. When memoizing it thus only makes sense to consider results
  computed by other ``T`` instances.

  - Addressed by ``Cache`` vs ``ModuleCache``

- Failing to memoize when appropriate will affect performance

  - There's only so much we can do here. Ultimately it's up to the database
    backend to determine if a set of inputs has been seen or not. Concievably,
    the database could use special comparisons to determine if two objects are
    equal.

- Memoizing when not appropriate will compromise the integrity of the scientific
  answer.

  - Again this is up to the database backend. In our initial design two objects
    must compare value equal to be considered the same. If value equality is
    implemented correctly there is no reason why false memoization will occur.

- Memoizing iterative (recursive) function requires special considerations

  - Addressed by having ``UserCache``

- Comparing objects can be expensive (think about distributed tensors)

  - It's our opinion that optimizing value comparison is easier than trying to
    implement a general comparison solution based on hashing or the like. Object
    developers are of course free to use hashing inside their object's value
    comparisons as an optimization.

- Module developers may want to do their own C/R.

  - Addresed by having a ``UserCache``. Module developers can put whatever
    they want in that cache and have it be checkpointed.
  - Module developers are also free to ignore the ``UserCache`` and do something
    else if they would like.

- Since memoization occurs per type ``T`` (``T`` derived from ``ModuleBase``)
  the actual cache will be a nested associative array (``T`` maps to an
  associative array ``A`` and ``A`` actually maps the inputs to results)

  - The ``Cache`` and ``ModuleCache`` classes addresses this.

- What are the keys? Input values? Hashes? Universally Unique IDs (UUIDs)?

  - From the perspective of the ``Cache`` the keys are input values. The
    ``Database`` implementation is free to map the input values to proxy objects
    such as UUIDs if it so likes.

- How many entries do we expect in a module's cache? 10s? 100s? 1000s?

  - Cache entries are ultimately calls to a module. In a given run, a module
    probably gets called less than 10 times, maybe in the 100s for an iterative
    process, or a very fundamental and frequently called module.
  - 1000+ calls is not unfathomable, but probably rare enough to justify
    treating it separately