############### Hacking on MATE ############### .. important:: This page is intended for the developers of MATE only -- it is not relevant for users. ******** Overview ******** Hello and welcome! 👋 If you are new to MATE, please check out: #. The :doc:`overview` #. The :doc:`quickstart` Then continuing reading this page to get a tour of the codebase, learn about our build workflow and engineering practices, and find other tips for developers. ********************** A tour of the codebase ********************** Below is some context on what some of the files and directories contain, along with what you might want to know about them. See the :doc:`architecture` documentation for additional information about MATE's components. - ``Dockerfile`` and ``dev-requirements.txt``: These files encode the dependencies necessary to build and to run ``mate``. MATE is released in the ``mate-dist`` Docker image, which contains only the minimal number of dependencies to run ``mate``, while ``mate-dev`` contains all the dependencies needed to for developing ``mate`` (see also :ref:`Docker Images and Compositions `). Any time the content of these files changes you will need to pull the corresponding image from the Github Container Registry or rebuild locally (see :doc:`quickstart`). - ``shake/exe/Mate.hs``: We use the `Shake `_ build system to build MATE; this is the file where the build system is defined. You can build MATE from inside the ``mate-dev`` container by running ``./shake.sh bdist``, which builds the binary distribution of MATE. This will create or update files in the ``.out`` output directory. If you want to change, add or see the details of any flags to the build system ``Mate.hs`` is the file to check out. - ``.github/``: This directory contains the CI pipelines. - ``frontend``: Files related to the MATE REST API, task substrate, and the legacy ``mate`` command line interface (CLI). For more CLI documentation check out :doc:`the MATE CLI overview `. The ``mate`` executable lives at ``frontend/mate/bin/mate``. When you build ``mate`` by running ``./shake.sh bdist`` in the ``mate-dev`` image, the frontend/mate directory gets copied to ``.out/bdist/local/lib/python3.8/site-packages/mate``. - ``frontend/mate-common``: This is an independent package for MATE components that are shared by both the MATE server, consumers of the ``mate`` Python package, and the REST client. - ``frontend/mate-rest-client``: This is an independent Python binding for MATE's REST API. - ``frontend/mate/mate/cpg``: This is where the query interface implementation lives. This is where Python classes corresponding to CPG elements (such as Nodes) are defined, along with the specification of their attributes. There are many different types of nodes in a hierarchical relationship, where child type nodes inherit the attributes of the parent nodes. We use `SQLAlchemy `_ to implement the queries. - ``frontend/mate/mate/poi``: This is the home of the POI (Point Of Interest) framework code as well as all automated POI analyses. For more info see the POI docs. - ``frontend/mate/mate/context/cli.py``: This file defines all the command line options and uses the ``argparse`` library to construct a context object. This context object shows up throughout the legacy CLI code in ``frontend/mate``, usually bound to a variable named ``ctxt``. - ``test/postgres``: This directory contains unit tests for features related to querying the CPG using the Python query interface. See :doc:`testing` for more details on running the various test suites as well as writing new unit tests. - ``llvm``: This directory contains our "middle-end" LLVM passes which are run with ``opt``. ``llvm/MATE/ASTGraphWriter.cpp`` is the pass which traverses the program under analysis and writes out the nodes and edges of our CPG into a JSONL file. These JSONL files are then post-processed to create the database which the query interface queries against. The pointer analysis is also implemented as a middle-end pass and passes data into the ``ASTGraphWriter``. - ``submodules``: MATE contains a number of other projects as submodules, which reside in the ``submodules`` directory. Whenever the code changes to rely on an updated version of a submodule, you'll need to get a later version by running ``git submodule update --init --recursive``. - ``ui-client``: This directory contains the MATE UI client. See the :ref:`hacking:Hacking on the MATE UI` section below. - ``doc``: This directory contains our documentation (in addition to the documentation contained in docstrings!) For more info, see the :ref:`hacking:Documentation` section below. ******************** Building and Running ******************** If you followed :doc:`quickstart`., you should have an initial build of MATE available. To re-build MATE with any changes you've made, just re-run the ``bdist`` and ``mate-dist`` steps: .. code-block:: bash docker run --rm -v $(pwd):/mate -it mate-dev ./shake.sh -j bdist docker build --target dist -t mate-dist . .. note:: A "release" build of MATE can be produced by passing ``--release-build`` to ``./shake.sh``. This *may* result in an even more memory intensive build phase. ********************** Hacking on the MATE UI ********************** MATE's current UI is a web application, based on React and written in TypeScript. It is decoupled from the rest of the MATE codebase, using only the REST API exposed by the MATE server. For local development, the ``docker-compose.ui.dev.yml`` compose file provides hot reloading of the UI. You can run it in conjunction with the main ``docker-compose.yml`` file to bring up the entire MATE stack at once: .. code-block:: bash docker-compose -f docker-compose.yml -f docker-compose.ui.dev.yml up Alternatively, if you already have the core MATE services running, you can run just the ``ui`` service: .. code-block:: bash docker-compose -f docker-compose.yml -f docker-compose.ui.dev.yml up ui ************* Documentation ************* Documentation for MATE is written in `ReStructured Text`_, and processed by the `Sphinx`_ documentation system. Documentation source files are located within the ``doc/`` subdirectory of the MATE repository. .. _ReStructured Text: https://www.sphinx-doc.org/en/main/usage/restructuredtext/basics.html#external-links .. _Sphinx: https://www.sphinx-doc.org/ Building ======== The build-system provides a top-level target for building the documentation:: $> ./shake.sh doc In addition to rendering the documentation, the ``doc`` target also pre-generates the autodoc sources, including the :ref:`CPG `, :ref:`API ` and :ref:`CLI ` references. Rendered documentation is placed in ``doc/_build/`` for each of the output formats, all of which are built each time. To view the built HTML, simply open the root document in your browser of choice:: $> $BROWSER doc/_build/html/index.html Alternatively, the built manual page may be viewed directly in the terminal:: $> MANPATH=doc/_build/man man mate Writing ======= The steps to writing more documentation are simple: navigate the `ToC Tree `_ starting from ``doc/index.rst`` to the most appropriate location for your contributions, and edit away. You may need to create new ``.rst`` files to hold your text; make sure to insert them into the appropriate ToC trees as you go. .. _toctree: https://www.sphinx-doc.org/en/1.8/usage/restructuredtext/directives.html#directive-toctree Once you've made your edits, rebuild, and make sure your work shows up in all the relevant indices and tables. Deployment ========== Documentation from the ``main`` branch of the repository is automatically rendered and deployed to the `web `_ by CI. Your edits will need to be merged into main before they show up. The JSON schemata ================= The JSON schemata in ``./frontend/mate-common/mate_common/schemata`` describe valid contents of the JSON blobs stored in the database. The ``edges.json`` and ``nodes.json`` files specify each kind of node and edge in the CPG along with their required and optional fields. This information can also be found in ``frontend/mate-common/mate_common/models/cpg_types/mate.py`` which codifies these structures so they can be used in the query interface. Style ===== Capitalization conventions -------------------------- - MATE is the name of our project, ``mate`` is the name of the legacy CLI tool and the primary Python package - Docker, Python, SQLAlchemy, etc. are proper names and need capitalization - JSON, SQL, POI, etc. are abbreviations and need capitalization Linking ------- Wikipedia has `sensible guidelines on linking `_, and the MATE documentation follows similar conventions. In particular: - Mention of unfamiliar or novel concepts like the CPG should include internal links to the main documentation page on the concept (in this case, :doc:`cpg`). - Links should generally appear at most once per page Abbreviations ------------- For an appropriate balance of clarity and concision, most abbreviations like POI for point of interest and CPG for code property graph should be introduced explicitly once per page, and referred to by the abbreviation thereafter. ***************************** Adding a new model to the CPG ***************************** Considerations ============== Before committing to adding a new CPG model, consider the following: #. Does your feature need its own (node or edge) kind and corresponding model, or can it be attached to a pre-existing model? #. If adding a new node kind, will new edge kinds be required? Changes ======= Enumerations ------------ In ``frontend/mate/mate/cpg/types/mate.py``, update the ``NodeKind`` or ``EdgeKind`` enum (or both, if adding both nodes and edges) to contain your new kinds. Nodes ^^^^^ If adding a new node whose kind is ``Foo``, you might add the following to ``NodeKind``: .. code-block:: python FOO = "Foo" Once you've updated the ``NodeKind`` enum, you should also: * Update the ``NODE_PROVENANCE`` map to reflect the MATE component responsible for the node's contents. * If appropriate, update one or more of the node sets (e.g. ``CONSTANT_NODES``, ``INSTRUCTION_NODES``, etc.) that the new node belongs to. Not all nodes have a set, so this step may not be necessary. * Update the ``NodeJSON`` enum in the same file to list the new node's expected top-level attributes. Attribute names are unique, so you shouldn't repeat any enumeration values that happen to already exist. For example, if ``Foo`` is expected to contain ``bar`` and ``baz``, you might add: .. code-block:: python BAR = "bar" BAZ = "baz" Edges ^^^^^ If adding a new edge whose kind is ``FooToBar``, you might add the following to ``EdgeKind``: .. code-block:: python FOO_TO_BAR = "FooToBar" Once you've updated the ``EdgeKind`` enum, you should also update one or more of the edge sets (e.g. ``POINTS_TO``, ``DATA_FLOW_FORWARD_THIN``, etc.) that the new edge belongs to. Not all edges have a set, so this step may not be necessary. Schemata -------- Next, you'll need to update the JSON schemata to reflect your new node and/or edge kinds. For new nodes, edit the ``frontend/mate-common/mate_common/schemata/nodes.json`` file. For new edges, edit the ``frontend/mate-common/mate_common/schemata/edges.json``. New relationships ^^^^^^^^^^^^^^^^^ If your changes include new relationships between nodes (including new edges), then you'll likely need to edit two files: * ``frontend/mate-common/mate_common/schemata/relationships.json``: Describes the cardinality of each ``EdgeKind``, i.e. ``one-to-one``, ``one-to-many``, etc. * ``frontend/mate-common/mate_common/schemata/endpoints.json``: Describes the valid node-node relationships as connected by edges. Adding the model ---------------- Finally, we need to add the actual model and connect it to the above enumerations. Adding Nodes ^^^^^^^^^^^^ For nodes, add the model to the appropriate file under ``frontend/mate/mate/cpg/models/node``. The appropriate model is determined by the node's origin: DWARF-related nodes go under ``dwarf.py``, analysis-related nodes go under ``analysis``, and various program ASTs go under ``ast/``. For example, if ``Foo`` is a binary-level program feature, we might add the following to ``frontend/mate/mate/cpg/models/node/ast/bin.py``: .. code-block:: python class Foo(NodeMixin): _kind = NodeKind.FOO pass See below for examles of adding relationships via the ``edge_relationship`` API. Our final step is to expose our model as an attribute on ``BaseCPG``, which is defined in ``frontend/mate/mate/cpg/models/core/cpg.py``: .. code-block:: python class BaseCPG: Foo: Type[Foo] # further down, in _attach_node_models: self.Foo = make_node_class(self.Node, _Foo) # even further down: self._node_model_classes = [ self.ASMBlock, ... self.Foo, ] Adding Edges ^^^^^^^^^^^^ For edges, you don't need to modify the ``BaseCPG``. However, if you'd like attributes to show up on nodes based on their connecting edges, you'll need to describe their relationships using ``edge_relationship``. For example, here's how an ``ASMGlobalVariable`` is connected to its ``DWARFType``: .. code-block:: python @declared_attr def dwarf_type(cls) -> RelationshipProperty: return cls.edge_relationship( EdgeKind.HAS_DWARF_TYPE, cls.cpg.DWARFType, backref="global_variables" ) ``backref`` is the same as SQLAlchemy, meaning that this code produces two new APIs: .. code-block:: python global_var.dwarf_type # => DWARFType some_type.global_variables # => [ASMGlobalVariable] ****************** Adding a POI Query ****************** To write an analysis you need to define two classes: 1. a ``POI`` class which is a record to describe the point of interest 2. an ``Analysis`` class which contains a ``run`` method with the logic of your analysis. ``Analysis`` subclasses yield instances of ``POI`` subclasses as soon as they are computed. Here are all of the contractual requirements for POIs and Analyses: - POIs must be JSON serializable - The POI base class provides default ``to_json`` and ``from_json`` functions which call ``json.loads`` and ``json.dumps``. If your POI subclass has simple fields (such as strings) this will be sufficient. If you POI has more complex fields (such as objects) you must define your own ``to_json`` and ``from_json`` objects that describe how to serialize and deserialize instances of your subclass. - Analyses must provide a ``run`` method with exactly the signature up above - Analyses should yield instances of the POI subclass they are returning as soon as possible - Analysis scripts must contain exactly one instance of an Analysis subclass. - Analyses may log using calls to ``logger.{log_level}`` but **MUST NOT** print - The analysis module itself **MUST** provide an ``analysis_background`` field, in Markdown, that will be rendered as context for each POI produced. Here's a checklist of things to think about when adding a POI query: - Add a test program and accompanying test harness. Optimally, this program will have true positives, false positives (where static information is insufficient or the query could stand to be improved), true negatives, and false negatives. - Add the new query to the tests in ``tests/integration/poi/test_pois.py``. - Test running the POI query on some mid-size programs to ensure it's sufficiently performant. - Consider whether there are opportunities to push work from Python into the database (which is generally way faster). - Consider the directionality of your queries - are there reasons to think that forwards or reverse queries might be more performant or precise? - Consider how your queries do or don't make use of context-sensitivity in the CPG - context-sensitivity can help build much more precise queries! **************************** Querying the SQL DB Directly **************************** It can be instructive to directly query the SQL database that stores the MATE CPG and other runtime state. After bringing up the MATE DB with ``docker-compose``, try the following to list the tables in the database (replacing ``mate_db_1`` with the appropriate container name): .. code:: bash docker exec -it mate_db_1 psql --username=mate --command='\dt' You can then attempt the following queries, replacing ``nodes`` or ``edges`` by some specific build's node or edge table (which will look something like ``edges_9b71d8a6f09c43989785682196f98bfd``). Listing Nodes/Edges =================== You can list all the nodes or edges in a graph by issuing direct table scans. Use ``limit`` to only see a sample. .. code:: sql SELECT * FROM nodes LIMIT 20; .. code:: sql SELECT * FROM edges LIMIT 20; Filtering on Node/Edge Attributes ================================= To restrict your listing to specific attributes, add a ``where`` clause and use ``json_extract``. .. code:: sql SELECT * FROM edges WHERE json_extract(attributes, '$.edge_kind') = 11 LIMIT 20; Documentation on how to manipulate JSON can be found in the `SQLite Documentation `__. Joins ===== To join a known number of tables, use them together in a ``WHERE`` clause. .. code:: sql SELECT json_extract(sources.attributes, '$.llvm_type') AS source_type, json_extract(targets.attributes, '$.llvm_type') AS target_type FROM nodes AS sources, edges, nodes AS targets WHERE sources.uuid = edges.source AND edges.target = targets.uuid AND json_extract(sources.attributes, '$.llvm_type') = 'i32' AND json_extract(edges.attributes, '$.edge_kind') = 11; ***************** Running the Shell ***************** MATE provides the capability to open an IPython shell which automatically connects to a running database instance. This shell can be run as follows:: docker-compose -f docker-compose.yml -f docker-compose.clients.yml run shell Once inside the shell, the database object ``db`` is available and preconfigured, can be used to make queries using the MATE query language. For example:: >>> session = db.new_session() >>> session.query(db.Build).count() 1 >>> build = session.query(db.Build).one() >>> print(build.uuid) "e4e8e8e5e9c848fd959463d568e30194" >>> graph = session.graph_from_build(build) >>> session.query(graph.Node).count() 909 >>> session.query(graph.Edge).count() 2045 *********************** Internal state machines *********************** MATE executes a variety of asynchronous tasks internally, the majority of which are implemented in ``mate.tasks`` (i.e., under ``frontend/mate/mate/tasks``). Nearly all tasks are modeled as state machines, with a limited number of valid transitions between states. Each is shown below, with terminal states highlighted in red. .. image:: /schemata/statemachine.AnalysisTaskState.png :alt: State transition diagram for AnalysisTaskState nodes :target: ./schemata/statemachine.AnalysisTaskState.png .. image:: /schemata/statemachine.BuildState.png :alt: State transition diagram for BuildState nodes :target: ./schemata/statemachine.BuildState.png .. image:: /schemata/statemachine.CompilationState.png :alt: State transition diagram for CompilationState nodes :target: ./schemata/statemachine.CompilationState.png .. image:: /schemata/statemachine.MantiserveTaskState.png :alt: State transition diagram for MantiserveTaskState nodes :target: ./schemata/statemachine.MantiserveTaskState.png **************** Pointer Analysis **************** Running Souffle =============== You can run the pointer analysis in the Souffle interpreter with a Shake target:: ./shake.sh run-souffle -- -- 1-callsite file.ll The output will be stashed in ``.out/cache/pointer-analysis``. N.B. this runs the analysis in "debug" mode, invoking both the unification analysis and subset analysis and including all relevant asserts. An individual pointer analysis can be run (without debug asserts) via ``run-souffle-unification`` and ``run-souffle-subset`` respectively. Profiling --------- Souffle has a `history `_ of `issues `_ `with `_ its profiler. So YMMV depending on the most recent release. You can profile Souffle using the ``run-souffle`` target above, with the addition of the ``-p`` flag (if you built Souffle from source, you'll have to modify the ``run-souffle`` Shake target to use your patched version of Souffle). .. code-block:: bash ./shake.sh run-souffle -- -- 1-callsite your-prog.bc -c -p your-prog-profile.json souffleprof -j your-prog-profile.json Debugging the pointer analysis results ====================================== :ref:`cclyzer++ ` stores its CSV results in a compressed format. As a result, the CSV intermediates produced by running a CPG build with ``debug_pointer_analysis=true`` are not initially human readable. To decompress the CSV intermediates for manual debugging, you can use ``gunzip``: .. code-block:: bash # -c emits to stdout gunzip -c /path/to/intermediates/whatever.csv.gz The results produced by the pointer analysis itself (from cclyzer++'s FactGenerator) are similarly compressed, and can be decompressed with the same ``gunzip`` invocation. Plots ===== To evaluate the impact of different context sensitivity strategies, there is some code to collect and plot statistics generated by the pointer analysis in ``llvm/PointerAnalysis/stats``. Here's an example of how to the program to evaluate three choices of context sensitivity across a few different input programs. The example works when run inside the ``mate-dev`` image, in the repository root: .. code-block:: python pip install -r llvm/PointerAnalysis/stats/requirements.txt python3 llvm/PointerAnalysis/stats/stats.py -vv collect \ --fact-generator-exe="$PWD/.out/build/llvm/PointerAnalysis/factgen-exe" \ --extra-souffle-arguments="-j8" \ --extra-souffle-arguments="-L .out/build/llvm" \ --extra-souffle-arguments="-lFunctors" \ --context-sensitivity="insensitive" \ --context-sensitivity="1-callsite" \ --context-sensitivity="2-callsite" \ path/to/output/directory \ llvm/PointerAnalysis/datalog/statistics.project \ program1.bc \ program2.ll \ program3.ll python3 llvm/PointerAnalysis/stats/stats.py -vv report \ path/to/output/directory \ path/to/report/directory ************************************ Publishing the MATE packages to PyPI ************************************ .. note:: These steps can be performed automatically via the ``pydists`` target in the ``shake`` build: .. code-block:: console docker run --rm -v $(pwd):/mate -it mate-dev:latest ./shake.sh -j pydists When run this way, all distribution files will be written to ``.out/build/pydist/``. .. note:: These steps will become mostly obsolete once MATE is open sourced. .. note:: You need to have a PyPI account (and be an owner/publisher of the relevant package) to follow these steps. In general, there are three steps to publishing one of the ``frontend/`` packages: #. *Install packaging dependencies* On your local host: .. code-block:: console $ pip install -U setuptools wheel build twine #. *Build the package* Using ``mate-common`` as an example: .. code-block:: console $ cd frontend/mate-common # build both source and binary (wheel) distributions $ python -m build --sdist --wheel .. note:: By default, ``build`` will create ``build/`` and ``dist/`` folders in ``$CWD``. This will confuse our linting tools (``mypy`` and ``bellybutton``), so you may want to remove them before running any linting actions. Then, confirm that the packages were built as expected: .. code-block:: console $ ls dist/* #. *Publish the package files* From the same directory as above: .. code-block:: console $ twine upload dist/* .. note:: ``twine`` will prompt you for your PyPI password, unless you have it (or an API token) preconfigured. .. note:: Make sure that ``dist/`` has nothing other than the new distribution files in it; preexisting distribution files will be rejected by PyPI. .. _history: ******* History ******* MATE was originally developed as part of the DARPA CHESS program. MATE may contain some amount of code for, documentation of, or references to integration with other systems ("TAs", "challenges", "challenge broker"). Furthermore, MATE was originally closed-source and developed on a private Gitlab instance hosted at Galois, Inc., and may contain references to Galois infrastructure or processes. We've tried our best to remove superfluous code, comments, and documentation in the open-sourcing process. Several components were de-prioritized during development, such as: - The trace-collection and analysis machinery - The patcher - Directed symbolic execution with Manticore Furthermore, some features of the CI system and development flow have not been ported over to Github. Contributors ============ The following individuals contributed to the initial development of MATE during the DARPA CHESS program: - Aaron Bembenek - Alex Cameron - Andrei Stefanescu - Andrew Kent - Ankita Singh - Annie Cherkaev - Artem Dinaburg - Ben Davis - Ben Selfridge - Boyan Milanov - Brad Swain - Carson Harmon - Eric Hennenfent - Eric Kilmer - Jason Graalum - Karl Smeltzer - Kevin Zhang - Langston Barrett - Michelle Cheatham - Niki Carroll - P.C. Shyamshankar - Richard Jones - Rory Mackie - Scott Moore - Sonya Schriner - Stephen Chong - Ted Hille - Trent Brunson - Weston Hopkins - William Woodruff ******************* CI and Git Branches ******************* .. NOTE:: Since :ref:`MATE is not actively developed by Galois, Inc. `, this section is somewhat :ref:`historical ` in nature. Background ========== MATE has a large test suite, ranging from unit tests of small pure functions to whole-system integraion tests running in Docker Compose. However, more tests running on every commit are not *always* better. We have to balance the trade-off between the additional assurance granted by running more tests against the time they take to run. The central concerns to take care of when considering this balance are: - Level of assurance for deliverables to the client - Depth of features in deliverables to the client - Developer productivity and happiness (None of these concerns correspond directly to more/slower tests or fewer/faster tests.) Implementation ============== Development of MATE happens in merge requests against the ``main`` branch, which has a suite of tests that run on every commit. MATE has enough long-running tests that it's not feasible to run all of them against every commit before merging to ``main``. Therefore, ``main`` is periodically merged into the ``stable`` branch, and the following script decides when to run a longer, more complete suite of tests: .. literalinclude:: ../../../ci/on-stable.sh :language: sh Specifically, when the above script exits with 0, - The environment variable ``MATE_INTEGRATION_TESTS`` is set to ``1`` before running ``pytest``, which causes more tests to be run and with slightly different settings, - All C++ components of MATE are built with the Clang Static Analyzer - All C++ components of MATE are built LLVM's Address Sanitizer and Undefined Behavior Sanitizer, and the test suite is run on top of each of these builds. - The ``--release-build`` flag is passed to Shake (see :ref:`Building Releases `) Building Releases ================= Releases are cut from the ``stable`` branch. The primary release artifact is the ``mate-dist`` Docker image that is build by CI and uploaded to the Galois-maintained Artifactory instance. Release builds are built with Shake's ``--release-build`` flag which - Disables assertions so that fewer errors are fatal - Turns on more aggressive optimization flags which cause build times to skyrocket but significantly improve performance The effect of this flag is predominantly implemented in ``llvm/CMakeLists.txt`` and ``llvm/PointerAnalysis/CMakeLists.txt``. ********************* Development practices ********************* .. NOTE:: Since :ref:`MATE is not actively developed by Galois, Inc. `, this section is largely :ref:`historical ` in nature. This section collects the practices that the whole MATE team has agreed to uphold. These practices continue to develop as we work together. The two major goals of these practices are: 1. To ease the maintenance of MATE in the long term 2. To reduce the potential for friction and inefficiency as a result of mismatched expectations To attend to (2), we need to agree on how consistently we uphold each of these practices. In the following sections, the word *consistently* will prefix practices that we do almost always, and *aspirationally* will be used for ones that aren't necessarily expected, but are nice to do. Some of these are *invariably* enforced, meaning that compliance is guaranteed by some automated mechanism (usually a Gitlab setting). Documenting development practices ================================= *Consistently*: documenting our development practices in writing ensures we're all on the same page. Changes to the practices should be reflected in this document, and should be approved by the whole team. Using Gitlab ============ MRs --- *Invariably*: Code is added to the main branch exclusively through merge requests (MRs). In practice this means developing on a branch off of main and opening a MR based on that branch. *Aspirationally*: MRs link to relevant :ref:`issues `, especially ones that should be considered "done" once the MR is merged. This helps us track exactly when and how issues got fixed. .. _mr-review: Assigning reviewers ^^^^^^^^^^^^^^^^^^^ *Consistently*: Every MR is assigned to one or more collaborators for :ref:`review `. The assignee tends to be, but doesn't have to be, someone familiar with that part of the codebase. When assigning the reviewers, the MR author should ping them in the ``~review-requests`` channel on Mattermost, and should do so every time the MR is ready for another round of reviews. One approval is required before merging. The assignee should review within a business day, or reassign the MR to someone else. Unless the author needs specific feedback from the assignee(s), it's OK to merge if a collaborator other than the assignee(s) reviews the MR. Squashing ^^^^^^^^^ *Aspirationally*: To preserve a clean Git history (see also :ref:`git`), we often squash the many commits added to a MR during the review/test cycle into one commit on the main branch. However, it is not a requirement to squash commits, and in fact it can be detrimental to a meaningful Git history to squash together unrelated changes (e.g. one commit that is purely organizational with one that changes functionality or fixes a bug). MR authors should use their best judgement, and :ref:`label the MR appropriately `. *Consistently*: As a rule, each commit on a non-squashed MR should pass CI, i.e. each commit on the main branch should always build and pass tests. Rebasing ^^^^^^^^ *Invariably*: Every MR has to be rebased against the current main branch (Gitlab conveniently provides a button for doing this). This ensures that the MR is run with the latest tests, and that another recently merged MR didn't subtly conflict with it. Unfortunately, it also means that all the tests need to run again, and they can take a while. *Aspirationally*: After rebasing a branch, the rebaser should cancel any ongoing CI runs for that same MR. This helps reduce load on the CI runners, speeding up other builds. Merging ^^^^^^^ *Consistently*: The author of the MR hits the merge button after it has been approved. This prevents merging in code that may be approved but that the author may wish to make more changes to before merging. Reviewers can hit the merge button for trivial changes but should generally defer to the author. :ref:`The section on MR labels ` describes how the author can indicate that the MR is ready to be merged. .. _mr-labels: MR Labels ^^^^^^^^^ *Aspirationally*: When an MR has been approved by a reviewer and is ready to merge, the author should add the ``MR-ready/squash`` or ``MR-ready/no-squash`` labels, indicating that the MR is ready to be merged, and that other collaborators should rebase it and merge it if another MR is merged first. See also :ref:`the section on issue labels `. WIP MRs ^^^^^^^ *Aspirationally*: If your MR is not ready for a full review and merge, but you'd like someone to have a head start on looking at your progress, prefix the title of the MR with "WIP:", and remove this prefix and notify the asignee (s) when it is ready for review. Issues ^^^^^^ *Aspirationally*: We use issues to track bugs, plan features, and as a forum for discussions. Creating issues and assigning them to yourself helps others know what you're up to, and prevents duplication of effort. .. _issue-labels: Issue Labels ^^^^^^^^^^^^ *Aspirationally*: There are a few labels that indicate the priority of issues. These are helpful for searching and sorting issues, especially in :ref:`milestones `. Collaborators should create and use additional Gitlab labels however they find helpful. .. _milestones: Milestones ^^^^^^^^^^ *Aspirationally*: We assign issues and MRs to milestones to track our progress, ensure necessary functionality and fixes get merged before releases, and to align our development with the Statement of Work. .. _code-review: Code review =========== We have no specific practices surrounding code review (other than :ref:`a review is required for each MR `). See also :ref:`further-reading`. Testing ======= *Aspirationally*: We generally expect that code added to MATE has at least one associated unit test, i.e., a test that specifically exercises the new functionality, independently from as much other code as possible. These tests run in CI on every merge request. .. _git: Using Git ========= Commit messages --------------- *Aspirationally*: We strive for meaningful commit messages. It's especially important to review the commit message that Gitlab will add to your MR when squashing (sometimes it makes a weird choice). Linting and formatting ====================== *Invariably*: All code introduced in MATE passes a suite of linters and formatters. These tools are integrated with the Shake build system, so developers can run the following: .. code-block:: bash ./shake.sh lint ./shake.sh format to lint and auto-format their code, respectively. Automatically formatting the code has a few advantages. - It ensures consistency in style across our codebase, hopefully making the code more readable. - It saves developer time, because you need to spend less time fixing up your hastily-written scripts to look nice. - It reduces friction during code reviews, because disagreement about how code should be formatted is front-loaded in the selection and configuration of a formatting tool. Linting shares many of the same advantages with respect to readability, but also helps us avoid common bugs, such as type errors. Mypy ---- *Aspirationally*: When possible, adding Mypy-compatible type annotations makes the code easier to refactor, communicates the author's intentions, and catches some type errors at lint-time (rather than at run-time). Sometimes it's not possible or worth the effort to add type annotations to code, due to the limitations of Mypy's type system (for example, in ``mate_query.cpg.models``). .. _further-reading: Further reading =============== Galwegians interested in learning more about internally-recommended software engineering practices should take a look at the `the relevant wiki page `_. In particular, engineers at Galois have identified the following pages as good resources for code reviewers. - `Palantir's blog post, "Code Review Best Practices" `_ - `Karl E. Wiegers' essay, "Humanizing Peer Reviews" `_