How does CodeQL analyze code?

Completed

Implementing code scanning with CodeQL requires an understanding of how the tool analyzes code.

CodeQL analysis consists of three steps:

  1. Preparing the code, by creating a CodeQL database.
  2. Running CodeQL queries against the database.
  3. Interpreting the query results.

In this unit, you'll learn about the three phases of CodeQL analysis.

Database creation

To create a database, CodeQL first extracts a single relational representation of each source file in the codebase.

For compiled languages, extraction works by monitoring the normal build process. Each time a compiler is invoked to process a source file, a copy of that file is made, and all relevant information about the source code is collected. This includes syntactic data about the abstract syntax tree and semantic data about name binding and type information.

For interpreted languages, the extractor runs directly on the source code, resolving dependencies to give an accurate representation of the codebase.

There is one extractor for each language supported by CodeQL to ensure that the extraction process is as accurate as possible. For multi-language codebases, databases are generated one language at a time.

After extraction, all the data required for analysis (relational data, copied source files, and a language-specific database schema that specifies the mutual relations in the data) is imported into a single directory, known as a CodeQL database.

Query execution

After you’ve created a CodeQL database, one or more queries are executed against it. CodeQL queries are written in a specially designed object-oriented query language called QL.

You can run the queries checked out from the CodeQL repo (or custom queries that you’ve written yourself) using the CodeQL for VS Code extension or the CodeQL CLI.

Query results

The final step converts results produced during query execution into a form that's more meaningful in the context of the source code, meaning that the results are interpreted in a way that highlights the potential issue that the queries are designed to find.

Screenshot of CodeQL query results.

Queries contain metadata properties that indicate how the results should be interpreted. For instance, some queries display a simple message at a single location in the code. Others display a series of locations that represent steps along a data-flow or control-flow path, along with a message explaining the significance of the result. Queries that don’t have metadata are not interpreted; their results are output as a table and not displayed in the source code.

Following interpretation, results are output for code review and triaging. In CodeQL for Visual Studio Code, interpreted query results are automatically displayed in the source code. You can output results generated by the CodeQL CLI into a number of different formats for use with different tools.