Skip to content

Latest commit

 

History

History
186 lines (160 loc) · 10.7 KB

README.md

File metadata and controls

186 lines (160 loc) · 10.7 KB

cxx-langstat

cxx-langstat is a clang-based tool to analyze the adoption and prevalence of language features in C/C++ codebases. Leveraging clang's ASTMatchers library we can analyze source code on the AST (abstract syntax tree) level to gain insights about the usage of high-level programming constructs. By finding the feature/construct in code and counting them, we can achieve insights about the popularity and prevalence of it.

Gaining insights is achieved in two steps, called "stages":

  • Emitting features from code: in this stage, the ASTs of the code are considered. By finding all instances of a feature (e.g. all variables that are constexpr) we can write a human-readable JSON file that contains all occurrences of a feature that interests us.
  • Emitting statistics from features: By using the occurrences from the JSON file from before, we can compute statistics e.g. by counting them.

This separation of the computation of statistics into two steps aids debugging (human-readable features) and avoids recomputation as we can compute new statistics from already extracted features, avoiding reparsing to get the AST or rematching of the AST.

Apart from the analyses that the tool comes with (see below) it also has an API that allows it to register and execute new analyses.

cxx-langstat was developed as part of my Bachelor's thesis at ETH Zurich, see here for the full text.

Instructions

Requirements

Building

  1. Clone/download cxx-langstat project
  2. Download the single-include json.hpp from JSON for Modern C++ and put it in cxx-langstat/include/nlohmann or use one of the other suggested integration methods
  3. mkdir build && cd build
  4. cmake -G "<generator>" -DCMAKE_CXX_COMPILER=<C++ compiler> ../ (clang++, clang++-11 etc.)
  5. ninja or make to build the binary

Testing

The LLVM integrated tester is used to test if features are correctly extracted from source/AST files - install using pip: pip install lit
When in build directory, type lit test -s. Use -vv to see why individual test cases fail.

Running

Basic usage: cxx-langstat [options]
Options:

  • -analyses=<string> Accepts a string of comma-separated shorthands of the analyses, e.g. -analyses=msa,ula will run MoveSemanticsAnalysis and UtilityLibAnalysis.
  • -in, -indir: specify input file or directory
  • -out, -outdir: specify output file or directory
  • -emit-features: compute features from source or AST files
  • -emit-statistics: compute statistics from features
  • parallel or -j: number of parallel instances to use, works for -emit-features only

Example use cases:

Single file

To analyze a single source or AST file:

  1. Extract features:
    cxx-langstat -analyses=<> -emit-features -in Helloworld.cpp -out Helloworld.cpp.json
  2. Compute statistics:
    cxx-langstat -analyses=<> -emit-statistics -in Helloworld.cpp.json -out Helloworld.json
Whole project

To analyze a complete software project, specify the root of it using the -indir flag

  1. cxx-langstat -analyses=<> -emit-features -indir MyProject/ -outdir features/
    will automatically consider ALL code files (.cpp, .h, .ast etc.) because of -emit-features
  2. cxx-langstat -analyses=<> -emit-statistics -indir features/ -out stats.json will automatically consider ALL .json files because of -emit-statistics

Emitting features for a project creates a JSON file for each input file, make sure to place them in a directory created before running it. Computing statistics creates a single JSON file.

Adding new analyses

A script and instructions for doing will be be merged into main soon.

Implemented Analyses

Algorithm Library Analysis (ALA)

Anecdotal evidence suggests that the C++ Standard Library Algorithms are rarely used, motivating analysis to check this claim. ALA finds and counts calls to function template from the STL algorithms, however, currently only of the non-modifying sequence operations and the minimum/maximum operations; also, the C++20 std::ranges algorithms aren't considered.

Constexpr Analysis (CEA)

Finds and counts how often variables, functions and if-statements are (not) constexpr. This could help us learning how prevalent compile-time constructs are. However, we probably should distinguish between those constructs that aren't constexpr and those that can't be. Currently, only some trivial conditions that prohibit constexpr-ness are checked, leading me to believe that we will underestimate the popularity of the keyword.

Container Library Analysis (CLA)

CLA reports variable declarations whose type is a C++ Standard Library Container: array, vector, forward_list, list, map, multimap, set, multiset, unordered_map, unordered_multimap, unordered_set, unordered_multiset, queue, priority_queue, stack, deque.
"Variable declarations" include member variables and function parameters. Other occurrences of containers are not respected.

Cyclomatic Complexity Analysis (CCA)

For each explicit (not compiler-generated) function declaration that has a body (i.e. is defined), CCA calculates the so-called cyclomatic complexity. This concept developed by Thomas J. McCabe, intuitively, computes for a "section of source code the number of independent paths within it, where linearly-independent means that each path has at least one edge that is not in the other paths." (https://en.wikipedia.org/wiki/Cyclomatic_complexity)

Function Parameter Analysis (FPA)

FPA extracts and counts the parameters of functions, function templates and their instantiations and specializations. This gives us insights about the commonness of the different kinds of parameters: by value, non-const lvalue ref, const lvalue ref, rvalue ref, forwarding ref.

Loop Depth Analysis (LDA)

LDA computes the depth of each loops and counts commonness of the dephts. Example of depth 2:

for(;;){
  do{
  doSomething();
  }while(true);
}

Currently the matchers for this analysis grow exponentially with the maximum loop depth to look for, which is not (yet) a problem since depths >5 are rare. Still, switching to a dominator tree-based approach might be favorable.

Loop Kind Analysis (LKA)

Extracts and countsfor, while, do-while and range-based for loops in C++. Especially interesting to investigate the adoption of range-based for since C++11. Limitation: not all occurrences of "traditional" loops can be converted to range-based loops, e.g. infinite loops, which might lead to LKA underestimating the popularity of range-based for.

Move Semantics Analysis (MSA)
  • Counts uses of std::move, std::forward
  • For each type, counts how often by-value parameter at function call sites where constructed by copy/move, respectively.
Template Parameter Analysis (TPA)

For the different kinds of templates (class, function, variable, alias), TPA counts the different kinds of template parameters (non-type template, type template and template template parameters) and reports whether the template employs a template parameter pack or not
(TPA finds template parameter packs, but gives no information about function parameter packs).

Template Instantiation Analysis (TIA)

Reports instantiations of templates, and counts how often certain class and function template instantiations were used. In the case of variable templates, the instantiation is reported but not (yet) counted due to oddities in clang's matchers.
Since the data types and function analyzed in ALA, CLA and ULA are mostly templates, they are based on TIA.

using Analysis (UA)

C++11 introduced type aliases (using keyword) which are similar to typedefs, but additionally can be templated. This analysis aims to find out if programmers shifted from typedefs to aliases. The analysis gives you usage figures of typedefs, aliases, "typedef templates" (an idiom used to get around above said typedef limitation) and alias templates.

Pre-C++11 Alias
// Regular typedef
typedef std::vector<int> IntVector;
// "typedef template" idiom
template<typename T>
struct TVector {
  // doesn't have to be named 'type'
  typedef std::vector<T> type;
};
// alias
using IntVector = std::vector<int>;
// alias template
template<typename T>
using TVector = std::vector<T>;
Utility Library Analysis (ULA)

Similar to the Container Library Analysis; analyzes the usage of certain class template types, namely, the following C++ Utilities: pair, tuple, bitset, unique_ptr, shared_ptr and weak_ptr.
Interesting to extend to see if auto_ptr is still used in C++.

Variable Template Analysis (VTA)

C++14 added variable templates. Previously, one used either class templates with a static data member or constexpr function templates returning the desired value. We here analyze whether programmers transitioned in favor of the new concept by reporting usage of the three constructs.

Class template with static member Constexpr function template Variable template (since C++14)
template<typename T>
class Widget {
public:
    static T data;
};
template<typename T>
constexpr T f1(){
    T data;
    return data;
}
template<typename T>
T data;

Attributions

  • add_new_analysis.py and AnalysisList are derived from LLVM clang-tidy's add_new_check.py and GlobList, respectively, which are distributed under the Apache License v2.0 with LLVM Exceptions.
  • Parts of Driver.cpp and the MatchingExtractor were learned from and inspired by and the CYC computation in CyclomaticComplexityAnalysis was derived from Peter Goldsborough's clang-useful tutorial: talk, code.
  • Code to walk through directories used in Runner.cpp was copied from here.