Skip to main content

Β· 10 min read
Simon Marlow

I just uploaded glean-0.2.0.0 to Hackage, along with releases of the Haskell Thrift compiler and other dependencies.

Since version 0.1.0.0, Glean has been installable using plain cabal install which vastly improves on the previous complex build process. For full details, see Building Glean From Source, but to summarise: on a recent Linux distro, with GHC 9.2-9.6 and cabal 3.6+, install some prerequisite system packages (listed in the building docs above), and then just

cabal install glean

The build takes a while, partly because one of the dependencies is a cabal-packaged copy of the "folly" C++ library (folly-clib) and cabal doesn't currently build C++ files in parallel.

Changes in 0.2.0.0​

Some pretty big things have landed:

  • Glean now comes with a generic LSP server, glean-lsp, which supports common IDE operations like go-to-definition, go-to-references, hover documentation, and symbol search. This means you can index a software project with Glean and then browse it using VS Code (for example). I'll give a couple of worked examples below showing step-by-step how to do this for some real world codebases.

  • Glean has a new experimental DB backend based on LMDB. LMDB is much smaller and simpler than RocksDB, and in most of our benchmarks it performed around 30-40% better. We're still investigating some performance issues encountered with very large indexing jobs, though. Currently it's still not possible to build Glean without the RockDB dependency, but we do intend to fix this in the future.

  • Added a new Haskell indexer that consumes .hie files directly, and collects much richer data than the old indexer - in particular it indexes local variables and collects type information for all variable occurrences, which appears on hover with glean-lsp and VS Code.

  • We're now also releasing the C++ indexer as a Cabal package along with Glean: glean-clang, so Glean can be used to index C++ projects out of the box.

Examples​

Here are a couple of things you can play with, once you've built and installed Glean.

Index LLVM + Clang and browse it in VS Code​

Clone an LLVM source tree:

git clone https://github.com/llvm/llvm-project.git

Configure, including Clang. This step also produces the compile_commands.json file that Glean will later use during indexing:

cd llvm-project/llvm
mkdir build && cd build
cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DCMAKE_BUILD_TYPE=Debug -DLLVM_ENABLE_PROJECTS=clang ..

Next, build LLVM. This step is required because LLVM includes a lot of generated code which is produced as part of the build process, so to index the source files we need to ensure all the generated code has been built first.

cmake --build . -j12

Go and get a coffee. Or two. (beware, even with 32GB this tends to OOM my laptop, so you might want to reduce -j12 to something lower). Next, we can index the project using Glean's C++ indexer.

If you haven't already install Glean's C++ indexer, do that:

cabal install glean-clang

Next we'll run the indexer. We'll store the resulting DB in llvm-project/gleandb for now.

cd ..
glean --db-root gleandb index cpp-cmake --db llvm/1 --cdb-dir "$(pwd)/llvm/build" . -j12

Go and get another coffee... this is essentially running the compiler over all the C++ code again. It should need no more than 16GB or so with 12 indexer processes running in parallel.

Note that you need to do this from the llvm-project directory, this ensures that the filenames in the Glean DB will be relative to that directory which is what glean-lsp expects. (Storing the data under the wrong filenames is the most common cause of things not working when we connect up the full IDE/LSP/Glean stack).

Next we need to set up VS Code and glean-lsp. There are full instructions for glean-lsp in its README, but here's specifically how to set it up for LLVM using the index we just created.

First install glean-lsp if you haven't already:

cabal install glean-lsp

To use this LSP server with VS Code, you need a generic LSP client such as Generic LSP Client (v2). Install that extension in VS Code, and then create llvm-project/.vscode/settings.json:

[
"glean-lsp": {
"repo": "llvm"
},
"glspc.server.command": "glean-lsp"
"glspc.server.commandArguments": ["--db-root", "gleandb"],
"glspc.server.languageId": [
"cpp", "c"
]
}

Now in VS Code, "Open Folder" and select the llvm-project folder. If you have another C++ extension installed, it probably makes sense to disable it for this folder, otherwise you'll see responses from both extensions for things like go-to-definition.

Open a source file, e.g. llvm-project/clang/include/clang/AST/Decl.h. You should have code navigation features available: holding down Ctrl while moving the mouse around should underline identifiers, and clicking on an identifier should jump to its definition. You should be able to right-click "Go to References" on a definition to find references throughout the whole LLVM + Clang tree instantly, and Ctrl+T for symbol search should work. Hovering the mouse over an identifier should show its type.

If things aren't working, then the first place to look for problems is in the output window for the Generic LSP Client: show the Output window, and then select Generic LSP Client from the dropdown on the right.

You can also open the DB in Glean's shell to check that it looks right:

glean shell --db-root llvm-project/gleandb --db llvm

Try e.g. :stat to see the contents, and try src.File _ to show known source files.

Download a DB of Stackage and try some queries​

You can download a DB of Stackage 21.21 and try some queries. This DB was produced by building ~3000 packages in Stackage 21.21 and then producing a Glean DB from the .hie files; for more details see Indexing Hackage: Glean vs. hiedb.

Unpack the DB:

mkdir /tmp/glean && tar xf glean-stackage-21.21.tar -C /tmp/glean

and start the Glean shell:

$ glean shell --db-root /tmp/glean
Glean Shell, built on 2025-07-14 13:39:35.711312749 UTC, from rev <unknown>
Using local DBs from rocksdb:/tmp/glean
type :help for help.
>

Load the DB:

> :db stackage/1
stackage>

Let's see what's in it:

stackage> :stat
hs.ClassDecl.3
count: 6503
size: 350074 (341.87 kiB) 0.0309%
hs.ConstrDecl.3
count: 89371
size: 4048652 (3.86 MiB) 0.3569%
hs.DataDecl.3
count: 40711
size: 2017999 (1.92 MiB) 0.1779%
...
Total: 21735709 facts (1.06 GiB)

Let's find the class declaration for Hashable. First we have to find its name:

stackage> hs.Name { occ = { name = "Hashable" }}
{
"id": 11500325,
"key": {
"occ": { "id": 11923, "key": { "name": "Hashable", "namespace_": 3 } },
"mod": {
"id": 733072,
"key": {
"name": { "id": 733071, "key": "Language.Preprocessor.Cpphs.SymTab" },
"unit": { "id": 560159, "key": "cpphs-1.20.9.1-inplace" }
}
},
"sort": { "external": { } }
}
}
...
5 results, 20 facts, 7.40ms, 316816 bytes, 914 compiled bytes

We got 5 results, and only one of them was the one we wanted. So let's restrict the query to find only results in the hashable package:

stackage> hs.Name { occ = { name = "Hashable" }, mod = { unit = "hashable".. }}
{
"id": 11924,
"key": {
"occ": { "id": 11923, "key": { "name": "Hashable", "namespace_": 3 } },
"mod": {
"id": 11922,
"key": {
"name": { "id": 11920, "key": "Data.Hashable.Class" },
"unit": { "id": 11921, "key": "hashable-1.4.3.0-inplace" }
}
},
"sort": { "external": { } }
}
}

1 results, 5 facts, 1.18ms, 353848 bytes, 1489 compiled bytes

OK, now let's find the class declaration:

stackage> hs.ClassDecl { name = { occ = { name = "Hashable" }, mod = { unit = "hashable".. }}}
{
"id": 19072033,
"key": {
"name": {
"id": 11924,
"key": {
"occ": { "id": 11923, "key": { "name": "Hashable", "namespace_": 3 } },
"mod": {
"id": 11922,
"key": {
"name": { "id": 11920, "key": "Data.Hashable.Class" },
"unit": { "id": 11921, "key": "hashable-1.4.3.0-inplace" }
}
},
"sort": { "external": { } }
}
},
"methods": [
{
...
}

1 results, 15 facts, 6.45ms, 514328 bytes, 1777 compiled bytes

Let's find the method names of the class:

stackage> (C.methods[..]).name.occ.name where C = hs.ClassDecl { name = { occ = { name = "Hashable" }, mod = { unit = "hashable".. }}}
{ "id": 21736733, "key": "hashWithSalt" }
{ "id": 21736734, "key": "hash" }

And finally, let's see how many instances in Stackage 21.21 provide a definition of hashWithSalt:

stackage> :count I where B = hs.InstanceBind { name = { occ = { name = "hashWithSalt" }, mod = { unit = "hashable".. }}}; hs.InstanceBindToDecl { bind = B, decl = { inst = I }}; 

267 results, 267 facts, 26.40ms, 644736 bytes, 2462 compiled bytes

To see where these instance declarations are:

stackage> I.loc where B = hs.InstanceBind { name = { occ = { name = "hashWithSalt" }, mod = { unit = "hashable".. }}}; hs.InstanceBindToDecl { bind = B, decl = { inst = I }}
{
"id": 21736733,
"key": {
"file": { "id": 3982208, "key": "text-latin1-0.3.1/src/Text/Latin1.hs" },
"span": { "start": 2662, "length": 157 }
}
}
{
"id": 21736734,
"key": {
"file": { "id": 11054592, "key": "shake-0.19.7/src/General/Thread.hs" },
"span": { "start": 526, "length": 87 }
}
}
{
"id": 21736735,
"key": {
"file": { "id": 5786370, "key": "strict-tuple-0.1.5.3/src/Data/Tuple/Strict/T6.hs" },
"span": { "start": 1887, "length": 265 }
}
}
...

There are also some example queries in an earlier blog post (however, the

schemafor Haskell has changed in a few ways since that post so some of the queries might not work exactly as written).

Index your own Haskell code​

To index the code of a Cabal package, add the following to your cabal.project:

package *
ghc-options:
-fwrite-ide-info
-hiedir .hiefiles

Then

$ cabal build
$ glean index haskell-hie --db-root /tmp/glean --db mydb/1 .hiefiles

and then you can query the new DB in the shell:

$ glean shell --db-root /tmp/glean --db mydb/1

Run a Glass server and make some simple queries​

Glass is a "symbol server", it provides a higher-level interface to the Glean data, with operations like documentSymbols for finding all the symbols in a file, and findReferences for finding all the references to a symbol. I used Glass to connect VS Code to Glean in the previous blog post.

Glass makes requests to a Glean server, so we need to start both glean-server and glass-server, like this:

$ glean-server --db-root /tmp/glean --port 12345

and in another terminal:

$ glass-server --service localhost:12345 --port 12346

then we can make requests using glean-democlient, for example to list the symbols in the file src/Data/Aeson.hs in the aeson-2.1.2.1 package:

$ glass-democlient --service localhost:12346 list stackage/aeson-2.1.2.1/src/Data/Aeson.hs
stackage/hs/aeson/Data/Aeson/var/eitherDecodeFileStrict
stackage/hs/aeson/Data/Aeson/var/eitherDecodeFileStrict%27
stackage/hs/aeson/Data/Aeson/var/eitherDecodeStrict
stackage/hs/aeson/Data/Aeson/var/fp/4335/2
stackage/hs/aeson/Data/Aeson/var/eitherDecodeStrict%27
stackage/hs/aeson/Data/Aeson/tyvar/a/6101/50
stackage/hs/aeson/Data/Aeson/tyvar/a/6563/56
stackage/hs/aeson/Data/Aeson/var/encodeFile
stackage/hs/aeson/Data/Aeson/tyvar/a/7047/61
...

Each of those symbols is a "Symbol ID", which is a string that uniquely identifies a particular symbol to Glass. Using the Symbol ID we can find all the references to a symbol:


Β· 8 min read
Simon Marlow

This post describes how Glean supports incremental indexing of source code, to maintain an up-to-date index of a large repository using minimal resources. This is an overview of the problems and some aspects of how we solved them; for the technical details of the implementation see Implementation Notes: Incrementality.

Background​

Indexing a large amount of source code can take a long time. It's not uncommon for very large indexing jobs to take multple hours. Furthermore, a monolithic indexing job produces a large DB that can be slow to ship around, for example to replicate across a fleet of Glean servers.

Source code changes often, and we would like to keep the index up to date, but redoing the monolithic indexing job for every change is out of the question. Instead what we would like to do is to update a monolithic index with the changes between the base revision and the current revision, which should hopefully be faster than a full repository indexing job because we only have to look at the things that have changed. The goal is to be able to index the changes in O(changes) rather than O(repository), or as close to that as we can get.

How incrementality works​

To produce a modified DB, we hide a portion of the data in the original DB, and then stack a DB with the new data on top. Like this:

Incremental stack

The user asks for the DB "new", and they can then work with the data in exactly the same way as they would for a single DB. The fact that the DB is a stack is invisible to a user making queries. Furthermore, the DB "old" still exists and can be used simultaneously, giving us access to multiple versions of the DB at the same time. We can even have many different versions of "new", each replacing a different portion of "old".

All of the interesting stuff is in how we hide part of the data in the old DB. How do we hide part of the data?

When facts are added to a Glean DB, the producer of the facts can label facts with a unit. A unit is just a string; Glean doesn't impose any particular meaning on units so the indexer can use whatever convention it likes, but typically a unit might be a filename or module name. For example, when indexing a file F, the indexer would label all the facts it produces with the unit F.

To hide some of the data in a DB, we specify which units to exclude from the base DB, like this:

glean create --repo <new> --incremental <old> --exclude A,B,C

would create a DB <new> that stacks on top of <old>, hiding units A, B and C.

So to index some code incrementally, we would first decide which files need to be reindexed, create an incremental DB that hides those files from the base DB and then add the new facts.

To implement hiding correctly, Glean has to remember which facts are owned by which units. But it's not quite that simple, because facts can refer to each other, and the DB contents must be valid (which has a formal definition but informally means "no dangling references"). For example, if we have facts x and y, where x refers to y:

Fact dependency

and we hide unit B, then y must still be visible, otherwise the reference from x would be dangling and the DB would not be valid.

So after the indexer has finished producing facts, Glean propagates all the units through the graph of facts, resulting in a mapping from facts to ownership sets.

Fact dependency with ownership sets

It turns out that while there are lots of facts, there are relatively few distinct ownership sets. Furthermore, facts produced together tend to have the same ownership set, and we can use this to store the mapping from facts to ownership sets efficiently. To summarise:

  • Ownership sets are assigned unique IDs and stored in the DB using Elias Fano Coding
  • The mapping from facts to ownership sets is stored as an interval map

As a result, keeping all this information only adds about 7% to the DB size.

What about derived facts?​

Derived facts must also have ownership sets, because we have to know which derived facts to hide. When is a derived fact visible? When all of the facts that it was derived from are visible.

For example, if we have a fact r that was derived from p and q:

Derived fact

The ownership set of r is { P } & { Q }, indicating that it should be visible if both P and Q are visible. Note that facts might be derived from other derived facts, so these ownership expressions can get arbitrarily large. Normalising them to disjunctive normal form would be possible, but we haven't found that to be necessary so far.

Performance​

There are three aspects to performance:

  • Indexing performance. We measured the impact of computing ownership at indexing time to be 2-3% for Python, for Hack it was in the noise, and we don't expect other languages to be significantly different.
  • Query performance. Initially query performance for an incremental stack was much slower because we have to calculate the visibility of every fact discovered during a query. However, with some caching optimisations we were able to get the overhead to less than 10% for "typical" queries, of the kind that Glass does. Queries that do a lot of searching may be impacted by around 3x, but these are not typically found in production use cases.
  • Incremental derivation performance. We would like derivation in the incremental DB to take time proportional to the number of facts in the increment. We implemented incremental derivation for some kinds of query; optimising queries to achieve this in general is a hard problem that we'll return to probably next year.

Stacked incremental DBs​

So far we have only been considering how to stack a single increment on top of a base DB. What if we want to create deeper stacks?

Stacked incremental database

The DB "newer" stacks on top of "new", and hides some more units. So there are now portions of both "new" and "old" that need to be hidden (the darker grey boxes), in addition to the original portion of "old" that we hid (light grey box).

As before, we might have multiple versions of "newer" stacked on top of the same "new", and in general these DB stacks form a tree. All the intermediate nodes of the tree are usable simultaneously: no data is being modified, only shared and viewed differently depending on which node we choose as the top of our stack.

One interesting aspect that arises when we consider how to track ownership of facts in this model is fact dependencies across the boundaries between DBs. For instance, suppose we have

Stacked dependency

If we hide B, then considering the ownership data for "old" alone would tell us that y is invisible. But we must make it visible, because x depends on it. So when computing the ownership data for "new", we have to transitively propagate ownership to facts in the base DB(s), store that in "new", and consult it when deciding which facts in "old" are visible.

Derived facts in stacked incremental DBs​

A fact might be derived from multiple facts in different DBs in the stack, and we have to represent its ownership correctly. Therefore

  • There must be a single namespace of ownership sets for the whole DB stack. That is, stacked DBs add more sets. (or else we copy ownership sets from the base DB, which doesn't seem attractive).
  • Since a fact may be owned multiple different ways (see previous section) we have to take this into account when computing the ownership expression for a derived fact.

This is the most complex diagram in the post (phew!):

Stacked derived

Here the dashed arrow means "derived from" and the solid arrow means "refers to".

The fact d should be visible if both x and y are visible. The ownership of x is {A} and y is {B,C} (because it is referred to from z which has owner B), so the final owner of d is {A} && {B,C}.

Tracking all this shouldn't be too expensive, but it's tricky to get right!