Skip to content

Failure to properly handle return type metadata? #65

@sogaiu

Description

@sogaiu
Owner

As reported by @rrudakov in this clojure-ts-mode issue, the current tree-sitter-clojure grammar might be seen to not properly handle return type metadata from a semantic perspective.

As a concrete example (slightly edited from the original), consider:

(defn to-string
  ^String
  [arg]
  (.toString arg))

^String here refers to the return type of the function. A straight-forward interpretation might consider the metadata to not be for the argument vector [arg]. (Possibly one could choose to see it that way though as argument information and return type information have to do with the input(s) and output(s) of a function and in some languages end up being expressed right next to each other.)

For reference, at clojure.org, this is currently documented at the Type Hints section of the Java Interop page:

For function return values, the type hint can be placed before the parameter vector:

(defn hinted-single ^String [])

-> #user/hinted-single

(defn hinted
  (^String [])
  (^Integer [a])
  (^java.util.List [a & args]))

-> #user/hinted

In a comment in the aforementioned issue, an alternative approach to parsing metadata was mentioned:

I think having metadata as standalone nodes would solve a lot of code navigation issues at least in Emacs.

The clojure-ts-mode issue had this comment as well:

Expressions with metadata (BTW, I really don't like how the grammar works with metadata, it's pain to work with and causes more problems than benefits)

Additional details were elaborated upon in this comment.

So the proposed alternative approach for handling metadata may have multiple motivations.

I suspect this would involve a significant reworking of the grammar. May be creating a new grammar would allow additional improvements to be incorporated such as this issue as well as an additional point made about open nodes [1].

[1] Also made in this comment.

Activity

NoahTheDuke

NoahTheDuke commented on May 18, 2025

@NoahTheDuke

I suspect this would involve a significant reworking of the grammar.

Why would that be?

sogaiu

sogaiu commented on May 18, 2025

@sogaiu
OwnerAuthor

I'd like to jot down some further thoughts and the results of some investigations before attempting to answer:

I suspect this would involve a significant reworking of the grammar.

Why would that be?


One thing to note is that "the return type being specifiable via metadata that appears before the argument vector" appears to be a "post-read" thing in Clojure. AFAIU, the "ordinary" (other?) metadata cases are handled in LispReader.java during read or similar.

I think this line in core.clj may be where the identified return type case happens. (I'm not quite sure about this yet and might look into it further.)

Assuming this is correct for the moment, I think it's not unreasonable for the parse results from tree-sitter to associate the metadata with the argument vector because the metadata in question does actually live on the argument vector (at least initially -- I haven't checked whether it goes away or is removed).


On a different note, I thought about doing a "standalone metadata nodes" grammar a bit more and I think I may have tried this before [1]. I'm not sure what the results were unfortunately.

I thought it might have adverse impacts on things like the handling of #_ and what it corresponds to, but I'm not sure. Perhaps it can be made to work.


[1] ...though I'm not having luck turning up any code for that.

sogaiu

sogaiu commented on May 18, 2025

@sogaiu
OwnerAuthor

the metadata in question does actually live on the argument vector (at least initially -- I haven't checked whether it goes away or is removed).

I think the following is consistent with the supposition above about the "return type metadata" living on the argument vector (some lines added to improve readability):

$ clj
Clojure 1.12.0

user=> (def res (read-string "(defn to-string ^String [arg] (.toString arg))"))
#'user/res

user=> (prn res)
(defn to-string [arg] (.toString arg))
nil

user=> (binding [*print-meta* true] (prn res))
(defn to-string ^String [arg] (.toString arg))
nil

user=> (binding [*print-meta* true] (prn (nth res 2)))
^String [arg]
nil

So although it might seem a bit weird, I'm inclined to say that there is a kind of consistency and correctness to the way metadata is handled currently even in the return type case.

sogaiu

sogaiu commented on May 18, 2025

@sogaiu
OwnerAuthor

To return to this question:

I suspect this would involve a significant reworking of the grammar.

Why would that be?

Looking over grammar.js, there are more than 10 node types that currently "include" metadata. They are:

  • sym_lit
  • list_lit
  • map_lit
  • vec_lit
  • set_lit
  • read_cond_lit
  • splicing_read_cond_lit
  • ns_map_lit
  • var_quoting_lit
  • evaling_lit
  • tagged_or_ctor_lit
  • derefing_lit
  • quoting_lit
  • syn_quoting_lit
  • unquote_splicing_lit
  • unquoting_lit

At minimum I think one approach might include the following changes:

  • The node types mentioned above might all change to not contain metadata
  • A replacement metadata node would be allowed to appear pretty much anywhere (like _ws, comment, and dis_expr perhaps)

Whether one considers this a significant reworking or not, it doesn't seem like a compatible change nor is it a change to a small number of node types.


Assuming a second grammar / parser that implements the sort of idea described above is feasible...IIUC a number of editors support the use of more than one parser per buffer (at least to handle "injections", for example clojure-ts-mode seems to do this for docstrings and regular expressions) so it might be possible to use two parsers alongside each other for non-injection activities.

One idea is for a second (newer) parser to handle certain tasks that are awkward for the first parser and gradually migrate existing functionality over to the newer parser's functionality if that seems worth it. Alternatively, one might just decide to use both.

I think at least in Emacs it might be a practical path.

As mentioned elsewhere [1], if one were going to make a new grammar / parser though, it might be good to collect other "failings" or "wishes" related to the current parser.


[1] At least in this comment (string_contents node) and in this one (related to the open field).

rrudakov

rrudakov commented on May 18, 2025

@rrudakov

I thought it might have adverse impacts on things like the handling of #_ and what it corresponds to, but I'm not sure. Perhaps it can be made to work.

Currently in the dis_expr kind of absorbs the following node including metadata:

(defn foo
  #_^{:bla "meta"}
  [arg]
  body)
(list_lit open: (
  value: (sym_lit name: (sym_name))
  value: (sym_lit name: (sym_name))
  (dis_expr marker: #_
   value: 
    (vec_lit
     meta: 
      (meta_lit marker: ^
       value: 
        (map_lit open: {
         value: (kwd_lit marker: : name: (kwd_name))
         value: (str_lit) close: }))
     open: [
     value: (sym_lit name: (sym_name))
     close: ]))
  value: (sym_lit name: (sym_name))
  close: ))

And if there is more than one dis_expr, they are nested, which is interesting, but I guess valid:

(if ^boolean (= 2 2)
  #_#_true
  false)
(list_lit open: (
  value: (sym_lit name: (sym_name))
  value: 
   (list_lit
    meta: 
     (meta_lit marker: ^
      value: (sym_lit name: (sym_name)))
    open: (
    value: (sym_lit name: (sym_name))
    value: (num_lit) value: (num_lit) close: ))
  (dis_expr marker: #_
   (dis_expr marker: #_ value: (bool_lit))
   value: (bool_lit))
  close: ))

If meta_lit was a standalone node, the dis_expr node would have to skip zero or more meta_lit nodes and wrap the first non-meta node, is it possible to implement in the grammar?


I think the following is consistent with the supposition above about the "return type metadata" living on the argument vector

Looks like you're right, I also tried to following in REPL:

user=> (defn to-string #_ ^String [arg] (.toString arg))
Syntax error macroexpanding clojure.core/defn at (REPL:1:1).
.toString - failed: vector? at: [:fn-tail :arity-n :bodies :params] spec: :clojure.core.specs.alpha/param-list
(.toString arg) - failed: vector? at: [:fn-tail :arity-1 :params] spec: :clojure.core.specs.alpha/param-list
sogaiu

sogaiu commented on May 18, 2025

@sogaiu
OwnerAuthor

More later, but regarding this:

And if there is more than one dis_expr, they are nested, which is interesting, but I guess valid

To give some background for this, some people make use of "stacked" #_ as part of the ability to conveniently disable bits of code like in the following cases:

(let [x 1
      #_ #_ y 2]
  x)
{:a 1
 :b 2
 #_ #_ :c 3}

It's easier to put the multiple #_ instances together (and remove them) than to try to interleave them.

To spell things out a bit, in the latter example, interleaving would look like:

{:a 1
 :b 2
 #_ :c #_ 3}

which achieves the same end but is more awkward to put into place as well as remove.

AFAIU, the current grammar "understands" this "stacked" usage...at least the intent was for it do so (^^;

sogaiu

sogaiu commented on May 18, 2025

@sogaiu
OwnerAuthor

If meta_lit was a standalone node, the dis_expr node would have to skip zero or more meta_lit nodes and wrap the first non-meta node, is it possible to implement in the grammar?

I'm not sure -- with tree-sitter, it is oftern trial and error.

What I wrote above:

A replacement metadata node would be allowed to appear pretty much anywhere (like _ws, comment, and dis_expr perhaps)

may be close to what you described. If that worked, it might yield a similar result to the idea of dis_expr skipping "zero or more meta_lit nodes".

dis_expr currently looks like:

dis_expr: $ =>
seq(field('marker', "#_"),
    repeat($._gap),
    field('value', $._form)),

It already "skips" zero or more _gap nodes which are effectively _ws, comment, and/or dis_expr nodes:

_gap: $ =>
choice($._ws,
       $.comment,
       $.dis_expr),

In case unknown / unfamiliar, in tree-sitter-land, it's often the case that one prioritizes parsing correct code over trying to forbid incorrect code so the idea of allowing metadata nodes to be pretty much anywhere isn't as strange as it might seem.


As to whether this alternative approach can be made to work (and work well), I think someone would have to try. FWIW, the kinds of issues I have encountered before include:

  • Unable to get the grammar to be accepted by the tree-sitter tooling (the equivalent of "won't compile")
  • Error rates are too high [1] (the equivalent of too many test failures)
  • Parsing is too slow

[1] I found this with at least one of the original grammars -- which is one of the motivations for creating yet another grammar.

sogaiu

sogaiu commented on May 18, 2025

@sogaiu
OwnerAuthor

I also tried to following in REPL

Thanks for the confirmation. Good to see another method of verifying.

sogaiu

sogaiu commented on May 18, 2025

@sogaiu
OwnerAuthor

I tried making some of the discussed changes and pushed de52a1f6 to a new branch for experimenting with standalone metadata nodes.

rrudakov

rrudakov commented on May 18, 2025

@rrudakov

I tried making some of the discussed changes and pushed de52a1f6 to a new branch for experimenting with standalone metadata nodes.

Thank you! I'll check it out.

rrudakov

rrudakov commented on May 18, 2025

@rrudakov

I tried making some of the discussed changes and pushed de52a1f6 to a new branch for experimenting with standalone metadata nodes.

At the first glance, besides few minor issues, it looks great! Navigation works as expected. I'll try to do some more experiments with clojure-ts-mode in a separate branch and I'll post here if I find anything broken.

Thank you @sogaiu !

sogaiu

sogaiu commented on May 19, 2025

@sogaiu
OwnerAuthor

Good to hear it seems to be more or less working :)

Curious about the "few minor issues", any hints about those?

sogaiu

sogaiu commented on May 19, 2025

@sogaiu
OwnerAuthor

BTW, I managed to find an older attempt at simplifying the grammar in various ways -- it was in a separate repository.

I've cleaned it up a bit and made a few changes.

Notable differences include:

  • #_ and metadata (actually ^ and #^) can appear anywhere whitespace and comments can
  • As it predates the support for picking apart namespaced symbols and keywords, symbols and keywords are "atomic"
  • No fields
  • set_lit and anon_fn_lit start with #{ and #( respectively (i.e. single anonymous nodes consisting of two characters each instead of two nodes where one is associated with # and the other with an opening delimiter)
  • It has a different name -- within emacs lisp files it should be referred to via 'clojure-simple instead of 'clojure. This was done to allow co-existence of the grammars.
rrudakov

rrudakov commented on May 19, 2025

@rrudakov

Good to hear it seems to be more or less working :)

Curious about the "few minor issues", any hints about those?

Sorry, it probably was confusing. There are no issues with the grammar itself. I meant that clojure-ts-mode didn't work with the modified grammar out of the box. I had to modify a few queries here and there and cleanup some things related to skipping metadata.

rrudakov

rrudakov commented on May 19, 2025

@rrudakov

BTW, I managed to find an older attempt at simplifying the grammar in various ways -- it was in a separate repository.

I've cleaned it up a bit and made a few changes.

Notable differences include:

  • #_ and metadata (actually ^ and #^) can appear anywhere whitespace and comments can
  • As it predates the support for picking apart namespaced symbols and keywords, symbols and keywords are "atomic"
  • No fields
  • set_lit and anon_fn_lit start with #{ and #( respectively (i.e. single anonymous nodes consisting of two characters each instead of two nodes where one is associated with # and the other with an opening delimiter)
  • It has a different name -- within emacs lisp files it should be referred to via 'clojure-simple instead of 'clojure. This was done to allow co-existence of the grammars.

Not sure it would be a good idea to use 2 different Clojure grammars for the same file. Also it might be a Emacs specific problem and there might be a solution (see the discussion https://debbugs.gnu.org/cgi/bugreport.cgi?bug=78458). I'll try to add sexp-default "thing" and see if it makes things better.

114 remaining items

rrudakov

rrudakov commented on Jul 8, 2025

@rrudakov

So may be for clojure-ts-mode, the way the unstable-2025* grammars behave currently is ok for completing symbol names?

clojure-ts-mode completion function doesn't analyze syntax at point, Tree-sitter is only used to query all potential candidates in the buffer. Prefix is taken using Emacs' builtin facilities (like symbol-at-point or something like this).

sogaiu

sogaiu commented on Jul 8, 2025

@sogaiu
OwnerAuthor

Ah, the (bounds-of-thing-at-point 'symbol) bit, I think I see.

Thanks for the explanation.


On a related note, it looks like there is some use of treesit-thing-at-point in clojure-ts-mode in a few places.

I suppose those kinds of things might work better when there are no errors...


I think there is some sense to recognizing "name"-less symbols (i.e. a symbol with just a namespace part and the slash delimiter) for editor operations so probably I should see if I can get that to work.

I have some nagging thoughts about how that might have adverse effects on how we're doing tests ATM and the symbol and keyword stuff is kind of messy, but may be it's best to try to see what happens...

sogaiu

sogaiu commented on Jul 9, 2025

@sogaiu
OwnerAuthor

I think there is some sense to recognizing "name"-less symbols (i.e. a symbol with just a namespace part and the slash delimiter) for editor operations so probably I should see if I can get that to work.

I've made some attempts and have something to report.


Obvious changes I tried to get things like clojure.string/ to be treated as a symbol without breaking other things didn't go very well.

One example change was:

_sym_qualified: $ =>
prec(1, seq(field("namespace", alias(SYMBOL, $.sym_ns)),
            field("delimiter", NS_DELIMITER),
            field("name", optional(alias(SYMBOL_NAMESPACED_NAME, $.sym_name))))),

This kind of change typically resulted in symbol-handling of the following sorts (from the corpus tests):

  • deeper map

    {:paths ["src"]
     :deps {clj-kondo/clj-kondo {:mvn/version "2020.09.09"}}}
            ^^^^^^^^^^---------

    The first symbol in the inner map gets parsed as two symbols: clj-kondo/ and clj-kondo

  • symbol with prefix

    clojure.string/blank?
    ^^^^^^^^^^^^^^^------

    Similar results with what should be a single symbol getting parsed as two: clojure.string/ and blank?

  • namespaced division symbol

    clojure.core//
    ^^^^^^^^^^^^^-

    Similarly: clojure.core/ and /


On a hunch, I tried effectively reverting handling of symbol parsing into pieces such as a namespace portion and a name portion (which dannyfreeman added in v0.10.0).

With that approach, I've had success properly parsing all the cases I've tested with. Parsing the clojars release jars samples seemed to go ok too [1].

The changes to grammar.js for this approach essentially boil down to the straight-forward [2]:

sym_lit: $ =>
choice("/",
       SYMBOL,
       seq(SYMBOL, NS_DELIMITER, SYMBOL),
       seq(SYMBOL, NS_DELIMITER)),

So ATM I know how to either:

  1. Achieve treating something that ends in a / that's not quite a symbol as a symbol, but only if giving up parsing of symbols into pieces (so forgo straight-forward finer-grained highlighting of symbols) -OR-

  2. Have the grammar tease apart symbols into pieces, but pass on recognizing something that ends in a / that's not quite a symbol as a symbol.


I'll keep exploring to try to get things to work out / understand the current arrangement better. Hopefully, it's just something I'm missing 😅


[1] In fact, it turned up one more problematic Clojure file...though possibly this could be a matter of opinion?

[2] Plus the addition of a conflict and removal of some bits that become unneeded.

sogaiu

sogaiu commented on Jul 9, 2025

@sogaiu
OwnerAuthor

I'll keep exploring to try to get things to work out / understand the current arrangement better. Hopefully, it's just something I'm missing 😅

After a bit more exploration I think I understand what was going on.

It looks like it may be doable.

The unstable-20250709 branch has the two apparently successful attempts. Likely the more interesting one is in this commit as it preserves "perception" of symbols into components.

NoahTheDuke

NoahTheDuke commented on Jul 9, 2025

@NoahTheDuke

[1] In fact, it turned up one more problematic Clojure file...though possibly this could be a matter of opinion?

I wouldn't worry too much about people doing very weird pre-processing of clojure files, especially in projects that haven't been touched in 13 years.

sogaiu

sogaiu commented on Jul 10, 2025

@sogaiu
OwnerAuthor

people doing very weird pre-processing of clojure files

Interesting...now that you put it that way, may be that is what's going on there!

I've seen plain prose within (comment ...) and I think also in #_ ( ... ), but not something that looks like it's intended to be manipulated programmatically (e.g. like template constructs).

Yeah, I'm not too worried, it's more about being aware of what kinds of things exist in the wild. Being able to programmatically ignore "invalid" Clojure in the context of testing against samples is something I have an interest in so I like to take a somewhat close look at what turns up [1].

Thanks for taking a look and sharing your observations, much appreciated!


[1] For reference, here is a list of some samples that I expect not to parse currently along with descriptions of "why".

sogaiu

sogaiu commented on Jul 10, 2025

@sogaiu
OwnerAuthor

It looks like it may be doable.

I thought I'd investigate whether something similar is doable for keywords.

Adapting the approach for symbols, I'm currently getting the following (one typical result from a particular corpus test):

For:

:clojure.core//

Actual result:

(source
  (kwd_lit
    (kwd_ns))
  (sym_lit
    (sym_name)))

What we would like:

(source
  (kwd_lit
    (kwd_ns)
    (kwd_name)))

So we're getting a keword followed by a symbol instead of a single keyword.

ATM, I don't know how to prevent this kind of thing from occurring.

When it was just symbols, there was a way to arrange for it by tweaking precedences, but my attempts so far at adapting that idea haven't worked out once keywords are also involved (may keep trying a bit though).

I suspect that allowing "partial" constructs like the proposed "not-quite-a-symbol-that-ends-with-a-slash" may result in more of these sorts of situations partly due to the way the grammar takes on explicit handling of whitespace [1].

FWIW, we do that handling by emptying extras, defining our "start symbol" like:

source: $ =>
repeat(choice($._form,
              $._gap)),

and locating _gap and _ws at various spots of the grammar [2].

I suspect that that approach might make it impractical to express the idea of "A cannot be followed by B" in some situations, but I'm still mulling things over.

It may be that trying to get "partial" constructs to work with a grammar with this sort of approach is problematic.


[1] AFAIU, we handle whitespace (and similar) explicitly (this is not the "default" for tree-sitter) in order to get dis_expr (and now _metadata_lit) to work in the way they do.

[2] One example is here.

sogaiu

sogaiu commented on Jul 11, 2025

@sogaiu
OwnerAuthor

I think I got the "keywords ending in /" bit working together with "symbols ending in /".

However, I think these investigations have made it clearer that:

allowing "partial" constructs like the proposed "not-quite-a-symbol-that-ends-with-a-slash" may result in more of these sorts of situations

By "these sorts of situations", I mean parse results where "partial" constructs get "glued together" with things which they shouldn't be resulting in parses like:

(source
  (kwd_lit
    (kwd_ns))
  (sym_lit
    (sym_name)))

for something like :clojure.core//.

ATM I'm not comfortable with looking after things in this state [1], but I'll keep thinking about it.


[1] Suppose at a later date we discover there is some undesirable parse result which we don't figure out how to "disallow".

What do we do then?...especially after various parties have become dependent on "partial" construct behavior (e.g. to implement completion in specific ways).

sogaiu

sogaiu commented on Jul 11, 2025

@sogaiu
OwnerAuthor

On a side note, I've noticed that tests have started to take somewhat longer with recent changes. To run the parse tests across the ~200,000 files from clojars release jars, the timings were recently:

  • 131,173 ms (v0.0.13 on master)

  • 148,466 ms (f8dd406 on unstable-20250629)

  • 159,109 ms (596e768 on unstable-20250709 (partial symbols and keywords))

Perhaps it is indicative of a decrease in performance.

My guess at this point is that this probably doesn't matter.

NoahTheDuke

NoahTheDuke commented on Jul 11, 2025

@NoahTheDuke

I think it is indicative of performance loss but it's fairly minimal: ~0.65ms per file for master, ~0.8ms per file on partial syms/kws. I wouldn't let that keep you from implementing these features.

NoahTheDuke

NoahTheDuke commented on Jul 11, 2025

@NoahTheDuke

Suppose at a later date we discover there is some undesirable parse result which we don't figure out how to "disallow".

What are you imagining here?

sogaiu

sogaiu commented on Jul 14, 2025

@sogaiu
OwnerAuthor

I'm sorry my previous explanation did not adequately communicate my concerns.

I've spent some time looking for actual examples, and I think I found one [1].

I think the recent changes being explored here (the "partial" symbols / keywords) break the existing handling of the new CLJ-2807 syntax which the v0.0.13 branch did work with:

String/1

should parse as:

(source [0, 0] - [1, 0]
  (sym_lit [0, 0] - [0, 8]
    namespace: (sym_ns [0, 0] - [0, 6])
    name: (sym_name [0, 7] - [0, 8])))

but with the changes it parses as:

(source [0, 0] - [1, 0]
  (sym_lit [0, 0] - [0, 7]
    namespace: (sym_ns [0, 0] - [0, 6]))
  (num_lit [0, 7] - [0, 8]))

[1] On a side note, ATM, I don't think it is practical to try to exhaustively find these kinds of instances manually or by reasoning -- we were just lucky here.

We could try generative / property-based testing with test.check to proactively look for issues now that phronmophobic has made it much more practical / shown how to use tree-sitter grammars from Clojure directly [1] [2].

I have some generator code written for some portions of Clojure that I used via libpython-clj which could be adapted / reused -- though it's not close to comprehensive.

I'm not sure it is worth the effort to go in this direction, but I'll give it some thought.


[1] Clobber has these bits that show how one can use tree-sitter grammars from Clojure via tree-sitter-ng.

[2] Although abandoned, also this.

sogaiu

sogaiu commented on Jul 14, 2025

@sogaiu
OwnerAuthor

I think it is indicative of performance loss but it's fairly minimal

Yes, I agree.

Thanks for thinking about it and providing feedback 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @bbatsov@phronmophobic@NoahTheDuke@rrudakov@sogaiu

        Issue actions

          Failure to properly handle return type metadata? · Issue #65 · sogaiu/tree-sitter-clojure