Description
As reported by @rrudakov in this clojure-ts-mode issue, the current tree-sitter-clojure grammar might be seen to not properly handle return type metadata from a semantic perspective.
As a concrete example (slightly edited from the original), consider:
(defn to-string
^String
[arg]
(.toString arg))^String here refers to the return type of the function. A straight-forward interpretation might consider the metadata to not be for the argument vector [arg]. (Possibly one could choose to see it that way though as argument information and return type information have to do with the input(s) and output(s) of a function and in some languages end up being expressed right next to each other.)
For reference, at clojure.org, this is currently documented at the Type Hints section of the Java Interop page:
For function return values, the type hint can be placed before the parameter vector:
(defn hinted-single ^String [])
-> #user/hinted-single
(defn hinted
(^String [])
(^Integer [a])
(^java.util.List [a & args]))
-> #user/hinted
In a comment in the aforementioned issue, an alternative approach to parsing metadata was mentioned:
I think having metadata as standalone nodes would solve a lot of code navigation issues at least in Emacs.
The clojure-ts-mode issue had this comment as well:
Expressions with metadata (BTW, I really don't like how the grammar works with metadata, it's pain to work with and causes more problems than benefits)
Additional details were elaborated upon in this comment.
So the proposed alternative approach for handling metadata may have multiple motivations.
I suspect this would involve a significant reworking of the grammar. May be creating a new grammar would allow additional improvements to be incorporated such as this issue as well as an additional point made about open nodes [1].
[1] Also made in this comment.
Activity
NoahTheDuke commentedon May 18, 2025
Why would that be?
sogaiu commentedon May 18, 2025
I'd like to jot down some further thoughts and the results of some investigations before attempting to answer:
One thing to note is that "the return type being specifiable via metadata that appears before the argument vector" appears to be a "post-read" thing in Clojure. AFAIU, the "ordinary" (other?) metadata cases are handled in LispReader.java during
reador similar.I think this line in core.clj may be where the identified return type case happens. (I'm not quite sure about this yet and might look into it further.)
Assuming this is correct for the moment, I think it's not unreasonable for the parse results from tree-sitter to associate the metadata with the argument vector because the metadata in question does actually live on the argument vector (at least initially -- I haven't checked whether it goes away or is removed).
On a different note, I thought about doing a "standalone metadata nodes" grammar a bit more and I think I may have tried this before [1]. I'm not sure what the results were unfortunately.
I thought it might have adverse impacts on things like the handling of
#_and what it corresponds to, but I'm not sure. Perhaps it can be made to work.[1] ...though I'm not having luck turning up any code for that.
sogaiu commentedon May 18, 2025
I think the following is consistent with the supposition above about the "return type metadata" living on the argument vector (some lines added to improve readability):
So although it might seem a bit weird, I'm inclined to say that there is a kind of consistency and correctness to the way metadata is handled currently even in the return type case.
sogaiu commentedon May 18, 2025
To return to this question:
Looking over
grammar.js, there are more than 10 node types that currently "include" metadata. They are:sym_litlist_litmap_litvec_litset_litread_cond_litsplicing_read_cond_litns_map_litvar_quoting_litevaling_littagged_or_ctor_litderefing_litquoting_litsyn_quoting_litunquote_splicing_litunquoting_litAt minimum I think one approach might include the following changes:
_ws,comment, anddis_exprperhaps)Whether one considers this a significant reworking or not, it doesn't seem like a compatible change nor is it a change to a small number of node types.
Assuming a second grammar / parser that implements the sort of idea described above is feasible...IIUC a number of editors support the use of more than one parser per buffer (at least to handle "injections", for example
clojure-ts-modeseems to do this for docstrings and regular expressions) so it might be possible to use two parsers alongside each other for non-injection activities.One idea is for a second (newer) parser to handle certain tasks that are awkward for the first parser and gradually migrate existing functionality over to the newer parser's functionality if that seems worth it. Alternatively, one might just decide to use both.
I think at least in Emacs it might be a practical path.
As mentioned elsewhere [1], if one were going to make a new grammar / parser though, it might be good to collect other "failings" or "wishes" related to the current parser.
[1] At least in this comment (
string_contentsnode) and in this one (related to theopenfield).rrudakov commentedon May 18, 2025
Currently in the
dis_exprkind of absorbs the following node including metadata:And if there is more than one
dis_expr, they are nested, which is interesting, but I guess valid:If
meta_litwas a standalone node, thedis_exprnode would have to skip zero or moremeta_litnodes and wrap the first non-meta node, is it possible to implement in the grammar?Looks like you're right, I also tried to following in REPL:
sogaiu commentedon May 18, 2025
More later, but regarding this:
To give some background for this, some people make use of "stacked"
#_as part of the ability to conveniently disable bits of code like in the following cases:{:a 1 :b 2 #_ #_ :c 3}It's easier to put the multiple
#_instances together (and remove them) than to try to interleave them.To spell things out a bit, in the latter example, interleaving would look like:
{:a 1 :b 2 #_ :c #_ 3}which achieves the same end but is more awkward to put into place as well as remove.
AFAIU, the current grammar "understands" this "stacked" usage...at least the intent was for it do so (^^;
sogaiu commentedon May 18, 2025
I'm not sure -- with tree-sitter, it is oftern trial and error.
What I wrote above:
may be close to what you described. If that worked, it might yield a similar result to the idea of
dis_exprskipping "zero or more meta_lit nodes".dis_exprcurrently looks like:It already "skips" zero or more
_gapnodes which are effectively_ws,comment, and/ordis_exprnodes:In case unknown / unfamiliar, in tree-sitter-land, it's often the case that one prioritizes parsing correct code over trying to forbid incorrect code so the idea of allowing metadata nodes to be pretty much anywhere isn't as strange as it might seem.
As to whether this alternative approach can be made to work (and work well), I think someone would have to try. FWIW, the kinds of issues I have encountered before include:
[1] I found this with at least one of the original grammars -- which is one of the motivations for creating yet another grammar.
sogaiu commentedon May 18, 2025
Thanks for the confirmation. Good to see another method of verifying.
sogaiu commentedon May 18, 2025
I tried making some of the discussed changes and pushed de52a1f6 to a new branch for experimenting with standalone metadata nodes.
rrudakov commentedon May 18, 2025
Thank you! I'll check it out.
rrudakov commentedon May 18, 2025
At the first glance, besides few minor issues, it looks great! Navigation works as expected. I'll try to do some more experiments with
clojure-ts-modein a separate branch and I'll post here if I find anything broken.Thank you @sogaiu !
sogaiu commentedon May 19, 2025
Good to hear it seems to be more or less working :)
Curious about the "few minor issues", any hints about those?
sogaiu commentedon May 19, 2025
BTW, I managed to find an older attempt at simplifying the grammar in various ways -- it was in a separate repository.
I've cleaned it up a bit and made a few changes.
Notable differences include:
#_and metadata (actually^and#^) can appear anywhere whitespace and comments canset_litandanon_fn_litstart with#{and#(respectively (i.e. single anonymous nodes consisting of two characters each instead of two nodes where one is associated with#and the other with an opening delimiter)'clojure-simpleinstead of'clojure. This was done to allow co-existence of the grammars.rrudakov commentedon May 19, 2025
Sorry, it probably was confusing. There are no issues with the grammar itself. I meant that
clojure-ts-modedidn't work with the modified grammar out of the box. I had to modify a few queries here and there and cleanup some things related to skipping metadata.rrudakov commentedon May 19, 2025
Not sure it would be a good idea to use 2 different Clojure grammars for the same file. Also it might be a Emacs specific problem and there might be a solution (see the discussion https://debbugs.gnu.org/cgi/bugreport.cgi?bug=78458). I'll try to add
sexp-default"thing" and see if it makes things better.114 remaining items
rrudakov commentedon Jul 8, 2025
clojure-ts-modecompletion function doesn't analyze syntax at point, Tree-sitter is only used to query all potential candidates in the buffer. Prefix is taken using Emacs' builtin facilities (likesymbol-at-pointor something like this).sogaiu commentedon Jul 8, 2025
Ah, the
(bounds-of-thing-at-point 'symbol)bit, I think I see.Thanks for the explanation.
On a related note, it looks like there is some use of
treesit-thing-at-pointinclojure-ts-modein a few places.I suppose those kinds of things might work better when there are no errors...
I think there is some sense to recognizing "name"-less symbols (i.e. a symbol with just a namespace part and the slash delimiter) for editor operations so probably I should see if I can get that to work.
I have some nagging thoughts about how that might have adverse effects on how we're doing tests ATM and the symbol and keyword stuff is kind of messy, but may be it's best to try to see what happens...
sogaiu commentedon Jul 9, 2025
I've made some attempts and have something to report.
Obvious changes I tried to get things like
clojure.string/to be treated as a symbol without breaking other things didn't go very well.One example change was:
This kind of change typically resulted in symbol-handling of the following sorts (from the corpus tests):
deeper map
{:paths ["src"] :deps {clj-kondo/clj-kondo {:mvn/version "2020.09.09"}}} ^^^^^^^^^^---------The first symbol in the inner map gets parsed as two symbols:
clj-kondo/andclj-kondosymbol with prefix
Similar results with what should be a single symbol getting parsed as two:
clojure.string/andblank?namespaced division symbol
Similarly:
clojure.core/and/On a hunch, I tried effectively reverting handling of symbol parsing into pieces such as a namespace portion and a name portion (which dannyfreeman added in v0.10.0).
With that approach, I've had success properly parsing all the cases I've tested with. Parsing the clojars release jars samples seemed to go ok too [1].
The changes to
grammar.jsfor this approach essentially boil down to the straight-forward [2]:So ATM I know how to either:
Achieve treating something that ends in a
/that's not quite a symbol as a symbol, but only if giving up parsing of symbols into pieces (so forgo straight-forward finer-grained highlighting of symbols) -OR-Have the grammar tease apart symbols into pieces, but pass on recognizing something that ends in a
/that's not quite a symbol as a symbol.I'll keep exploring to try to get things to work out / understand the current arrangement better. Hopefully, it's just something I'm missing 😅
[1] In fact, it turned up one more problematic Clojure file...though possibly this could be a matter of opinion?
[2] Plus the addition of a conflict and removal of some bits that become unneeded.
sogaiu commentedon Jul 9, 2025
After a bit more exploration I think I understand what was going on.
It looks like it may be doable.
The unstable-20250709 branch has the two apparently successful attempts. Likely the more interesting one is in this commit as it preserves "perception" of symbols into components.
NoahTheDuke commentedon Jul 9, 2025
I wouldn't worry too much about people doing very weird pre-processing of clojure files, especially in projects that haven't been touched in 13 years.
sogaiu commentedon Jul 10, 2025
Interesting...now that you put it that way, may be that is what's going on there!
I've seen plain prose within
(comment ...)and I think also in#_ ( ... ), but not something that looks like it's intended to be manipulated programmatically (e.g. like template constructs).Yeah, I'm not too worried, it's more about being aware of what kinds of things exist in the wild. Being able to programmatically ignore "invalid" Clojure in the context of testing against samples is something I have an interest in so I like to take a somewhat close look at what turns up [1].
Thanks for taking a look and sharing your observations, much appreciated!
[1] For reference, here is a list of some samples that I expect not to parse currently along with descriptions of "why".
sogaiu commentedon Jul 10, 2025
I thought I'd investigate whether something similar is doable for keywords.
Adapting the approach for symbols, I'm currently getting the following (one typical result from a particular corpus test):
For:
Actual result:
What we would like:
So we're getting a keword followed by a symbol instead of a single keyword.
ATM, I don't know how to prevent this kind of thing from occurring.
When it was just symbols, there was a way to arrange for it by tweaking precedences, but my attempts so far at adapting that idea haven't worked out once keywords are also involved (may keep trying a bit though).
I suspect that allowing "partial" constructs like the proposed "not-quite-a-symbol-that-ends-with-a-slash" may result in more of these sorts of situations partly due to the way the grammar takes on explicit handling of whitespace [1].
FWIW, we do that handling by emptying
extras, defining our "start symbol" like:and locating
_gapand_wsat various spots of the grammar [2].I suspect that that approach might make it impractical to express the idea of "A cannot be followed by B" in some situations, but I'm still mulling things over.
It may be that trying to get "partial" constructs to work with a grammar with this sort of approach is problematic.
[1] AFAIU, we handle whitespace (and similar) explicitly (this is not the "default" for tree-sitter) in order to get
dis_expr(and now_metadata_lit) to work in the way they do.[2] One example is here.
sogaiu commentedon Jul 11, 2025
I think I got the "keywords ending in /" bit working together with "symbols ending in /".
However, I think these investigations have made it clearer that:
By "these sorts of situations", I mean parse results where "partial" constructs get "glued together" with things which they shouldn't be resulting in parses like:
for something like
:clojure.core//.ATM I'm not comfortable with looking after things in this state [1], but I'll keep thinking about it.
[1] Suppose at a later date we discover there is some undesirable parse result which we don't figure out how to "disallow".
What do we do then?...especially after various parties have become dependent on "partial" construct behavior (e.g. to implement completion in specific ways).
sogaiu commentedon Jul 11, 2025
On a side note, I've noticed that tests have started to take somewhat longer with recent changes. To run the parse tests across the ~200,000 files from clojars release jars, the timings were recently:
131,173 ms (v0.0.13 on master)
148,466 ms (f8dd406 on unstable-20250629)
159,109 ms (596e768 on unstable-20250709 (partial symbols and keywords))
Perhaps it is indicative of a decrease in performance.
My guess at this point is that this probably doesn't matter.
NoahTheDuke commentedon Jul 11, 2025
I think it is indicative of performance loss but it's fairly minimal: ~0.65ms per file for master, ~0.8ms per file on partial syms/kws. I wouldn't let that keep you from implementing these features.
NoahTheDuke commentedon Jul 11, 2025
What are you imagining here?
sogaiu commentedon Jul 14, 2025
I'm sorry my previous explanation did not adequately communicate my concerns.
I've spent some time looking for actual examples, and I think I found one [1].
I think the recent changes being explored here (the "partial" symbols / keywords) break the existing handling of the new CLJ-2807 syntax which the v0.0.13 branch did work with:
should parse as:
but with the changes it parses as:
[1] On a side note, ATM, I don't think it is practical to try to exhaustively find these kinds of instances manually or by reasoning -- we were just lucky here.
We could try generative / property-based testing with
test.checkto proactively look for issues now that phronmophobic has made it much more practical / shown how to use tree-sitter grammars from Clojure directly [1] [2].I have some generator code written for some portions of Clojure that I used via libpython-clj which could be adapted / reused -- though it's not close to comprehensive.
I'm not sure it is worth the effort to go in this direction, but I'll give it some thought.
[1] Clobber has these bits that show how one can use tree-sitter grammars from Clojure via tree-sitter-ng.
[2] Although abandoned, also this.
sogaiu commentedon Jul 14, 2025
Yes, I agree.
Thanks for thinking about it and providing feedback 👍