diff --git a/README.md b/README.md index bcd1f8f370c..6fcd7eb078d 100644 --- a/README.md +++ b/README.md @@ -5,13 +5,110 @@ [![Build status](https://ci.appveyor.com/api/projects/status/j56x1hbje8rdg6xk/branch/master?svg=true)](https://ci.appveyor.com/project/matklad/libsyntax2/branch/master) +libsyntax2.0 is an **experimental** parser of the Rust language, +intended for the use in IDEs. +[RFC](https://github.com/rust-lang/rfcs/pull/2256). -libsyntax2.0 is an **experimental** implementation of the corresponding [RFC](https://github.com/rust-lang/rfcs/pull/2256). -See [`docs`](./docs) folder to learn how libsyntax2 works, and check -[`CONTRIBUTING.md`](./CONTRIBUTING.md) if you want to contribute! -**WARNING** everything is in a bit of a flux recently, the docs are obsolete, -see the recent work on red/green trees. +## Quick Start + +``` +$ cargo test +$ cargo parse < crates/libsyntax2/src/lib.rs +``` + + +## Trying It Out + +This installs experimental VS Code plugin + +``` +$ cargo install-code +``` + +It's better to remove existing Rust plugins to avoid interference. +Warning: plugin is not intended for general use, has a lot of rough +edges and missing features (notably, no code completion). That said, +while originally libsyntax2 was developed in IntelliJ, @matklad now +uses this plugin (and thus, libsytax2) to develop libsyntax2, and it +doesn't hurt too much :-) + + +### Features: + +* syntax highlighting (LSP does not have API for it, so impl is hacky + and sometimes fall-backs to the horrible built-in highlighting) + +* commands (`ctrl+shift+p` or keybindings) + - **Show Rust Syntax Tree** (use it to verify that plugin works) + - **Rust Extend Selection** (works with multiple cursors) + - **Rust Matching Brace** (knows the difference between `<` and `<`) + - **Rust Parent Module** + - **Rust Join Lines** (deals with trailing commas) + +* **Go to symbol in file** + +* **Go to symbol in workspace** (no support for Cargo deps yet) + +* code actions: + - Flip `,` in comma separated lists + - Add `#[derive]` to struct/enum + - Add `impl` block to struct/enum + - Run tests at caret + +* **Go to definition** ("correct" for `mod foo;` decls, index-based for functions). + + +## Code Walk-Through + +### `crates/libsyntax2` + +- `yellow`, red/green syntax tree, heavily inspired [by this](https://github.com/apple/swift/tree/ab68f0d4cbf99cdfa672f8ffe18e433fddc8b371/lib/Syntax) +- `grammar`, the actual parser +- `parser_api/parser_impl` bridges the tree-agnostic parser from `grammar` with `yellow` trees +- `grammar.ron` RON description of the grammar, which is used to + generate `syntax_kinds` and `ast` modules. +- `algo`: generic tree algorithms, including `walk` for O(1) stack + space tree traversal (this is cool) and `visit` for type-driven + visiting the nodes (this is double plus cool, if you understand how + `Visitor` works, you understand libsyntax2). + + +### `crates/libeditor` + +Most of IDE features leave here, unlike `libanalysis`, `libeditor` is +single-file and is basically a bunch of pure functions. + + +### `crates/libanalysis` + +A stateful library for analyzing many Rust files as they change. +`WorldState` is a mutable entity (clojure's atom) which holds current +state, incorporates changes and handles out `World`s --- immutable +consistent snapshots of `WorldState`, which actually power analysis. + + +### `crates/server` + +An LSP implementation which uses `libanalysis` for managing state and +`libeditor` for actually doing useful stuff. + + +### `crates/cli` + +A CLI interface to libsyntax + +### `crate/tools` + +Code-gen tasks, used to develop libsyntax2: + +- `cargo gen-kinds` -- generate `ast` and `syntax_kinds` +- `cargo gen-tests` -- collect inline tests from grammar +- `cargo install-code` -- build and install VS Code extension and server + +### `code` + +VS Code plugin ## License diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md deleted file mode 100644 index 6b4434396aa..00000000000 --- a/docs/ARCHITECTURE.md +++ /dev/null @@ -1,93 +0,0 @@ -# Design and open questions about libsyntax - - -The high-level description of the architecture is in RFC.md. You might -also want to dig through https://github.com/matklad/fall/ which -contains some pretty interesting stuff build using similar ideas -(warning: it is completely undocumented, poorly written and in general -not the thing which I recommend to study (yes, this is -self-contradictory)). - -## Tree - -The centerpiece of this whole endeavor is the syntax tree, in the -`tree` module. Open questions: - -- how to best represent errors, to take advantage of the fact that - they are rare, but to enable fully-persistent style structure - sharing between tree nodes? - -- should we make red/green split from Roslyn more pronounced? - -- one can layout nodes in a single array in such a way that children - of the node form a continuous slice. Seems nifty, but do we need it? - -- should we use SoA or AoS for NodeData? - -- should we split leaf nodes and internal nodes into separate arrays? - Can we use it to save some bits here and there? (leaves don't need - first_child field, for example). - - -## Parser - -The syntax tree is produced using a three-staged process. - -First, a raw text is split into tokens with a lexer (the `lexer` module). -Lexer has a peculiar signature: it is an `Fn(&str) -> Token`, where token -is a pair of `SyntaxKind` (you should have read the `tree` module and RFC -by this time! :)) and a len. That is, lexer chomps only the first -token of the input. This forces the lexer to be stateless, and makes -it possible to implement incremental relexing easily. - -Then, the bulk of work, the parser turns a stream of tokens into -stream of events (the `parser` module; of particular interest are -the `parser/event` and `parser/parser` modules, which contain parsing -API, and the `parser/grammar` module, which contains actual parsing code -for various Rust syntactic constructs). Not that parser **does not** -construct a tree right away. This is done for several reasons: - -* to decouple the actual tree data structure from the parser: you can - build any data structure you want from the stream of events - -* to make parsing fast: you can produce a list of events without - allocations - -* to make it easy to tweak tree structure. Consider this code: - - ``` - #[cfg(test)] - pub fn foo() {} - ``` - - Here, the attribute and the `pub` keyword must be the children of - the `fn` node. However, when parsing them, we don't yet know if - there would be a function ahead: it very well might be a `struct` - there. If we use events, we generally don't care about this *in - parser* and just spit them in order. - -* (Is this true?) to make incremental reparsing easier: you can reuse - the same rope data structure for all of the original string, the - tokens and the events. - - -The parser also does not know about whitespace tokens: it's the job of -the next layer to assign whitespace and comments to nodes. However, -parser can remap contextual tokens, like `>>` or `union`, so it has -access to the text. - -And at last, the TreeBuilder converts a flat stream of events into a -tree structure. It also *should* be responsible for attaching comments -and rebalancing the tree, but it does not do this yet :) - -## Validator - -Parser and lexer accept a lot of *invalid* code intentionally. The -idea is to post-process the tree and to proper error reporting, -literal conversion and quick-fix suggestions. There is no -design/implementation for this yet. - - -## AST - -Nothing yet, see `AstNode` in `fall`. diff --git a/docs/RFC.md b/docs/RFC.md deleted file mode 100644 index 2bd9f18f1d8..00000000000 --- a/docs/RFC.md +++ /dev/null @@ -1,494 +0,0 @@ -- Feature Name: libsyntax2.0 -- Start Date: 2017-12-30 -- RFC PR: (leave this empty) -- Rust Issue: (leave this empty) - - ->I think the lack of reusability comes in object-oriented languages, ->not functional languages. Because the problem with object-oriented ->languages is they’ve got all this implicit environment that they ->carry around with them. You wanted a banana but what you got was a ->gorilla holding the banana and the entire jungle. -> ->If you have referentially transparent code, if you have pure ->functions — all the data comes in its input arguments and everything ->goes out and leave no state behind — it’s incredibly reusable. -> -> **Joe Armstrong** - -# Summary -[summary]: #summary - -The long-term plan is to rewrite libsyntax parser and syntax tree data -structure to create a software component independent of the rest of -rustc compiler and suitable for the needs of IDEs and code -editors. This RFCs is the first step of this plan, whose goal is to -find out if this is possible at least in theory. If it is possible, -the next steps would be a prototype implementation as a crates.io -crate and a separate RFC for integrating the prototype with rustc, -other tools, and eventual libsyntax removal. - -Note that this RFC does not propose to stabilize any API for working -with rust syntax: the semver version of the hypothetical library would -be `0.1.0`. It is intended to be used by tools, which are currently -closely related to the compiler: `rustc`, `rustfmt`, `clippy`, `rls` -and hypothetical `rustfix`. While it would be possible to create -third-party tools on top of the new libsyntax, the burden of adopting -to breaking changes would be on authors of such tools. - - -# Motivation -[motivation]: #motivation - -There are two main drawbacks with the current version of libsyntax: - -* It is tightly integrated with the compiler and hard to use - independently - -* The AST representation is not well-suited for use inside IDEs - - -## IDE support - -There are several differences in how IDEs and compilers typically -treat source code. - -In the compiler, it is convenient to transform the source -code into Abstract Syntax Tree form, which is independent of the -surface syntax. For example, it's convenient to discard comments, -whitespaces and desugar some syntactic constructs in terms of the -simpler ones. - -In contrast, IDEs work much closer to the source code, so it is -crucial to preserve full information about the original text. For -example, IDE may adjust indentation after typing a `}` which closes a -block, and to do this correctly, IDE must be aware of syntax (that is, -that `}` indeed closes some block, and is not a syntax error) and of -all whitespaces and comments. So, IDE suitable AST should explicitly -account for syntactic elements, not considered important by the -compiler. - -Another difference is that IDEs typically work with incomplete and -syntactically invalid code. This boils down to two parser properties. -First, the parser must produce syntax tree even if some required input -is missing. For example, for input `fn foo` the function node should -be present in the parse, despite the fact that there is no parameters -or body. Second, the parser must be able to skip over parts of input -it can't recognize and aggressively recover from errors. That is, the -syntax tree data structure should be able to handle both missing and -extra nodes. - -IDEs also need the ability to incrementally reparse and relex source -code after the user types. A smart IDE would use syntax tree structure -to handle editing commands (for example, to add/remove trailing commas -after join/split lines actions), so parsing time can be very -noticeable. - - -Currently rustc uses the classical AST approach, and preserves some of -the source code information in the form of spans in the AST. It is not -clear if this structure can full fill all IDE requirements. - - -## Reusability - -In theory, the parser can be a pure function, which takes a `&str` as -an input, and produces a `ParseTree` as an output. - -This is great for reusability: for example, you can compile this -function to WASM and use it for fast client-side validation of syntax -on the rust playground, or you can develop tools like `rustfmt` on -stable Rust outside of rustc repository, or you can embed the parser -into your favorite IDE or code editor. - -This is also great for correctness: with such simple interface, it's -possible to write property-based tests to thoroughly compare two -different implementations of the parser. It's also straightforward to -create a comprehensive test suite, because all the inputs and outputs -are trivially serializable to human-readable text. - -Another benefit is performance: with this signature, you can cache a -parse tree for each file, with trivial strategy for cache invalidation -(invalidate an entry when the underling file changes). On top of such -a cache it is possible to build a smart code indexer which maintains -the set of symbols in the project, watches files for changes and -automatically reindexes only changed files. - -Unfortunately, the current libsyntax is far from this ideal. For -example, even the lexer makes use of the `FileMap` which is -essentially a global state of the compiler which represents all know -files. As a data point, it turned out to be easier to move `rustfmt` -into the main `rustc` repository than to move libsyntax outside! - - -# Guide-level explanation -[guide-level-explanation]: #guide-level-explanation - -Not applicable. - - -# Reference-level explanation -[reference-level-explanation]: #reference-level-explanation - -It is not clear if a single parser can accommodate the needs of the -compiler and the IDE, but there is hope that it is possible. The RFC -proposes to develop libsynax2.0 as an experimental crates.io crate. If -the experiment turns out to be a success, the second RFC will propose -to integrate it with all existing tools and `rustc`. - -Next, a syntax tree data structure is proposed for libsyntax2.0. It -seems to have the following important properties: - -* It is lossless and faithfully represents the original source code, - including explicit nodes for comments and whitespace. - -* It is flexible and allows to encode arbitrary node structure, - even for invalid syntax. - -* It is minimal: it stores small amount of data and has no - dependencies. For instance, it does not need compiler's string - interner or literal data representation. - -* While the tree itself is minimal, it is extensible in a sense that - it possible to associate arbitrary data with certain nodes in a - type-safe way. - - -It is not clear if this representation is the best one. It is heavily -inspired by [PSI] data structure which used in [IntelliJ] based IDEs -and in the [Kotlin] compiler. - -[PSI]: http://www.jetbrains.org/intellij/sdk/docs/reference_guide/custom_language_support/implementing_parser_and_psi.html -[IntelliJ]: https://github.com/JetBrains/intellij-community/ -[Kotlin]: https://kotlinlang.org/ - - -## Untyped Tree - -The main idea is to store the minimal amount of information in the -tree itself, and instead lean heavily on the source code for the -actual data about identifier names, constant values etc. - -All nodes in the tree are of the same type and store a constant for -the syntactic category of the element and a range in the source code. - -Here is a minimal implementation of this data structure with some Rust -syntactic categories - - -```rust -#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord)] -pub struct NodeKind(u16); - -pub struct File { - text: String, - nodes: Vec, -} - -struct NodeData { - kind: NodeKind, - range: (u32, u32), - parent: Option, - first_child: Option, - next_sibling: Option, -} - -#[derive(Clone, Copy)] -pub struct Node<'f> { - file: &'f File, - idx: u32, -} - -pub struct Children<'f> { - next: Option>, -} - -impl File { - pub fn root<'f>(&'f self) -> Node<'f> { - assert!(!self.nodes.is_empty()); - Node { file: self, idx: 0 } - } -} - -impl<'f> Node<'f> { - pub fn kind(&self) -> NodeKind { - self.data().kind - } - - pub fn text(&self) -> &'f str { - let (start, end) = self.data().range; - &self.file.text[start as usize..end as usize] - } - - pub fn parent(&self) -> Option> { - self.as_node(self.data().parent) - } - - pub fn children(&self) -> Children<'f> { - Children { next: self.as_node(self.data().first_child) } - } - - fn data(&self) -> &'f NodeData { - &self.file.nodes[self.idx as usize] - } - - fn as_node(&self, idx: Option) -> Option> { - idx.map(|idx| Node { file: self.file, idx }) - } -} - -impl<'f> Iterator for Children<'f> { - type Item = Node<'f>; - - fn next(&mut self) -> Option> { - let next = self.next; - self.next = next.and_then(|node| node.as_node(node.data().next_sibling)); - next - } -} - -pub const ERROR: NodeKind = NodeKind(0); -pub const WHITESPACE: NodeKind = NodeKind(1); -pub const STRUCT_KW: NodeKind = NodeKind(2); -pub const IDENT: NodeKind = NodeKind(3); -pub const L_CURLY: NodeKind = NodeKind(4); -pub const R_CURLY: NodeKind = NodeKind(5); -pub const COLON: NodeKind = NodeKind(6); -pub const COMMA: NodeKind = NodeKind(7); -pub const AMP: NodeKind = NodeKind(8); -pub const LINE_COMMENT: NodeKind = NodeKind(9); -pub const FILE: NodeKind = NodeKind(10); -pub const STRUCT_DEF: NodeKind = NodeKind(11); -pub const FIELD_DEF: NodeKind = NodeKind(12); -pub const TYPE_REF: NodeKind = NodeKind(13); -``` - -Here is a rust snippet and the corresponding parse tree: - -```rust -struct Foo { - field1: u32, - & - // non-doc comment - field2: -} -``` - - -``` -FILE - STRUCT_DEF - STRUCT_KW - WHITESPACE - IDENT - WHITESPACE - L_CURLY - WHITESPACE - FIELD_DEF - IDENT - COLON - WHITESPACE - TYPE_REF - IDENT - COMMA - WHITESPACE - ERROR - AMP - WHITESPACE - FIELD_DEF - LINE_COMMENT - WHITESPACE - IDENT - COLON - ERROR - WHITESPACE - R_CURLY -``` - -Note several features of the tree: - -* All whitespace and comments are explicitly accounted for. - -* The node for `STRUCT_DEF` contains the error element for `&`, but - still represents the following field correctly. - -* The second field of the struct is incomplete: `FIELD_DEF` node for - it contains an `ERROR` element, but nevertheless has the correct - `NodeKind`. - -* The non-documenting comment is correctly attached to the following - field. - - -## Typed Tree - -It's hard to work with this raw parse tree, because it is untyped: -node containing a struct definition has the same API as the node for -the struct field. But it's possible to add a strongly typed layer on -top of this raw tree, and get a zero-cost AST. Here is an example -which adds type-safe wrappers for structs and fields: - -```rust -// generic infrastructure - -pub trait AstNode<'f>: Copy + 'f { - fn new(node: Node<'f>) -> Option; - fn node(&self) -> Node<'f>; -} - -pub fn child_of_kind<'f>(node: Node<'f>, kind: NodeKind) -> Option> { - node.children().find(|child| child.kind() == kind) -} - -pub fn ast_children<'f, A: AstNode<'f>>(node: Node<'f>) -> Box + 'f> { - Box::new(node.children().filter_map(A::new)) -} - -// AST elements, specific to Rust - -#[derive(Clone, Copy)] -pub struct StructDef<'f>(Node<'f>); - -#[derive(Clone, Copy)] -pub struct FieldDef<'f>(Node<'f>); - -#[derive(Clone, Copy)] -pub struct TypeRef<'f>(Node<'f>); - -pub trait NameOwner<'f>: AstNode<'f> { - fn name_ident(&self) -> Node<'f> { - child_of_kind(self.node(), IDENT).unwrap() - } - - fn name(&self) -> &'f str { self.name_ident().text() } -} - - -impl<'f> AstNode<'f> for StructDef<'f> { - fn new(node: Node<'f>) -> Option { - if node.kind() == STRUCT_DEF { Some(StructDef(node)) } else { None } - } - fn node(&self) -> Node<'f> { self.0 } -} - -impl<'f> NameOwner<'f> for StructDef<'f> {} - -impl<'f> StructDef<'f> { - pub fn fields(&self) -> Box> + 'f> { - ast_children(self.node()) - } -} - - -impl<'f> AstNode<'f> for FieldDef<'f> { - fn new(node: Node<'f>) -> Option { - if node.kind() == FIELD_DEF { Some(FieldDef(node)) } else { None } - } - fn node(&self) -> Node<'f> { self.0 } -} - -impl<'f> FieldDef<'f> { - pub fn type_ref(&self) -> Option> { - ast_children(self.node()).next() - } -} - -impl<'f> NameOwner<'f> for FieldDef<'f> {} - - -impl<'f> AstNode<'f> for TypeRef<'f> { - fn new(node: Node<'f>) -> Option { - if node.kind() == TYPE_REF { Some(TypeRef(node)) } else { None } - } - fn node(&self) -> Node<'f> { self.0 } -} -``` - -Note that although AST wrappers provide a type-safe access to the -tree, they are still represented as indexes, so clients of the syntax -tree can easily associated additional data with AST nodes by storing -it in a side-table. - - -## Missing Source Code - -The crucial feature of this syntax tree is that it is just a view into -the original source code. And this poses a problem for the Rust -language, because not all compiled Rust code is represented in the -form of source code! Specifically, Rust has a powerful macro system, -which effectively allows to create and parse additional source code at -compile time. It is not entirely clear that the proposed parsing -framework is able to handle this use case, and it's the main purpose -of this RFC to figure it out. The current idea for handling macros is -to make each macro expansion produce a triple of (expansion text, -syntax tree, hygiene information), where hygiene information is a side -table, which colors different ranges of the expansion text according -to the original syntactic context. - - -## Implementation plan - -This RFC proposes huge changes to the internals of the compiler, so -it's important to proceed carefully and incrementally. The following -plan is suggested: - -* RFC discussion about the theoretical feasibility of the proposal, - and the best representation representation for the syntax tree. - -* Implementation of the proposal as a completely separate crates.io - crate, by refactoring existing libsyntax source code to produce a - new tree. - -* A prototype implementation of the macro expansion on top of the new - sytnax tree. - -* Additional round of discussion/RFC about merging with the mainline - compiler. - - -# Drawbacks -[drawbacks]: #drawbacks - -- No harm will be done as long as the new libsyntax exists as an - experiemt on crates.io. However, actually using it in the compiler - and other tools would require massive refactorings. - -- It's difficult to know upfront if the proposed syntax tree would - actually work well in both the compiler and IDE. It may be possible - that some drawbacks will be discovered during implementation. - - -# Rationale and alternatives -[alternatives]: #alternatives - -- Incrementally add more information about source code to the current - AST. - -- Move the current libsyntax to crates.io as is. In the past, there - were several failed attempts to do that. - -- Explore alternative representations for the parse tree. - -- Use parser generator instead of hand written parser. Using the - parser from libsyntax directly would be easier, and hand-written - LL-style parsers usually have much better error recovery than - generated LR-style ones. - -# Unresolved questions -[unresolved]: #unresolved-questions - -- Is it at all possible to represent Rust parser as a pure function of - the source code? It seems like the answer is yes, because the - language and especially macros were cleverly designed with this - use-case in mind. - - -- Is it possible to implement macro expansion using the proposed - framework? This is the main question of this RFC. The proposed - solution of synthesizing source code on the fly seems workable: it's - not that different from the current implementation, which - synthesizes token trees. - - -- How to actually phase out current libsyntax, if libsyntax2.0 turns - out to be a success? diff --git a/docs/TESTS.md b/docs/TESTS.md deleted file mode 100644 index a9d32d1d410..00000000000 --- a/docs/TESTS.md +++ /dev/null @@ -1,44 +0,0 @@ -# libsyntax2.0 testing infrastructure - -Libsyntax2.0 tests are in the `tests/data` directory. Each test is a -pair of files, an `.rs` file with Rust code and a `.txt` file with a -human-readable representation of syntax tree. - -The test suite is intended to be independent from a particular parser: -that's why it is just a list of files. - -The test suite is intended to be progressive: that is, if you want to -write a Rust parser, you can TDD it by working through the test in -order. That's why each test file begins with the number. Generally, -tests should be added in order of the appearance of corresponding -functionality in libsytnax2.0. If a bug in parser is uncovered, a -**new** test should be created instead of modifying an existing one: -it is preferable to have a gazillion of small isolated test files, -rather than a single file which covers all edge cases. It's okay for -files to have the same name except for the leading number. In general, -test suite should be append-only: old tests should not be modified, -new tests should be created instead. - -Note that only `ok` tests are normative: `err` tests test error -recovery and it is totally ok for a parser to not implement any error -recovery at all. However, for libsyntax2.0 we do care about error -recovery, and we do care about precise and useful error messages. - -There are also so-called "inline tests". They appear as the comments -with a `test` header in the source code, like this: - -```rust -// test fn_basic -// fn foo() {} -fn function(p: &mut Parser) { - // ... -} -``` - -You can run `cargo collect-tests` command to collect all inline tests -into `tests/data/inline` directory. The main advantage of inline tests -is that they help to illustrate what the relevant code is doing. - - -Contribution opportunity: design and implement testing infrastructure -for validators. diff --git a/docs/TOOLS.md b/docs/TOOLS.md deleted file mode 100644 index f8754c06fe4..00000000000 --- a/docs/TOOLS.md +++ /dev/null @@ -1,36 +0,0 @@ -# Tools used to implement libsyntax - -libsyntax uses several tools to help with development. - -Each tool is a binary in the [tools/](../tools) package. -You can run them via `cargo run` command. - -``` -cargo run --package tools --bin tool -``` - -There are also aliases in [./cargo/config](../.cargo/config), -so the following also works: - -``` -cargo tool -``` - - -## Tool: `gen` - -This tool reads a "grammar" from [grammar.ron](../grammar.ron) and -generates the `syntax_kinds.rs` file. You should run this tool if you -add new keywords or syntax elements. - - -## Tool: `parse` - -This tool reads rust source code from the standard input, parses it, -and prints the result to stdout. - - -## Tool: `collect-tests` - -This tools collect inline tests from comments in libsyntax2 source code -and places them into `tests/data/inline` directory.