spiking incremental java compilation

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

spiking incremental java compilation

Adam Murdoch
Hi,

Just some thoughts on how we might spike a solution for incremental java compilation, to see if it’s worthwhile and what the effort might be:

The goal is to improve the Java compile tasks, so that they do less work for certain kinds of changes. Here, ‘less work’ means compiling fewer source files, and also touching fewer output files so that consumers of the task output can also do less work. It doesn’t mean compiling the *fewest* possible number of source files - just fewer than we do now.

The basic approach comes down to keeping track of dependencies between source files and the other compilation inputs - where inputs are source files, the compile classpath, the compile settings, and so on. Then, when an input changes, we would recompile the source files that depend on that input. Currently, we assume that every source file depends on every input, so that when an input changes we recompile everything.

Note that we don’t necessarily need to track dependencies at a fine-grained level. For example, we may track dependencies between packages rather than classes, or we may continue to assume that every source file depends on every class in the compile classpath.

A basic solution would look something like:

1. Determine which inputs have changed.
2. If the compile settings have changed, or if we don’t have any history, then schedule every source file for compilation, and skip to #5.
3. If a class in the compile classpath has changed, then schedule for compilation every source file that depends on this class.
4. If a source file has changed, then schedule for compilation every source file that depends on the classes of the source file.
5. For each source file scheduled for compilation, remove the previous output for that source file.
6. Invoke the compiler.
7. For each successfully compiled source file, extract the dependency information for the classes in the source file and persist this for next time.

For the above, “depends on” includes indirect dependencies.

Steps #1 and #2 are already covered by the incremental task API, at least enough to spike this.

Step #3 isn’t quite as simple as it is described above:
- Firstly, we can ignore changes for a class with a given name, if a class with the same name appears before it in the classpath (this includes the source files).
- If a class is removed, this counts as a ‘change’, so that we recompile any source files that used to depend on this class.
- If a class is added before some other class with the same name in the classpath, then we recompile any source files that used to depend on the old class.
- Dependencies can travel through other classes in the classpath, or source files, or a combination of both (e.g. a source class depends on a classpath class depends on a source class depends on a classpath class).

Step #4 is similar to step #3.

For a spike, it might be worth simply invalidating everything when the compile classpath changes, and just deal with changes in the source files.

For step #7 we have three basic approaches for extracting the dependencies:

The first approach is to use asm to extract the dependencies from the byte code after compilation. The upside is that this is very simple to implement and very fast. We have an implementation already that we use in the tooling API (ClasspathInferer  - but it’s mixed in with some other stuff). It also works for things that we only have the byte code for.

The downside is that it’s lossy: the compiler inlines constants into the byte code and discards source-only annotations. We also don’t easily know what type of dependency it is (is it an implementation detail or is is visible in the API of the class?)

Both these downsides can be addressed: For example we might treat a class with a constant field or a class for a source-only annotation as a dependency of every source file, so that when one of these things change, we would recompile everything. And to determine the type of dependency, we just need to dig deeper into the byte code.

The second approach is to use the compiler API that we are already using to invoke the compiler to query the dependencies during compilation. The upside is that we get the full source dependency information. The downsides are that we have to use a sun-specific extension of the compiler API to do this and it’s a very complicated API, which means fiddly to get right.

The third approach is to parse and analyse the source separately from compilation.

I’d probably try out the first option, as it’s the simplest to implement and probably the fastest at execution time.

There are some issues around making this efficient.

First, we need to make the persistence mechanism fast. For the spike, let’s assume we can do this. I would just keep the state in some static field somewhere and not bother with persistence.

Second, we need to make the calculation of affected source files fast. One option is to calculate this when something changes rather than each time we run the compilation task, so that we keep, basically, a map from input file to the closure of all source files affected by that input file.

Third, we need to keep the dependency graph as small as we can. So, we might play around with tracking dependencies between packages rather than classes. We should also ignore dependencies that are not visible to the consumer, so that we don’t traverse the dependencies of method bodies, or private elements.

Finally, we should ignore changes that are not visible to the consumer, so that we ignore changes to method bodies, private elements of a class, the annotations of classes, debug info and so on. This is relatively easy for changes to the compile classpath. For changes to source files, it’s a bit trickier, as we don’t know what’s changed until we compile the source file. We could, potentially, compile in two passes - first source files that have changed and then second source files that have not change but depend on those that have. Something, potentially, to play with as part of a spike.


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com



Reply | Threaded
Open this post in threaded view
|

Re: spiking incremental java compilation

Adam Murdoch

On 20 Dec 2013, at 2:37 pm, Adam Murdoch <[hidden email]> wrote:

Hi,

Just some thoughts on how we might spike a solution for incremental java compilation, to see if it’s worthwhile and what the effort might be:

The goal is to improve the Java compile tasks, so that they do less work for certain kinds of changes. Here, ‘less work’ means compiling fewer source files, and also touching fewer output files so that consumers of the task output can also do less work. It doesn’t mean compiling the *fewest* possible number of source files - just fewer than we do now.

The basic approach comes down to keeping track of dependencies between source files and the other compilation inputs - where inputs are source files, the compile classpath, the compile settings, and so on. Then, when an input changes, we would recompile the source files that depend on that input. Currently, we assume that every source file depends on every input, so that when an input changes we recompile everything.

Note that we don’t necessarily need to track dependencies at a fine-grained level. For example, we may track dependencies between packages rather than classes, or we may continue to assume that every source file depends on every class in the compile classpath.

A basic solution would look something like:

1. Determine which inputs have changed.
2. If the compile settings have changed, or if we don’t have any history, then schedule every source file for compilation, and skip to #5.
3. If a class in the compile classpath has changed, then schedule for compilation every source file that depends on this class.
4. If a source file has changed, then schedule for compilation every source file that depends on the classes of the source file.
5. For each source file scheduled for compilation, remove the previous output for that source file.
6. Invoke the compiler.
7. For each successfully compiled source file, extract the dependency information for the classes in the source file and persist this for next time.

For the above, “depends on” includes indirect dependencies.

Steps #1 and #2 are already covered by the incremental task API, at least enough to spike this.

Step #3 isn’t quite as simple as it is described above:
- Firstly, we can ignore changes for a class with a given name, if a class with the same name appears before it in the classpath (this includes the source files).
- If a class is removed, this counts as a ‘change’, so that we recompile any source files that used to depend on this class.
- If a class is added before some other class with the same name in the classpath, then we recompile any source files that used to depend on the old class.
- Dependencies can travel through other classes in the classpath, or source files, or a combination of both (e.g. a source class depends on a classpath class depends on a source class depends on a classpath class).

Step #4 is similar to step #3.

For a spike, it might be worth simply invalidating everything when the compile classpath changes, and just deal with changes in the source files.

For step #7 we have three basic approaches for extracting the dependencies:

The first approach is to use asm to extract the dependencies from the byte code after compilation. The upside is that this is very simple to implement and very fast. We have an implementation already that we use in the tooling API (ClasspathInferer  - but it’s mixed in with some other stuff). It also works for things that we only have the byte code for.

The downside is that it’s lossy: the compiler inlines constants into the byte code and discards source-only annotations. We also don’t easily know what type of dependency it is (is it an implementation detail or is is visible in the API of the class?)

Both these downsides can be addressed: For example we might treat a class with a constant field or a class for a source-only annotation as a dependency of every source file, so that when one of these things change, we would recompile everything.

We can fine-tune this: for constants we only need to recompile classes where the constant appears in the class’s constant pool, and probably only if the class has code.


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com



Reply | Threaded
Open this post in threaded view
|

Re: spiking incremental java compilation

Luke Daley-2
In reply to this post by Adam Murdoch

On 20 Dec 2013, at 3:37, Adam Murdoch wrote:

> Hi,
>
> Just some thoughts on how we might spike a solution for incremental
> java compilation, to see if it’s worthwhile and what the effort
> might be:
>
> The goal is to improve the Java compile tasks, so that they do less
> work for certain kinds of changes. Here, ‘less work’ means
> compiling fewer source files, and also touching fewer output files so
> that consumers of the task output can also do less work. It doesn’t
> mean compiling the *fewest* possible number of source files - just
> fewer than we do now.
>
> The basic approach comes down to keeping track of dependencies between
> source files and the other compilation inputs - where inputs are
> source files, the compile classpath, the compile settings, and so on.
> Then, when an input changes, we would recompile the source files that
> depend on that input. Currently, we assume that every source file
> depends on every input, so that when an input changes we recompile
> everything.
>
> Note that we don’t necessarily need to track dependencies at a
> fine-grained level. For example, we may track dependencies between
> packages rather than classes, or we may continue to assume that every
> source file depends on every class in the compile classpath.
>
> A basic solution would look something like:
>
> 1. Determine which inputs have changed.
> 2. If the compile settings have changed, or if we don’t have any
> history, then schedule every source file for compilation, and skip to
> #5.
> 3. If a class in the compile classpath has changed, then schedule for
> compilation every source file that depends on this class.
> 4. If a source file has changed, then schedule for compilation every
> source file that depends on the classes of the source file.
> 5. For each source file scheduled for compilation, remove the previous
> output for that source file.
> 6. Invoke the compiler.
> 7. For each successfully compiled source file, extract the dependency
> information for the classes in the source file and persist this for
> next time.
>
> For the above, “depends on” includes indirect dependencies.
>
> Steps #1 and #2 are already covered by the incremental task API, at
> least enough to spike this.
>
> Step #3 isn’t quite as simple as it is described above:
> - Firstly, we can ignore changes for a class with a given name, if a
> class with the same name appears before it in the classpath (this
> includes the source files).
> - If a class is removed, this counts as a ‘change’, so that we
> recompile any source files that used to depend on this class.
> - If a class is added before some other class with the same name in
> the classpath, then we recompile any source files that used to depend
> on the old class.
> - Dependencies can travel through other classes in the classpath, or
> source files, or a combination of both (e.g. a source class depends on
> a classpath class depends on a source class depends on a classpath
> class).
>
> Step #4 is similar to step #3.
>
> For a spike, it might be worth simply invalidating everything when the
> compile classpath changes, and just deal with changes in the source
> files.
>
> For step #7 we have three basic approaches for extracting the
> dependencies:
>
> The first approach is to use asm to extract the dependencies from the
> byte code after compilation. The upside is that this is very simple to
> implement and very fast. We have an implementation already that we use
> in the tooling API (ClasspathInferer  - but it’s mixed in with some
> other stuff). It also works for things that we only have the byte code
> for.
>
> The downside is that it’s lossy: the compiler inlines constants into
> the byte code and discards source-only annotations. We also don’t
> easily know what type of dependency it is (is it an implementation
> detail or is is visible in the API of the class?)
>
> Both these downsides can be addressed: For example we might treat a
> class with a constant field or a class for a source-only annotation as
> a dependency of every source file, so that when one of these things
> change, we would recompile everything. And to determine the type of
> dependency, we just need to dig deeper into the byte code.
>
> The second approach is to use the compiler API that we are already
> using to invoke the compiler to query the dependencies during
> compilation. The upside is that we get the full source dependency
> information. The downsides are that we have to use a sun-specific
> extension of the compiler API to do this and it’s a very complicated
> API, which means fiddly to get right.
>
> The third approach is to parse and analyse the source separately from
> compilation.
>
> I’d probably try out the first option, as it’s the simplest to
> implement and probably the fastest at execution time.
>
> There are some issues around making this efficient.
>
> First, we need to make the persistence mechanism fast. For the spike,
> let’s assume we can do this. I would just keep the state in some
> static field somewhere and not bother with persistence.
>
> Second, we need to make the calculation of affected source files fast.
> One option is to calculate this when something changes rather than
> each time we run the compilation task, so that we keep, basically, a
> map from input file to the closure of all source files affected by
> that input file.

This is a direction we are no doubt going to go into anyway.

> Third, we need to keep the dependency graph as small as we can. So, we
> might play around with tracking dependencies between packages rather
> than classes.

Will be interesting to see how this works in the real world on nasty
code bases where packages are monolithic and have lots of dependencies.

> We should also ignore dependencies that are not visible to the
> consumer, so that we don’t traverse the dependencies of method
> bodies, or private elements.

What do you mean here?

> Finally, we should ignore changes that are not visible to the
> consumer, so that we ignore changes to method bodies, private elements
> of a class, the annotations of classes, debug info and so on. This is
> relatively easy for changes to the compile classpath. For changes to
> source files, it’s a bit trickier, as we don’t know what’s
> changed until we compile the source file. We could, potentially,
> compile in two passes - first source files that have changed and then
> second source files that have not change but depend on those that
> have. Something, potentially, to play with as part of a spike.


I'm pretty dubious about all of this. Looks to me like a difficult thing
to pull off outside of the compiler. I'm sure we can get something
working, but whether it's reliable enough and fast enough is another
question (hopefully answered by the spike). I also wonder whether
investing into more fine grained parallelism and coarser avoidance (e.g.
ignoring non visible classpath changes) wouldn't be more fruitful and
more generally applicable.

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply | Threaded
Open this post in threaded view
|

Re: spiking incremental java compilation

Adam Murdoch

On 22 Dec 2013, at 12:49 am, Luke Daley <[hidden email]> wrote:


On 20 Dec 2013, at 3:37, Adam Murdoch wrote:

Hi,

Just some thoughts on how we might spike a solution for incremental java compilation, to see if it’s worthwhile and what the effort might be:

The goal is to improve the Java compile tasks, so that they do less work for certain kinds of changes. Here, ‘less work’ means compiling fewer source files, and also touching fewer output files so that consumers of the task output can also do less work. It doesn’t mean compiling the *fewest* possible number of source files - just fewer than we do now.

The basic approach comes down to keeping track of dependencies between source files and the other compilation inputs - where inputs are source files, the compile classpath, the compile settings, and so on. Then, when an input changes, we would recompile the source files that depend on that input. Currently, we assume that every source file depends on every input, so that when an input changes we recompile everything.

Note that we don’t necessarily need to track dependencies at a fine-grained level. For example, we may track dependencies between packages rather than classes, or we may continue to assume that every source file depends on every class in the compile classpath.

A basic solution would look something like:

1. Determine which inputs have changed.
2. If the compile settings have changed, or if we don’t have any history, then schedule every source file for compilation, and skip to #5.
3. If a class in the compile classpath has changed, then schedule for compilation every source file that depends on this class.
4. If a source file has changed, then schedule for compilation every source file that depends on the classes of the source file.
5. For each source file scheduled for compilation, remove the previous output for that source file.
6. Invoke the compiler.
7. For each successfully compiled source file, extract the dependency information for the classes in the source file and persist this for next time.

For the above, “depends on” includes indirect dependencies.

Steps #1 and #2 are already covered by the incremental task API, at least enough to spike this.

Step #3 isn’t quite as simple as it is described above:
- Firstly, we can ignore changes for a class with a given name, if a class with the same name appears before it in the classpath (this includes the source files).
- If a class is removed, this counts as a ‘change’, so that we recompile any source files that used to depend on this class.
- If a class is added before some other class with the same name in the classpath, then we recompile any source files that used to depend on the old class.
- Dependencies can travel through other classes in the classpath, or source files, or a combination of both (e.g. a source class depends on a classpath class depends on a source class depends on a classpath class).

Step #4 is similar to step #3.

For a spike, it might be worth simply invalidating everything when the compile classpath changes, and just deal with changes in the source files.

For step #7 we have three basic approaches for extracting the dependencies:

The first approach is to use asm to extract the dependencies from the byte code after compilation. The upside is that this is very simple to implement and very fast. We have an implementation already that we use in the tooling API (ClasspathInferer  - but it’s mixed in with some other stuff). It also works for things that we only have the byte code for.

The downside is that it’s lossy: the compiler inlines constants into the byte code and discards source-only annotations. We also don’t easily know what type of dependency it is (is it an implementation detail or is is visible in the API of the class?)

Both these downsides can be addressed: For example we might treat a class with a constant field or a class for a source-only annotation as a dependency of every source file, so that when one of these things change, we would recompile everything. And to determine the type of dependency, we just need to dig deeper into the byte code.

The second approach is to use the compiler API that we are already using to invoke the compiler to query the dependencies during compilation. The upside is that we get the full source dependency information. The downsides are that we have to use a sun-specific extension of the compiler API to do this and it’s a very complicated API, which means fiddly to get right.

The third approach is to parse and analyse the source separately from compilation.

I’d probably try out the first option, as it’s the simplest to implement and probably the fastest at execution time.

There are some issues around making this efficient.

First, we need to make the persistence mechanism fast. For the spike, let’s assume we can do this. I would just keep the state in some static field somewhere and not bother with persistence.

Second, we need to make the calculation of affected source files fast. One option is to calculate this when something changes rather than each time we run the compilation task, so that we keep, basically, a map from input file to the closure of all source files affected by that input file.

This is a direction we are no doubt going to go into anyway.

Third, we need to keep the dependency graph as small as we can. So, we might play around with tracking dependencies between packages rather than classes.

Will be interesting to see how this works in the real world on nasty code bases where packages are monolithic and have lots of dependencies.

It’s all about trade-offs. The idea of tracking package dependencies is to make the graphs smaller, with fewer nodes and edges, meaning less work to figure out the things that definitely are up-to-date, at the cost of compiling some source files that might not be out-of-date. We’d have to measure and see.

The point was really that we don’t necessarily need to build a graph of individual source files to make compilation better, and given how fast the Java compiler is, it might be a better trade-off to chunk the source files to some degree.


We should also ignore dependencies that are not visible to the consumer, so that we don’t traverse the dependencies of method bodies, or private elements.

What do you mean here?

There are lots of dependencies of a class that aren’t visible at compile time to any consumer of that class. For example:

- A class that is only referenced in a method body.
- A class that is only referenced in a signature of a private element
- Annotations (except @Deprecated)

So, if we have something like:

A extends B  { }

B { someMethod() { new C() } }

Then C is not visible to A via B, and when C changes we shouldn’t recompile A, but we should recompile B.

However, when a dependency may be visible to a consumer, then we should traverse that dependency:

A extends B { }

B extends C { }

Then C is visible to A via B and when C changes we should recompile both A and B.



Finally, we should ignore changes that are not visible to the consumer, so that we ignore changes to method bodies, private elements of a class, the annotations of classes, debug info and so on. This is relatively easy for changes to the compile classpath. For changes to source files, it’s a bit trickier, as we don’t know what’s changed until we compile the source file. We could, potentially, compile in two passes - first source files that have changed and then second source files that have not change but depend on those that have. Something, potentially, to play with as part of a spike.


I'm pretty dubious about all of this. Looks to me like a difficult thing to pull off outside of the compiler. I'm sure we can get something working, but whether it's reliable enough and fast enough is another question (hopefully answered by the spike).

We can make it reliable. The byte code format makes it quite easy and very fast to extract the compile-time dependencies of a type - you slurp up the CONSTANT_Class entries in the constant pool. You have to invalidate more source files than might have actually changed when a kind of usage that the compiler inlines changes, such as a constant. But this is just part of the trade-off.

We can also make it reliable using the compiler API during compilation. It’s just more fiddly, and doesn’t work when we aren’t responsible for compilation.

Fast is another story, and that’s something for the spike to answer. I’m pretty confident.

Something to remember is that the goal here is not just about making compilation faster - it’s also about reducing the impact on things that use the compiled classes. So, it would be totally fine if we actually made compilation slightly slower, provided we saw a nice reduction in the average total build time.


I also wonder whether investing into more fine grained parallelism and coarser avoidance (e.g. ignoring non visible classpath changes) wouldn't be more fruitful and more generally applicable.

The coarser avoidance is part of this work. It has to be to make it efficient.


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com



Reply | Threaded
Open this post in threaded view
|

Re: spiking incremental java compilation

Hans Dockter-2
In reply to this post by Luke Daley-2

On Saturday, December 21, 2013, Luke Daley wrote:

<snip>

I'm pretty dubious about all of this. Looks to me like a difficult thing to pull off outside of the compiler. I'm sure we can get something working, but whether it's reliable enough and fast enough is another question (hopefully answered by the spike). I also wonder whether investing into more fine grained parallelism and coarser avoidance (e.g. ignoring non visible classpath changes) wouldn't be more fruitful and more generally applicable.

There are very different scenarios where this is helpful.

- There are some large Gradle installations out there with massive single source trees with compile times of 2 minutes and more. 
- You have to see this also in the context of our upcoming continuous mode and deeper IDE integration where on the one hand Gradle compile is used with high frequency and on the other hand we don't have to scan the filesystem and we can keep the graph in memory.

Last but not least it is long overdue that we need to have this graph available also for other important functionality:

- Unused elements in the classpath
- Enforcing API compatibility between modules
- Incremental Testing
- A lot of other interesting analytics tasks

We have postponed this for years with the valid argument that just from a performance perspective there is bigger fish to fry. I think this statement is still valid. But considering the other extremely important functionality we can get out of this and that we will tackle for example fine grained parallelism in any case soon I think it is time to tackle this.

In fact I would like to see at least some of the non performance related stories developed from the very beginning.

Hans

Reply | Threaded
Open this post in threaded view
|

Re: spiking incremental java compilation

Hans Dockter-2
In reply to this post by Adam Murdoch


On Friday, December 20, 2013, Adam Murdoch wrote:
Hi,

Just some thoughts on how we might spike a solution for incremental java compilation, to see if it’s worthwhile and what the effort might be:

The goal is to improve the Java compile tasks, so that they do less work for certain kinds of changes. Here, ‘less work’ means compiling fewer source files, and also touching fewer output files so that consumers of the task output can also do less work. It doesn’t mean compiling the *fewest* possible number of source files - just fewer than we do now.

The basic approach comes down to keeping track of dependencies between source files and the other compilation inputs - where inputs are source files, the compile classpath, the compile settings, and so on. Then, when an input changes, we would recompile the source files that depend on that input. Currently, we assume that every source file depends on every input, so that when an input changes we recompile everything.

Note that we don’t necessarily need to track dependencies at a fine-grained level. For example, we may track dependencies between packages rather than classes,

That is I think the approach of the incremental compile functionality used in the OpenJDK Java build.

Hans
 
Reply | Threaded
Open this post in threaded view
|

Re: spiking incremental java compilation

Adam Murdoch
In reply to this post by Hans Dockter-2

On 24 Dec 2013, at 6:12 pm, Hans Dockter <[hidden email]> wrote:


On Saturday, December 21, 2013, Luke Daley wrote:

<snip>

I'm pretty dubious about all of this. Looks to me like a difficult thing to pull off outside of the compiler. I'm sure we can get something working, but whether it's reliable enough and fast enough is another question (hopefully answered by the spike). I also wonder whether investing into more fine grained parallelism and coarser avoidance (e.g. ignoring non visible classpath changes) wouldn't be more fruitful and more generally applicable.

There are very different scenarios where this is helpful.

- There are some large Gradle installations out there with massive single source trees with compile times of 2 minutes and more. 
- You have to see this also in the context of our upcoming continuous mode and deeper IDE integration where on the one hand Gradle compile is used with high frequency and on the other hand we don't have to scan the filesystem and we can keep the graph in memory.

Last but not least it is long overdue that we need to have this graph available also for other important functionality:

- Unused elements in the classpath
- Enforcing API compatibility between modules
- Incremental Testing
- A lot of other interesting analytics tasks

- real conflict detection
- some interesting api-implementation separation options
- better handling for some interesting ways of structuring code, such as all source files in the same physical source directory, while keeping separation between the logically separate things.
- possibly what you meant by ‘analytics tasks’, but this allows us to answer questions such as ‘which other teams use this method and where?’ or do things like ‘test only the downstream dependencies that are affected by this change’.


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com



Reply | Threaded
Open this post in threaded view
|

Re: spiking incremental java compilation

Niclas Hedhman-2

For "Only testing downstream dependencies" you will have a challenge, since tests are often involving Reflection (via Spring for instance), and it might become hard for people to set this up correctly. 

In the "performance space", I would like to suggest that you add some type of 'profiler' that can collect statistics for you (perhaps optionally) and have people send back those to you for analysis.

I know, for instance, that what our projects suffer from at work are completely different from what my open source projects suffer from. And you should probably get yourself a better/wider view of what is really going on "out there" in real numbers, rather than benchmarking and sampling a few select projects.


Cheers
Niclas


On Fri, Dec 27, 2013 at 5:55 AM, Adam Murdoch <[hidden email]> wrote:

On 24 Dec 2013, at 6:12 pm, Hans Dockter <[hidden email]> wrote:


On Saturday, December 21, 2013, Luke Daley wrote:

<snip>

I'm pretty dubious about all of this. Looks to me like a difficult thing to pull off outside of the compiler. I'm sure we can get something working, but whether it's reliable enough and fast enough is another question (hopefully answered by the spike). I also wonder whether investing into more fine grained parallelism and coarser avoidance (e.g. ignoring non visible classpath changes) wouldn't be more fruitful and more generally applicable.

There are very different scenarios where this is helpful.

- There are some large Gradle installations out there with massive single source trees with compile times of 2 minutes and more. 
- You have to see this also in the context of our upcoming continuous mode and deeper IDE integration where on the one hand Gradle compile is used with high frequency and on the other hand we don't have to scan the filesystem and we can keep the graph in memory.

Last but not least it is long overdue that we need to have this graph available also for other important functionality:

- Unused elements in the classpath
- Enforcing API compatibility between modules
- Incremental Testing
- A lot of other interesting analytics tasks

- real conflict detection
- some interesting api-implementation separation options
- better handling for some interesting ways of structuring code, such as all source files in the same physical source directory, while keeping separation between the logically separate things.
- possibly what you meant by ‘analytics tasks’, but this allows us to answer questions such as ‘which other teams use this method and where?’ or do things like ‘test only the downstream dependencies that are affected by this change’.



--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com






--
Niclas Hedhman, Software Developer
http://www.qi4j.org - New Energy for Java

I  live here; http://tinyurl.com/2qq9er
I  work here; http://tinyurl.com/2ymelc
I relax here; http://tinyurl.com/2cgsug
Reply | Threaded
Open this post in threaded view
|

Re: spiking incremental java compilation

Hans Dockter-2


On Sunday, December 29, 2013, Niclas Hedhman wrote:

For "Only testing downstream dependencies" you will have a challenge, since tests are often involving Reflection (via Spring for instance), and it might become hard for people to set this up correctly.  

In the "performance space", I would like to suggest that you add some type of 'profiler' that can collect statistics for you (perhaps optionally) and have people send back those to you for analysis.

That is something we are working on. 
 

I know, for instance, that what our projects suffer from at work are completely different from what my open source projects suffer from. And you should probably get yourself a better/wider view of what is really going on "out there" in real numbers, rather than benchmarking and sampling a few select projects.

I agree. That would make a lot of sense.

Hans
 


Cheers
Niclas


On Fri, Dec 27, 2013 at 5:55 AM, Adam Murdoch <<a href="javascript:_e({}, &#39;cvml&#39;, &#39;adam.murdoch@gradleware.com&#39;);" target="_blank">adam.murdoch@...> wrote:

On 24 Dec 2013, at 6:12 pm, Hans Dockter <<a href="javascript:_e({}, &#39;cvml&#39;, &#39;hans.dockter@gradleware.com&#39;);" target="_blank">hans.dockter@...> wrote:


On Saturday, December 21, 2013, Luke Daley wrote:

<snip>

I'm pretty dubious about all of this. Looks to me like a difficult thing to pull off outside of the compiler. I'm sure we can get something working, but whether it's reliable enough and fast enough is another question (hopefully answered by the spike). I also wonder whether investing into more fine grained parallelism and coarser avoidance (e.g. ignoring non visible classpath changes) wouldn't be more fruitful and more generally applicable.

There are very different scenarios where this is helpful.

- There are some large Gradle installations out there with massive single source trees with compile times of 2 minutes and more. 
- You have to see this also in the context of our upcoming continuous mode and deeper IDE integration where on the one hand Gradle compile is used with high frequency and on the other hand we don't have to scan the filesystem and we can keep the graph in memory.

Last but not least it is long overdue that we need to have this graph available also for other important functionality:

- Unused elements in the classpath
- Enforcing API compatibility between modules
- Incremental Testing
- A lot of other interesting analytics tasks

- real conflict detection
- some interesting api-implementation separation options
- better handling for some interesting ways of structuring code, such as all source files in the same physical source directory, while keeping separation between the logically separate things.
- possibly what you meant by ‘analytics tasks’, but this allows us to answer questions such as ‘which other teams use this method and where?’ or do things like ‘test only the downstream dependencies that are affected by this change’.



--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com






--
Niclas Hedhman, Software Developer
http://www.qi4j.org - New Energy for Java

I  live here; http://tinyurl.com/2qq9er
I  work here; http://tinyurl.com/2ymelc
I relax here; http://tinyurl.com/2cgsug
Reply | Threaded
Open this post in threaded view
|

Re: spiking incremental java compilation

Szczepan Faber-2
In reply to this post by Adam Murdoch
I’d probably try out the first option, as it’s the simplest to implement and probably the fastest at execution time.

Why do you think it is the fastest at execution time? Does fiddling with the compiler api and getting dependency tree info increase the overhead?

Cheers!
--
Szczepan Faber
Principal engineer@gradle; Founder@mockito
Reply | Threaded
Open this post in threaded view
|

Re: spiking incremental java compilation

Adam Murdoch

On 15 Jan 2014, at 3:58 am, Szczepan Faber <[hidden email]> wrote:

I’d probably try out the first option, as it’s the simplest to implement and probably the fastest at execution time.

Why do you think it is the fastest at execution time? Does fiddling with the compiler api and getting dependency tree info increase the overhead?

It’s just a guess, based on the fact that it’s simpler and streaming, and that asm is fast where we use it for this kind of thing (e.g test detection). We’d have to measure and see.


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com



Reply | Threaded
Open this post in threaded view
|

Re: spiking incremental java compilation

Szczepan Faber-2
In reply to this post by Adam Murdoch



On Fri, Dec 20, 2013 at 4:37 AM, Adam Murdoch <[hidden email]> wrote:
Hi,

Just some thoughts on how we might spike a solution for incremental java compilation, to see if it’s worthwhile and what the effort might be:

The goal is to improve the Java compile tasks, so that they do less work for certain kinds of changes. Here, ‘less work’ means compiling fewer source files, and also touching fewer output files so that consumers of the task output can also do less work. It doesn’t mean compiling the *fewest* possible number of source files - just fewer than we do now.

The basic approach comes down to keeping track of dependencies between source files and the other compilation inputs - where inputs are source files, the compile classpath, the compile settings, and so on. Then, when an input changes, we would recompile the source files that depend on that input. Currently, we assume that every source file depends on every input, so that when an input changes we recompile everything.

Note that we don’t necessarily need to track dependencies at a fine-grained level. For example, we may track dependencies between packages rather than classes, or we may continue to assume that every source file depends on every class in the compile classpath.

A basic solution would look something like:

1. Determine which inputs have changed.
2. If the compile settings have changed, or if we don’t have any history, then schedule every source file for compilation, and skip to #5.
3. If a class in the compile classpath has changed, then schedule for compilation every source file that depends on this class.
4. If a source file has changed, then schedule for compilation every source file that depends on the classes of the source file.
5. For each source file scheduled for compilation, remove the previous output for that source file.
6. Invoke the compiler.
7. For each successfully compiled source file, extract the dependency information for the classes in the source file and persist this for next time.

For the above, “depends on” includes indirect dependencies.

Steps #1 and #2 are already covered by the incremental task API, at least enough to spike this.

Step #3 isn’t quite as simple as it is described above:
- Firstly, we can ignore changes for a class with a given name, if a class with the same name appears before it in the classpath (this includes the source files).
- If a class is removed, this counts as a ‘change’, so that we recompile any source files that used to depend on this class.
- If a class is added before some other class with the same name in the classpath, then we recompile any source files that used to depend on the old class.
- Dependencies can travel through other classes in the classpath, or source files, or a combination of both (e.g. a source class depends on a classpath class depends on a source class depends on a classpath class).

Re whole #3 step

Do you have some thoughts on how to figure out what classes have changed in the compile classpath given the classpath consists mostly of jars and the incremental api only tells us currently that jar X have changed? I can imagine that we have this information for project dependencies, because project dependencies are 'internal' jars, also compiled incrementally, so we should have all information. This may be tricky for 'external' jars, though. Are you after a generic solution that works for internal and external jars? If so, do you have some thoughts on how to approach it.

In what ways you see improving the incremental task api here?

Step #4 is similar to step #3.

For a spike, it might be worth simply invalidating everything when the compile classpath changes, and just deal with changes in the source files.

For step #7 we have three basic approaches for extracting the dependencies:

The first approach is to use asm to extract the dependencies from the byte code after compilation. The upside is that this is very simple to implement and very fast. We have an implementation already that we use in the tooling API (ClasspathInferer  - but it’s mixed in with some other stuff). It also works for things that we only have the byte code for.

The downside is that it’s lossy: the compiler inlines constants into the byte code and discards source-only annotations. We also don’t easily know what type of dependency it is (is it an implementation detail or is is visible in the API of the class?)

Both these downsides can be addressed: For example we might treat a class with a constant field or a class for a source-only annotation as a dependency of every source file, so that when one of these things change, we would recompile everything. And to determine the type of dependency, we just need to dig deeper into the byte code. 
The second approach is to use the compiler API that we are already using to invoke the compiler to query the dependencies during compilation. The upside is that we get the full source dependency information. The downsides are that we have to use a sun-specific extension of the compiler API to do this and it’s a very complicated API, which means fiddly to get right.

What compiler API do you mean? What is the downside of using sun-specific extension here?
 
The third approach is to parse and analyse the source separately from compilation.

Do you mean Compiler Tree Api here?

Perhaps we can mix the approaches. E.g. for certain classes we could go for source analysis.

Source analysis might be interesting from the standpoint of the daemon's background activity. Imagine the daemon works out the class dependencies as the developer edits the java code.
 
I’d probably try out the first option, as it’s the simplest to implement and probably the fastest at execution time.

There are some issues around making this efficient. 
First, we need to make the persistence mechanism fast. For the spike, let’s assume we can do this. I would just keep the state in some static field somewhere and not bother with persistence.

We could use daemon's state as we do for task history cache.
 
Second, we need to make the calculation of affected source files fast. One option is to calculate this when something changes rather than each time we run the compilation task, so that we keep, basically, a map from input file to the closure of all source files affected by that input file.

Third, we need to keep the dependency graph as small as we can. So, we might play around with tracking dependencies between packages rather than classes. We should also ignore dependencies that are not visible to the consumer, so that we don’t traverse the dependencies of method bodies, or private elements.

Finally, we should ignore changes that are not visible to the consumer, so that we ignore changes to method bodies, private elements of a class, the annotations of classes, debug info and so on. This is relatively easy for changes to the compile classpath. For changes to source files, it’s a bit trickier, as we don’t know what’s changed until we compile the source file. We could, potentially, compile in two passes - first source files that have changed and then second source files that have not change but depend on those that have. Something, potentially, to play with as part of a spike.

Can you elaborate why it's easy to obtain this information from the compile classpath?
 


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com






--
Szczepan Faber
Principal engineer@gradle; Founder@mockito
Reply | Threaded
Open this post in threaded view
|

Re: spiking incremental java compilation

Szczepan Faber-2
In reply to this post by Adam Murdoch
Both these downsides can be addressed: For example we might treat a class with a constant field or a class for a source-only annotation as a dependency of every source file, so that when one of these things change, we would recompile everything.

We can fine-tune this: for constants we only need to recompile classes where the constant appears in the class’s constant pool, and probably only if the class has code.

This is a lovely idea. I'm a bit worried about this contant-inlining limitation because of those legacy codebases obsessed on turning every literal into a constant. This trick can potentially mitigate it nicely.

-- 
Szczepan Faber
Principal engineer@gradle; Founder@mockito
Reply | Threaded
Open this post in threaded view
|

Re: spiking incremental java compilation

Adam Murdoch
In reply to this post by Szczepan Faber-2

On 16 Jan 2014, at 11:44 am, Szczepan Faber <[hidden email]> wrote:




On Fri, Dec 20, 2013 at 4:37 AM, Adam Murdoch <[hidden email]> wrote:
Hi,

Just some thoughts on how we might spike a solution for incremental java compilation, to see if it’s worthwhile and what the effort might be:

The goal is to improve the Java compile tasks, so that they do less work for certain kinds of changes. Here, ‘less work’ means compiling fewer source files, and also touching fewer output files so that consumers of the task output can also do less work. It doesn’t mean compiling the *fewest* possible number of source files - just fewer than we do now.

The basic approach comes down to keeping track of dependencies between source files and the other compilation inputs - where inputs are source files, the compile classpath, the compile settings, and so on. Then, when an input changes, we would recompile the source files that depend on that input. Currently, we assume that every source file depends on every input, so that when an input changes we recompile everything.

Note that we don’t necessarily need to track dependencies at a fine-grained level. For example, we may track dependencies between packages rather than classes, or we may continue to assume that every source file depends on every class in the compile classpath.

A basic solution would look something like:

1. Determine which inputs have changed.
2. If the compile settings have changed, or if we don’t have any history, then schedule every source file for compilation, and skip to #5.
3. If a class in the compile classpath has changed, then schedule for compilation every source file that depends on this class.
4. If a source file has changed, then schedule for compilation every source file that depends on the classes of the source file.
5. For each source file scheduled for compilation, remove the previous output for that source file.
6. Invoke the compiler.
7. For each successfully compiled source file, extract the dependency information for the classes in the source file and persist this for next time.

For the above, “depends on” includes indirect dependencies.

Steps #1 and #2 are already covered by the incremental task API, at least enough to spike this.

Step #3 isn’t quite as simple as it is described above:
- Firstly, we can ignore changes for a class with a given name, if a class with the same name appears before it in the classpath (this includes the source files).
- If a class is removed, this counts as a ‘change’, so that we recompile any source files that used to depend on this class.
- If a class is added before some other class with the same name in the classpath, then we recompile any source files that used to depend on the old class.
- Dependencies can travel through other classes in the classpath, or source files, or a combination of both (e.g. a source class depends on a classpath class depends on a source class depends on a classpath class).

Re whole #3 step

Do you have some thoughts on how to figure out what classes have changed in the compile classpath given the classpath consists mostly of jars and the incremental api only tells us currently that jar X have changed? I can imagine that we have this information for project dependencies, because project dependencies are 'internal' jars, also compiled incrementally, so we should have all information. This may be tricky for 'external' jars, though. Are you after a generic solution that works for internal and external jars? If so, do you have some thoughts on how to approach it.

Not at this stage. To start with, it’ll be the compile task’s problem to scan the jars that have changed to decide which particular classes have changed or not, and to take care of persisting whatever it needs to make this decision.


In what ways you see improving the incremental task api here?

After we’ve made some progress on this and the incremental native compilation, we can roll some stuff back into the incremental task API.


Step #4 is similar to step #3.

For a spike, it might be worth simply invalidating everything when the compile classpath changes, and just deal with changes in the source files.

For step #7 we have three basic approaches for extracting the dependencies:

The first approach is to use asm to extract the dependencies from the byte code after compilation. The upside is that this is very simple to implement and very fast. We have an implementation already that we use in the tooling API (ClasspathInferer  - but it’s mixed in with some other stuff). It also works for things that we only have the byte code for.

The downside is that it’s lossy: the compiler inlines constants into the byte code and discards source-only annotations. We also don’t easily know what type of dependency it is (is it an implementation detail or is is visible in the API of the class?)

Both these downsides can be addressed: For example we might treat a class with a constant field or a class for a source-only annotation as a dependency of every source file, so that when one of these things change, we would recompile everything. And to determine the type of dependency, we just need to dig deeper into the byte code. 
The second approach is to use the compiler API that we are already using to invoke the compiler to query the dependencies during compilation. The upside is that we get the full source dependency information. The downsides are that we have to use a sun-specific extension of the compiler API to do this and it’s a very complicated API, which means fiddly to get right.

What compiler API do you mean? What is the downside of using sun-specific extension here?

The compiler tree API (the com.sun.source packages). The downside is that it's not guaranteed to be stable and it's not guaranteed to be present.

 
The third approach is to parse and analyse the source separately from compilation.

Do you mean Compiler Tree Api here?

No, I meant something like the source parser we use to extract the documentation metadata for the DSL generation. In fact, this might be a better option than using the compiler tree API if we go for source analysis - it already does much of the work, and it happens to understand groovy too.


Perhaps we can mix the approaches. E.g. for certain classes we could go for source analysis.

Source analysis might be interesting from the standpoint of the daemon's background activity. Imagine the daemon works out the class dependencies as the developer edits the java code.

A nice idea. There is certainly no reason we couldn’t mix the approaches (apart from the cost of implementing two approaches instead of one).

 
I’d probably try out the first option, as it’s the simplest to implement and probably the fastest at execution time.

There are some issues around making this efficient. 
First, we need to make the persistence mechanism fast. For the spike, let’s assume we can do this. I would just keep the state in some static field somewhere and not bother with persistence.

We could use daemon's state as we do for task history cache.
 
Second, we need to make the calculation of affected source files fast. One option is to calculate this when something changes rather than each time we run the compilation task, so that we keep, basically, a map from input file to the closure of all source files affected by that input file.

Third, we need to keep the dependency graph as small as we can. So, we might play around with tracking dependencies between packages rather than classes. We should also ignore dependencies that are not visible to the consumer, so that we don’t traverse the dependencies of method bodies, or private elements.

Finally, we should ignore changes that are not visible to the consumer, so that we ignore changes to method bodies, private elements of a class, the annotations of classes, debug info and so on. This is relatively easy for changes to the compile classpath. For changes to source files, it’s a bit trickier, as we don’t know what’s changed until we compile the source file. We could, potentially, compile in two passes - first source files that have changed and then second source files that have not change but depend on those that have. Something, potentially, to play with as part of a spike.

Can you elaborate why it's easy to obtain this information from the compile classpath?

Because it’s in byte code format. All the structural stuff has already been figured out for us and described in the byte code and we just need to diff the structure to decide whether the changes to the class are visible to the consumers or not. When the change is not visible, we don’t need to recompile the consumer source files.

For source files, all we have is some text so we don’t know whether the change can safely be ignored or not. We either need to recompile all consumer source files regardless of whether they need it, or we need to preprocess the source file to decide if the change is visible to consumers or not.


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com