Lichess on Scala3 – Help needed

ulularem · on Dec 5, 2022

Is the JVM running out of code/JIT cache space?

We had a similar problem recently in vanilla Java: monolith-like servers would seem to eventually "go bad" for no real discernable reason after hours/days of uptime. It turned out we needed to increase the amount of code-cache size the JVM was allowed to use.

Though supposedly the cache should have kept working (in the LRU-like way many caches operate), we observed that formerly-fine parts of the affected servers also seemed to be unreasonably slow, as if the whole code/JIT caching behavior had been disabled completely.

cldellow · on Dec 5, 2022

The jstat output also shows that CCS is at 99.21% usage, which could support this theory.

At a previous company, we operated some Scala services and ran into an issue like this. I forget if it was triggered from a JDK update or a Scala update. Scala (at least the 2.x series) generated a lot of classes, so there was a lot of memory pressure on this part of the system.

IIRC, we increased the limit and it resolved the issue.

I feel like there's a flag you can pass to see JIT invocations, which might also help validate this as the problem. -XX:+PrintCompilation maybe?

Caveats: It's been a long time, so my memory may be faulty, and this may not apply any longer.

agilob · on Dec 5, 2022

>The jstat output also shows that CCS is at 99.21% usage, which could support this theory.

If the code cache is full, the cache sweeper will have more work to do, will run slower, and this cache is a linked list (afair), any attempt to create another C1/C2 optimised code will cause the allocator to treverse the list, try to find enough contiguous space, and fail, triggering an attempt to fragment the space. Occasionally, removing some less frequently used compiled caches.

This process isn't your normal GC process. If it runs out of memory and nothing can be removed, you're at plateau of how fast code can execute, but your JVM is consuming more CPU cycles, that means, you're losing overall performance. There is no OOM error here, it all fails and slows down silently. No exceptions, no logs, nothing but wasted CPU cycles. This is one of the worst aspects of JVM to monitor and tune. I don't know of any promethues-like metric exporters that can be used here, like in any GC activity or stack/heap metrics.

As I stated in another comment, try

`-XX:+PrintCompilation` and `-XX:+PrintInlining` and `-XX:+LogCompilation`. When this turns out to be filled, try increasing `ReservedCodeCacheSize`. This is out of your non-heap area.

Tsarbomb · on Dec 6, 2022

Stupid question, if you already are at the edge of max heap size for allowing compressed OOPs, can increasing ReservedCodeCacheSize kick you into 64 bit uncompressed land?

agilob · on Dec 13, 2022

I'm not sure if I understand the question and context, but reserved code isn't in heap, but in "non-heap area", non-heap area has many sections, so you could potentially cause OOM on non-heap somewhere, or just see another regression in performance, due to poor heap vs non-heap ratio, then you change default -XX:MaxRAMPercentage=25, to something lower like -XX:MaxRAMPercentage=22 (depends on your total available memory).

xxs · on Dec 5, 2022

>The jstat output also shows that CCS is at 99.21% usage, which could support this theory.

If code cache has run out, the process effectively runs in interpreted mode. I'd wonder however how they would have so much code. Still, they should just run a profiler or any java monitoring tool.

munificent · on Dec 5, 2022

> I'd wonder however how they would have so much code.

They probably didn't author that much code, but Scala 3 may have many language features that its compiler desugars to large amounts of generated code.

(For example, when C# added support for anonymous functions, they initially did so by compiling each lambda to a generated class with a field for each local variable that the function closed over.)

xxs · on Dec 5, 2022

They still need to use the functions quite a bit to trigger at c1. Normally Java doesn't compile immediately but after enough iterations.

Also if scala does that for real, it'd eat the inline budget effectively killing the performance.

toast0 · on Dec 6, 2022

> Normally Java doesn't compile immediately but after enough iterations.

I've never done perf work with JVM, but ...

Is that iteration count reduced over time at all? If not, and from other behavior described, it seems like this could explain the 48 hours of fine, and then spikey garbage. Compilation cache is nearly filled, some bit(s) of code get called often enough to be compiled after 48 hours, finding space in the cache takes lots of cpu, eventually something more useful is evicted, but then you have even more cache pressure because those more useful items will come back in less than 48 hours, and more things will come in as they hit their required 49,50, etc worth of iterations.

If it's inexepensive to increase the cache size, seems like something reasonable to try, but the 48 hour period of stability makes testing difficult. I'm assuming the only realistic test system is production, cause that's how it usually is.

xxs · on Dec 6, 2022

>If it's inexepensive to increase the cache size, seems like something reasonable to try

absolutely, you just never let that thing drop below 80%.

My advice would be to also run C2 compilation at 100-1000 cycles (e.g. -XX:CompileThreshold=100) - it's a much slower startup (but they do have lots of CPUs), and it may not generate a good perf guided code but they will have a good idea how much code cache to dedicate.

xendo · on Dec 6, 2022

They are using JDK17, which segments the cache into profiled and non-profiled code. Both at around 120MB by default, it’s not that difficult to hit.

ulularem · on Dec 5, 2022

To expand a bit more here (and I'll probably slightly flub some terminology): the JVM of course does just-in-time hotspot compilation of frequently-executed code. When it does so, it caches the optimized code for later executions. There are values set for how many times code is executed before it's optimized in this way.

A monolith-like server (contains a lot of code to cache), that's been up for hours/days (has eventually triggered many disparate code paths enough to kick-off the optimization) in a new version of a language (speculatively may contain more code to optimize than the previous version) all seem like factors pointing to this potential situation.

whizzter · on Dec 5, 2022

This definitely sounds like a plausible explanation if the codegen for Scala3 has changed to enable more dynamism and that in turn makes some functions/patterns far larger.

It seems the place to inform them might be on their email or the discord so join up there and suggest it there?

Gonna be fun to hear the solution later.

Floegipoky · on Dec 5, 2022

Do you think that was driven by the code cache repeatedly filling and flushing? And bumping the size resolved it because it wasn't filling anymore? Did you experiment with disabling flushing?

My team is scaling up a Java service and we don't have a lot of institutional experience operating such systems, so I'm really interested in JVM tuning "case studies".

michaelt · on Dec 5, 2022

There's a JVM option - -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation which does what it sounds like.

The output might be instructive if you want to monitor the compilation behaviour more closely. And there are tools like JITWatch if you want to get into even more detail.

djexjms · on Dec 10, 2022

In case you didn't know, this was the correct answer. Nice troubleshooting!

morsch · on Dec 5, 2022

Try looking at it with async-profiler. This can be done in production. I discovered performance problems in unexpected (and some expected) places with it in the past. It may be more helpful if it's your application code that's to blame, though, less so if it's the JVM itself.

https://github.com/jvm-profiling-tools/async-profiler

kelseyfrog · on Dec 5, 2022

I'm surprised to find this comment at the bottom of the comment section. My advice, like yours, is to profile performance issues before getting into the hypothesis-change-measure loop. Like you said, it quickly bifurcates the problem space into application code and the JVM and eliminates entire classes of performance problems.

It's important to point out too that while most people assume cpu profiling, the JVM and hence most profiling tools also have memory profiling which can be helpful in diagnosing problems (especially ones related to GC). I hope the querent ends up profiling and shares the results. It would be both fun and productive to investigate the results collaboratively.

morsch · on Dec 5, 2022

The cool thing is that async-profiler is lightweight enough to run it in production, even against a server that's already under heavy load (in my experience, ymmv). Oh and it's free. We have a cron job running it three times a day (alloc and cpu, both).

Of course the report is rudimentary compared to heavyweight profilers.

Matthias247 · on Dec 5, 2022

+1. Continuous profiling should help figuring out the root cause of such issues. I have only used an in-house tool for Java for this so far, so I don't have a recommendation for a specific one. But the linked one looks reasonable, and there might be other tools too.

It might not even require continuous profiling. One alternative approach is to capture some minutes of profiling data after startup (when CPU usage looks good), and some minutes after the CPU spikes occur. Then compare those two and check for major differences.

michaelt · on Dec 5, 2022

> I don't think the garbage collector is to blame. During the worst times, when the JVM almost maxed out 90 CPUs, the GC was only using 3s of CPU time per minute.

The graph linked only shows 3s in young gen GC - you should check the time spent in the old gen GC too.

You can get loads of time spent in GC even without running out of memory - running out of other resources like file handles or network connections will also trigger a full GC in the hopes of freeing some up.

If you've got 1000 file handles available, one process that uses 100 per second and doesn't leak, and another that uses 1 per hour and leaks it, after 900 hours everything will look fine, then after 1000 hours you'll run out - and the symptoms will manifest in the first process, not the second.

Admittedly, there's a text file of jstats output linked which doesn't show any full GC happening, so maybe this is nothing...

zug_zug · on Dec 5, 2022

Would a way to test this hypothesis be to switch garbage-collecting algorithm?

agilob · on Dec 5, 2022

G1GC should be completely suitable, you can try with ZGC or Shenondoah. Both have some memory "penalty", each object takes a bit more memory, so with change you will see 5-15% increase memory usage. This would be normal.

G1GC should be fine, so enable GC logging, analyse them using:

https://www.tagtraum.com/gcviewer.html

or

https://gceasy.io/

You can log GC for long time with rotation, something like this https://dzone.com/articles/try-to-avoid-xxusegclogfilerotati...

For the GC analysing you will be looking for tenured generation, so must add this flag: -XX:+PrintTenuringDistribution

You would be looking for GC major and GC evacuation times. Major GCs are STW and take more time, so overall the goal is to eliminate them as much as possible.

I usually find it very important to have charts of heap usage. Overall heap allocated (all regions) vs complete heap size. The same for non-heap area.

randoglando · on Dec 5, 2022

STW GC is only for ParallelGC right? Does not apply to G1GC, ZGC or Shenandoah?

eonwe · on Dec 5, 2022

G1 GC totally has stw mode.

CraigJPerry · on Dec 5, 2022

At a cursory glance of the thread dumps i see the time spent in the server compiler (C2) is very very high. It might be worth exploring if that’s expected?

Alternatively if you just want a quick way to rule it out you could turn off tiered compilation, i had to google the option: -XX:-TieredCompilation

agilob · on Dec 5, 2022

C1 vs C2 is best shot here IMO. Except a normal memory leak of course.

Problem with JVM is these compilation stages are difficult to monitor and tune. They require logging, parsing logs and trial and error approach.

OP definitely worth logging tiered compilation `-XX:+PrintCompilation` and `-XX:+PrintInlining` and `-XX:+LogCompilation`. When this turns out to be filled, emptied and filled again, try increasing `ReservedCodeCacheSize`

xxs · on Dec 5, 2022

not having the tiered compilation would switch off c1, not c2. Tiered compilation is mostly an issue if the code generated remains c1 (the dumb compiler) and never promotes to c2... or if there is an OSR (on stack replacement issue - bug)

flip note: "java -XX:+PrintFlagsFinal -version", to see all available flags and their values, included the ones based on ergonomics.

0xFFFE · on Dec 5, 2022

It appears the problem existed before the lila3 was deployed on 11/22. If you notice the GC graph. The number of GC cycles/minute kept increasing gradually starting on 11/10 and almost doubling on 11/21. The 11/22 deployment of lila3 reset the graph and since you have been restarting everyday since, we can't see the growth of more than a day. My wild hunch is a code push on 11/10 causing a memory leak, worth checking in my opinion.

reidrac · on Dec 5, 2022

I haven't looked at the graphs, but they updated netty to 4.1.85.Final on 2022-11-10.

https://netty.io/news/2022/11/09/4-1-85-Final.html mentions that a potential memory leak was fixed. In includes a debug log warning of the leak; but enabling debug logging may be a no-no.

Perhaps could be worth reverting that and see if there's any change. Sounds cheap and harmless.

agilob · on Dec 5, 2022

Netty is a super complex and also super poorly documented project. I did weeks of exploration and found these JVM args:

      -Dio.netty.allocator.numDirectArenas=0
      -Dio.netty.noPreferDirect=true
      -Dio.netty.noUnsafe=true

work pretty well for us on any HTTP server. They slightly reduce performance as HTTP pool is weaker, but deceases memory usage by 25-40%, also eliminated one of a few memory leaks in an older version of KeyCloak.

nickspag · on Dec 5, 2022

there were no deployments near 11/10, according to that graph - they also say in the blog that scala2 could go for two weeks without restart, so theyre presumably aware of some sort of memory management issues they're just okay with it.

gavinray · on Dec 5, 2022

I assume the author tried asking in the Scala discord?

https://discord.com/invite/scala

Most of the core community hangs out there, and some of the folks that contribute to the compiler too. If there's someone that knows, they're either on the Discord or the forums.

ajkjk · on Dec 5, 2022

Hm why assume that? would never have thought to do that.

gavinray · on Dec 5, 2022

I dunno, every language/tech thing has a Discord nowadays (mostly). If I need help with something I usually go there first.

Even for niche things like D language, the Discord is the place to go. I learned Scala 3 and D mostly through the help of folks from their Discord servers guiding me.

jraph · on Dec 5, 2022

The use of Discord in free software communities never ceases to depress and disappoint me. I hope this fad dies soon. [1]

I would not risk assuming maintainers of an open source project to have the reflex of jumping into Discord to ask questions.

Anyway, D has a forum and an IRC channel. Scala has a Discourse.

The nice thing about forums is that problems and solutions are searchable by other people in the future. I learned many things by myself thanks to this. I would not like to live in a world where you need to engage with people all the time, asking the same questions again and again, to use some tool or some programming language.

[1] https://drewdevault.com/2022/03/29/free-software-free-infras...

vips7L · on Dec 5, 2022

Discord is also searchable by other people in the future. Forum channels are exactly the use case you describe: https://support.discord.com/hc/en-us/articles/6208479917079-...

kevincox · on Dec 5, 2022

Searchable within Discord. It can't be found by general search engines and can't be archived. It's a walled garden.

InGoodFaith · on Dec 5, 2022

You might be interested in Linen to make your discord (and slack) searchable outside of the walled garden (can also use to archive too).

https://github.com/linen-dev/linen.dev

https://news.ycombinator.com/item?id=31494908

GrumpySloth · on Dec 6, 2022

And yet I can’t access several month old threads. When I try to open them, they’re in a perpetually-loading state, with gray rectangles in place of text. Old-style forums are a lot more durable and more neatly organised.

nszceta · on Dec 6, 2022

Discord search is fuzzy, limited by time, and you are unable to delete old messages without a hack which puts your account at risk of a permanent ban. Discord is terrible.

ornicar · on Dec 6, 2022

Thank you so much for all the support, thoughts, and advice.

I just published an update on the blog post: https://lichess.org/@/thibault/blog/lichess-on-scala3-help-n...

timkofu · on Dec 6, 2022

Thanks for sharing this.

ketzo · on Dec 5, 2022

Off-topic response, but I had no idea Lichess did community-driven development like this. Very cool.

netsuitebitch · on Dec 5, 2022

It's a bit rare.

Sebguer · on Dec 5, 2022

It's open-source, so why wouldn't it?

ketzo · on Dec 5, 2022

well for one, I didn’t know Lichess was open source.

But for another, even for an open-source project, I love the super public-but-detailed approach to this bug solving. It almost feels like a bounty.

_siis · on Dec 6, 2022

There are various types of open source projects.

Some are more closed, some are more open.

Lichess historically has been closed, and they have only reached out to the community for help in the past as a last ditch effort when everything else has failed.

Its not that detailed either.

gowld · on Dec 5, 2022

On the one hand, it's written on the home page. https://lichess.org/

On the other hand, there is a lot of chesscom-style clutter there no one wants (animations of random other games)

AzzieElbab · on Dec 7, 2022

Looks like they figured it out after one of the compiler guys looked at the code https://discord.com/channels/280713822073913354/104933911236... ------ So I took a look at the code and the generate bytecode to try to see if the Scala 3 compiler was generating significantly more code. I don't see anything too weird, but the one thing that stands out is all the inline given Conversion in the codebase. I assume the idea is to use inline for performance but I'm afraid it's counter productive here: you end up inlining the anonymous class creation, so every usage of a conversion ends up creating a new anonymous class. So I recommend just dropping the inline keyword when defining a given ... =

facorreia · on Dec 7, 2022

It seems that this commit fixed it:

https://github.com/lichess-org/lila/commit/c4e53798d171ab71b...

sideeffffect · on Dec 9, 2022

Nope, it was wrongly configured code cache

https://lichess.org/@/thibault/blog/lichess-on-scala3-help-n...

https://old.reddit.com/r/scala/comments/zh3o6x/lichess_runni...

gopalv · on Dec 6, 2022

There's a lot of other theories in this thread, but the StringBuilder lock is what I can't get over from the stack traces.

  at java.lang.AbstractStringBuilder.ensureCapacityInternal(java.base@17/AbstractStringBuilder.java:228)
 - parking to wait for  <0x0000000088eee648> (a scala.concurrent.impl.ExecutionContextImpl$$anon$3)
 at java.lang.AbstractStringBuilder.append(java.base@17/AbstractStringBuilder.java:582)

That's a lock on a scala object in the middle of a java lang call-stack.

How is that lock happening?

The comments of ensureCapacityInternal explicitly says

    /**
     * This method has the same contract as ensureCapacity, but is
     * never synchronized.
     */
    private void ensureCapacityInternal(int minimumCapacity) {

randoglando · on Dec 6, 2022

I believe that is just an artifact of there being no memory to allocate, and so ensureCapacityInternal is blocked.

cadbox1 · on Dec 6, 2022

Deploy daily.

We had a similar issue in a Java application and couldn't work out the root cause so we decided to redeploy daily. No problems after that.

aexl · on Dec 6, 2022

Problem with that is that you will probably have a downtime for several minutes (or at least many seconds) each day. That's not optimal for a site where at any time of the day thousands of people are playing chess...

nonethewiser · on Dec 5, 2022

How much does lichess make from donations? Does it really cover their $400k / year costs?

https://docs.google.com/spreadsheets/d/1Si3PMUJGR9KrpE5lngSk...

Scarblac · on Dec 5, 2022

Yes, they have no other income.

TheFattestNinja · on Dec 6, 2022

Slightly related: how can I educate myself on the topics of JVM internals, performance optimization and non-functional analysis like these (efficiently) ? Anyone has a recommended path/list of resources that is not just spending all time googling topics?

cldellow · on Dec 6, 2022

My path was: operate a high-volume service. It will fall over and force you to learn (unfortunately, through Googling).

Less facetiously... I found searching for blogs by quantitative traders useful. You can write JVM code that has excellent latency and throughput characteristics, but you need to know the sharp edges to avoid.

A strategy that I think I used: find high performance collections libraries (trove, koloboke, fastutil). You'll likely need these anyway. Then find their authors, and go hunting for their blogs. They'll often list pitfalls.

It's been years, so I don't remember the names of people I read, otherwise I'd share them -- sorry!

kaba0 · on Dec 6, 2022

I have also been on a lookout for something like that for a long time. What I know of as good are Optimizing Java book by Evans, and the JVM anatomy quarks.

homarp · on Dec 9, 2022

solution discussed here https://news.ycombinator.com/item?id=33920399

tmd83 · on Dec 5, 2022

Does anyone know what they are generating jvm.memory.allocation.sum data from?

_siis · on Dec 6, 2022

There's no real way to tell what the problem is from the information they provided. They would need to generate a flamegraph to identify what's going on.

These type of issues aren't that uncommon with monolithic codebases, and releases every morning aren't good practice.

I tried to reach out to their email but I expect they still have my email on a auto-discard rule from when they banned me years ago without reason, response, due process, etc following posting an issue on their github which appears to now have been removed.

I don't see them getting much additional help given I'm not the only one that was treated that way. The project in general is fairly closed and has been and how they treat the people that want to help, and the lack of controls for auditing people with privileged access. Not something I'd participate in moving forward with what I know.

They can figure out how to use a flamegraph themselves.

I was professional and they weren't so I have no impetus to help them in the least.

I really wish Lichess would stop showing up in my HN feed. Its been almost 4 or 5 years, but you always remember the people that wrong you.

_siis · on Dec 6, 2022

No responses and negative downvotes for describing an actual experience, and providing a productive and professional answer despite that negative experience. Talk about unreasonable snowflakes.

Just goes to show anyone that matters how far the HN community has fallen in recent years.

You want to de-amp and be unreasonable by marginalizing me, I won't contribute at all, you can learn how things are without my help. Whole attempt at manipulation is saf, with the idiot controlling the bots in the majority.

Stripping people of agency and voice like that is very evil, and anyone participating in that will eventually have that negative karma come back at them, and it will be magnitudes worse the longer it takes.

If you can't be civil or reasonable, don't do anything.