The difference is that Java's memory model was designed from the ground up to work with multiple threads. Python's semantics are that only 1 thread is running at any point in time. Keeping that illusion while running multiple threads in parallel is what causes the huge overhead. The alternative is to do away with CPython semantics, but that would break a ton of existing code.