Showing posts with label remote debugging solr in eclipse. Show all posts
Showing posts with label remote debugging solr in eclipse. Show all posts

Saturday, December 28, 2024

Debugging Spark Scala/ Java components

In continuation to the earlier post regarding debugging Pyspark, here we show how to debug the Spark Scala/ Java side. Spark is a distributed processing environment and has Scala Api's for connecting from different languages like Python & Java. The high level Pyspark Architecture is shown here.

For debugging the Spark Scala/ Java components as these run within the JVM, it's easy to make use of Java Tooling Options for remote debugging from any compatible IDE such as Idea (Eclipse longer supports Scala). A few points to remember:

  • Multiple JVMs in Spark: Since Spark is a distributed application, it involves several components like the Master/ Driver, Slave/ Worker, Executor. In a real world truly distributed setting, each of the components runs in its own separate JVM on separated Physical machines. So be clear about which component you are exactly wanting to debug & set up the Tooling options accordingly targetting the specific JVM instance.

  • Two-way connectivity between IDE & JVM: At the same time there should be a two-way network connectivity between the IDE (debugger) & the running JVM instance

  • Debugging Locally: Debugging is mostly a dev stage activity & done locally. So it may be better to debug on a a Spark cluster running locally. This could be either on a Spark Spark cluster or a Spark run locally (master=local[n]/ local[*]).

Steps:

Environment: Ubuntu-20.04 having Java-8, Spark/Pyspark (ver 2.1.0), Python3.5, Idea-Intelli (ver 2024.3), Maven3.6

(I) Idea Remote JVM Debugger
In Idea > Run/ Debug Config > Edit > Remote JVM Debug.

  • Start Debugger in Listen to Remote JVM Mode
  • Enable Auto Restart

(II)(a) Debug Spark Standlone cluster
Key features of the Spark Standalone cluster are:

  • Separate JVMs for Master, Slave/ Worker, Executor
  • All could run on a single dev box, provided enough resources (Mem, CPU) are available
  • Scripts inside SPARK_HOME/sbin folder like start-master.sh, start-slave.sh (start-worker.sh), etc to start the services

In order to Debug lets say some Executor, a Spark Standalone cluster could be started off with 1 Master, 1 Worker, 1 Executor.   

    # Start Master (Check http://localhost:8080/ to get Master URL/ PORT)
    ./sbin/start-master.sh 

    # Start Slave/ Worker
    ./sbin/start-slave.sh spark://MASTER_URL:<MASTER_PORT>

    # Add Jvm tooling to extraJavaOption to spark-defaults.conf
    spark.executor.extraJavaOptions  -agentlib:jdwp=transport=dt_socket,server=n,address=localhost:5005,suspend=n

    # The value could instead be passed as a conf to SparkContext in Python script:
    from pyspark.conf import SparkConf
    confVals = SparkConf()
    confVals.set("spark.executor.extraJavaOptions","-agentlib:jdwp=transport=dt_socket,server=n,address=localhost:5005,suspend=y")
    sc = SparkContext(master="spark://localhost:7077",appName="PythonStreamingStatefulNetworkWordCount1",conf=confVals)

(II)(b) Debug locally with master="local[n]"

  • In this case a local Spark cluster is spun up via scripts like spark-shell, spark-submit, etc. located inside the bin/ folder
  • The different components Master, Worker, Executor all run within one JVM as threads, where the value n is the no of threads, (set n=2)
  • Export JAVA_TOOL_OPTIONS before in the terminal from which the Pyspark script will be run

        export JAVA_TOOL_OPTIONS="-agentlib:jdwp=transport=dt_socket,server=n,suspend=n,address=5005"

(III) Execute PySpark Python script
    python3.5 ${SPARK_HOME}/examples/src/main/python/streaming/network_wordcount.py localhost 9999

This should start off the Pyspark & connect the Executor JVM to the waiting Idea Remote debugger instance for debugging.

Wednesday, September 12, 2012

Remotely Debug Solr Cloud in Eclipse Using JPDA, JDWP, JVMTI & JDI


The acronym's first:
JPDA - Java Platform Debug Architecture
JDWP - Java Debug Wire Protocol
JVMTI - JVM Tool Interface
JDI - Java Debug Interface

To debug any of the open source Java projects such as Solr using Eclipse, rely on the JDWP feature available within any standard JVM. You can get a lot more info about the terms and architecture here.

At a high level the concept is that there is a JVM to be debugged (Solr) & a client side JVM debuggee (Eclipse). The two communicate over the JDWP. Thanks to a standardized wire protocol the client may even be a non JVM application which subscribes to the protocol.

One of the two JVMs acts as the debugging server (the one that waits for the client to connect). The other JVM acts as the debugging client which connects to the debugger server, to start the debugging process.

In our case, to keep things simple let Solr be the debugger server, while Eclipse can be the debugger client. The configurations then are as follows.

On Solr side (assuming Solr Cloud):

java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8000 -Djetty.port=7200 -Dhost=myhost -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -Djava.util.logging.config.file=etc/logging.properties -DnumShards=3 -DzkHost=zk1:2171 -jar start.jar

Note: Since we have set suspend = y, Solr side will stay suspended until the Eclipse debugger client has connected

On Eclipse side:
Go to Run > Debug Configurations > Remote Java Application
Then choose Standard Socket Attach. Host: localhost (or IP). Port: 8000 (the same as set above)

Also in Eclipse you should have checked out the Solr source code from Solr trunk as a project. This will allow you to put break points at appropriate location to help with the debugging. So go on give this a shot and happy debugging!