The WebSphere Notes

The WebSphere Notes (www.webspherenotes.com)  is a blog that has my study notes about WebSphere Application server administration and WebSphere Portal Server developer and administration certification.

 

Sunil has been in the IT industry for 10 years, worked with IBM Software Labs and was part of WebSphere Portal Server Development team for 4 years, and is now working for Ascendant Technology. Sunil has been working with WebSphere Portal since 2003. He is author of "Java Portlets 101" book and more than 25 articles and has a popular blog about portlet development and administration (http://wpcertification.blogspot.com)

Letzte Blogeinträge

  • How to use custom delimiter character while reading file in Spark

    Montag, 18. April 2016

    I wanted to figure out how to get spark to read text file and break it based on custom delimiter instead of '\n'. These are my notes on how to do that The Spark Input/Output is based on Mapreduce's InputFormat and OutputFormat. For example when you call SparkContext.textFile() it actually uses TextInputFormat for reading the file. Advantage of this approach is that you do everything that TextInputFormat does. For example by default when you use TextInputFormat to read file it will break the file into records based on \n character.

  • Difference between reduce() and fold() method on Spark RDD

    Mittwoch, 17. Februar 2016

    When you can call fold() method on the RDD it returns a different result than you normally expect, so i wanted to figure out how fold() method actually works so i built this simple application First thing that i do in the application is create a simple RDD with 8 values from 1 to 8 and divide it into 3 partitions sparkContext.parallelize(List(1,2,3,4,5,6,7,8),3).

  • How to parse fillable PDF form in Java

    Mittwoch, 17. Februar 2016

    I wanted to figure out how to parse a fillable PDF form in Java, so that i could do some processing on it. So i built this sample PDFFormParsingPOC project that uses Apache PDFBox library. This is simple java class that i built, in which i read the PDF file first and then parse it into PDDocument. Then i can get all the fields in the PDF form by calling PDDocument.getDocumentCatalog().getAcroForm().getFields() and start iterating through it.

  • Invoking Python from Spark Scala project

    Samstag, 13. Februar 2016

    When your developing your Spark code, you have option of developing it using either Scala, Java or Python. In some cases you might want to mix the languages that you want to use. I wanted to try that out so i built this simple Spark program that passes control to Python for performing transformation (All that it does it append word "python " in front of every line).

  • How to use HBase sink with Flume

    Montag, 1. Februar 2016

    I wanted to figure out how to use HBase as target for flume, so i created this sample configuration which reads events from netcat and writes them to HBase.

    1. First step is to create test table in HBase with CF1 as column family. Everytime Flume gets a event it will write to HBase in test table in CF1 column family

    create 'test','CF1'
  • Reading content of file into String in scala

    Montag, 25. Januar 2016

    One of the common requirements is to read content of a file into String, You would want to read content of config file at particular path in your program at runtime but during testing you would wan to read content of a file on class path.

  • Hello Apache Tika

    Samstag, 23. Januar 2016

    Apache Tika is nice framework that lets you extract content of file. Example you can extract content of PDF or word document or excel as string. It also lets you extract metadata about the file. For example things like when it was created, author,.. etc. I built this sample application to play with Tika You can try using it by giving it full path of the file that you want to extract.

  • Flume to Spark Streaming - Pull model

    Samstag, 23. Januar 2016

    In this post i will demonstrate how to stream data from flume into Spark using Streaming. When it comes to Streaming data from Flume to Spark you have 2 options.

    1. Push Model: Spark listens on particular port for Avro event and flume connects to that port and publishes event
    2. Pull Model: You use special Spark Sink in flume that keeps collecting published data and Spark pulls that data at certain frequency

    I built this simple configuration in which i could send event to flume on netcat, flume would take those events and send them to Spark as well as print...

  • Setting up local repository for maven

    Freitag, 22. Januar 2016

    Before couple of days i was working with my colleague on setting up a cluster in AWS for Spark Lab. One problem we ran into is every time you start a spark build it download bunch of this dependencies(In our case around 200 MB mostly because of complexity of our dependencies). We thought if every student has to download all the dependencies it would take lot of time and cost money for network bandwidth consumption. So the way we ended up solving this issue is first we ran the maven build for first user say user1.

  • Monitoring HDFS directory for new files using Spark Streaming

    Sonntag, 17. Januar 2016

    I wanted to build this simple Spark Streaming application that monitors a particular directory in HDFS and whenever a new file shows up, i want to print its content to Console. I built this HDFSFileStream.scala. In this program after creating a SparkStreamContext. I am calling sparkStreamingContext.textFileStream(<directoryName>) on it. Once a new file appears in the directory the value of fileRDD.count() would return more than 0 and then i invoke processNewFile().