Friday, February 15, 2013

Solr Cell, Tika And Pages

With Solr Cell, aka Tika, you get the power to index content from within  a wide set of digital files such as Pdfs, Office, Text, etc.

Tika however doesn't naturally offer any demarcations for page boundaries. So you can search for content matches from a file, but not for specific pages from within these files.

Among several different ways to solve this problem, one way could be to index each page of the file as a separate document in Solr and do a field collapsing/ result grouping on the search results by a common file identifier shared by all pages of the file.

Since there could be performance overheads with result grouping, another way is to index the combined file as one solr document (of type Combined) & each page as a separate solr document (of type Page) with a common file identifier. The search can then be performed initially against the combined document (type:combined AND text:abc) to identify files that match & then against the corresponding page type document (type:page AND file-id:123 AND text:abc) to identify pages.

Wednesday, February 6, 2013

Mocking AWS ELB Behaviour Locally For Testing

Once hosted out of Amazon, you make use of the AWS Elastic Load Balancer (ELB) for balancing load across your EC2's within or acroos Availability Zones (AZ). Since code gets developed and tested locally (outside of Amazon), at times you might want to test load balancer scenarios before deploying to production. Here's one way to mock up the load balancer behaviour for local testing.

Use Apache (you could very well use something like Nginx instead) in a reverse proxy, load balancer set up via mod_proxy & mod_proxy_balancer.  Fairly simple for anyone with slight experience with configuring Apache. We used Apache as a load balancer front-end to IIS on local, exactly the way ELB would load balance in front of production IIS.

Additionally, since ELB was also an SSL end point for our production servers, we set up Apache to be the SSL end point (via mod_ssl) on local. Apache was configured to listen on port 443 (using a self-signed certificate), and would forward all traffic from port 443 to backend IIS on port 80.

Once we had that set-up going, we were quickly able to reproduce an issue with application generated Secure cookies not getting set properly across client request/ response. Once we had the fix on the local (which was to set the flag on the cookies in the request, not response) the same worked flawlessly on the AWS as well.