Saturday, March 10, 2012

Apache Solr integration with Talend ETL custom components


Solr/Talend components documentation

This tutorial explain the use of the SOLR customs components for TALEND.
These components can be found on the TALEND Forge exchange section.



About SOLR (from SOLR website) :
Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites. For more information, wiki and tutorials can be found here : http://lucene.apache.org/solr/

Installation of SOLR components

The components pack can be found on the TALEND Forge exchange section.
Downloaded each component and add all to your TALEND installation.

In Talend, you can add a custom folder for user components.
Go to : Préférence -> Talend -> Components and select your user custom folder




Add this 5 components to your custom user folder :
  1. tSOLRConnection : used for define your connection to your SOLR server 
  2. tSOLRCommit : used to commit your datas modification in your SOLR core
  3. tSOLRRollback : used to rollback your datas modification in your SOLR core
  4. tSOLRInput : used for querying to your SOLR server
  5. tSOLROutput : used for add or delete datas in your SOLR server
Version of SOLR
All components use the java client api SolrJ.
In each component folder, there is the solr-solrj-X.X.X.jar library (and other jar dependencies)
I've test components with solr version 3.3.0 but if you want to work with another version, you should:

  1. replace in each component folder the jar dependencies and solrj jar with the right version you use
  2. if jars name change, rename in each tSOLRXXXX_java.xml file components definition, the name of jars with new name (in the <CODEGENERATION> <IMPORTS> section)



 tSOLRConnection component

This component define the connection to the SOLR server by setting properties on CommonsHttpSolrServer :

baseURL : the url to the SOLR Server. If you have a SOLR multi-core configuration, this is the URL to core that you want use.
allowCompression set allowCompression
connectionTimeout set connectionTimeout on the underlying HttpConnectionManager
maxTotalConnection : set maxTotalConnection on the underlying HttpConnectionManager
connectionManagerTimeout set soTimeout (read timeout) on the underlying HttpConnectionManager
defaultMaxConnectionsPerHost set maxConnectionsPerHost on the underlying HttpConnectionManager
followRedirects : set followRedirects
maxRetries : set maximum number of retries to attempt in the event of transient errors
allow http solr authentication : if checked two new properties appears for define the authentication.

  • login : the login to use
  • password : the password to use
Authentication is based on basic http auth.





 tSOLROutput component

This component add datas to the SOLR server.

Connection : A closed component list containing the tSOLRConnection components defined in the job. Allow to select the connection to use.
Schema : the schema for this component. The schema must map the SOLR fieds on which you want set datas.
Actions : actions to the SOLR server. You have choice between the values :

  • Delete : delete datas in SOLR server. In this case, you must define a key in your schema that map the unique identifier of a SOLR document in order to delete it (id field in basic SOLR configuration). 
  • Delete all before insert : delete all datas in SOLR before inserting the new datas
  • Insert : insert data in SOLR server
  • Update : No update mode available. The reason is that SOLR disallow to update some datas on a document (like drbms). To achieve this, you must update all fields on a document and define the unique identifier of the solr document. If the document is already present in SOLR (based on the unique identifier), it will be reset with new values. So for update document, you can just use the insert mode, define the unique identifier field for SOLR and fill datas on all fields.
Batch size : each row to insert is added to a document list before inserted in SOLR server (batch insert). In case of a lot of documents, you can have a OutOfMemoryException. By defined this property, you set the maximum number of documents in the list for evict this problem. When the list reach the max of capacity, the rows are inserted and a new list is created for the next rows. 0 means no limit for the batch list.





Example of schema defined





Important : In case of multi-valued SOLR fields, you can add data by setting a List type in the schema for the multi-valued properties


 tSOLRCommit 
component

This component commit changes made on the SOLR.
Connection : A closed component list containing the tSOLRConnection components defined in the job. Allow to select the connection to use.
Optimize index after commit : when checked, inform SOLR to optimize index after the commit

Important : SOLR does not support "transaction isolation" like you'll get from an rdbms.
In concurrent insert and commit, all pending changes are simply held in a single queue, and any commit will commit all of them.



 tSOLRRollback 
component

This component rollback changes made on SOLR

Connection : A closed component list containing the tSOLRConnection components defined in the job. Allow to select the connection to use.



 tSOLRInput component

This component is used for querying on SOLR server.

Connection : A closed component list containing the tSOLRConnection components defined in the job. Allow to select the connection to use.
Schema : the schema for this component. The schema must map the SOLR fieds on which you want retrieve datas.
Solr query : the solr query. Default to "*:*" (retrieve all documents). For more information on  SOLR  query, please go to the SOLR documentation.
Rows : the max rows retrieved per page. SOLR use pagination for retrieve results. If you have an OutOfMemory exception on a query that retrieves a lot of document you must decrease this value.
Solr filters : used to define many filters query (fq param in SOLR). You must not use quote (") string for filter name and values of this table.
For more information on  SOLR  query, please go to the SOLR documentation.
Solr params : used for append another params to the request query made to SOLR. Useful for define facet for example. You must not use quote (") string for param name and values of this table.
For more information on  SOLR  params, please go to the SOLR documentation.



The component detect automatically facets results in SOLR query response.

The variables defined with 'AFTER' availabilty are :



FACETS_FIELD : contains the facet results if any facet field has been defined in query.
The type of this variable is : List<org.apache.solr.client.solrj.response.FacetField>
FACETS_QUERY : contains the facet results if any facet query has been defined in query.
The type of this variable is : List<java.util.Map<String,Integer>>
FIELDS_STAT : contains the stats result if any SOLR component stat has been defined.
The type of this variable is : List<java.util.Map<String, org.apache.solr.client.solrj.response.FieldStatsInfo>>
NB_LINE_FACETS_FLOW : the number of facets results found
NB_LINE_ROWS_FLOW : the standard nb line found on the main rows flow

For more information on these SOLR functionnalities, please go to the SOLR documentation.





Scenario (Test case)

Bellow an illustration of the use of these components :
  1. Define connection to SOLR server
  2. Add data from xml file
  3. Commit changes (or rollback if any error occurs)
  4. Querying on data inserted at the step 2
  5. Delete document resulting from the query
  6. Commit changes (or rollback if any error occurs)








18 comments:

  1. Hello,

    thanks a lot for your development. The docu is also great.
    Unfortunatly all the indexed data gets logged (tSolrOutput) on the console whithout a possiblity to deselect this option (as a non-programmer). This slows down the process enourmously if you start the job from TOS

    ReplyDelete
    Replies
    1. Hello, can you give me the logs that appears on the TOS console ?
      Because the only logs should be where you not specify a key on schema in case of delete documents action on tSOLROutput(unless i forgot to comment/delete others logs).

      Thanks

      Delete
    2. Hello Sébastian,

      can I send you an e-mail, cause I'm only allowed to post 4092 chars. This is not much for the log output, but it generally looks like:
      DEBUG org.apache.commons.httpclient.params.DefaultHttpParams - Set parameter http.useragent = Jakarta Commons-HttpClient/3.1
      DEBUG org.apache.commons.httpclient.params.DefaultHttpParams - Set parameter http.protocol.version = HTTP/1.1

      ...
      search_title search_creatorcontrib

      ...
      DEBUG org.apache.commons.httpclient.MultiThreadedHttpConnectionManager - HttpConnectionManager.getConnection: config = HostConfiguration[host=http://192.168.178.50:8080], timeout = 100

      Delete
  2. hi! thank you for contributing this to the talend community! i'm trying out these Solr components, against a Solr 3.5.0 instance. First, I tried your components with the included JAR files, but got an error on the tSOLROutput component stating:
    "Exception in component tSOLROutput_1
    org.apache.solr.common.SolrException: Bad Request

    Bad Request

    request: http://kopdevsolr.advanceweb.com:8080/careers/jobPost/update?wt=javabin&version=2
    ...
    ...
    "

    I'm guessing i just need to use Solr 3.5.0 JAR files instead? In your writeup, you do mention replacing the JAR files with the correct version, but should i retain the JAR file names from 3.5.0, or should i rename those JAR files to match the names that come with the components?

    ReplyDelete
    Replies
    1. ah, nevermind. i found my problem once i looked in the solr log, i had a field in my Talend job that i had not specified in my schema.xml file.
      I also deleted all of the JAR files in each component directory and copied in the equivalent from my Solr 3.5.0 build, without renaming any files. That seems to work fine, too.
      thank you for the components!

      Delete
    2. yes, i forgot to specify this (and i've updated post) : if you change the solr client jar(so the name of the lib), you must rename it and all dependencies jar that change too, in each tSOLRXXXX_java.xml (in the section)
      I will add this to the post.
      It's strange that it works fines without this.
      The 3.3.0 lib works with solr server 3.5.0 on standard functionnalities but it's not very clean to work with different versions

      Delete
    3. well, maybe i updated the wrong component folders with the new JARs. When i initially installed these components, I went through the Exchange link in TOS. But, upon trying to use the components, I was getting an error about missing JAR files. So, in re-reading your documentation, I realized I needed to do the Preferences -> Talend -> Components step, too. I pointed this to the "plugins/org.talend.designer.components.exchange_5.0.1.r74687/downloaded" folder under my main TOS directory. All of the components are subdirectories under this main folder. These are the component directories that I put the 3.5.0 jar files into, too. I removed the old jar files and copied in the new ones.

      However, i'm noticing these components in 2 other TOS subdirectories, one under "plugins/org.talend.designer.components.localprovided_5.0.1.r74687/components/ext/user" and the other under "plugins/org.talend.designer.components.localprovider_5.0.1.r74687/components/ext/exchange"

      So, maybe I'm not pointing to the correct directory in my Preferences->Talend->Components setting and it's using the JAR files from one of the other 2 directories?

      Delete
    4. For be sure, change your jars with your solr version in the user components directory that you define in tos.
      Go to Preferences -> Talend -> Components and press "Apply" for reloading the user components folder.
      The modification in components will be updated to other folder you describe.
      Don't forget to modify the name of jars in each component if you change it before.

      Delete
  3. i have installed solr components in my talend studio and dragged solrconnection component to a job. While compilin git displaying error that org.apache.solr cannot resolve to be avariable . Can u Solve this?

    ReplyDelete
    Replies
    1. strange, which is your talend version ?

      Delete
    2. I am using talend 5.0r version.

      Delete
    3. can u send me the output of the solr search job ? bcoz i tried u r job but fininputxml is displaying erros? and also tel me how to use xpath query in inputfilexml.

      Thanks in advance.

      Delete
    4. please give me your mail at sebastien[dot]jaussaud[at]gmail[dot]com and i will send you a talend export of the job example.

      Delete
  4. Hi Sébastien,

    I am testing your components and for me they work very well, nice job!

    I have a question about highlighting: is your tSolrInput component able to get the highlighting field ?

    Thanks in advance,
    Maxime

    ReplyDelete
  5. Hi,

    I am trying to load data from a mysql table into SOLR. Keep getting a Bad Request Error with incorrect field. Am i missing anything? Should i set any configuration/schema on the SOLR side? I just followed your steps for the tsolroutput. Instead of xml, i used a mysql source. Did not set anything on the solr side. I guess the jars are fine as i am able to connect to SOLR and able to see the log for the error on the admin page.

    ReplyDelete
  6. Hello! my multi-value field is defined as a List in the schema but it shows up as a comma separated string in tLogRow. Also, I was wondering how do I normalise a List datatype in the schema.

    Thank you very much for your great work!

    Martin

    ReplyDelete
  7. I am interested in knowing more about Talend 5.1.1 and SOLR 4.7.0 connection. Is this accepted and/or supported by Talend? Thanks, Jennifer

    ReplyDelete