6/21/2018

Oracle TIMESTAMP for use with Solr's Data Import Handler


The purpose of this article is to walk through how to setup an Oracle table with a TIMESTAMP column and use that column as the predicate for Solr's Data Import Handler (DIH).  In addition you'll likely want to use your TIMESTAMP column as a data point in Solr for sorting, faceting, etc.

The first thing I thought about was how to get the data stored in Oracle so that the native format would support Solr's DIH without having to do any conversions.  The thought here is that it will run more quickly by avoiding function calls in the SELECT statement.  That said I found that there was not a way to do this and that datetime format conversions are required.  This is a simple test and helped me to get familiar with the intricacies of working with dates, including using dates for the delta imports.  For real world projects we'll likely have to deal with a variety of datetime formats, and come up with creative solutions for tables that do not have timestamps.

For this testing, the first thing you'll want to do is create a column with a timestamp datatype. Because the defaults in Solr are UTC I've proceeded forward in the same fashion.  I created a test table as follows:

CREATE TABLE "[schema]"."SOLR_TEST" 
( 
"ID" NUMBER generated always AS identity, 
"TIME_STAMP" timestamp (6) DEFAULT sys_extract_utc(systimestamp), 
"CATEGORY"   VARCHAR2(255 byte), 
"TYPE"       VARCHAR2(255 byte), 
"SERVERNAME" VARCHAR2(255 byte), 
"CODE"       VARCHAR2(255 byte), 
"MSG"        VARCHAR2(255 byte) 
);


The ID and TIME_STAMP columns with be automatically generated.  The "ID" will have a number starting at 1 and increasing by 1 each insert.  The "TIME_STAMP" will get a UTC date and time value generated at the time of insert.  An example value is as follows:
'19-JUN-18 09.23.58.692808000 PM' 

DIH Delta Query

Information on getting started with the DIH can be found here.

We'll need to configure our deltaQuery to compare our Oracle TIME_STAMP column with Solr's variable: ${dataimporter.last_index_time}.  An example value for this variable is as follows:
'2018-06-19 20:31:20' 

So, we need to get the conversion configured.  From a strictly Oracle perspective the following query will result in a format that matches the "TIME_STAMP" column as defined above:

SELECT TO_TIMESTAMP('2018-06-19 20:31:20', 'YYYY-MM-DD HH24:MI:SS') 
FROM   dual; 

Based on this the deltaQuery can therefore be configured, in the dataConfig file, as such:

deltaquery= "SELECT ID FROM SOLR_TEST WHERE TIME_STAMP > TO_TIMESTAMP('${dataimporter.last_index_time}', 'YYYY-MM-DD HH24:MI:SS')"

DIH Query and Managed Schema

In order to import your TIME_STAMP column for use in solr you'll need to conform to Solr's preferred format.  This is detailed here.  An example value is as follows:
'2018-06-19T21:09:39Z' 

Again, from a strictly Oracle perspective the following query will result in a format that matches the Solr timestamp format:

SELECT To_char(To_timestamp('19-JUN-18 09.23.58.692808000 PM'), 
              'YYYY-MM-DD"T"HH24:MI:SS"Z"') 
FROM   dual;

Because our Oracle "TIME_STAMP" column is already a "TIMESTAMP" datatype, we can skip the to_timestamp function in our DIH query, which looks like this:

query= "SELECT ID, to_char(TIME_STAMP, 'YYYY-MM-DD"T"HH24:MI:SS"Z"') AS TIME_STAMP, CATEGORY, TYPE, SERVERNAME, CODE, MSG FROM SOLR_TEST"

You'll notice """ in there a couple of times.  This is the XML encoding for the double quote.  I've seen this documented differently on where and what, in the data config file, needs to be XML encoded vs. what does not.  All I can say is be aware of this as a potential issue, and that the above in-line double quote does need to be encoded.

I did test this with the '>' symbol in the deltaquery using both '>' and '>'.  Both worked.

This formatting is done specifically to get Solr to recognize these values as dates so that they can then be subsequently used in sorting and faceting.  That said, here's an excerpt of the definitions for the related fields from the "managed-schema" file:

...
<field name="id" type="tint" indexed="true" stored="true" required="true" multiValued="false" docValues="true" /> <field name="time_stamp" type="date" indexed="true" stored="true" multiValued="false" docValues="true" />
<field name="category" type="string" indexed="true" stored="true" multiValued="false" /> <field name="type" type="string" indexed="true" stored="true" multiValued="false" /> <field name="servername" type="string" indexed="true" stored="true" multiValued="false" /> <field name="code" type="string" indexed="true" stored="true" multiValued="false" /> <field name="msg" type="string" indexed="true" stored="true" multiValued="false" />
...
<fieldType name="tint" class="solr.TrieIntField" precisionStep="8" positionIncrementGap="0" /> <fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0" /> <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
...

Youl'll notice that docValues are set to true for the id and time_stamp.  Here's why.

In addition here's the full data config file:
<dataconfig> 
<datasource driver="oracle.jdbc.OracleDriver" url="jdbc:oracle:thin:...[put your connections string here] />
<document>
<entity name = "SOLR_TEST" query="SELECT IDto_char(TIME_STAMP,'YYYY-MM-DD&quot;T&quot;HH24:MI:SS&quot;Z&quot;') AS TIME_STAMP, CATEGORY, TYPE, SERVERNAME, CODE, MSG FROM SOLR_TEST" deltaImportQuery="SELECT ID, to_char(TIME_STAMP,'YYYY-MM-DD&quot;T&quot;HH24:MI:SS&quot;Z&quot;') AS TIME_STAMP, CATEGORY, TYPE, SERVERNAME, CODE, MSG FROM SOLR_TEST WHERE  ID = ${dataimporter.delta.ID}" deltaQuery="SELECT ID
FROM SOLR_TEST WHERE  
TIME_STAMP > TO_TIMESTAMP('${dataimporter.last_index_time}', 'YYYY-MM-DD HH24:MI:SS')">

<field column = "ID" name="id" />
<field column = "TIME_STAMP" name="time_stamp" />
<field column = "CATEGORY" name="category" />
<field column = "TYPE" name="type" />
<field column = "SERVERNAME" name="servername" />
<field column = "CODE" name="code" />
<field column = "MSG" name="msg" />
</entity>
</document>

You'll notice that the "ID" coming from Oracle is capitalized and the "id" in solr is lower case.  This DOES make a difference in the data config queries.  Specifically my delta imports were failing until I changed "dataimporter.delta.id" to "dataimporter.delta.ID".  Basically the deltaImportQuery is using the results from the deltaquery, so the name and capitalization need to match.

P.S.

I should probably mention that when running your delta-import, you'll most likely want to set clean=false, otherwise you'll loose those documents you loaded with your full-import.


For your reference:

How Solr likes datetime:
https://lucene.apache.org/solr/guide/6_6/working-with-dates.html#working-with-dates

Oracle datetime formatting:
https://docs.oracle.com/database/121/SQLRF/sql_elements004.htm#CDEHIFJA

The dates solr uses for delta comparisons:
${dataimporter.last_index_time
This is also saved in a file called "dataimport.properties" which, in my SolrCloud configuration, is found in zookeeper under the 'configs' folder for the collection.

2 comments:

  1. The clean parameter defaults to false on delta-import. It defaults to true on full-import. If you're finding that delta-import deletes all your documents without clean=false, that's a bug. You haven't indicated which version of Solr you're running.

    ReplyDelete
    Replies
    1. solr admin console for 6.6.2 defaults to clean (checked check box) for delta-import

      Delete