Endeca Baseline - Index process fails immediately on RepositoryExport(ATG-Endeca) Process

Possible symptoms in log: /atg/commerce/endeca/index/ProductCatalogSimpleIndexingAdmin ---java.lang.RuntimeException: org.apache.commons.httpclient.ConnectTimeoutException:
The host did not accept the connection within timeout of 30000 ms
There might be a problem with the connectivity between ATG and Endeca CAS – possibly a configuration issue or networking issue.

Check:
1. Check CASHostName and CASPort properties set properly in below components


/atg/endeca/index/SchemaDocumentSubmitter

/atg/endeca/index/DataDocumentSubmitter

/atg/endeca/index/DimensionDocumentSubmitter

Also modify  atg/endeca/index/IndexingApplicationConfiguration component (in ATG ).

2. Ping target Endeca CAS server from the command line on the ATG box to verify that it is accessible on the network.

3. Make sure  DNS resolution isn’t mis-configured such that the CAS hostname is resolving to an IP address of some server other than the actual CAS server.

Endeca data model design - Pros/Cons

Problem Statement:
         Consider  ~30,000 sku’s that are each represented uniquely across up to 8000 stores.   Each store could have up to 20 unique fields for each product. These include various prices, sale prices, coupon codes and on hand inventory which are specific for every store.

This totals up to:

30,000 skus * 8000 stores * 20 unique fields = 4.8 BILLION cells of data that need to be stored in Endeca.

Solution:

There are two viable data models available:

1) “Wide” model that consists of adding store-specific attributes to each base product record. This equates to 30,000 rows of data (one for each product) where each row has 160,000 columns of data.

2) “Multi-Record” model that consists of a full copy of each product record with one store’s data attached. This equates to 240 million rows of data for each product-store combination where each row has 20 columns of data.

Wide record  pro’s and con’s:

Pros:
1) Simple, performant queries

Cons:
1) High indexing time
2) Complex process to dynamically create attributes
3) Complex dimension mapping for precise values like price
4) Indexing scales poorly for >100k properties
5) Complex display logic

Multi-Record pro’s and con’s:

Pros:
1) Simple, Performant Queries
2) Simple Indexing Logic
3) Simple Dimension Mapping for Precise Values
4) Simple Attribute Display Logic

Cons:
1) Large Index Size:  Memory and Disk Footprint
2) Possible Run-time Performance Issues From Inadequate Memory

How to check the Endeca Dgraph health?

A quick way of checking the health of a Dgraph or an Agraph is by accessing the URL:

http://DgraphServerNameOrIP:DgraphPort/admin?op=ping


The Dgraph quickly returns a lightweight HTML response page with the following content:

dgraph host:port responding at date/time

The Dgraph ping is the recommended mechanism for performing automated health checks with a load balancer.

Endeca Forge State: how to restore forge state?

Restoring state is a simple matter of copying the appropriate files from a backup directory to the state directory prior to the next baseline/delta update.

To revert to a previous version of state:


1. Locate the appropriate backup of the data/state directory that you wish to rollback to.

2. For autogen state, copy the autogen* files into data/state

3. For delta update state files, copy the appropriate delta_state*.bin file to into data/state

Tips to Improve performance of ATG-Endeca integration environment (Version 3.1.1 and 3.1.2)

1) Make Sure below patch is applied
     
 Patch 17342677  - It reduces the number of supplemental objects that are returned with the queries and fixes an XML Parser locking problem.

2) Check the properties being returned by Endeca ( Apply Endeca setSelection feature)
     In Assembler, you can select which properties are returned back with the search results
     Include only properties that are required on the application. Here is the ATG component path
  /atg/endeca/assembler/cartridge/handler/config/ResultsListConfig.properties
 Refer - http://ravihonakamble.blogspot.com/2015/06/endeca-select-feature-aka-set-selection.html 

3) Disable Endeca preview on Production
Use/dyn/admin/nucleus/atg/endeca/assembler/cartridge/manager/AssemblerSettings/ component and  set previewEnabled = false

4) Configure records Per Aggregate Record set to 1
 atg/endeca/assembler/cartridge/handler/config/ResultsListConfig
   # For aggregate records, sets the number of sub records that should be included in the results
 subRecordsPerAggregateRecord=ONE  


5) Ensure non-Endeca URLs don’t hit Assembler
Use /atg/endeca/assembler/AssemblerPipelineServlet.ignoreRequestURIPattern component to set URL patterns

   

ProductCatalogOutputConfig Cleaning process taking long time

Problem statement:
Endeca baseline index is taking long time during cleaning stage of ProductCatalogOutputConfig component.
Here is the screenshot of component from ATG Dynamo Administration:
ProductCatalogOutputConfig- Cleaning
Cause:
Any changes on the Product/SKU will be queued in /atg/search/repository/IncrementalItemQueueRepository and once baseline/partial update extracts those changes then it deletes those items one by one.

Solution: 
Baseline Index indexes all the products/skus and does not depends on have any dependency on IncrementalItemQueueRepository data. Before every baseline update run below query to delete products/skus data from srch_update_queue table.
DELETE srch_update_queue WHERE item_desc_name = ‘sku’ OR item_desc_name = ‘product’