Scaling Apache SolrCloud at Numerator

Scaling Apache SolrCloud at Numerator, by reindexing to Time Routed Aliasas.

Overview

Numerator’s AdIntel business helps brands and agencies make impactful advertising decisions. We provide insights by transcribing and storing advertisements run across different media types and media channels in North America.

We store transcribed advertisements in an Apache SolrCloud cluster and leverage its full-text searching and faceting capabilities for analytics.

  • Total collection size : 700 GB; ~ 450 million documents (advertisements) in current year plus last 4 years.
  • Data updates: 200K documents indexed (mix of inserts and updates) every hour. Hourly indexing is limited to advertisements in the current month +- 3 months.
  • Data replication and sharding: NRT (Near Real-Time) replication in SolrCloud cluster with no sharding.

We have multiple performance and scalability issues in the current SolrCloud setup.

Issue 1

High query response times (more than 10 seconds): We identified that all slow queries were using leading wildcard searches.

Example Query: all_text: *school, note the leading wildcard * at the beginning.

Solution: Modify the index analyzer of the text field.

Issue 2

High query response times during hourly indexing: The current SolrCloud setup has NRT replication, which means all nodes are used for indexing and also querying and query response times are adversely affected when hourly indexing is in progress.

Solution: Setup Solr cloud as a combination of TLOG and PULL replicas for better indexing and query performance, making the search less real-time. See Shards and Indexing Data in SolrCloud.

Issue 3

Scalability and future data growth: We have plans to support the storage and analysis of more than 4 years of historical advertisement data. The current setup of a single shard may not scale to support that data growth. It was not as pressing as the above 2 issues but still relevant as sharding can improve scalability and positively affect indexing and query performance.

Solution: Shard data using SolrCloud's Time Routed Aliases feature, since all advertisements are time-stamped using advertisement run date and hourly indexing is limited to advertisements in last 3 months or are booked for the next 3 months.

All roads lead to Rome (Solr reindex)

I mean to fix Issue 1 and Issue 3, we need to reindex just like several other Solr changes that require reindex.

The current SolrCloud setup takes more than 30 days to reindex the existing 450 million documents from the source database (Postgres).

It means that to rollout performance and scalability changes in production with minimal downtime and keep costs in check, we have to Conquer Rome, not literally :).

Conquer Rome

Current reindex process

Slower due to

  • Network latency
  • Serialization and deserialization cycles
  • Larger and more segments in a single collection.

New reindex process

We explored a completely different reindex process that does not depend on source DB and external process for reindexing but create daemons in SolrCloud Node that use Streaming Expressions

The New reindex process was faster

  • No network traffic, only local traffic.
  • Fewer and smaller segments per collection mean less IO and faster segment merges.

Example Python Script that creates daemons using streaming expressions:

date_format = "%Y-%m-%dT%H:%M:%SZ"

for y in range(2018,2023):
    for m in range(1,13):
        start_date = date(y,m,1)
        end_date = start_date + relativedelta(months=1)
        # Solr query to filter ADs that ran in Jan 2018:  run_date:[2018-01-01T00:00:00Z TO 2018-02-01T00:00:00Z}
        query = f"run_date:[{start_date.strftime(date_format)} TO {end_date.strftime(date_format)}" + "}"
        
        daemon_expr = f"""
                        commit(
                            alias_name,
                            batchSize="2000",            
                            update(
                                alias_name,
                                batchSize="2000",                        
                                topic(
                                    checkpoint_collection,
                                    source_collection,
                                    id="topic_{y}{m:02.0f}",  
                                    q="{query}",
                                    fl="*",
                                    rows="2000",
                                    initialCheckpoint="0"
                                )
                            )
                        )
                        """                
        url = "http://solrHost:8983/solr/worker_collection/stream"
        expr = f"""
                daemon(
                    id="daemon_{y}{m:02.0f}}",
                    terminate="true",
                    {daemon_expr}
                )
                """
        params = {"expr": expr}
        response = requests.post(url, params=params)        

New reindex process in test environment finished in 2 days.

Finally in the test environment after all changes and full reindex the results were encouraging:

  • 10 times faster queries (~1 second vs 10 seconds).
  • 15 times faster reindex (2 days vs 30 days).
  • 12 times faster forced segment merge (5 mins vs 1 hr).
  • 3 times less storage required (Collection Size 350 GB vs 1 TB)

It is critical to revisit search engine or analytical database every 3-5 years and re-architect for performance and scalability as data and traffic grow.

If you would like to conquer Rome :), I mean solve data and performance challenges, Numerator is hiring, Join our team

References