About Me

My photo
PLANO, Texas, United States

Saturday, May 15, 2021

Large Data Volume (LDV)

Data is one of the key elements. As the system gets older, data size gets increased. These large data volumes (LDV) can lead to performance issues, including slower queries, slower search and list views, and slower sandbox refreshing. When we deal with large data volume, we may have below three actions that need to be taken care of:
  1. Conduct data query & Search- This includes a selective query, Skinny table, etc.
  2. Load the data- Use Lean Loading
  3. Perform data deletion & extraction - Hard delete, PK chunking
Best practices, can address the issue in the early stages while searching, loading, deleting, or extracting the data. Let's talk about possible data issues and their resolution along with different tools used for data issues.

What is Data Skew?

Data skew primarily refers to a non-uniform distribution in a dataset. Whenever data is distributed non uniformly, it leads to performance issues. We need to architect our design to avoid data skew.

Problems due to Data Skew

  1. Record Locking -You’re updating a large number of contacts under the same account in multiple threads. For each update, the system locks both the contact being changed and its parent account to maintain integrity in the database. Even though each lock is held for a very short time, because all the updates are trying to lock the same account, there’s a high risk an update will fail because a previous one is still holding the lock on the account.
  2. Sharing Issues- There’s a similar dynamic when it comes to sharing. Depending on how you have sharing configured, when you do something that seems simple, like changing the owner of an account, you may need to examine every one of the account’s child records and adjust their sharing, as well. That may include recalculating the role hierarchy and sharing rules. And if we’re talking about hundreds of thousands of child records, that can eat up tons of time.

Types of Data Skew

There are different three types of data skews.
  1. Ownership Skew
  2. Account Data Skew
  3. Lookup Skew 

Ownership Skew

  • When a large number of records with the same object type are owned by a single user, this imbalance causes ownership skew. 
  • When the skewed owner exists in the role hierarchy, operations like deletes or owner updates must remove sharing from the old owner and all parent users within the role hierarchy, and from all users given access by sharing rules. That’s why ownership changes tend to be one of the most costly transactional changes in the system.
  • Example-An integration user creating lots of records without assigning them to a new owner or a customer can assign all of his or her unassigned leads to a dummy user. 
  • In some cases, an ownership skew simply can’t be avoided. In these cases, it’s best to ensure the skewed owner doesn’t have a role or keep the user in the top role hierarchy. That way, you take the user and their records away from the role hierarchy and its associated sharing rules.

Account Data Skew

  • Certain Salesforce objects, like accounts and opportunities, have special data relationships that maintain parent and child record access under private sharing models. Too many child records associated with the same parent object in one of these relationships cause account data to skew.

Lookup Skew

  • Lookup skew happens when a very large number of records are associated with a single record in the lookup object. Because you can place lookup fields on any object in Salesforce, lookup skew can create problems for any object within your organization.

Index

An index is used to improve the speed of data retrieval. If you often search a table or sort its records by a particular field, you can speed up these operations by creating an index for the field. The following fields are indexed by default:
  • Primary keys (ID, Name, and Owner fields).
  • Foreign keys (lookup or master-detail relationship fields).
  • Audit dates (such as SystemModStamp). 
  • Custom fields marked as External ID or Unique.
You can use Query Plan to see if your query is Index or not.

Force.com Query Optimize

There are some symptoms that use complaint slowness about unresponsive reports, dashboard taking a long time to load, list view taking forever display, slowness on Lightning pages and you frequently getting Apex/ SOQL errors & timeouts. If you have seen these in your org, then you may have a SOQL problem. The Force.com query optimizer tool optimizes the SOQL used to fetch data by inspecting SOQL, Reports, and List Views.



  • The Force.com query optimizer is an engine that sits between your SOQL, reports, and list views, and the database itself. Because of salesforce.com’s multitenancy, the optimizer gathers its own statistics instead of relying on the underlying database statistics.  
  • Pre-computed Statistics contains the following information
    • Row Count
    • User Visibility
    • Custom index
    • Owner row count

Working with Very Large SOQL Queries

Use SOQL FOR

Your SOQL query sometimes returns so many sObjects that the limit on heap size is exceeded and an error occurs. To resolve, use a SOQL query for loop instead, since it can process multiple batches of records by using internal calls to query and queryMore.

Use selective query

Whenever you work in production, soql query plays a crucial role, an indexed query does not guarantee a selective query. A query is selective
  • When one of the query filters is on an indexed field
  • The query filter reduces the resulting number of rows below a system-defined threshold.
The performance of the SOQL query improves when two or more filters used in the WHERE clause meet the mentioned conditions. Force.com query Optimizer decides whether a query is selected or not.
 























A query is selective when one of the query filters is on an indexed field and the query filter reduces the resulting number of rows below a system-defined threshold. The performance of the SOQL query improves when two or more filters used in the WHERE clause meet the mentioned conditions.

The selectivity threshold is 10% of the records for the first million records and less than 5% of the records after the first million records, up to a maximum of 333,000 records. In some circumstances, for example with a query filter that is an indexed standard field, the threshold may be higher. Also, the selectivity threshold is subject to change.

Common Causes of Non-Selective SOQL Queries

  • Performing Large Data Loads
    • Large data loads and deletions can affect query performance. The Force.com query optimizer uses the total number of records (including soft deleted) as part of the calculation for its selectivity threshold.
    • If you are deleting, huge data, it is recommended to use hard delete
    • EXAMPLE- Suppose if your query is fetching cards (Custom Object) with Status as Available and you have loaded around 300K active cards which makes a total count of 500K. Force.com query optimizer will not consider this as a selective query. You may need to change the card status as On Hold and write some batch which can make the card Available, by checking Threshold. 
  • Using Leading % Wildcards- LIKE condition with a leading % wildcard does not use an index.  SELECT id FROM Account WHERE Name LIKE ‘%Acme%’  Note: Within a report/list view, the CONTAINS clause translates into ‘%string%’.
  • Using NOT and != When your filter uses != or NOT—which includes using NOT EQUALS/CONTAINS for reports, even if the field is indexed—the Force.com query optimizer can’t use the index to drive the query. For better performance, filter using = or IN, and the reciprocal values.
  • Using Complex Joins -Complex AND/OR conditions and sub-queries require the Force.com query optimizer to produce a query that is optimized for the join, but might not perform as well as multiple issued queries would. This is especially true with the OR condition. For Force.com to use an index for an OR condition, all of the fields in the condition must be indexed and meet the selectivity threshold. If the fields in the OR condition are in multiple objects, and one or more of those fields does not meet a selectivity threshold, the query can be expensive.
  • Avoiding nulls -We should avoid null values in the query filter, if you don’t have an option to avoid, create the formula field to convert Null value to some defined value and use it in your soql.

Best Practice to avoid non-selective query

To ensure that your queries are selective, avoid some common pitfalls. Also, if you feel your query is fine and you may need to enable a custom index, you can call salesforce and they can assist you to enable a custom index. Note-A custom index can't be created on these types of fields: multi-select picklists, currency fields in a multicurrency organization, long text fields, some formula fields, and binary fields (fields of type blob, file, or encrypted text.) 

  • Understand your schema and have proper indexes created.
  • Apply as many filters as possible to reduce the result set.
  • Minimize the amount of records in the Recycle Bin.

Remember that NOT operations and LIKE conditions with a leading % wildcard do not use indexes, and complex joins might perform better as separate queries. 

Bulk Queries

Use bulk queries to fetch large data in an efficient way.

  • A bulk query can retrieve up to 15 GB of data, divided into fifteen 1 GB files.
  • Bulk API query supports both query and queryAll operations. The queryAll operation returns records that have been deleted because of a merge or delete. 

How Bulk Queries Are Processed?

  • When a bulk query is processed, Salesforce attempts to execute the query. If the query doesn’t execute within the standard two-minute timeout limit, the job fails and a QUERY_TIMEOUT error is returned. If this happens, rewrite a simpler query and resubmit the batch.
  • If the query succeeds, Salesforce attempts to retrieve the results. If the results exceed the 1 GB file size limit or take longer than 10 minutes to retrieve, the completed results are cached and another attempt is made. After 15 attempts, the job fails and the error message Retried more than fifteen times is returned. If this happens, consider using the PK Chunking header to split the query results into smaller chunks. If the attempts succeed, the results are returned and stored for seven days.

Skinny Tables

Let's say you followed coding best practices and worked with Salesforce Customer Support to place custom indexes wherever appropriate, but you’re still encountering performance problems. Users are complaining about their reports and dashboards timing out, and the SOQL called from your Visualforce page is performing slower and slower. If you desperately need to further improve performance, there’s a special, powerful solution: skinny tables.

  • A skinny table is a custom table in the Force.com platform that contains a subset of fields from a standard or custom base Salesforce object. Force.com can have multiple skinny tables if needed and maintains them and keeps them completely transparent to you.
  • Skinny tables can be a convenient means of remedying performance issues. But they might not accommodate all use cases, or improve performance more than reading from the base Salesforce object with efficient indexes. They come with side effects you should understand because they might restrict or burden your business processes. Here are a few things to consider before implementing skinny tables:
    • Skinny tables are skinny. To ensure optimal performance, they contain only the minimum set of fields required to fulfill specific business use cases. If you later decide to add a field to your report or SOQL query, you must contact Salesforce Customer Support to recreate the table.
    • For Full sandboxes: Skinny tables are copied to your Full sandbox orgs. For other types of sandboxes: Skinny tables aren’t copied to your sandbox organizations. To have production skinny tables activated for sandbox types other than Full sandboxes, contact Salesforce Customer Support.
    • Skinny tables are custom tables in the underlying Force.com database. They don’t have the dynamic metadata flexibility you find in the base object. If you alter a field type (for example, change a number field to a text field) the skinny table becomes invalid, and you must contact Salesforce Customer Support to create a new skinny table.

Perform Data Deletes and Extracts

Deletion & Extraction is one of the key processes in LDV. We might be using delete and extract for defining the archive process and we might be using a 3rd party tool to take care. When you have very huge data, we must follow best practices to have a smooth business. In this topic, we will understand more about key aspects of data delete and extract.

Bulk API

Bulk API is very useful when it comes to LDV. You can get more detail for Bulk API from my Integration API blog

Soft vs. Hard Deletion

  • Whenever you delete the data in salesforce, by default, it goes to the recycle bin for 15 days. This deletion is called soft delete. When data is soft deleted, it still affects database performance because it’s still living in the org, and deleted records have to be excluded from any queries.
  • Bulk API supports a hard delete (physical delete) option, which allows records to bypass the Recycle Bin and immediately become available for deletion. Using Bulk API’s hard delete function is a recommended strategy for deleting large data volumes to free up space sooner and keep extraneous material from affecting performance. Note that the hard delete option is disabled by default and must be enabled by an administrator.

Chunking Data

  • When extracting data with Bulk API, queries are split into 100,000 record chunks by default—you can use the chunkSize header field to configure smaller chunks or larger ones up to 250,000. Larger chunk sizes use up fewer Bulk API batches but may not perform as well. You might need to experiment a bit to determine the optimal chunk size.
  • At extremely high volumes—hundreds of millions of records—defining these chunks by filtering on field values may not be practical. The number of rows that are returned may be higher than the selectivity threshold of Salesforce’s query optimizer. The result could be a full table scan and slow performance, or even failure. Then you need to employ a different strategy.

Using PK Chunking

  • So if attribute filtering doesn’t help you break the data up into small enough chunks, what can you do? Use PK Chunking to handle extra-large data set extracts. PK stands for Primary Key—the object’s record ID—which is always indexed. PK chunking splits bulk queries on very large tables into chunks based on the record IDs of the queried records.
  • You can use PK Chunking with most standard objects. It’s supported for Account, Campaign, CampaignMember, Case, Contact, Lead, LoginHistory, Opportunity, Task, and User, as well as all custom objects. To enable the feature, specify the header Sforce-Enable-PKChunking on the job request for your Bulk API query.
  • Salesforce recommends that you enable PK chunking when querying tables with more than 10 million records or when a bulk query constantly times out.

Truncation

  • If you want to delete records in a sandbox org’s custom objects immediately, you can try truncating those custom objects. Truncating custom objects is a fast way to permanently remove all the records from a custom object while keeping the object and its metadata intact for future use.
  • Truncating is useful, for example, if you have created a custom object and filled it with test records. When you’re done with the test data, you can truncate the object to purge the test records, but keep the object and put it into production. This is much faster than batch-deleting records and possibly recreating the object.

Lean Loading

Whether we’re talking LDV migration or ongoing large data sync operations, A smart strategy for accomplishing this is loading lean—including only the data and configuration you need to meet your business-critical operations. Lean Loading
  • Identifying the business-critical operations before moving users to Salesforce.
  • Identifying the minimal data set and configuration required to implement those operations.
  • Defining a data and configuration strategy based on the requirements you’ve identified.
  • Loading the data as quickly as possible to reduce the scope of synchronization.

Anti-Patterns for LVD

Design patterns in the context of computer science are proven, highly effective approaches to solving common challenges in software development. So it makes sense that an anti-pattern is the opposite of a design pattern: A software implementation of some sort that is suboptimal. There are few anti-patterns that cause a lot of problems:

  1. Formula Fields Unknowingly Slow Reports
    • Formula fields are a great tool when used in the proper context while understanding the scalability and response time implications for their usage. However, without a proper understanding of how they function, formula fields can sometimes come back to bite you.
    • Use a trigger to populate denormalized related fields that would facilitate blazing report/query performance without runtime joins.
  2. Full Database Backups are Slow
    • Use incremental data backups (only backing up the data that is new or updated) instead of the full backup. Also, use queries that filter records using SystemModstamp (a standard field in all objects that has an index) rather than the LastModifiedDate field (not indexed)
    • A full reload integration operation might use an unnecessarily high amount of Salesforce server resources, which could result in degradation in organization-wide processing performance, which could impact Bulk API jobs (or other data loading jobs), Dashboard and Report performance, Apex code (@future or batch Apex), and even general page performance.

LastModifiedDate and SystemModStamp?

Let’s recap what LastModifiedDate and SystemModStamp dates are. They are both system fields that store date and time values for each record.

  • LastModifiedDate is automatically updated whenever a user creates or updates the record. LastModifiedDate can be imported with any back-dated value if your business requires preserving original timestamps when migrating data into Salesforce.
  • SystemModStamp is strictly read-only. Not only is it updated when a user updates the record, but also when automated system processes update the record. Because of this behavior, it creates a difference in stored value where ‘LastModifiedDate <= SystemModStamp’ but never ‘LastModifiedDate > SystemModStamp’.

Approaches for Tuning Force.com Performance

  • Long-Term Approaches
    • Keeping Your Org Lean
    • Only store what’s needed.
    • Know your expected data growth rate and design accordingly
    • Define criteria for data archiving and purging. 
  • Short-Term Approaches
    • Use skinny table

What is Reference Data?

Reference Data as we know is the data used to categorize other data within applications and databases. We usually refer to this data as look-up, code table or domain values. These code tables are usually characterized by code and value pairs. Think of it as a set of allowed values for a given field.