Update: With Spring ’18, there is a new feature that will allow for taking advantage of mass checking records that meet duplicate criteria and for duplicate record set management. For more information, check out the release notes! https://releasenotes.docs.salesforce.com/en-us/spring18/release-notes/rn_sales_data_quality_duplicate_jobs.htm
So you’re excited to take advantage of the new duplicate management features, either because it is very tightly integrated into lightning (duplicate checks an records meet criteria, duplicate checks on load of records and duplicate checks on lead convert) or you’re looking to prevent bad data entry. Congratulations, you’ve taken the first step towards a cleaner database!
Imagine: you’ve worked on your duplicate rules and matching rules and have them just how you want them in your developer environment. You deploy them to UAT per your usual change process. You activate them and you receive an email saying that the duplicate rule failed to activate because there are too many records that meet the criteria. You’ve checked your criteria over and over again. It’s right. You merge a few records that seem to have a high level of matching. Try activating again and it fails. You repeat this process multiple times and are dismayed and wonder how to go about cleaning in a more systematic way.
Good news! You will get to learn from my experience and hopefully it will save you from encountering it.
There are many reasons why the duplicate rules may be failing to activate:
- Selectivity – Check this first. Make sure that you’re filtering your duplicate rules off of fields that are most commonly populated, that you’re matching off of more than just account name, etc. If that’s okay, then it may be one of the below.
- Bad list loads – Did someone fill in the system with dirty lead data? Did they forget to scrub emails for common fake emails that people put in forms?
- Bad integrations – Do you have an integration that does not have proper field mapping?
- Web-to-Lead forms – These leads come straight in and if someone spams them, you could be sitting on many leads that all promote outlet stores, male enhancements, you get the idea.
- Bad Process – Admins never do wrong right? Perhaps we do… Perhaps we required information to be entered before people have it available and that leads to dummy data being entered.
The good thing is, once you see these issues, you can start to correct them, put appropriate processes in place to prevent them and work on cleaning your data once they have been addressed (or concurrently, though this may be harder).
There are many tools that can help you through this process, such as DemandTools to speed up merging. However, I’m going to focus on the tool that I used to solve this issue. And the best part is that it is completely free. The tool is SQLite, which is a lite-weight database that you can use to load data.
Now, I bet you’re saying, well why not just use aggregate queries? The problem is that there are governor limits that prevent querying large amounts of data in aggregate. You’ll likely get an error that says something along the lines of “Limit batch size. Aggregate query does not support queryMore().” That means that you have a few options. If you have wave, you can easily use that tool to pull the queries, but the data sets only refresh so often, making it not real time, but a tool to really prevent the problem from occurring in the future, or diagnosing the root cause.
For SQLite, you can store up 140TB of data in the database. What that means is that you’ll likely only be limited by the file size of your machine that is doing the processing. If you don’t have a very large hard drive, you could always use AWS and spin up a machine.
For installation, if you’re a Mac user, you’re in luck, you don’t have to do anything! Just open up terminal and type the command “sqlite3” and you’re set. If you’re on windows, you’ll need to go to the SQLite download page to continue.
You’ll want to spin up a fresh dev org for these exercises (because the wave DEs start with more data and duplicates!) and I’ll go through simple dashboards that you can create. I’ll be using that standard data to walk you through the techniques I used to start identifying issues and cleaning up.
Before attempting any of this for your system, make sure that you have a backup either from your weekly export or through a third-party provider.
You should become familiar with data loader settings.
- Bulk API – You will use this for pulling out hundreds of thousands / millions of records at a go.
- A few things worth noting, you only get 10,000 bulk API batches processed during a day. Make sure you take note of where you typically sit for this number and how large your data load batches will be. The worst thing you could have happen is lock the bulk API for updates if this number runs out. (This didn’t happen to me, but I could see someone who didn’t know better having this happen, so it’s important to be aware of!)
- You cannot insert nulls with the bulk API. If you need to clear fields, you will need to use the SOAP API.
- SOAP API – You will use this for inserting nulls or updating a smaller set of records (less than 100 thousand) in a go. A simple word of caution, again, make sure you review all spreadsheets and updates before inserting nulls, deleting & updating.
Always, always have a backup of data before proceeding!
Now that you’ve got dataloader all set up the way you want, pull all the data down. We’ll be doing these exercises with Contact and Account data. Make the following queries in dataloader.
SELECT Id, Name, Type, ParentId, BillingStreet, BillingCity, BillingState, BillingPostalCode, BillingCountry, Phone, Website, OwnerId, Owner.Name, Owner.IsActive, CreatedDate, CreatedById, CreatedBy.Name, LastModifiedDate, LastModifiedById, LastModifiedBy.Name FROM Account
The contact extract will come later.
Now, we’re going to insert some Accounts a few more times, just to artificially create duplicates. Select a few accounts to reinsert for this exercise. If you want your results to match exactly to mine, I chose to reinsert the below:
|United Oil & Gas, UK|
|United Oil & Gas, Singapore|
|Burlington Textiles Corp of America|
|Pyramid Construction Inc.|
- Run your export again for accounts.
- Go to your terminal and type “sqlite3”.
- Change the mode to read CSVs by typing “.mode csv”.
- Insert the accounts and contacts:
.import /Users/yourFileLocation/yourAccountFileName.csv Account
- Run a query to determine how many accounts have the exact same name. Note that the name is changed to LOWER() to somewhat normalize. While this is not a perfect method for finding duplicates, if the threshold is over 100, your duplicate rules will fail. I’ll go over creating a more accurate count by adding multiple fields to your criteria to create a more accurate composite key:
SELECT COUNT(LOWER(Name)), LOWER(Name) FROM Account GROUP BY LOWER(Name) HAVING COUNT(LOWER(Name)) > 1;
You should get the following results:
2,"burlington textiles corp of america" 2,"edge communications" 2,genepoint 2,"pyramid construction inc." 2,"united oil & gas, singapore" 2,"united oil & gas, uk"
Pretty cool right? If you have a bunch of accounts that have the same name, you can always adjust the HAVING number up to limit the results based on the amount returned.
Now, we want to take a closer look at the accounts that are duplicates in a flat file. Run the following:
.output /Users/yourFileLocation/yourFileName.csv .headers on SELECT * FROM Account WHERE LOWER(Name) IN (SELECT LOWER(Name) FROM Account GROUP BY LOWER(Name) HAVING COUNT(LOWER(Name)) > 1) ORDER BY LOWER(Name);
The output command says where to output the file. The headers on tag turns on the headers for the file (you’ll definitely want these!) If you messed up and didn’t turn the headers on and rerun the query, it will just append the results to the existing file. If this is the case, make sure to delete the file, turn headers on, rerun the output command and the query. The ORDER BY LOWER(Name) tag is important as this will group like accounts together in your extract.
We will be switching between .output /FileLocation.csv format and .output stdout. stdout will output the results back to the console. This ensures that you aren’t appending results to your output files.
To see the total count of potential duplicates where the count is greater than 2. You would use the following:
.output stdout SELECT COUNT(LOWER(Name)) FROM Account WHERE LOWER(Name) IN (SELECT LOWER(Name) FROM Account GROUP BY LOWER(Name) HAVING COUNT(LOWER(Name)) > 1);
What is being done here is a subselect in the where clause where we are returning all results that meet the criteria of being a potential duplicate and then counting those rows. This is useful to get the total if you have a large number of results that you expect.
The concept of a composite key is to take two fields and combine them. This is useful for determining the odds of that composite key in that matching rule being a failing key. Again, this will only return exact matches. If you will not allow for blanks to be treated as duplicates, you will need to ensure in your where clause that you exclude null values in your search.
First a simple composite key for searching on Account Type and Account Name:
SELECT COUNT(LOWER(Name || Type)), LOWER(Name || Type) FROM Account GROUP BY LOWER(Name || Type) HAVING COUNT(LOWER(Name || Type)) > 1;
If you used the same accounts, you should get the below results:
2,"burlington textiles corp of americacustomer - direct" 2,"edge communicationscustomer - direct" 2,"genepointcustomer - channel" 2,"pyramid construction inc.customer - channel" 2,"united oil & gas, singaporecustomer - direct" 2,"united oil & gas, ukcustomer - direct"
You follow the same process in the previous section to use a where clause to limit just to these outputs.
In SQLite using the double bar (||) is a concatenation operator. The rest of the syntax is similar to what we used before. Easy right!?
What if you wanted to exclude records that have null values? Let’s give that a go! First, go to the burlington textiles corp account and remove “customer – direct” from the type field for the one that we inserted with no contacts. Redownload the account information or run a SQLite update statement on that account record with the following syntax:
UPDATE Account SET Type = null WHERE Id = '[your record id]';
Important! Do not forget the where clause. If you did, then you just updated the entire table! Bad news bears (and you’ll have to redownload the data). There are better database management techniques that could be explained here but since the data is readily available and this is being used only for this express purpose, I believe that this would detract from the post.
Now run the below:
SELECT COUNT(LOWER(Name || Type)), LOWER(Name || Type) FROM Account WHERE Type != '' GROUP BY LOWER(Name || Type) HAVING COUNT(LOWER(Name || Type)) > 1;
Notice the only difference? The where clause. Burlington Textiles should now drop from the list.
2,"edge communicationscustomer - direct" 2,"genepointcustomer - channel" 2,"pyramid construction inc.customer - channel" 2,"united oil & gas, singaporecustomer - direct" 2,"united oil & gas, ukcustomer - direct"
Now, doing this on millions of rows may take some time and is limited by your processing power but I found it to be quite quick all the same.
Okay, so where do the Contacts come into play? Well, say you have multiple Accounts with similar contact names and you want to see what values there are that have the same unique email address, last name + company name (fuzzy), etc. Well, for the latter, you need to look just beyond the contact object and have to pull in account data.
For the contact object, remember how we did not insert the same contact information? Now is the time to do that. The easiest and fastest way would be to simply clone existing contacts to the duplicate accounts. Here are the records I cloned:
Pat Stumuller at Pyramid Construction
Jack Rogers at Burlington Textiles
Remember how I mentioned the contact extract would come later? Here it is! Run the following with dataloader and import your contact file.
SELECT Id, AccountId, LastName, FirstName, Name, Phone, Email, Title, Department, LeadSource, OwnerId, Owner.Name, CreatedDate, CreatedById, CreatedBy.Name, LastModifiedDate, LastModifiedById,LastModifiedBy.Name FROM Contact
To import to SQLite:
.import /Users/yourFileLocation/yourContactFileName.csv Contact
Now for the good part! Let’s see how many potential duplicates we have based on last name and company name (in this case exact).
SELECT COUNT(LOWER(a.Name || c.LastName)), LOWER(a.Name || c.LastName) FROM Contact c INNER JOIN Account a ON a.Id = c.AccountId GROUP BY LOWER(a.Name || c.LastName) HAVING COUNT(LOWER(a.Name || c.LastName)) > 1;
Building on our previous queries, it’s pretty much the same. A few things to call out. This takes advantage of what is called aliasing. The ability to name a table something else for the purposes of the query. In this case, “a” = “Account” and “c” = “Contact”. Less typing = happy!
The other thing to mention is that this is taking advantage of what is called an inner join. This says I only want to see instances where there are both a Contact and a matching Account. There are other kinds of joins but you can read more about those here.
Your output should be like the below:
2,"burlington textiles corp of americarogers" 2,"pyramid construction inc.stumuller"
That’s exactly what we want! Now let’s export the contact data and account data. There are two ways to do this, depending on what you want to see.
If you want all contact and account columns (I’d only advise this if your tables are very skinny):
SELECT c.*, a.* FROM Contact c INNER JOIN Account a ON a.Id = c.AccountId WHERE c.Id IN (SELECT c.Id FROM Contact c INNER JOIN Account a ON a.Id = c.AccountId GROUP BY LOWER(a.Name || c.LastName) HAVING COUNT(LOWER(a.Name || c.LastName)) > 1);
If you want to select specific attributes:
SELECT c.Id, c.FirstName, c.LastName, c.Email, a.Name, a.Id,a.OwnerId, a.'Owner.Name' FROM Contact c INNER JOIN Account a ON a.Id = c.AccountId WHERE c.Id IN (SELECT c.Id FROM Contact c INNER JOIN Account a ON a.Id = c.AccountId GROUP BY LOWER(a.Name || c.LastName) HAVING COUNT(LOWER(a.Name || c.LastName)) > 1);
Last bit to point out in the above. Notice that related fields in our query are enclosed in quotes. This will need to be done if you don’t extract all users and join with the user table.
For the latter, your output should be:
ID,FIRSTNAME,LASTNAME,EMAIL,NAME,ID,OWNERID,OWNER.NAME 0031N00001JzJwAQAV,Jack,Rogers,email@example.com,"Burlington Textiles Corp of America", 0011N00001BfQT9QAN,0051N000005KaIeQAK,"Jeremiah Dohn" 0031N00001JzJwBQAV,Pat,Stumuller,firstname.lastname@example.org,"Pyramid Construction Inc.", 0011N00001Br7y2QAB,0051N000005KaIeQAK,"Jeremiah Dohn"
Voila! You’ve taken the first few steps on the journey to identifying duplicates and getting your system clean!