Forgot password? | Forgot username? | Register

Duplicate spotter Bash script

Duplicate spotter Bash script

Hi

I've found this very useful for cleaning up the cruft that has accumulated in our database over many years, migrations and imports. I'm sharing it in the hope that others may find it useful too.

This comes from a support request where Helena in the Melbourne office gave me the TexQL command for a particular search and suggested it could also be run as a shell script. I've just generalised the search to save typing out the column names repeatedly.

The actual search used is:

Code:


select count(record), $query
from
(
        nest
        (
                select $query from $TABLE
        )
        on $query
        forming record
)
where count(record) > $MAXNOTRETURNED;

Everything else is error-checking and building the variables

It allows you to, say, look in the Parties module and return all the duplicates occurrences of SummaryData, NamFirst and NamLast.

It doesn't tell you what records are duplicates, it shows you what values are duplicated and you can then do a search using those values to get them up and have a check-see if you've got good candidates for merging.

The script is used like this (in a shell on the EMu server):
duplicates.sh -t [the name of the module] -n [the minimum number of duplications that should be reported(optional)] [field1] [field2] [...] [fieldn]

To illustrate, this is an excerpt checking to see the occurrences where the Name of an organisation, its address and the summary data occur more than three times. The results are returned with the number of duplications first and then the column values in the order they were entered.

Code:


$duplicates.sh -t eparties -n 3 NamOrganisation AddPhysStreet SummaryData
columns searched: NamOrganisation AddPhysStreet SummaryData (3 columns)

(4,'CSIRO',NULL,'CSIRO')
(4,'Elder Expedition',NULL,'Elder Expedition')
(4,'Fisheries Officer',NULL,'Fisheries Officer')

I added the -n option to prevent getting disheartened at the amount of duplicates. It allows you to work on the worst offenders first by putting in a high number and then when you've sorted them out put in a lower number and so on.

I use the script in combination with a very full list view and lots of clicking through tabs in records.

I can't overemphasise that it is up to you to confirm that records which appear to be duplicates really are!

I assume no responsibility and make no guarantees, use at your own risk, read through the script and verify it does what I've said it does etc. but I do hope it's of use to some of you. It's attached as duplicates.sh, you'll need to remove the first line -added so the forum uploader wouldn't reject it- and change the permissions as appropriate

Regards
Keith

Attachments:
Keith Maguire
South Australian Museum
useravatar
Offline
11 Posts
Website 
Administrator has disabled public posting. Please login or register in order to proceed.
There are 0 guests and 0 other users also viewing this topic