Loading temporary data into the registry database, and how to do pass-thru SQL

Sometimes you want to do something that involves a lot of data from the registry database, but you do not want to pull all that data into R — maybe you are only filtering out some subset of data with a join on a small set of keys that you have as a data frame in R, or you need to do a join of two large tables/views in kosmos, and wants to use the indexes on those tables to speed up the operation.

The example data

The example code uses the LMED data in CRIME3: the prescription registry for the year 2015 is about 25M rows, and the entire LMED (years 2005 to 2017) dataset is 800M rows, for a total of about 350GB of data. We will select a (very) small subset of ATC codes from the 2015 data, and then get data for those ATC codes from the entire dataset.

The example is somewhat contrived, since the whole operation could be run in the database, without fetching the small subset to R/Python — but this is to demonstrate how to write a temporary table .

The connection

The VDI environment, as well as the compute servers, have a ODBC data source named kosmos that we will use.

With R

Preparations

The following has been tested in R 4.2, both on vector and in RStudio on a Windows machine.

You need two packages: DBI and odbc. Both are in CRAN, and regularly updated.

Code

First connect to the database:

> dbhandle <- DBI::dbConnect(odbc::odbc(), dsn = "kosmos")

… then run the first query in the database:

> query <- DBI::dbSendQuery(dbhandle, 
+   'select distinct ATC, LAKMED from MGRCRIME3.MGRCRIME3.V_LMED2015 where TKOST > 500000')
> result <- DBI::dbFetch(query)
> DBI::dbClearResult(query)
> result

      ATC                                     LAKMED
1 J05AX14       Daklinza, filmdragerad tablett 60 mg
2 R07AX02      Kalydeco, filmdragerad tablett 150 mg
3   J05AX Harvoni, filmdragerad tablett 90 mg/400 mg
4 J05AX15       Sovaldi, filmdragerad tablett 400 mg

Upload the results to a temporary table in the registry database — the #-mark tells the database engine to make it temporary:

> DBI::dbWriteTable(dbhandle, "#expensive_2015", result)

… and use it in a second query:

> query_2 <- DBI::dbSendQuery(dbhandle, 
+   'select distinct a.LAKMED from MGRCRIME3.MGRCRIME3.V_LMED a join #expensive_2015 e on e.atc = a.atc')
> result_2 <- DBI::dbFetch(query_2)
> DBI::dbClearResult(query_2)
> DBI::dbDisconnect(dbhandle)
> result_2
                                              LAKMED
1               Daklinza, filmdragerad tablett 30 mg
2         Harvoni, filmdragerad tablett 90 mg/400 mg
3              Kalydeco, filmdragerad tablett 150 mg
4               Exviera, filmdragerad tablett 250 mg
5               Sovaldi, filmdragerad tablett 400 mg
6 Viekirax, filmdragerad tablett 12,5 mg/75 mg/50 mg
7               Daklinza, filmdragerad tablett 60 mg

An alternative without (visible) pass-thru SQL

There is more than one way to do this exact operation, and this specific example can also be acheived with dplyr/dbplyr. This works with local data too, you just have to be careful with the order of inputs, since dplyr will copy the second input to a temp table in the database, regardless of which is the local one, and which is smaller.

> library(dplyr)
> library(dbplyr)

> crime3 <- DBI::dbConnect(odbc::odbc(),
+                          dsn = "kosmos", database = "MGRCRIME3")
> lmed15 <- tbl(crime3, in_schema("MGRCRIME3", "V_LMED2015"))
> lmed <- tbl(crime3, in_schema("MGRCRIME3", "V_LMED"))
>
> result <- lmed %>% 
+   # either in the database
+   # inner_join(lmed15 %>% filter(tkost > 500000) %>% select(ATC)) %>%
+   # or with local data
+   inner_join(data.frame("ATC" = c("J05AX", "J05AX14", "J05AX15", "R07AX02")), 
+              copy = TRUE) %>% 
+   select("LAKMED") %>% distinct()
Joining, by = "ATC"
> result
# Source:   SQL [7 x 1]
# Database: Microsoft SQL Server 13.00.7016[meb\joelus@MEB-SQL02\KOSMOS/MGRCRIME3]
  LAKMED                                            
  <chr>                                             
1 Harvoni, filmdragerad tablett 90 mg/400 mg        
2 Daklinza, filmdragerad tablett 60 mg              
3 Sovaldi, filmdragerad tablett 400 mg              
4 Exviera, filmdragerad tablett 250 mg              
5 Viekirax, filmdragerad tablett 12,5 mg/75 mg/50 mg
6 Kalydeco, filmdragerad tablett 150 mg             
7 Daklinza, filmdragerad tablett 30 mg   

> # this must come *after* we use the result, since dplyr/dbplyr is being "lazy"
> DBI::dbDisconnect(dbhandle)

With Python

Preparation

You need the pyodbc package — and for convenience the following code also uses pandas

Code

Connecting to the database:

import pandas as pd
import pyodbc 
  
conn = pyodbc.connect('DSN=kosmos;'
                      'Database=MGRCRIME3;')

… and run the first query in the database:

df = pd.read_sql_query('select distinct ATC, LAKMED from MGRCRIME3.V_LMED2015 where TKOST > 500000', conn)

print(df)

       ATC                                      LAKMED
0  J05AX14        Daklinza, filmdragerad tablett 60 mg
1    J05AX  Harvoni, filmdragerad tablett 90 mg/400 mg
2  R07AX02       Kalydeco, filmdragerad tablett 150 mg
3  J05AX15        Sovaldi, filmdragerad tablett 400 mg

Create a temporary table:

cursor = conn.cursor()

cursor.execute('''CREATE TABLE #expensive_2015 (atc varchar(15),lakmed varchar(100))''')

conn.commit()

… load the dataframe into that:

for index, row in df.iterrows():
     cursor.execute("INSERT INTO #expensive_2015 (atc,lakmed) values(?,?)", row.ATC, row.LAKMED)
conn.commit()

And finally, run the join query:

df2=pd.read_sql_query('select distinct a.LAKMED from MGRCRIME3.V_LMED a join #expensive_2015 e on e.atc = a.atc', conn)

print(df2)

                                              LAKMED
0         Harvoni, filmdragerad tablett 90 mg/400 mg
1               Daklinza, filmdragerad tablett 30 mg
2               Sovaldi, filmdragerad tablett 400 mg
3               Exviera, filmdragerad tablett 250 mg
4              Kalydeco, filmdragerad tablett 150 mg
5               Daklinza, filmdragerad tablett 60 mg
6  Viekirax, filmdragerad tablett 12,5 mg/75 mg/5...