Sometimes you want to do something that involves a lot of data from the registry database, but you do not want to pull all that data into R — maybe you are only filtering out some subset of data with a join on a small set of keys that you have as a data frame in R, or you need to do a join of two large tables/views in kosmos, and wants to use the indexes on those tables to speed up the operation.
The example data
The example code uses the LMED
data in CRIME3
: the prescription registry for the year 2015 is about 25M rows, and the entire LMED (years 2005 to 2017) dataset is 800M rows, for a total of about 350GB of data. We will select a (very) small subset of ATC codes from the 2015 data, and then get data for those ATC codes from the entire dataset.
The example is somewhat contrived, since the whole operation could be run in the database, without fetching the small subset to R/Python — but this is to demonstrate how to write a temporary table .
The connection
The VDI environment, as well as the compute servers, have a ODBC data source named kosmos
that we will use.
With R
Preparations
The following has been tested in R 4.2, both on vector
and in RStudio on a Windows machine.
You need two packages: DBI
and odbc
. Both are in CRAN, and regularly updated.
Code
First connect to the database:
> dbhandle <- DBI::dbConnect(odbc::odbc(), dsn = "kosmos")
… then run the first query in the database:
> query <- DBI::dbSendQuery(dbhandle,
+ 'select distinct ATC, LAKMED from MGRCRIME3.MGRCRIME3.V_LMED2015 where TKOST > 500000')
> result <- DBI::dbFetch(query)
> DBI::dbClearResult(query)
> result
ATC LAKMED
1 J05AX14 Daklinza, filmdragerad tablett 60 mg
2 R07AX02 Kalydeco, filmdragerad tablett 150 mg
3 J05AX Harvoni, filmdragerad tablett 90 mg/400 mg
4 J05AX15 Sovaldi, filmdragerad tablett 400 mg
Upload the results to a temporary table in the registry database — the #
-mark tells the database engine to make it temporary:
> DBI::dbWriteTable(dbhandle, "#expensive_2015", result)
… and use it in a second query:
> query_2 <- DBI::dbSendQuery(dbhandle,
+ 'select distinct a.LAKMED from MGRCRIME3.MGRCRIME3.V_LMED a join #expensive_2015 e on e.atc = a.atc')
> result_2 <- DBI::dbFetch(query_2)
> DBI::dbClearResult(query_2)
> DBI::dbDisconnect(dbhandle)
> result_2
LAKMED
1 Daklinza, filmdragerad tablett 30 mg
2 Harvoni, filmdragerad tablett 90 mg/400 mg
3 Kalydeco, filmdragerad tablett 150 mg
4 Exviera, filmdragerad tablett 250 mg
5 Sovaldi, filmdragerad tablett 400 mg
6 Viekirax, filmdragerad tablett 12,5 mg/75 mg/50 mg
7 Daklinza, filmdragerad tablett 60 mg
An alternative without (visible) pass-thru SQL
There is more than one way to do this exact operation, and this specific example can also be acheived with dplyr/dbplyr
. This works with local data too, you just have to be careful with the order of inputs, since dplyr
will copy the second input to a temp table in the database, regardless of which is the local one, and which is smaller.
> library(dplyr)
> library(dbplyr)
> crime3 <- DBI::dbConnect(odbc::odbc(),
+ dsn = "kosmos", database = "MGRCRIME3")
> lmed15 <- tbl(crime3, in_schema("MGRCRIME3", "V_LMED2015"))
> lmed <- tbl(crime3, in_schema("MGRCRIME3", "V_LMED"))
>
> result <- lmed %>%
+ # either in the database
+ # inner_join(lmed15 %>% filter(tkost > 500000) %>% select(ATC)) %>%
+ # or with local data
+ inner_join(data.frame("ATC" = c("J05AX", "J05AX14", "J05AX15", "R07AX02")),
+ copy = TRUE) %>%
+ select("LAKMED") %>% distinct()
Joining, by = "ATC"
> result
# Source: SQL [7 x 1]
# Database: Microsoft SQL Server 13.00.7016[meb\joelus@MEB-SQL02\KOSMOS/MGRCRIME3]
LAKMED
<chr>
1 Harvoni, filmdragerad tablett 90 mg/400 mg
2 Daklinza, filmdragerad tablett 60 mg
3 Sovaldi, filmdragerad tablett 400 mg
4 Exviera, filmdragerad tablett 250 mg
5 Viekirax, filmdragerad tablett 12,5 mg/75 mg/50 mg
6 Kalydeco, filmdragerad tablett 150 mg
7 Daklinza, filmdragerad tablett 30 mg
> # this must come *after* we use the result, since dplyr/dbplyr is being "lazy"
> DBI::dbDisconnect(dbhandle)
With Python
Preparation
You need the pyodbc
package — and for convenience the following code also uses pandas
…
Code
Connecting to the database:
import pandas as pd
import pyodbc
conn = pyodbc.connect('DSN=kosmos;'
'Database=MGRCRIME3;')
… and run the first query in the database:
df = pd.read_sql_query('select distinct ATC, LAKMED from MGRCRIME3.V_LMED2015 where TKOST > 500000', conn)
print(df)
ATC LAKMED
0 J05AX14 Daklinza, filmdragerad tablett 60 mg
1 J05AX Harvoni, filmdragerad tablett 90 mg/400 mg
2 R07AX02 Kalydeco, filmdragerad tablett 150 mg
3 J05AX15 Sovaldi, filmdragerad tablett 400 mg
Create a temporary table:
cursor = conn.cursor()
cursor.execute('''CREATE TABLE #expensive_2015 (atc varchar(15),lakmed varchar(100))''')
conn.commit()
… load the dataframe into that:
for index, row in df.iterrows():
cursor.execute("INSERT INTO #expensive_2015 (atc,lakmed) values(?,?)", row.ATC, row.LAKMED)
conn.commit()
And finally, run the join query:
df2=pd.read_sql_query('select distinct a.LAKMED from MGRCRIME3.V_LMED a join #expensive_2015 e on e.atc = a.atc', conn)
print(df2)
LAKMED
0 Harvoni, filmdragerad tablett 90 mg/400 mg
1 Daklinza, filmdragerad tablett 30 mg
2 Sovaldi, filmdragerad tablett 400 mg
3 Exviera, filmdragerad tablett 250 mg
4 Kalydeco, filmdragerad tablett 150 mg
5 Daklinza, filmdragerad tablett 60 mg
6 Viekirax, filmdragerad tablett 12,5 mg/75 mg/5...