Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

Loading temporary data into the registry database, and how to do pass-thru SQL

Sometimes you want to do something that involves a lot of data from the registry database, but you do not want to pull all that data into R — maybe you are only filtering out some subset of data with a join on a small set of keys that you have as a data frame in R, or you need to do a join of two large tables/views in kosmos, and wants to use the indexes on those tables to speed up the operation.

The example data

The example code uses the LMED data in CRIME3: the prescription registry for the year 2015 is about 25M rows, and the entire LMED (years 2005 to 2017) dataset is 800M rows, for a total of about 350GB of data. We will select a (very) small subset of ATC codes from the 2015 data, and then get data for those ATC codes from the entire dataset.

The example is somewhat contrived, since the whole operation could be run in the database, without fetching the small subset to R/Python — but this is to demonstrate how to write a temporary table .

The connection

The VDI environment, as well as the compute servers, have a ODBC data source named kosmos that we will use.

With R

Preparations

The following has been tested in R 4.2, both on vector and in RStudio on a Windows machine.

You need two packages: DBI and odbc. Both are in CRAN, and regularly updated.

Code

First connect to the database:

> dbhandle <- DBI::dbConnect(odbc::odbc(), dsn = "kosmos")

… then run the first query in the database:

> query <- DBI::dbSendQuery(dbhandle, 
+   'select distinct ATC, LAKMED from MGRCRIME3.MGRCRIME3.V_LMED2015 where TKOST > 500000')
> result <- DBI::dbFetch(query)
> DBI::dbClearResult(query)
> result

      ATC                                     LAKMED
1 J05AX14       Daklinza, filmdragerad tablett 60 mg
2 R07AX02      Kalydeco, filmdragerad tablett 150 mg
3   J05AX Harvoni, filmdragerad tablett 90 mg/400 mg
4 J05AX15       Sovaldi, filmdragerad tablett 400 mg

Upload the results to a temporary table in the registry database — the #-mark tells the database engine to make it temporary:

> DBI::dbWriteTable(dbhandle, "#expensive_2015", result)

… and use it in a second query:

> query_2 <- DBI::dbSendQuery(dbhandle, 
+   'select distinct a.LAKMED from MGRCRIME3.MGRCRIME3.V_LMED a join #expensive_2015 e on e.atc = a.atc')
> result_2 <- DBI::dbFetch(query_2)
> DBI::dbClearResult(query_2)
> DBI::dbDisconnect(dbhandle)
> result_2
                                              LAKMED
1               Daklinza, filmdragerad tablett 30 mg
2         Harvoni, filmdragerad tablett 90 mg/400 mg
3              Kalydeco, filmdragerad tablett 150 mg
4               Exviera, filmdragerad tablett 250 mg
5               Sovaldi, filmdragerad tablett 400 mg
6 Viekirax, filmdragerad tablett 12,5 mg/75 mg/50 mg
7               Daklinza, filmdragerad tablett 60 mg

An alternative without (visible) pass-thru SQL

There is more than one way to do this exact operation, and this specific example can also be acheived with dplyr/dbplyr. This works with local data too, you just have to be careful with the order of inputs, since dplyr will copy the second input to a temp table in the database, regardless of which is the local one, and which is smaller.

> library(dplyr)
> library(dbplyr)

> crime3 <- DBI::dbConnect(odbc::odbc(),
+                          dsn = "kosmos", database = "MGRCRIME3")
> lmed15 <- tbl(crime3, in_schema("MGRCRIME3", "V_LMED2015"))
> lmed <- tbl(crime3, in_schema("MGRCRIME3", "V_LMED"))
>
> result <- lmed %>% 
+   # either in the database
+   # inner_join(lmed15 %>% filter(tkost > 500000) %>% select(ATC)) %>%
+   # or with local data
+   inner_join(data.frame("ATC" = c("J05AX", "J05AX14", "J05AX15", "R07AX02")), 
+              copy = TRUE) %>% 
+   select("LAKMED") %>% distinct()
Joining, by = "ATC"
> result
# Source:   SQL [7 x 1]
# Database: Microsoft SQL Server 13.00.7016[meb\joelus@MEB-SQL02\KOSMOS/MGRCRIME3]
  LAKMED                                            
  <chr>                                             
1 Harvoni, filmdragerad tablett 90 mg/400 mg        
2 Daklinza, filmdragerad tablett 60 mg              
3 Sovaldi, filmdragerad tablett 400 mg              
4 Exviera, filmdragerad tablett 250 mg              
5 Viekirax, filmdragerad tablett 12,5 mg/75 mg/50 mg
6 Kalydeco, filmdragerad tablett 150 mg             
7 Daklinza, filmdragerad tablett 30 mg   

> # this must come *after* we use the result, since dplyr/dbplyr is being "lazy"
> DBI::dbDisconnect(dbhandle)

With Python

Preparation

You need the pyodbc package — and for convenience the following code also uses pandas

Code

Connecting to the database:

import pandas as pd
import pyodbc 
  
conn = pyodbc.connect('DSN=kosmos;'
                      'Database=MGRCRIME3;')

… and run the first query in the database:

df = pd.read_sql_query('select distinct ATC, LAKMED from MGRCRIME3.V_LMED2015 where TKOST > 500000', conn)

print(df)

       ATC                                      LAKMED
0  J05AX14        Daklinza, filmdragerad tablett 60 mg
1    J05AX  Harvoni, filmdragerad tablett 90 mg/400 mg
2  R07AX02       Kalydeco, filmdragerad tablett 150 mg
3  J05AX15        Sovaldi, filmdragerad tablett 400 mg

Create a temporary table:

cursor = conn.cursor()

cursor.execute('''CREATE TABLE #expensive_2015 (atc varchar(15),lakmed varchar(100))''')

conn.commit()

… load the dataframe into that:

for index, row in df.iterrows():
     cursor.execute("INSERT INTO #expensive_2015 (atc,lakmed) values(?,?)", row.ATC, row.LAKMED)
conn.commit()

And finally, run the join query:

df2=pd.read_sql_query('select distinct a.LAKMED from MGRCRIME3.V_LMED a join #expensive_2015 e on e.atc = a.atc', conn)

print(df2)

                                              LAKMED
0         Harvoni, filmdragerad tablett 90 mg/400 mg
1               Daklinza, filmdragerad tablett 30 mg
2               Sovaldi, filmdragerad tablett 400 mg
3               Exviera, filmdragerad tablett 250 mg
4              Kalydeco, filmdragerad tablett 150 mg
5               Daklinza, filmdragerad tablett 60 mg
6  Viekirax, filmdragerad tablett 12,5 mg/75 mg/5...