MIMIC-iv官方SQL查询标注(简单基础篇)

2024-01-07 13:38

本文主要是介绍MIMIC-iv官方SQL查询标注(简单基础篇),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

说明:
MIMIC的SQL查询确实够复杂,阻碍了很多想利用数据库做研究的人,也让人对已经做出来的研究的确实性产生怀疑 。
今天是几个官方SQL简单基础的查询,循序渐进, 相信能够更好地掌握。

一、单表查询

  1. 基础命令: SELECT 需要你知道列的名字,取一列或者列的一部分数据形成新的列;FROM需要你知道表格的名字;WHERE 限定查询的范围;GROUPBY 确定索引,保证索引列中的值是唯一的(或者理解成这一列是配合聚合函数使用的)。
SELECT column1, column2, ...
FROM table_name
WHERE condition
GROUP BY column1, column2, ...;

其中,column1、column2等是要选择的列的名称,table_name是要选择的表的名称。如果要选择所有列,则可以使用星号(*)代替列名。

  1. 基础命令上结合聚合函数(MAX,MIN等):MIMIC 查询过程中大量使用聚合函数,主要的目的是多行的值唯一化(最常见的情况是多个时间点收集同样的值),至于使用什么样的聚合函数不同的变量也有约定俗成的做法。注意恰当选择GROUPBY列,是聚合函数聚合的依据。
  2. case搜索函数 用在长表中,实现行转列,这种情况是一行中有多种数据,取其中的几类,并转为列。
-- begin query that extracts the data--本查询的目的是查询两个酶的表达
SELECTMAX(subject_id) AS subject_id--MAX是从多条记录中取一条,最新的。, MAX(hadm_id) AS hadm_id, MAX(charttime) AS charttime, le.specimen_id-- 索引列,使用GROUPBY, MAX(CASE WHEN itemid = 51002 THEN value ELSE NULL END) AS troponin_i--这里的MAX一方面是求最大值,另一方面是多条记录转单条记录的操作, MAX(CASE WHEN itemid = 51003 THEN value ELSE NULL END) AS troponin_t--CASE WHEN 这里是纵向转横向的一个操作,很多地方都见过。, MAX(CASE WHEN itemid = 50911 THEN valuenum ELSE NULL END) AS ck_mb
FROM mimic_hosp.labevents le
WHERE le.itemid IN
(-- 51002, -- Troponin I (troponin-I is not measured in MIMIC-IV)-- 52598, -- Troponin I, point of care, rare/poor quality51003, -- Troponin T50911  -- Creatinine Kinase, MB isoenzyme
)
GROUP BY le.specimen_id
--按照每个样本分组,查询结果是specimen_id的值是唯一的,其他列可能有重复值。
--让我想起索引的操作,一系列的数据中是不是总是要有个作为索引呢?
;
  1. 再体会另外一个完全相似的查询语句例子,也是调取lab表格中相关的条目,位于hosp模块中, 表格是labevents,这是一个长表,同类的项目名集中在itemid这一列中,所以会有case when这样的操作。
SELECTMAX(subject_id) AS subject_id, MAX(hadm_id) AS hadm_id, MAX(charttime) AS charttime, le.specimen_id-- lab数据是使用这个作为索引,在其它的表中是其它的列-- convert from itemid into a meaningful column, MAX(CASE WHEN itemid = 51221 THEN valuenum ELSE NULL END) AS hematocrit, MAX(CASE WHEN itemid = 51222 THEN valuenum ELSE NULL END) AS hemoglobin, MAX(CASE WHEN itemid = 51248 THEN valuenum ELSE NULL END) AS mch, MAX(CASE WHEN itemid = 51249 THEN valuenum ELSE NULL END) AS mchc, MAX(CASE WHEN itemid = 51250 THEN valuenum ELSE NULL END) AS mcv, MAX(CASE WHEN itemid = 51265 THEN valuenum ELSE NULL END) AS platelet, MAX(CASE WHEN itemid = 51279 THEN valuenum ELSE NULL END) AS rbc, MAX(CASE WHEN itemid = 51277 THEN valuenum ELSE NULL END) AS rdw, MAX(CASE WHEN itemid = 52159 THEN valuenum ELSE NULL END) AS rdwsd, MAX(CASE WHEN itemid = 51301 THEN valuenum ELSE NULL END) AS wbc
FROM mimic_hosp.labevents le
--经常用到的表格也就那么几个,hosp模块中的itemid(列)常作为限定条件,
--而valuenum(列)和value(列)常被取出作为分析目标。
WHERE le.itemid IN
(51221, -- hematocrit51222, -- hemoglobin51248, -- MCH51249, -- MCHC51250, -- MCV51265, -- platelets51279, -- RBC51277, -- RDW52159, -- RDW SD51301  -- WBC
)
AND valuenum IS NOT NULL--非空,去除缺失值的方法
-- lab values cannot be 0 and cannot be negative
AND valuenum > 0 --去除异常值
GROUP BY le.specimen_id--建立数据索引,为什么是这一列呢?lab数据一般是这一列。
;
  1. 位于icu模块中,表格是chartevents,这也是一个长表(MIMIC中长表有哪些?可以问chatGPT),注意哪一列作为索引使用?聚合函数的类型?
    用到了avg这个函数, 不再是MAX, 猜测是因为连续性指标, 什么指标求平均,什么指标求MAX, 恐怕要根据医学知识来判断。分类的指标一般用MAX
-- This query pivots the vital signs for the entire patient stay.
-- Vital signs include heart rate, blood pressure, respiration rate, and temperature 这里是生命体征的测量, 求的是平均值
selectce.subject_id, ce.stay_id, ce.charttime, AVG(case when itemid in (220045) and valuenum > 0 and valuenum < 300 then valuenum else null end) as heart_rate, AVG(case when itemid in (220179,220050) and valuenum > 0 and valuenum < 400 then valuenum else null end) as sbp, AVG(case when itemid in (220180,220051) and valuenum > 0 and valuenum < 300 then valuenum else null end) as dbp, AVG(case when itemid in (220052,220181,225312) and valuenum > 0 and valuenum < 300 then valuenum else null end) as mbp, AVG(case when itemid = 220179 and valuenum > 0 and valuenum < 400 then valuenum else null end) as sbp_ni, AVG(case when itemid = 220180 and valuenum > 0 and valuenum < 300 then valuenum else null end) as dbp_ni, AVG(case when itemid = 220181 and valuenum > 0 and valuenum < 300 then valuenum else null end) as mbp_ni, AVG(case when itemid in (220210,224690) and valuenum > 0 and valuenum < 70 then valuenum else null end) as resp_rate, ROUND(AVG(case when itemid in (223761) and valuenum > 70 and valuenum < 120 then (valuenum-32)/1.8 -- converted to degC in valuenum callwhen itemid in (223762) and valuenum > 10 and valuenum < 50  then valuenum else null end), 2) as temperature, MAX(CASE WHEN itemid = 224642 THEN value ELSE NULL END) AS temperature_site--分类, AVG(case when itemid in (220277) and valuenum > 0 and valuenum <= 100 then valuenum else null end) as spo2--SELECT内去除异常值, AVG(case when itemid in (225664,220621,226537) and valuenum > 0 then valuenum else null end) as glucoseFROM mimic_icu.chartevents ce --chawhere ce.stay_id IS NOT NULLand ce.itemid in(220045, -- Heart Rate225309, -- ART BP Systolic225310, -- ART BP Diastolic225312, -- ART BP Mean220050, -- Arterial Blood Pressure systolic220051, -- Arterial Blood Pressure diastolic220052, -- Arterial Blood Pressure mean220179, -- Non Invasive Blood Pressure systolic220180, -- Non Invasive Blood Pressure diastolic220181, -- Non Invasive Blood Pressure mean220210, -- Respiratory Rate224690, -- Respiratory Rate (Total)220277, -- SPO2, peripheral-- GLUCOSE, both lab and fingerstick225664, -- Glucose finger stick220621, -- Glucose (serum)226537, -- Glucose (whole blood)-- TEMPERATURE223762, -- "Temperature Celsius"223761,  -- "Temperature Fahrenheit"224642 -- Temperature Site-- 226329 -- Blood Temperature CCO (C)
)--这些编码能不能用在我们自建的数据库中? 可以自己制作或者网上寻找itemid的对照表,方便了解itemid中的内容。
group by ce.subject_id, ce.stay_id, ce.charttime--select中没有使用聚合函数的,就使用GROUPBY?
;

二、多表查询:临时表的使用,join命令

SQL计算并生成新的列(临时表的使用)

mimi_derived这个表中是经过预处理的数据,有大用。

-- This query extracts the serum creatinine baselines of adult patients on each hospital admission.
-- The baseline is determined by the following rules:
--     i. if the lowest creatinine value during this admission is normal (<1.1), then use the value
--     ii. if the patient is diagnosed with chronic kidney disease (CKD), then use the lowest creatinine value during the admission, although it may be rather large.
--     iii. Otherwise, we estimate the baseline using the Simplified MDRD Formula:
--          eGFR = 186 × Scr^(-1.154) × Age^(-0.203) × 0.742Female
--     Let eGFR = 75. Scr = [ 75 / 186 / Age^(-0.203) / (0.742Female) ] ^ (1/-1.154)
WITH p as
(SELECT ag.subject_id, ag.hadm_id, ag.age, p.gender, CASE WHEN p.gender='F' THEN POWER(75.0 / 186.0 / POWER(ag.age, -0.203) / 0.742, -1/1.154)ELSE POWER(75.0 / 186.0 / POWER(ag.age, -0.203), -1/1.154)END AS MDRD_estFROM `physionet-data.mimic_derived.age` agLEFT JOIN `physionet-data.mimic_core.patients` pON ag.subject_id = p.subject_idWHERE ag.age >= 18
)
, lab as
(SELECT hadm_id, MIN(creatinine) AS scr_minFROM `physionet-data.mimic_derived.chemistry`GROUP BY hadm_id
)
, ckd as --建立一个临时表, 用诊断限定查询的范围
(SELECT hadm_id, MAX(1) AS CKD_flagFROM mimic_hosp.diagnoses_icd --另外一个常用的表格,其中的icd_code和icd_version常结合使用。WHERE (SUBSTR(icd_code, 1, 3) = '585'--String数据的处理AND icd_version = 9)OR (SUBSTR(icd_code, 1, 3) = 'N18'AND icd_version = 10)GROUP BY 1-- 1这里指第一列
)
--主查询的功能是罗列查询的列,连接各个表格,较复杂的都在临时表处理好。
SELECT p.hadm_id, p.gender, p.age, lab.scr_min, COALESCE(ckd.ckd_flag, 0) AS ckd, p.MDRD_est, CASE WHEN lab.scr_min<=1.1 THEN scr_minWHEN ckd.ckd_flag=1 THEN scr_minELSE MDRD_est END AS scr_baseline
FROM p
LEFT JOIN lab
ON p.hadm_id = lab.hadm_id
LEFT JOIN ckd
ON p.hadm_id = ckd.hadm_id
;

三、更多的列子

5.单位转换:一列中有不同来源的数据(单位不同),分开处理,再合并,注意列中是否有单位不一致的情况。
并去除超范围的值。
icu模块中重要的表格叫做chartevents,其中的itemid(列)和valuenum是最重要的列。

-- prep height
WITH ht_in AS
(SELECT c.subject_id, c.stay_id, c.charttime-- Ensure that all heights are in centimeters 单位的一个转换,为了后续的合并, ROUND(c.valuenum * 2.54, 2) AS height, c.valuenum as height_origFROM `physionet-data.mimic_icu.chartevents` cWHERE c.valuenum IS NOT NULL-- Height (measured in inches)AND c.itemid = 226707
)
, ht_cm AS--两个临时表用逗号分隔
(SELECT c.subject_id, c.stay_id, c.charttime-- Ensure that all heights are in centimeters, ROUND(c.valuenum, 2) AS heightFROM `physionet-data.mimic_icu.chartevents` cWHERE c.valuenum IS NOT NULL-- Height cmAND c.itemid = 226730
)
-- merge cm/height, only take 1 value per charted row
, ht_stg0 AS
(SELECTCOALESCE(h1.subject_id, h1.subject_id) as subject_id--coalesce()的英文意思是合并的作用是:返回传入的参数中第一个非null的值, COALESCE(h1.stay_id, h1.stay_id) AS stay_id, COALESCE(h1.charttime, h1.charttime) AS charttime, COALESCE(h1.height, h2.height) as heightFROM ht_cm h1FULL OUTER JOIN ht_in h2ON h1.subject_id = h2.subject_idAND h1.charttime = h2.charttime
)
--主查询
SELECT subject_id, stay_id, charttime, height
FROM ht_stg0
WHERE height IS NOT NULL--去掉缺失值
-- filter out bad heights 过滤有可能出错的数据
AND height > 120 AND height < 230;
  1. 这里没有使用临时表的形式
selectstay_id, charttime, sum(urineoutput) as urineoutput
from
(select-- patient identifiersoe.stay_id, oe.charttime-- volumes associated with urine output ITEMIDs-- note we consider input of GU irrigant as a negative volume-- GU irrigant volume in usually has a corresponding volume out-- so the net is often 0, despite large irrigant volumes, casewhen oe.itemid = 227488 and oe.value > 0 then -1*oe.valueelse oe.valueend as urineoutputfrom `physionet-data.mimic_icu.outputevents` oewhere itemid in(226559, -- Foley226560, -- Void226561, -- Condom Cath226584, -- Ileoconduit226563, -- Suprapubic226564, -- R Nephrostomy226565, -- L Nephrostomy226567, -- Straight Cath226557, -- R Ureteral Stent226558, -- L Ureteral Stent227488, -- GU Irrigant Volume In227489  -- GU Irrigant/Urine Volume Out)
) uo
group by stay_id, charttime--outputevents的索引列
;
  1. 综合性的例子:
-- For reference, some common unit conversions:
-- 10^9/L == K/uL == 10^3/uL
WITH blood_diff AS
(
SELECTMAX(subject_id) AS subject_id--分类变量都用MAX, MAX(hadm_id) AS hadm_id, MAX(charttime) AS charttime, le.specimen_id-- lab通常使用这个作为索引-- create one set of columns for percentages, and one set of columns for counts-- we harmonize all count units into K/uL == 10^9/L-- counts have an "_abs" suffix, percentages do not-- absolute counts, MAX(CASE WHEN itemid in (51300, 51301, 51755) THEN valuenum ELSE NULL END) AS wbc, MAX(CASE WHEN itemid = 52069 THEN valuenum ELSE NULL END) AS basophils_abs-- 52073 in K/uL, 51199 in #/uL, MAX(CASE WHEN itemid = 52073 THEN valuenum WHEN itemid = 51199 THEN valuenum / 1000.0 ELSE NULL END) AS eosinophils_abs-- 51133 in K/uL, 52769 in #/uL, MAX(CASE WHEN itemid = 51133 THEN valuenum WHEN itemid = 52769 THEN valuenum / 1000.0 ELSE NULL END) AS lymphocytes_abs-- 52074 in K/uL, 51253 in #/uL, MAX(CASE WHEN itemid = 52074 THEN valuenum WHEN itemid = 51253 THEN valuenum / 1000.0 ELSE NULL END) AS monocytes_abs, MAX(CASE WHEN itemid = 52075 THEN valuenum ELSE NULL END) AS neutrophils_abs-- convert from #/uL to K/uL, MAX(CASE WHEN itemid = 51218 THEN valuenum / 1000.0 ELSE NULL END) AS granulocytes_abs-- percentages, equal to cell count / white blood cell count, MAX(CASE WHEN itemid = 51146 THEN valuenum ELSE NULL END) AS basophils, MAX(CASE WHEN itemid = 51200 THEN valuenum ELSE NULL END) AS eosinophils, MAX(CASE WHEN itemid in (51244, 51245) THEN valuenum ELSE NULL END) AS lymphocytes, MAX(CASE WHEN itemid = 51254 THEN valuenum ELSE NULL END) AS monocytes, MAX(CASE WHEN itemid = 51256 THEN valuenum ELSE NULL END) AS neutrophils-- other cell count percentages, MAX(CASE WHEN itemid = 51143 THEN valuenum ELSE NULL END) AS atypical_lymphocytes, MAX(CASE WHEN itemid = 51144 THEN valuenum ELSE NULL END) AS bands, MAX(CASE WHEN itemid = 52135 THEN valuenum ELSE NULL END) AS immature_granulocytes, MAX(CASE WHEN itemid = 51251 THEN valuenum ELSE NULL END) AS metamyelocytes, MAX(CASE WHEN itemid = 51257 THEN valuenum ELSE NULL END) AS nrbc-- utility flags which determine whether imputation is possible, CASE-- WBC is availableWHEN MAX(CASE WHEN itemid in (51300, 51301, 51755) THEN valuenum ELSE NULL END) > 0-- and we have at least one percentage from the diff-- sometimes the entire diff is 0%, which looks like bad dataAND SUM(CASE WHEN itemid IN (51146, 51200, 51244, 51245, 51254, 51256) THEN valuenum ELSE NULL END) > 0THEN 1 ELSE 0 END AS impute_absFROM mimic_hosp.labevents le
WHERE le.itemid IN
(51146, -- basophils52069, -- Absolute basophil count51199, -- Eosinophil Count51200, -- Eosinophils52073, -- Absolute Eosinophil count51244, -- Lymphocytes51245, -- Lymphocytes, Percent51133, -- Absolute Lymphocyte Count52769, -- Absolute Lymphocyte Count51253, -- Monocyte Count51254, -- Monocytes52074, -- Absolute Monocyte Count51256, -- Neutrophils52075, -- Absolute Neutrophil Count51143, -- Atypical lymphocytes51144, -- Bands (%)51218, -- Granulocyte Count52135, -- Immature granulocytes (%)51251, -- Metamyelocytes51257,  -- Nucleated Red Cells-- wbc totals measured in K/uL51300, 51301, 51755-- 52220 (wbcp) is percentage-- below are point of care tests which are extremely infrequent and usually low quality-- 51697, -- Neutrophils (mmol/L)-- below itemid do not have data as of MIMIC-IV v1.0-- 51536, -- Absolute Lymphocyte Count-- 51537, -- Absolute Neutrophil-- 51690, -- Lymphocytes-- 52151, -- NRBC
)
AND valuenum IS NOT NULL
-- differential values cannot be negative
AND valuenum >= 0
GROUP BY le.specimen_id
)
--以下是主查询
SELECT 
subject_id, hadm_id, charttime, specimen_id, wbc
-- impute absolute count if percentage & WBC is available
, ROUND(CASEWHEN basophils_abs IS NULL AND basophils IS NOT NULL AND impute_abs = 1THEN basophils * wbcELSE basophils_abs
END, 4) AS basophils_abs
, ROUND(CASEWHEN eosinophils_abs IS NULL AND eosinophils IS NOT NULL AND impute_abs = 1THEN eosinophils * wbcELSE eosinophils_abs
END, 4) AS eosinophils_abs
, ROUND(CASEWHEN lymphocytes_abs IS NULL AND lymphocytes IS NOT NULL AND impute_abs = 1THEN lymphocytes * wbcELSE lymphocytes_abs
END, 4) AS lymphocytes_abs
, ROUND(CASEWHEN monocytes_abs IS NULL AND monocytes IS NOT NULL AND impute_abs = 1THEN monocytes * wbcELSE monocytes_abs
END, 4) AS monocytes_abs
, ROUND(CASEWHEN neutrophils_abs IS NULL AND neutrophils IS NOT NULL AND impute_abs = 1THEN neutrophils * wbcELSE neutrophils_abs
END, 4) AS neutrophils_abs, basophils
, eosinophils
, lymphocytes
, monocytes
, neutrophils-- impute bands/blasts?
, atypical_lymphocytes
, bands
, immature_granulocytes
, metamyelocytes
, nrbc
FROM blood_diff
;

总结

从简单查询中,可以学到以下几点:

  • 聚合函数的使用:多行变一行,不同的表GROUPBY的列不一样;
  • Case搜索函数的使用,可以行变列,主要用于几个长表;
  • 体验with临时表
  • 注意数据中可能有单位不相同的情况

这篇关于MIMIC-iv官方SQL查询标注(简单基础篇)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/580097

相关文章

将sqlserver数据迁移到mysql的详细步骤记录

《将sqlserver数据迁移到mysql的详细步骤记录》:本文主要介绍将SQLServer数据迁移到MySQL的步骤,包括导出数据、转换数据格式和导入数据,通过示例和工具说明,帮助大家顺利完成... 目录前言一、导出SQL Server 数据二、转换数据格式为mysql兼容格式三、导入数据到MySQL数据

MySQL分表自动化创建的实现方案

《MySQL分表自动化创建的实现方案》在数据库应用场景中,随着数据量的不断增长,单表存储数据可能会面临性能瓶颈,例如查询、插入、更新等操作的效率会逐渐降低,分表是一种有效的优化策略,它将数据分散存储在... 目录一、项目目的二、实现过程(一)mysql 事件调度器结合存储过程方式1. 开启事件调度器2. 创

SQL Server使用SELECT INTO实现表备份的代码示例

《SQLServer使用SELECTINTO实现表备份的代码示例》在数据库管理过程中,有时我们需要对表进行备份,以防数据丢失或修改错误,在SQLServer中,可以使用SELECTINT... 在数据库管理过程中,有时我们需要对表进行备份,以防数据丢失或修改错误。在 SQL Server 中,可以使用 SE

mysql外键创建不成功/失效如何处理

《mysql外键创建不成功/失效如何处理》文章介绍了在MySQL5.5.40版本中,创建带有外键约束的`stu`和`grade`表时遇到的问题,发现`grade`表的`id`字段没有随着`studen... 当前mysql版本:SELECT VERSION();结果为:5.5.40。在复习mysql外键约

SQL注入漏洞扫描之sqlmap详解

《SQL注入漏洞扫描之sqlmap详解》SQLMap是一款自动执行SQL注入的审计工具,支持多种SQL注入技术,包括布尔型盲注、时间型盲注、报错型注入、联合查询注入和堆叠查询注入... 目录what支持类型how---less-1为例1.检测网站是否存在sql注入漏洞的注入点2.列举可用数据库3.列举数据库

Oracle查询优化之高效实现仅查询前10条记录的方法与实践

《Oracle查询优化之高效实现仅查询前10条记录的方法与实践》:本文主要介绍Oracle查询优化之高效实现仅查询前10条记录的相关资料,包括使用ROWNUM、ROW_NUMBER()函数、FET... 目录1. 使用 ROWNUM 查询2. 使用 ROW_NUMBER() 函数3. 使用 FETCH FI

数据库oracle用户密码过期查询及解决方案

《数据库oracle用户密码过期查询及解决方案》:本文主要介绍如何处理ORACLE数据库用户密码过期和修改密码期限的问题,包括创建用户、赋予权限、修改密码、解锁用户和设置密码期限,文中通过代码介绍... 目录前言一、创建用户、赋予权限、修改密码、解锁用户和设置期限二、查询用户密码期限和过期后的修改1.查询用

Mysql虚拟列的使用场景

《Mysql虚拟列的使用场景》MySQL虚拟列是一种在查询时动态生成的特殊列,它不占用存储空间,可以提高查询效率和数据处理便利性,本文给大家介绍Mysql虚拟列的相关知识,感兴趣的朋友一起看看吧... 目录1. 介绍mysql虚拟列1.1 定义和作用1.2 虚拟列与普通列的区别2. MySQL虚拟列的类型2

mysql数据库分区的使用

《mysql数据库分区的使用》MySQL分区技术通过将大表分割成多个较小片段,提高查询性能、管理效率和数据存储效率,本文就来介绍一下mysql数据库分区的使用,感兴趣的可以了解一下... 目录【一】分区的基本概念【1】物理存储与逻辑分割【2】查询性能提升【3】数据管理与维护【4】扩展性与并行处理【二】分区的

MySQL中时区参数time_zone解读

《MySQL中时区参数time_zone解读》MySQL时区参数time_zone用于控制系统函数和字段的DEFAULTCURRENT_TIMESTAMP属性,修改时区可能会影响timestamp类型... 目录前言1.时区参数影响2.如何设置3.字段类型选择总结前言mysql 时区参数 time_zon