数据过滤与清洗

您可以通过SPL指令和SQL函数过滤与清洗您所采集的海量日志数据，实现数据格式标准化。本文介绍过滤与清洗数据的常见场景和相关操作。

场景1：过滤日志（where指令）

您可以使用where指令过滤日志。常用规则如下所示：

where <bool-expression>

示例如下所示：

子场景1：根据字段内容过滤数据条目。

原始日志

#日志1
__source__:  192.168.0.1
__tag__:__client_ip__:  192.168.0.2
__tag__:__receive_time__:  1597214851
__topic__: app
class:  test_case
id:  7992
test_string:  <function test1 at 0x1027401e0>
#日志2
__source__:  192.168.0.1
__tag__:__client_ip__:  192.168.0.2
__tag__:__receive_time__:  1597214861
__topic__: web
class:  test_case
id:  7992
test_string:  <function test1 at 0x1027401e0>

SPL语句
丢弃__topic__字段为app的日志。
```
* | where __topic__!='app'
```

输出结果

__source__:  192.168.0.1
__tag__:__client_ip__:  192.168.0.2
__tag__:__receive_time__:  1597214861
__topic__: web
class:  test_case
id:  7992
test_string:  <function test1 at 0x1027401e0>

子场景2：使用匹配字段名的正则表达式过滤数据条目。

原始日志

#日志1
__source__:  192.168.0.1
__tag__:__client_ip__:  192.168.0.2
__tag__:__receive_time__:  1597214851
__topic__: app
class:  test_case
id:  7992
test_string:  <function test1 at 0x1027401e0>
server_protocol：test
#日志2
__source__:  192.168.0.1
__tag__:__client_ip__:  192.168.0.2
__tag__:__receive_time__:  1597214861
__topic__: web
class:  test_case
id:  7992
test_string:  <function test1 at 0x1027401e0>
server_protocol: 14861

SPL语句
保留server_protocol为数字的字段。
```
* | where regexp_like(server_protocol, '\d+')
```

输出结果

__source__:  192.168.0.1
__tag__:__client_ip__:  192.168.0.2
__tag__:__receive_time__:  1597214861
__topic__: web
class:  test_case
id:  7992
test_string:  <function test1 at 0x1027401e0>
server_protocol: 14861

场景2：为日志空缺字段赋值（extend、parse-regexp指令）

您可以使用extend、parse-regexp指令过滤日志。示例如下所示：

子场景1：原字段不存在或者为空时，为字段赋值。

* | extend <output>=<expression>, ...

输入数据
```
name:
```
SPL语句：为name字段赋值
```
* | extend name='lily'
```
输出结果
```
name:lily
```

子场景2：使用正则表达式从文本字段中提取结构化内容。

* | parse-regexp -flags=<flags> <field>, <pattern> as <output>, ...

输入数据

content: '10.0.0.0 GET /index.html 15824 0.043'

SPL语句

* | parse-regexp content, '(\S+)' as ip | parse-regexp content, '\S+\s+(\w+)' as method

输出结果

content: '10.0.0.0 GET /index.html 15824 0.043'
ip: '10.0.0.0'
method: 'GET'

子场景3：为多个字段赋值。

* | extend <output>=<expression> | extend <output1>=<expression> | <output2>=<expression>

输入数据

__source__:  192.168.0.1
__topic__:
__tag__:
__receive_time__:
id:  7990
test_string:  <function test1 at 0x1020401e0>

SPL语句

为__topic__字段、__tag__字段和__receive_time__字段赋值。

* | extend __topic__='app' | extend __tag__='stu' | extend __receive_time__='1597214851'

输出数据

__source__:  192.168.0.1
__topic__:  app
__tag__:  stu
__receive_time__:  1597214851
id:  7990
test_string:  <function test1 at 0x1020401e0>

场景3：删除和重命名字段（project-away、project-rename指令）

推荐您使用project-away、project-rename指令进行删除和重命名字段。

子场景1：删除特定字段。

* | project-away -wildcard-off <field-pattern>, ...

输入数据
```
content：123
age：23
name：twiss
```
SPL语句
```
* | project-away age, name
```
输出结果
```
content：123
```

子场景2：重命名特定字段。

* | project-rename <output>=<field>, ...

输入数据
```
content：123
age：23
name：twiss
```

SPL语句

* | project-rename new_age=age, new_name=name

输出结果

content：123
new_age：23
new_name：twiss

场景4：转换日志参数类型

子场景1：调用concat函数进行字符拼接。

输入数据
```
x: 123
y: 100
```

SPL语句

* | extend a=cast(x as bigint) + cast(y as bigint)| extend b=concat(x, y)

输出结果
```
x: 123
y: 100
a: 223
b: 123100
```

子场景2：调用字符串或日期时间转换为标准时间。如下使用to_unixtime函数将time1表示的日期时间转化为Unix时间戳。

原始日志
```
time1: 2020-09-17 9:00:00
```

加工规则

将time1表示的日期时间转化为Unix时间戳。

* | extend time1=cast(time1 as TIMESTAMP) | extend new_time=to_unixtime(time1)

加工结果

time1:  2020-09-17 9:00:00
time2:  1600333200.0

场景5：为日志不存在的字段填充默认值（COALESCE表达式）

使用COALESCE表达式为不存在的字段填充默认值。

输入数据
```
server_protocol: 100
```
SPL语句
如果server_protocol存在则y为server_protocol的值，如果server_protocol1不存在则x为200。
```
* | extend x=COALESCE(server_protocol1, '200') | extend y=COALESCE(server_protocol, '200')
```
输出结果
```
server_protocol: 100
x: 200
y: 100
```

场景6：判断日志并增加字段（where和extend组合指令）

推荐您使用where和extend组合指令进行。

* | where <bool-expression> | extend <output>=<expression> |...

示例如下所示：

输入数据
```
status1: 200
status2: 404
```

SPL语句

* | where status1='200'| extend status1_info='normal' | where status2='404'| extend status2_info='error'

输出结果

status1: 200
status2: 404
status1_info: normal
status2_info: error

场景1：过滤日志（where指令）

场景2：为日志空缺字段赋值（extend、parse-regexp指令）

场景3：删除和重命名字段（project-away、project-rename指令）

场景4：转换日志参数类型

场景5：为日志不存在的字段填充默认值（COALESCE表达式）

场景6：判断日志并增加字段（where和extend组合指令）

数据过滤与清洗 2025-04-22 10:55

处理日期时间 2025-04-22 10:55

数据脱敏 2025-04-22 10:55

解析CSV格式日志 2025-04-22 10:55

使用SPL的正则表达式解析Nginx日志 2025-04-22 10:55

解析Java报错日志 2025-04-22 10:55

目录